Compare commits

...

1736 Commits

Author SHA1 Message Date
a932e2796d ⬆️ Bump dev version (#4293) 2025-10-15 18:11:52 -06:00
04fd1203af Release: v0.24 (#4292) 2025-10-15 18:10:10 -06:00
19d2f97932 Deprecate BestOfNSampler (#4291)
Co-authored-by: behroozazarkhalili <ermiaazarkhalili>
Co-authored-by: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com>
2025-10-15 18:06:34 -06:00
31caf64778 Remove unused commands directory (#4258)
Co-authored-by: behroozazarkhalili <ermiaazarkhalili>
2025-10-15 18:01:50 -06:00
8e2d5516ca Add accuracy reward (#4270)
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-10-15 18:01:07 -06:00
94aac4a101 Remove how_to_train.md: outdated training FAQ (#4267)
Co-authored-by: behroozazarkhalili <ermiaazarkhalili>
2025-10-15 23:49:04 +00:00
26b7c2507e Add support for token_type_ids in DPOTrainer (#4285)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-10-15 17:33:35 -06:00
aa25c2697c Remove using_llama_models.md: outdated Llama2-specific documentation (#4268)
Co-authored-by: behroozazarkhalili <ermiaazarkhalili>
2025-10-15 14:13:27 -07:00
93c7d88563 Remove logging.md: trainer-specific metrics documentation (#4269)
Co-authored-by: behroozazarkhalili <ermiaazarkhalili>
2025-10-15 14:12:32 -07:00
c7c041ecc8 Fix CI slow tests: ImportError: vLLM is not installed (#4287) 2025-10-15 18:15:36 +02:00
ef40c047aa Replace unittest skipTest with pytest.skip (#4263) 2025-10-15 18:15:28 +02:00
7e0adbc552 Fix CI dev test TypeError: unexpected keyword argument 'load_in_4bit' (#4262) 2025-10-15 18:14:49 +02:00
773afd9314 💰 RichProgressCallback enhancement (#4245) 2025-10-15 09:39:17 -06:00
966b397201 Fix CI slow test OSError: You are trying to access a gated repo (#4283) 2025-10-15 16:11:11 +02:00
927cf6ba46 Fix docstrings with Sphinx 'deprecated' directive (#4279) 2025-10-15 10:39:12 +02:00
56cb6ccf76 Fix typo in Colab link (#4276) 2025-10-14 18:51:17 +02:00
49c8f14b06 Add Qwen3-VL notebooks (SFT, GRPO) (#4275)
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-10-14 18:45:01 +02:00
cefbacb30e Fix style with make precommit (#4265) 2025-10-14 12:13:15 +02:00
fae245a062 Use FutureWarning instead of DeprecationWarning (#4266) 2025-10-14 12:12:03 +02:00
2aa9506c69 Fix docstring interlinks (#4221) 2025-10-13 13:40:24 +02:00
d6eeb290d9 Raise deprecation warning for Python 3.9 (#4226) 2025-10-13 11:06:09 +02:00
1684ef279a Fix Python version check for skipping tests on Python 3.13.8 (#4246) 2025-10-10 17:41:24 +02:00
aab21eb5e7 Include chat_template_kwargs in apply_chat_template (#4233)
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-10-10 10:39:29 -05:00
b997a31981 [Online-DPO] fix the completion_len == max_new_tokens crash (#4193)
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-10-10 17:21:01 +02:00
86d1963cc1 Fix CI slow test AttributeError: 'TestSFTTrainerSlow' object has no attribute 'addCleanup' (#4255) 2025-10-10 17:19:53 +02:00
039d526d24 Deprecate unused dataset_formatting module (#4242)
Co-authored-by: behroozazarkhalili <ermiaazarkhalili>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-10-10 10:16:18 -05:00
bcd059a384 Remove obsolete research_projects directory (#4243)
Co-authored-by: behroozazarkhalili <ermiaazarkhalili>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-10-10 10:15:47 -05:00
0e57b4a9df 🧺 [3/N] Refactor _generate in GRPO/RLOO: Rely on generator for prompt truncation (#4153) 2025-10-10 10:02:11 -05:00
98488e0946 Fix CI slow test ValueError: Unknown loss type: dapo (#4254) 2025-10-10 16:37:02 +02:00
f45e86571b Fix CI ImportError for 'require_torch_gpu_if_bnb_not_multi_backend_enabled' (#4253) 2025-10-10 16:13:22 +02:00
f5827928a0 Install peft from main for CI tests with dev dependencies (#4250) 2025-10-10 16:12:15 +02:00
f853e091ea Fix CI CUDA out of memory errors by improving GPU memory management (#4238) 2025-10-10 09:49:45 +02:00
803ec0d856 Fix CI slow test ValueError: Backward pass should have cleared tracker of all tensors (#4236)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2025-10-10 09:28:34 +02:00
7a0a615d50 Warnings pointing to RFC (#4224) 2025-10-09 17:05:36 -06:00
c38cb69ec7 🧘 Enhance markdown style (#4235) 2025-10-09 13:49:44 -05:00
68ef15c686 Remove unused log_example_reports.py script (#4241)
Co-authored-by: behroozazarkhalili <ermiaazarkhalili>
2025-10-09 09:18:48 -07:00
3dd7fc2850 Fix CI IndentationError for Python 3.13.8 (#4240) 2025-10-09 15:46:41 +02:00
51ced65153 Replace setup with pyproject in CI tests paths (#4230) 2025-10-09 09:35:08 +02:00
4bb883a6e6 Update CI Docker image to pytorch/pytorch:2.8.0 (#4232) 2025-10-09 08:09:15 +02:00
f7846321e7 Remove unused Path import in __init__.py (#4227) 2025-10-08 21:30:54 +02:00
a944890ff1 Fix callable annotations (#4216) 2025-10-08 21:21:21 +02:00
521db3520a Fix CI unittest asserts (#4234) 2025-10-08 21:18:41 +02:00
e2c97a805a Exclude vllm dependencies from dev extra (#4229) 2025-10-08 18:14:23 +02:00
d1d0407d3c 🏷️ Account for token_type_ids in DataCollatorForVisionLanguageModeling (#4190) 2025-10-08 09:34:48 -06:00
824ff8c73e Add Efficient Online Training with GRPO and vLLM in TRL to community tutorials (#4219) 2025-10-08 12:59:04 +02:00
f15399d3d3 Fix entropy and accuracy calculation for prompt_tuning techniques. (#4196) 2025-10-08 09:42:19 +01:00
cc578b6b14 🧺 [2/N] Refactor _generate in GRPO/RLOO: Use prompt_ids from generation (#4152) 2025-10-07 12:11:34 -06:00
30cf68a97b 🎨 Support mixing image+text and text-only examples (#4203)
Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
2025-10-07 10:21:10 -06:00
452284b8dc Add trainers taxonomy to docs (#4195) 2025-10-07 16:06:30 +02:00
6be53e19bc [DOCS] fix prose in lora guide (#4217) 2025-10-07 10:40:37 +02:00
3080fc1bd7 Fix LoRA params in Python in LoRA without regret (#4215) 2025-10-07 09:56:04 +02:00
5d870955f8 Fix prompt-completion labeling with add_generation_prompt and warning (#4201)
Co-authored-by: behroozazarkhalili <ermiaazarkhalili>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-10-06 18:35:50 -06:00
8265800abf Fix trl-internal-testing/tiny-DbrxForCausalLM (#4213) 2025-10-06 15:11:16 -06:00
65eb45c32b Apply style and revert change in sft_video_llm example (#4214) 2025-10-06 13:07:18 -06:00
ae6837f8d4 Removed tokenizer/processor creation from example scripts (#4211) 2025-10-06 18:40:18 +02:00
56a8f1128b Replace setup with pyproject and fix packaging unintended modules (#4194) 2025-10-06 17:45:44 +02:00
529101537f Remove Optional from processing_class in PPOTrainer (#4212) 2025-10-06 16:04:06 +02:00
0588b1f01d Updated vLLM integration guide (#4162)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-10-06 15:57:17 +02:00
45ee98b05e Replace unittest with pytest (#4188) 2025-10-06 11:14:54 +02:00
3800a6ecc7 Hotfix: Exclude transformers 4.57.0 for Python 3.9 (#4209)
Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
2025-10-06 11:13:21 +02:00
7ad9ce8acc Remove tokenizer creation from sft example script (#4197) 2025-10-06 11:04:20 +02:00
0c2dc14014 Remove custome_container for building the docs (#4198) 2025-10-06 08:31:58 +02:00
ced8b337ba [DOCS/FIX] lora without regrets - fix lr (#4207) 2025-10-06 08:23:11 +02:00
1eff7da9e0 [DOCS] Lora without regret (#4181)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: sergiopaniego <sergiopaniegoblanco@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2025-10-03 20:40:37 +02:00
1cbfb00b6a Replace remaining trainer.tokenizer with trainer.processing_class in GRPO test (#4192) 2025-10-03 09:08:53 +02:00
e086f073cf 🌡️ Have vLLM return processed (temperature scaled) log probs (#4163)
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-10-01 11:58:13 -06:00
e5d437ed76 Hotfix wrong formatting of docstrings with blockquote tips (#4187) 2025-10-01 19:42:36 +02:00
d1b4691900 Fix CI ImportError: FlashAttention2 and decorator order for all parameterized tests (#4176) 2025-10-01 18:01:56 +02:00
39c603872f 🔣 Fix test: replace trainer.tokenizer by trainer.processing_class (#4185) 2025-10-01 09:16:42 -06:00
5a4021f23e Fix handling of f_divergence_type in DPO (#4171)
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-10-01 09:44:14 +02:00
ea66a9e650 🧺 [1/N] Refactor _generate in GRPO/RLOO: list of ints instead of tensors (#4146)
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
2025-09-30 16:22:30 -06:00
da209f89fc 🎁 RewardTrainer refactor (#4093)
Co-authored-by: juejuezi <juejuezi.git@foxmail.com>
Co-authored-by: Yi Shi <96773624+singing-cat@users.noreply.github.com>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2025-09-30 15:13:45 -06:00
ebb8899f5d Fix Flash Attention x Padding-Free loss (#4170) 2025-09-30 12:01:29 -06:00
70e2017dbc 🎞️ Support sequence classification models in clone_chat_template (#4097) 2025-09-30 11:42:56 -06:00
4368f54c97 👾 Use our own require_bitsandbytes (#4137) 2025-09-30 11:11:29 -06:00
22720d176b Add logging for training completion and model saving in training scripts (#4048) 2025-09-30 10:57:33 -06:00
c8a5add88a Fix PEFT interlinks in docstrings (#4178) 2025-09-30 18:32:23 +02:00
a7b54f988b Fix CI ValueError: Unknown loss type: dapo (#4173) 2025-09-30 18:27:21 +02:00
78bf77abbd 🅰️ Remove apex (#4139) 2025-09-30 09:52:52 -06:00
3b9ac65a05 🖨️ Print rich table for messages (#4160) 2025-09-30 09:07:57 -06:00
7a78320f58 Fix link in docstring of RLOOTrainer (#4180) 2025-09-30 16:54:55 +02:00
67e83aee90 Fix docstring interlink to parent class for NashMDTrainer and XPOTrainer (#4179) 2025-09-30 15:43:37 +02:00
a0df357591 Fix docstrings with 'deprecated' Sphinx directive (#4174) 2025-09-30 10:13:35 +02:00
864e593e9f Add missing FDivergenceType docstring (#4165) 2025-09-29 20:03:33 +02:00
6428647063 Remove unnecessary list comprehensions (#4164) 2025-09-29 20:02:46 +02:00
8a5bfecc3a 💡 Replace <Tip> with new markdown syntax (#4161)
Co-authored-by: sergiopaniego <sergiopaniegoblanco@gmail.com>
2025-09-29 10:48:00 -06:00
910aeebe06 Pass required token_type_ids (#4148) 2025-09-29 17:40:11 +02:00
e208823b3e Add docstring for OnlineTrainerState (#4166) 2025-09-29 17:26:14 +02:00
f397a61e82 😷 Refactor GRPO/RLOO to isolate _generate for GRPO with replay buffer (#4158) 2025-09-26 19:31:06 -06:00
7fe9dd42ac 📽 Multi image support for GRPO replay buffer (#4157) 2025-09-26 19:11:53 -06:00
79c774af54 🟩 Drop image_split_sizes in favour of image_grid_thw (#4156) 2025-09-26 18:50:27 -06:00
9603b41d7e 😷 Refactor GRPO/RLOO to isolate _generate (#4114) 2025-09-25 20:48:52 -06:00
5ee56ed04f Fixed some <Tip> rendering issues (#4143) 2025-09-25 14:47:46 -06:00
e85e634bff Refactor trainers classes to use BaseTrainer with shared functionality (#4128) 2025-09-25 18:32:57 +02:00
d633c4337f Fix import statement and GRPO test case (#4141) 2025-09-24 16:23:32 -06:00
d1e24df031 [GRPO]: Sample from a Replay Buffer To Substitute Groups with 0 std. (#4060)
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-09-24 21:12:16 +01:00
094e0760d4 🌵 Mark GKD trainer test as expected failure due to OOM issue (#4126) 2025-09-24 12:26:44 -06:00
01c9b4c414 🤸‍♀️ Fix DFT test (#4135) 2025-09-24 12:25:56 -06:00
18faf03c4e Fix CI: torch.AcceleratorError: CUDA error: device-side assert triggered (#4138) 2025-09-24 20:12:17 +02:00
d144e73e78 🪙 [Experimental] Support GSPO-token (#3820)
Co-authored-by: LeonEricsson <70749762+LeonEricsson@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-09-24 09:57:18 -06:00
be1ffe59d2 🌺 Fix GPT-OSS test (#4134) 2025-09-24 09:07:48 -06:00
fb6bdab33b Improve typing of SFT trainer (#4007) 2025-09-24 07:45:03 -06:00
526303edbd [SFTrainer]: Fix DFT Loss (#4112)
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-09-24 11:46:12 +01:00
9e5e60c933 👩‍🦯 Fix usage of VLM using text only (#4080)
Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-09-23 12:07:25 -06:00
5c52f46f9a Remove Python version < 3.13 constraint from vllm extra dependencies (#4125) 2025-09-23 17:04:32 +02:00
deac14a39f 🧹 Remove max_batch_tokens, num_blocks and block_size from generation kwargs (#4065) 2025-09-23 08:50:52 -06:00
3d5a30bb77 👋 Remove backend parameter from GuidedDecodingParams (#4123) 2025-09-23 08:12:13 -06:00
251fdb228a 📌 Pin vLLM version (#4122) 2025-09-23 08:02:30 -06:00
37806e618b 📤 Fix a dataset loading bug in scripts 2025-09-23 05:21:40 +00:00
008c7ad9aa [vllm] ensure MASTER_ADDR/MASTER_PORT are set safely (#4057)
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-09-22 23:19:12 -06:00
e8ba9eaf27 📤 Fix a dataset loading bug in scripts (#4124)
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-09-22 22:58:40 -06:00
abe07c9e32 🐯 fix: use_liger_kernel with IterableDataset (#4087)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-09-22 20:23:58 -06:00
fe02ea2b52 😴 Add vllm_enable_sleep_mode to RLOO Trainer (#4107)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-09-22 19:41:29 -06:00
68408d7219 📽 Multi image support for GRPO/RLOO (#4113) 2025-09-22 18:17:42 -06:00
94f8d00a62 🔭 Align param passing to VLM configs in generate_tiny_models (#4118)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-09-22 17:14:10 -06:00
b5ca3799ad 🟩 Drop image_split_sizes in favour of image_grid_thw (#4111) 2025-09-22 16:38:39 -06:00
a68b4af50f Fix code style with make precommit (#4119) 2025-09-22 13:19:54 -06:00
9f0ed8b130 CI hotfix: xfail test_training_with_transformers_paged for transformers<4.57.0 (#4120) 2025-09-22 13:19:30 -06:00
27f22ba5a1 docs: correct option name to enable vllm sleep mode (#4102)
Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
2025-09-22 13:04:00 +02:00
86f74b486f Fix VLM configs in generate_tiny_models (#4101) 2025-09-20 09:49:16 +02:00
26b497ea63 Fix typos (#4109) 2025-09-19 09:44:07 -06:00
d22bdb8031 Fix typos (#4106)
Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>
Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
2025-09-19 16:58:43 +02:00
0e204482e6 Some nits GRPO and RLOO trainer docs (#4108)
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2025-09-19 16:37:25 +02:00
3c8d7209f1 👁️ Add VLM support to RLOO trainer (#4067)
Co-authored-by: behroozazarkhalili <ermiaazarkhalili>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-09-18 21:54:06 -06:00
0450f05ad9 [GKD] Fix batchmean reduce op in GKDTrainer's loss (#4105) 2025-09-18 19:44:04 +02:00
7e2075347e Fix get_peft_model() so that prepare_model_for_kbit_training does not reapply to an instance of PeftModel, thus freezing all the layers (#4081)
Co-authored-by: Hoesu <hoesu.chung@qraftec.com>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2025-09-18 10:31:03 +02:00
20cc58d777 ℹ️ Enable XPU for vLLM client (#4031) 2025-09-17 22:06:25 -06:00
a6c0c57f6b ℹ️ feat: Add NPU and XPU support for activation offloading (#4056) 2025-09-17 22:03:56 -06:00
10dc36d610 🌪️ [GFPO]: implement GFPO in GRPOTrainer (#3989) 2025-09-17 19:14:40 -06:00
d2d1912d96 ⚖️ Align SFT and DPO for model creation and deprecate DPOConfig.padding_value in favour or pad_token_id (#4006)
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
2025-09-17 18:39:26 -06:00
08ea00289a 🧶 feat: Add WeaveCallback for W&B Weave integration (#4089)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-09-17 18:10:45 -06:00
4ff8b4e007 📜 Convert set to list of tags (#4092) 2025-09-17 14:05:41 -06:00
6356343fd2 Add deprecation warnings to docstrings (#4083)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-09-17 09:30:43 +02:00
45e59f77ea ⌨️ Pin num2words (#4094)
Co-authored-by: sergiopaniego <sergiopaniegoblanco@gmail.com>
2025-09-16 08:48:09 -06:00
4bd4acf172 🏞️ Context Parallelism benchmark guide (#4075)
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-09-16 08:46:12 -06:00
8380869d33 Community Tutorials design adaptation for videos (#4095) 2025-09-16 16:28:22 +02:00
5139af3712 Add support for testing experimental features (#4082) 2025-09-16 07:46:48 +02:00
2f46c18a66 Align slow tests with regular tests (#4085) 2025-09-16 07:22:30 +02:00
e2b18ec4e7 ▶️ Add video to community tutorials (#4090)
Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
2025-09-15 10:51:23 -06:00
78f1a928ce 🗑️ Remove deprecated AlignPropTrainer, DDPOTrainer and IterativeSFTTrainer (#4068) 2025-09-15 09:56:41 -06:00
1d0b196f6b Reviewed HF jobs updated docs (#4088) 2025-09-15 08:41:08 -06:00
5a1c2f9b3b Aux loss is already included in the loss returned by Transformers (#4078) 2025-09-14 16:56:58 +01:00
9955ee7eaa 🐳 Docker update + Simplify Jobs doc (#3931)
Co-authored-by: sergiopaniego <sergiopaniegoblanco@gmail.com>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2025-09-13 18:35:55 -06:00
304eaf8053 🛠️ Fix CI (#4076) 2025-09-13 12:38:48 -06:00
69e288ebad ✂️ [GRPO VLM] Update split sizes to generalize (#4032)
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-09-12 19:11:32 -06:00
d655ce48f8 🌾 [Experimental] BEMA for ref model (#3898)
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2025-09-12 11:47:44 -06:00
91c4bba922 🧪 Add trl.experimental Submodule (#4073)
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-09-12 11:02:23 -06:00
2845d024a4 Set Ruff src for first-party imports (#4074) 2025-09-12 15:43:04 +02:00
f4ff248407 ♨️ [GRPO] Fix potential hang in get_high_entropy_mask (#4041) 2025-09-11 19:33:39 -06:00
b8eb5c5d2d Improve docstring of AlignPropTrainer (#4059)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-09-11 11:42:31 -06:00
07f9ad982d 💡 Fix type hint to make_parser function in multiple scripts (#4050) 2025-09-11 11:36:05 -06:00
417915a3e4 Fix CI failure in slow GRPO test due to missing pillow dependency (#4064) 2025-09-11 17:35:57 +02:00
44ddc28bcd Hotfix: Add ParallelismConfig fallback for transformers with old accelerate (#4063) 2025-09-11 15:11:41 +02:00
e8b8499f1f Remove redundant 'None' from docstrings (#4058) 2025-09-11 08:16:34 +02:00
7eb7f42372 ⬆️ Bump dev version (#4054) 2025-09-09 22:17:35 -06:00
6adfd138d8 Release: 0.23 (#4053) 2025-09-09 22:16:17 -06:00
a647e5a78a 🗜 Hotfix: avoid passing quantization_config=None (#4019) 2025-09-09 14:50:15 -06:00
816ac610c0 🪪 Update SFTTrainer to handle labels correctly and add configuration example in paper index (#4051) 2025-09-09 14:49:36 -06:00
373a64a7ce 💬 Remove setting chat template in sft script (#4037) 2025-09-09 13:24:08 -06:00
09e19244c0 Improve SFT doc (#4005)
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
2025-09-09 13:22:00 -06:00
a228cb51d1 Add autodoc for BestOfNSampler and improve docstrings (#4034) 2025-09-09 20:28:02 +02:00
6c6f13b5f3 🏂 Fix label shifting logic in SFTTrainer for compatibility with CP (#4038) 2025-09-09 12:08:38 -06:00
b3f9f613f9 Update VLM arch check to AutoModelForImageTextToText for DPO and Online DPO (#4049) 2025-09-09 11:10:27 -06:00
659d2c1284 🧨 DFT (#4042) 2025-09-09 08:23:30 -06:00
82b34e5723 Update transformers minimum version to 4.56.1 (#4047) 2025-09-09 16:05:04 +02:00
27e30f86ef CI hotfix: xfail test_training_with_transformers_paged (#4046) 2025-09-09 15:47:25 +02:00
af82b38482 ⚖️ Remove average_tokens_across_devices default replacement (#4039) 2025-09-09 07:39:12 -06:00
1b799a23c1 🥓 [docs] add CP docs (#3994)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-09-08 21:46:22 -06:00
e4ebf3ba11 Add autodoc for AlignPropTrainer and AlignPropConfig (#4033) 2025-09-08 20:13:23 +02:00
e458df650a Add missing trainer docstrings (#4030) 2025-09-08 20:12:58 +02:00
a1ee7d2182 [doc] Group paper index by trainer (#4027) 2025-09-08 18:03:48 +02:00
1d06757e57 [doc] Paper index for Truncated Importance Sampling (#4026) 2025-09-08 08:11:08 +02:00
4f9009b0f2 Fix formatting errors in docstrings (#4025) 2025-09-08 07:22:00 +02:00
c9484b161f Align docstring parameters with function definitions (#4017) 2025-09-07 10:40:09 +02:00
f5c2fec4a9 Fix typo in GRPO quickstart (#4020) 2025-09-06 10:31:09 +02:00
d1bf56020d ⚖️ Add vLLM server mode and VLM support to OnlineDPOTrainer (#3783)
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-09-05 16:58:49 -06:00
19f9b9ee69 Add missing doc strings in SFTrainer (#4003)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-09-04 23:20:07 +01:00
1eb38018b7 [SFTTrainer]: Add Aux Loss for MoE models. (#4012)
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-09-04 22:49:39 +01:00
deae7e00b8 🌵 Refactor entropy_from_logits for memory efficiency (#4013) 2025-09-04 13:59:48 -06:00
0c69fd2867 👷 Added Kernels on the Hub x TRL guide (#3969)
Co-authored-by: vb <vaibhavs10@gmail.com>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2025-09-04 15:37:02 +02:00
b5fd290b2c [SFT] fix: collator docstring (#4011) 2025-09-04 14:35:09 +02:00
67991605c0 Comprehensive Paper Index Enhancement with 9 New Algorithm Implementations (#3990)
Co-authored-by: behroozazarkhalili <ermiaazarkhalili>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-09-03 19:59:16 -06:00
208e9f7df7 📏 torch_dype to dtype everywhere (#4000)
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-09-03 15:45:37 -06:00
3bfa981bd2 [GRPO]: Fix Multi-GPU training for Entropy based masking of tokens. (#3964)
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-09-03 22:10:16 +01:00
6a5dfffe56 💾 [bugfix] fix PPO save_checkpoint (#3998) 2025-09-03 14:51:34 -06:00
18633dbb06 ✖️ Support pad-to-multiple-of and padding-free (#3996) 2025-09-03 08:37:44 -06:00
e4dbf57bf2 Fixed tags shown problem in memory usage docs (#3999) 2025-09-03 08:35:51 -06:00
12fc85fd13 [GRPO] Truncated Importance Sampling to address rollout-training mismatch (#3867)
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-09-03 09:50:54 +02:00
fdd6bda111 Add pre-commit and hf-doc-builder as dev dependencies (#3993) 2025-09-03 08:25:14 +02:00
cb84da0ece fix: add return to shift_tokens_right (#3987) 2025-09-02 19:18:17 -06:00
35702ce378 ⚖️ Fix scale_rewards issue in GRPO (#3992)
Co-authored-by: Leon <leon.ericsson@foi.se>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-09-02 18:02:06 -06:00
705306d78b 🎯 Add Trackio integration documentation and update TOC (#3971)
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2025-09-02 13:06:06 -06:00
edbe8234bc [GRPO] Adds an option to sleep vllm when running in colocated mode (#3968)
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
2025-09-01 09:59:52 +02:00
4c47b32811 🪃 args.gradient_checkpointing = False instead of args = dataclasses.replace(args, gradient_checkpointing=False) (#3981) 2025-08-30 16:01:33 -07:00
92046bb972 👮 Fix GRPO CLI by setting parameters for get_soft_overlong_punishment (#3972) 2025-08-30 16:00:26 -07:00
39faf36a91 Refactor version retrieval to use importlib.metadata for improved reliability 2025-08-29 20:44:05 +00:00
1cb4150dfb ⬆️ Bump dev version (#3978) 2025-08-29 13:21:55 -07:00
3a6b365c0d Release: v0.22 (#3977) 2025-08-29 13:19:34 -07:00
7ae16d3234 🧱 PyPI publishing workflow (#3976) 2025-08-29 12:52:25 -07:00
ab984fabac Style 2025-08-29 19:50:23 +00:00
419d716a6b Fix CI (#3975) 2025-08-29 12:23:20 -07:00
f538bd3085 📜 GSPO docs - Sequence importance ratio and differences in relation to GRPO (#3816)
Co-authored-by: Shirin Yamani <75791599+shirinyamani@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-08-29 12:08:40 -07:00
8aa0eed816 ℹ️ Validate examples on xpu (#3897)
Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
Signed-off-by: YAO Matrix <matrix.yao@intel.com>
2025-08-29 10:56:57 -07:00
e7b37d4e8d 🔥 [Refactor] RLOOTrainer (#3801)
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Edward Beeching <edbeeching@users.noreply.github.com>
2025-08-29 09:27:28 -06:00
b7676d1701 Fixed some typos and added small details about trackio to docs (#3965) 2025-08-27 17:57:19 +02:00
515e9eb255 [CI] Modify tests to handle device allocation for models (#3962) 2025-08-27 17:23:37 +02:00
26442abff2 Add HF jobs tag when creating model card via jobs (#3956) 2025-08-27 12:18:05 +02:00
0c91515b58 🧭 HF jobs x TRL guide (#3890)
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-08-26 21:44:29 -07:00
4b3517facc 📸 Return position_ids for flash_attention_3 (#3942) 2025-08-26 20:32:17 -07:00
6f5865131b 🦥 Unsloth Docs update (#3955)
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-08-26 20:17:21 -07:00
0c7ab76a01 LitePPO: Fix Docs for paper index (#3954) 2025-08-26 20:16:43 -07:00
ffc061b5e5 ✂️ fix: handle list tensors in split_tensor_dict function (#3951) 2025-08-25 09:56:16 -07:00
38fc1f6ecf 🤸 [SFT] Drop entropy calculation when using liger (#3947) 2025-08-25 09:14:39 +02:00
39cc9a826a [GKD] add liger loss (#3946)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-08-24 19:25:25 +02:00
1f15f187c3 [DPO] Adding support for different losses which are now supported by Liger (#3815)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2025-08-24 18:53:35 +02:00
181a841877 🗂 Update paper_index section (#3937)
Co-authored-by: behroozazarkhalili <ermiaazarkhalili>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-08-22 12:13:22 -07:00
da167d88b2 🎆 Add entropy logging in SFT (#3940) 2025-08-22 10:40:23 -07:00
2324245cad 🏌️ DAPO loss type (#3938) 2025-08-22 10:38:28 -07:00
fe44806b68 🪶 [GRPO] PPO Lite: Scale rewards by Std of Batch (#3935)
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-08-21 12:47:07 -07:00
251c0488c8 📦 Wrapping the main execution code to avoid multi-processing issues from vLLM (#3932)
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
2025-08-21 12:45:13 -07:00
e2eaa2334d 🗞 bugfix 'TrainerState' object is not subscriptable (#3936)
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-08-21 12:33:23 -07:00
48d7ecc67b 🗑️ Deprecate setup_chat_format (#3929) 2025-08-20 14:06:23 -07:00
215294872e prepare_multimodal_messages fix 2025-08-20 17:25:51 +00:00
MQY
85ead751f5 ♻️ Reuse multimodal message preparation from SFTTrainer in GRPOTrainer (#3919)
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-08-20 10:04:54 -07:00
8793a46760 🧾 Use logger.warning instead of warnings.warn (#3923)
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2025-08-20 09:20:09 -07:00
730e19d939 🤹‍♂️ Multi-image testing dataset (#3916) 2025-08-20 08:27:14 -07:00
7233b981ce 🧹 Clean SFT tests (#3922) 2025-08-20 07:36:03 -07:00
18836f078e ✏️ Fix typos (#3921)
Signed-off-by: cyy <cyyever@outlook.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-08-19 10:07:34 -07:00
e575ea3815 📚 Update BEMACallback documentation to ignore docstyle and fix lag parameter description (#3917) 2025-08-18 17:57:45 -07:00
52eaa552aa ➡️ SFTTrainer for VLM: support completion-only loss (#3908) 2025-08-18 17:23:41 -07:00
0227d68e50 🌓 SFTTrainer for VLM: Support for prompt-completion data (#3907) 2025-08-18 16:46:17 -07:00
b08bc7f33e ♻️ use_cache should be set in the forward pass (#3891) 2025-08-18 14:47:33 -07:00
152235a8e5 🗑 Deprecate IterativeSFTTrainer (#3905) 2025-08-18 14:28:04 -07:00
4fcef6c32d 🐯 Support assistant-only training and Liger (#3914) 2025-08-18 14:23:46 -07:00
d15049bf71 🗳️ Extend BCO Trainer dataset format support (#3134) 2025-08-17 00:35:23 -07:00
b9718449a8 🗿 [CPO] Add AlphaPO method via CPOTrainer (#3824)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-08-16 23:26:02 -07:00
0e7c99ab07 Optimize completion_ids list conversion in GRPO trainer (#3874)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-08-16 21:47:13 -07:00
MQY
c99cd2361e 🌳 Enhance segment tree implementation for non-power-of-2 values (#3888)
Co-authored-by: Pramodith Ballapuram <16939722+pramodith@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-08-16 21:39:57 -07:00
68937969b4 Add tests for get_position_ids_from_packed_seq_lengths (#3883) 2025-08-16 21:36:53 -07:00
a6f802f41d ⚔️ Optimize truncate_with_protected_tokens to use vectorized operations (#3875)
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-08-16 21:17:54 -07:00
jp
dfb96af810 ☑️ Check eval batch size in grpo (#3889) 2025-08-15 21:41:04 -07:00
485e7d1c74 ✏️ Fix SFTTrainer token accuracy computation with PromptEncoder (#3821) 2025-08-14 20:22:05 -07:00
7ee8f796ff 👔 HF Doc Builder style (#3498) 2025-08-14 18:58:12 -07:00
64b7028fe9 🪄 Improve quickstart documentation with updated API examples (#3873)
Co-authored-by: behroozazarkhalili <ermiaazarkhalili>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-08-14 17:17:16 -07:00
1324448c6f 👁️ VLM blog (#3899) 2025-08-14 17:09:16 -07:00
206964ce16 🎢 [Callbacks] BEMA (#3855)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-08-14 13:54:52 -07:00
39efa8affb 🧩 Fix reward_processing_classes validation in GRPOTrainer (#3876)
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-08-13 15:47:45 -07:00
499d9fb32c Minor optimizations in SFT. (#3884) 2025-08-13 14:27:31 -07:00
44e6c153a5 🔮 Native VLM support for SFTTrainer (#3862)
Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2025-08-12 20:43:00 -07:00
f5b1ed24a0 Replaced unittest.TestCase with TrlTestCase that handles tmp dir (#3863) 2025-08-12 12:37:19 -07:00
7f53ac08f2 🕹️ [GRPO] Fix vllm mode validation in distributed setting (#3886)
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-08-12 11:15:31 -07:00
b4c418110c 💇 Add soft overlong punishment reward function and update documentation (#3804) 2025-08-12 10:58:41 -07:00
80b660de76 ⌨️ Add py.typed (#3841)
Signed-off-by: cyy <cyyever@outlook.com>
2025-08-12 10:06:53 -07:00
65d7894b6a Integrate PEFT model preparation across trainers and utilities (#3882) 2025-08-12 10:02:27 -07:00
72d4d82b8c 🎚️ Add dataset mixer (#3791)
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-08-11 20:14:50 -07:00
de27d612b0 🦦 Validate vllm_mode param in GRPO (#3866) 2025-08-08 21:00:18 -07:00
a222aeb462 🎀 New defaults: gradient_checkpointing=True (#3510) 2025-08-08 20:59:37 -07:00
cb95323429 👋 Remove --bf16 value in scripts (#3869) 2025-08-07 12:25:36 -07:00
2fb7090231 👁️ From AutoModelForVision2Seq to AutoModelForImageTextToText (#3836) 2025-08-07 08:00:16 -07:00
f23543fc96 [GRPO] 👁️ Fix vLLM server mode for VLM GRPO training incompatibility for certain AutoProcessors (#3832)
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
2025-08-07 11:04:02 +02:00
d3f63ca292 Small style fix in README (#3861) 2025-08-07 09:51:30 +02:00
ad0b9dae1e Typo fix in new model description (#3854) 2025-08-06 11:23:01 +02:00
f3289be384 🔗 Fix collection link in doc (#3852) 2025-08-05 15:51:31 -07:00
f9b0947155 ⬆️ Bump dev version (#3850) 2025-08-05 09:52:43 -07:00
46d09bd240 Release: v0.21 (#3849) 2025-08-05 09:50:17 -07:00
17393b8c82 🌺 OpenAI GPT OSS & Harmony support (#3848)
Co-authored-by: Shirin Yamani <75791599+shirinyamani@users.noreply.github.com>
Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
2025-08-05 09:44:59 -07:00
21060b25a5 🪦 Remove deprecated (#3817)
Co-authored-by: LeonEricsson <70749762+LeonEricsson@users.noreply.github.com>
2025-08-05 09:14:59 -07:00
5d914a4125 [GRPO]: Fix Entropy Mask Threshold Calculation when using Multi-GPU training (#3833)
Co-authored-by: LeonEricsson <70749762+LeonEricsson@users.noreply.github.com>
2025-08-05 12:27:59 +02:00
67763762bc Add 'Post training a VLM for reasoning with GRPO using TRL' recipe to Community tutorials (#3843) 2025-08-04 18:46:53 +02:00
072d7dd5a6 Improve trainer doc (#3818) 2025-08-01 11:14:16 +02:00
ead5aaf934 Performance optimization: Replace list comprehensions with tensor operations in BCO and KTO trainers (#3813)
Co-authored-by: chiliu <chiliu@paypal.com>
2025-08-01 11:11:20 +02:00
dbbc770f45 fix CI docs and grpo slow test (#3814)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-07-31 14:10:00 +02:00
294e8cb093 Fix citation 2025-07-31 03:10:19 +00:00
79c5797d92 GSPO parameters update from v2 (#3798)
Co-authored-by: LeonEricsson <70749762+LeonEricsson@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-07-30 20:11:00 -06:00
ab2400029a add xpu support for mergekit (#3800)
Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
2025-07-30 20:07:55 -06:00
3ae60cd1b4 Add GSPO script examples (VLM/LLM) (#3810) 2025-07-30 20:07:23 -06:00
9a1e6a4508 Correction parameter description (#3803)
Co-authored-by: lunzhongwang <lunzhongwang@soulapp.cn>
Co-authored-by: LeonEricsson <70749762+LeonEricsson@users.noreply.github.com>
2025-07-30 21:41:15 +02:00
90c7876da5 Add vLLM transformers backend to online methods (#3773)
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: sergiopaniego <sergiopaniegoblanco@gmail.com>
2025-07-30 18:24:50 +02:00
72bbc6dd0d Examples list updated in docs (#3806) 2025-07-30 04:09:29 -06:00
25ce0f31ae 🐙 Add MPO VLM example script (#3799) 2025-07-29 20:52:32 -06:00
9269f9f151 Fix broken PEFT+TRL docs link in using_llama_models.md (#3794)
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2025-07-29 20:24:11 +02:00
eb5d0fe484 ⬆️ Bump dev version (#3793) 2025-07-28 22:11:46 -06:00
30576d2ddc Release: v0.20 (#3792) 2025-07-28 22:08:54 -06:00
5522cc0a3f 👐 FSDP2+GRPO (#3687)
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: Shirin Yamani <75791599+shirinyamani@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-07-28 22:01:08 -06:00
303d3b1d63 📘 SFT doc rewrite (#3619)
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2025-07-28 17:06:45 -06:00
3d765b0702 🔍 Add guidance on choosing max_length value and include visualization tool (#3630) 2025-07-28 16:29:35 -06:00
fcd3e0fd15 🌋 [GRPO] add support for pixel_attention_mask (SmolVLM2) and image_sizes (LLaVa-Next) (#3760)
Co-authored-by: sergiopaniego <sergiopaniego@users.noreply.huggingface.co>
Co-authored-by: sergiopaniego <sergiopaniegoblanco@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-07-28 16:28:29 -06:00
8a23c866f8 💬 Fix clone_chat_template vocab size and support PEFT instruction tuning (#3763)
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2025-07-28 11:47:17 -06:00
5bb3ca4b21 📍 Support training peft model with gradient checkpointing (#3785)
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2025-07-28 11:27:57 -06:00
fd70021cd7 📐 Add epsilon hyperparameter recommendation to GSPO (#3790) 2025-07-28 09:34:45 -06:00
a902450e85 🤏 [SFT] Improve doc on training on assistant only messages (#3784) 2025-07-27 22:00:53 -06:00
03034317d0 🎞️ GSPO (#3775) 2025-07-27 06:14:29 -06:00
23ea671c5e 🍿 [SFT] Fix dataset indexing which crashed with a IterableDataset (#3771)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-07-26 16:42:07 -06:00
fc08f55518 🩹 [Hotfix] Fix pynccl communicator assertion error with VLLMClient (#3774)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-07-26 16:33:18 -06:00
2f4cb38f28 📐 Fix CI and GeometricMixtureWrapper (#3779) 2025-07-26 16:15:08 -06:00
eee9ec94ef Update missing uv dep (#3772) 2025-07-25 08:00:03 -07:00
a043fd74a3 Add uv scripts headers (#3767) 2025-07-25 07:48:40 -07:00
d16b960dfa 🤓 [GRPO] Documentation for entropy metric (#3770) 2025-07-25 07:26:10 -06:00
daad892730 🌌 [GRPO] Log generation entropy (#3700)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-07-24 23:55:23 -06:00
097d6153a2 🔠 Support model str in OnlineDPO (#3765)
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-07-24 23:29:54 -06:00
bc3eebb73e 🔔 Add deprecation warnings for AlignPropTrainer and DDPOTrainer (#3755)
Co-authored-by: LeonEricsson <70749762+LeonEricsson@users.noreply.github.com>
2025-07-24 23:27:41 -06:00
1fb115daff Prevent NCCL Device Conflicts Between vLLM Server and Trainers (#3762)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-07-24 23:16:15 -06:00
3a40f18192 Add MPO recipe to Community tutorials (#3766) 2025-07-24 09:16:35 -07:00
56f4201db6 👁️ [GRPO] Add VLM training capabilities to the trainer (#3072) 2025-07-22 20:31:08 -07:00
a50bdc6388 👨‍💼 [SFT] Packing with completion_only and assistant_only training (#3749)
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-07-21 21:49:10 -07:00
e102ac8df1 ⚰️ Remove deprecated (#3704) 2025-07-21 18:16:29 -07:00
d870230218 🐙 MPO (#2544)
Co-authored-by: ariG23498 <aritra.born2fly@gmail.com>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: sergiopaniego <sergiopaniegoblanco@gmail.com>
2025-07-21 11:13:05 -07:00
68ce3a3f07 Add Object detection grounding recipe to Community tutorials (#3752) 2025-07-21 11:02:48 +02:00
5787f3bf63 [GRPO] Fix: Processing ref logprobs in batches (#3740)
Co-authored-by: LeonEricsson <70749762+LeonEricsson@users.noreply.github.com>
2025-07-20 16:17:02 +02:00
116ec493fa 🏗️ Refactor top-entropy in GRPO (#3727) 2025-07-19 13:48:57 -07:00
1b17fa78ae uses steps_per_generation in vllm max_num_seqs (#3747) 2025-07-19 09:58:14 -07:00
c389599057 Add comment for average_tokens_across_devices (#3746) 2025-07-19 07:35:32 -07:00
e333da8cf0 Updated missing processing_class docs for rest of trainers (#3745) 2025-07-18 19:51:07 +02:00
c8347b4287 Updated processing_class docs for trainers (#3737) 2025-07-16 07:26:32 -07:00
8684cb4666 🕸 Use wandb.run.url instead of wandb.run.get_url() (deprecated) (#3726) 2025-07-15 18:44:18 -07:00
508d551db1 🔧 Fix GRPO sampling logic (#3725) 2025-07-15 13:39:09 -07:00
569d60e999 [GRPO] remove common activation offloading substring in all cases (#3738) 2025-07-15 13:33:48 -07:00
640a9f3916 📥 Set environment variables for vLLM distributed training in GRPOTrainer (#3723) 2025-07-11 20:15:22 -07:00
5a2b04a699 ↔️ Fix CB in GRPO (#3722)
Co-authored-by: Shirin Yamani <75791599+shirinyamani@users.noreply.github.com>
2025-07-11 18:21:24 -07:00
dffd1acb94 👋 Remove --bf16 flag from training scripts (#3724)
Co-authored-by: Shirin Yamani <75791599+shirinyamani@users.noreply.github.com>
2025-07-11 18:20:15 -07:00
43e6b24e70 Remove deprecated processor.tokenizer (#3720) 2025-07-11 15:46:34 -06:00
2ae43f80d9 [Online DPO] Safeguard logit slice against empty prompt (#3719) 2025-07-11 12:40:17 +02:00
c949b66f01 Fix ORPOTrainer loss scaling with gradient accumulation (#3716) 2025-07-11 00:37:00 +02:00
97085539a3 BUG: Disregard pad token entropies for entropy threshold calculation (#3715) 2025-07-10 16:06:26 +02:00
68ed863eed ⚗️ Tiny MoE for test (#3712) 2025-07-09 08:25:47 -07:00
0462dd7f12 [SFT] Add seq_lengths to signature columns (#3699)
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2025-07-08 19:20:13 +02:00
68db24e010 🔭 Fix package discovery configuration in setup.cfg (#3703) 2025-07-07 19:50:56 -07:00
2d086f26a5 📣 Use explicit version for checking datasets version (#3702) 2025-07-07 11:35:57 -07:00
b674989f15 ✂️ [BUG when vllm and prompt_truncation are used]: Strip out pad tokens in truncated prompt text (#3698)
Co-authored-by: Shirin Yamani <75791599+shirinyamani@users.noreply.github.com>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-07-07 11:29:34 -07:00
0353d67661 Fix mislabeling: "First-fit decreasing" is actually "Best-fit-decreasing" (#3696)
Co-authored-by: Shirin Yamani <75791599+shirinyamani@users.noreply.github.com>
2025-07-07 19:47:18 +02:00
d98d53983b Add type hints to dpo_trainer.py (#3631)
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: LeonEricsson <70749762+LeonEricsson@users.noreply.github.com>
2025-07-06 10:33:36 +02:00
c30344e9ee Restore the effect of liger_kernel's monkey_patch on global modules in UT. (#3680)
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2025-07-06 09:40:44 +02:00
db19d79e30 [CI] Fix slow grpo CI (#3693) 2025-07-04 19:46:21 +02:00
e8abe03a06 [fix] type error of quantile (#3667) 2025-07-04 17:30:26 +02:00
7eb52c1b4e fix: support dict access in SFT Trainer (#3677)
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2025-07-04 11:27:46 +02:00
686cd35a72 Fix non-serializable torch.dtype bug in VLLM weight sync (#3690)
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2025-07-03 21:25:29 +02:00
601a25693e Update steps_per_generation default description grpo_config.py (#3685) 2025-07-03 20:47:05 +02:00
d42188b17f Support datasets 4 (#3688)
Co-authored-by: Quentin Lhoest <quentinlhoest@Quentin-Ls-MacBook-Pro.local>
2025-07-03 11:45:37 -06:00
4ccc5ca7bd Faster position_ids computation for FFD packing (#3649)
Co-authored-by: Shirin Yamani <75791599+shirinyamani@users.noreply.github.com>
2025-07-03 13:43:22 +02:00
d1e116c67d [SFT] drop attention_mask if we have position ids for fa2 (#3673)
Co-authored-by: Shirin Yamani <75791599+shirinyamani@users.noreply.github.com>
2025-07-03 09:18:41 +02:00
90cdf96418 🖼️ Add mlflow support for generate_during_eval DPOTrainer (#3660)
Co-authored-by: Shirin Yamani <75791599+shirinyamani@users.noreply.github.com>
2025-07-02 14:42:11 -06:00
b520378b97 Enable completion-only loss in SFTTrainer when using Liger Kernel (#3674)
Co-authored-by: kwhitecross <kwhitecross@cs.umass.edu>
Co-authored-by: shirinyamani <75791599+shirinyamani@users.noreply.github.com>
2025-07-02 12:12:14 -06:00
e04f7eb3b9 feat: Pass trainer state to reward functions (#3669)
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2025-07-01 14:16:26 +02:00
02cce41d06 Add support for CB with native transformers (#3471)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2025-07-01 12:26:09 +02:00
6a6d4345c9 Add paranthesis to correct the check. (#3658) 2025-06-28 07:19:01 +02:00
79ec242aef [GRPO] Make sure special tokens aren't lost when truncating prompt. (#3651) 2025-06-26 09:29:20 +02:00
7e8ef867ae Add entropy based filtering inside the GRPOTrainer. (#3563)
Co-authored-by: LeonEricsson <70749762+LeonEricsson@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-06-25 22:38:41 +02:00
32df09358e 🤝 validate gradient_accumulation_steps vs steps_per_generation for on-policy GRPO (#3493)
Co-authored-by: Shirin Yamani <75791599+shirinyamani@users.noreply.github.com>
2025-06-25 18:03:22 +02:00
0336e4bcbb ️ GRPO script reward_funcs error (#3639)
Co-authored-by: Shirin Yamani <75791599+shirinyamani@users.noreply.github.com>
2025-06-25 16:47:08 +02:00
ab331bfd56 Update dpo_vlm.py (#3629)
Co-authored-by: Shirin Yamani <75791599+shirinyamani@users.noreply.github.com>
2025-06-24 13:56:34 +02:00
84d7b5bbfa env var for vllm colocate exp added (#3638) 2025-06-24 13:44:19 +02:00
b40c959c00 fixing num_processes (#3637) 2025-06-24 13:42:58 +02:00
34fa6b9af2 🐛 fix grpo generation_kwargs (#3634)
Signed-off-by: ahatamizadeh <ahatamizadeh@nvidia.com>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2025-06-24 11:43:45 +02:00
eef7a43427 Revert "🔍 Add guidance on choosing max_length value and include visualization tool"
This reverts commit 89c699f59839bb1e2917c2da770015320d087a88.
2025-06-22 23:08:26 +02:00
89c699f598 🔍 Add guidance on choosing max_length value and include visualization tool 2025-06-22 23:06:36 +02:00
559a99f053 ⬆️ Bump dev version (#3626) 2025-06-20 19:02:19 +02:00
5b3ea9dd43 Release: v0.19 (#3625) 2025-06-20 18:43:31 +02:00
c262674ea7 🧰 [SFT] Tool support (#3597)
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2025-06-20 17:39:24 +02:00
5c3dd3ab24 🔍 Add test to verify chat template consistency (#3624) 2025-06-20 17:16:52 +02:00
4c92de0000 ⚔️ Fix bf16 fp16 config conflict issue (#3598)
Signed-off-by: YAO Matrix <matrix.yao@intel.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-06-20 15:00:39 +02:00
67f17f7ea4 📜 Add chat_template_path parameter to SFTConfig (#3599) 2025-06-20 14:15:03 +02:00
37a71e82bf 🧬 Add generation_kwargs as a property of GRPOConfig to support additional generation arguments. (#3617)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-06-20 14:14:48 +02:00
b0958c6f8f [GRPO] Fix prompt truncation (max_prompt_length) with vLLM. (#3601)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-06-20 12:12:33 +02:00
8bad863ffa Add vllm_gpu_memory_utilization recommendation script (#3554)
Co-authored-by: Shirin Yamani <75791599+shirinyamani@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-06-19 23:17:47 +02:00
d00441505d 🎁 Put the reward computation in a separate function (#3620)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-06-19 22:59:44 +02:00
9554c2f319 🤵‍♂️ SFT on assistant messages only (#3586)
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2025-06-19 22:59:26 +02:00
712afd5dd1 🦘 Skip no-op ChatML conversion for datasets already in ChatML format (#3594) 2025-06-19 22:37:58 +02:00
086e9d56e3 📚 SFTTrainer support chat template kwargs (#3609) 2025-06-19 22:12:30 +02:00
5206c927f6 🔖 Fix: ensure user-provided labels are retained in self._signature_columns (#3589)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-06-19 16:03:58 +02:00
e4b586a389 👔 Apply doc-builder style (#3615)
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2025-06-19 12:02:51 +02:00
0576346758 🏛️ Fix CI and Iterative SFT (#3614) 2025-06-19 11:33:20 +02:00
e63588a56a 🏁 Refactor reference model initialization in GRPOTrainer (#3575)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Shirin Yamani <75791599+shirinyamani@users.noreply.github.com>
2025-06-18 16:20:36 +02:00
d9d25a71b2 [SFT] Clarify default collator docs (#3606) 2025-06-18 14:43:09 +02:00
58ea227d4c Change enforce_eager default value in vLLM server. (#3607) 2025-06-18 14:42:53 +02:00
a768484d47 Fix Typos in Comments and Improve Clarity in Trainer Modules (#3596) 2025-06-18 14:42:42 +02:00
d17ec7ad72 Fix: list-typed tags handling in Trainer::create_model_card (#3613) 2025-06-18 14:32:36 +02:00
ed9b78a5f7 🗳️ Remove logging_steps parameter from for simpler setup (#3612) 2025-06-18 13:52:21 +02:00
d6a969ff7d ♻️ Avoids redundant calculation of ref logps in the new policy update loop (#3600)
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-06-18 11:56:45 +02:00
FT
8a235a9b71 Fix Typo in Documentation and Notebook; Improve Library Installation Comment (#3593) 2025-06-15 16:46:41 +02:00
afa06c3b56 Fix typos and improve metric descriptions in documentation (#3585) 2025-06-15 16:00:38 +02:00
77ec43ce31 🛡️ Adding trust_remote_code to vllm-serve (#3588)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-06-15 16:00:07 +02:00
4126803875 💬 Fix setup_chat_format and add clone_chat_template (#3404)
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Edward Beeching <edbeeching@users.noreply.github.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2025-06-15 15:59:42 +02:00
91b3f5ee9a 💡 Fix wrong type hint for formatting_func argument in SFTTrainer (#3584)
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-06-15 15:38:12 +02:00
b6e255a9d3 💡 Fix type hints in trainer/utils.py (#3591) 2025-06-15 12:43:54 +02:00
0d54f05fa3 Adjust max_num_batched_tokens (#3565)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-06-13 16:08:07 +02:00
72c91e77f5 📨 [SFT] Tokenize directly when applying the chat template (#3572) 2025-06-13 16:03:55 +02:00
32ffa1170e 🎀 New defaults: bf16=True (#3515) 2025-06-13 13:40:12 +02:00
fd4c9e3b72 Add Community Tutorial: GRPO text summarization example with Unsloth optimizations (#3576) 2025-06-13 13:08:10 +02:00
c5e64b479b 🫸 Push model card with checkpoint (#3550) 2025-06-13 11:18:02 +02:00
15ff54790b 🏗️ Add test for training with multiple dataloader workers and update worker initialization for compatibility with transformers 4.52.0 (#3568) 2025-06-12 19:13:19 +02:00
3d077fd3de Add support for IterableDataset in DPO Trainer (#3559)
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2025-06-12 13:06:34 +02:00
53c4a7c2b8 [Liger] liger DPO support (#2568)
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
Co-authored-by: Vaibhav Jindal <32337828+vaibhavjindal@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-06-12 12:25:12 +02:00
aff16a5b2f Fix dev version (#3570) 2025-06-12 10:06:20 +02:00
1314aac502 ℹ️ Unify autocast behavior to torch.autocast and make it cover XPU (#3541)
Signed-off-by: YAO Matrix <matrix.yao@intel.com>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-06-10 09:13:00 +02:00
e99a8aec4b Update tests_latest.yml (#3558) 2025-06-09 21:15:17 -07:00
b9572737b4 🆙 Bump transformers to 4.51 and use _VALID_DICT_FIELDS (#3553) 2025-06-09 21:50:57 +02:00
4cafb2744a 🧮 Rearrange DPOTrainer (#3501)
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-06-09 19:44:24 +02:00
c49c7b7d4e 🛋️ Fix CI and bump accelerate (#3551) 2025-06-09 14:56:20 +02:00
b773a4c191 💽 [TRLParser] Fail when unknown args are provided in the config file. (#3543)
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-06-05 21:43:21 -07:00
7c8355d038 📦 Packing with flash attn kwargs to avoid cross-contamination (#3526)
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-06-05 21:18:46 -07:00
50a2fa8ec8 Faster FFD packing (#3537)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-06-04 14:37:28 -07:00
0333108854 🎀 [SFT][Bugfix] sets average_tokens_across_devices to true in SFTConfig (#3538)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-06-04 14:20:57 -07:00
6ffde23a45 💭 [Data] Fix DeepSeek-R1 case (#3522)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-06-04 11:48:16 -07:00
6f288c2d9d 🐳 Add DeepseekV3 model configurations and update tests for new models (#3536) 2025-06-04 09:34:28 -07:00
8cf6220cef 🧭 Remove useless transformers version checks (#3534) 2025-06-04 09:03:38 -07:00
da7b3fe745 🎯 Don't use getattr to get gradient_checkpointing (#3535) 2025-06-04 09:03:24 -07:00
24ef9eb8e7 📰 Add blog "No GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL" (#3527) 2025-06-03 13:22:50 -07:00
b0eff324aa 🎀 New defaults: logging_steps=10 (#3514) 2025-06-03 11:45:08 -07:00
026fc9439c 🪦 RIP trl chat (#3531) 2025-06-03 12:19:03 -06:00
a912ad1bcf 🎀 New defaults: preparing the new structure (#3530) 2025-06-03 10:48:26 -07:00
fef915e36f 📉 FFD packing (#3521) 2025-06-02 13:15:22 -07:00
0db63f0f50 Add "🐯 Liger GRPO meets TRL" (#3525) 2025-06-02 11:32:31 -07:00
7359ddcc6f 🎀 New default: beta=0.0 for GRPO (#3516) 2025-05-30 09:51:07 -07:00
0844936930 🧭 Patch release guide (#3512) 2025-05-30 09:50:31 -07:00
897c87fa91 📚 Fix doc building by removing vLLM from dev dependencies in setup.cfg (#3511) 2025-05-29 11:39:40 -07:00
c13de6f9c0 📎 Fix clip ratio logging (#3506) 2025-05-28 08:46:35 -07:00
722847abbc ⬆️ Bump dev version (#3505) 2025-05-27 19:03:59 -07:00
ef4b0b225c Release: v0.18 (#3504) 2025-05-27 18:43:58 -07:00
8e8e62b380 ✂️ [DPO] Fix truncation keep_end leading to zero'd out samples (#3398)
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-05-27 16:36:01 -07:00
824100ce25 🏰 [vllm] Support base_url parameter for vLLM client initialization (#3324)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-05-27 16:05:40 -07:00
4e7f0a5eb9 🤧 LD-DPO support (#3458)
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-05-27 16:05:30 -07:00
17a9069710 📏 Completion length logging fix + remainder logging fix (#3482)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-05-27 14:31:03 -07:00
cb07c44920 Forgotten commit from #3502 2025-05-27 20:02:22 +00:00
0b6a1874f1 🔭 [GRPO] Log advantages and fraction of samples with an std of zero (#3502)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-05-27 12:58:41 -07:00
ac18c9d532 🐌 Clean two-sided clipping (#3499) 2025-05-27 09:39:37 -07:00
d1174adc5b 🛠️ Initialize reward_kwargs to prevent UnboundLocalError in GRPOTrainer (#3459)
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-05-26 18:28:27 -07:00
cd838417e4 👇 Update grpo.py to fix bugs for cli grpo --reward_funcs my_lib.my_reward (#3454)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-05-26 17:59:57 -07:00
c7e3f096a5 [GKD] fix the gkd script (#3497) 2025-05-26 20:22:15 +02:00
5c08897570 [GRPO] disabling top_k sampling default (#3494)
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2025-05-26 11:32:07 +02:00
3ef9faf257 [Docs] sync logging doc to current metrics (#3478) 2025-05-25 17:46:28 +02:00
9ac614fb08 Fix mis-aligned prompts and completions in colocate mode (#3491)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-05-24 16:50:45 -06:00
29401e790e [Doc][SFT] Update sft_trainer.md. link prompt-completion dataset example (#3486)
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2025-05-24 19:13:00 +02:00
31bf3f9244 Fix typo (#3489) 2025-05-24 13:24:15 +02:00
7f32792c07 [CI] fix sampler api to make the CI green (#3488) 2025-05-23 17:32:23 +02:00
3d8727918a [SFT] update minimal liger version (#3483) 2025-05-23 13:44:20 +02:00
65245f6be8 Update .pre-commit-config.yaml (#3479) 2025-05-22 16:08:23 +02:00
a528b9c465 [NashMD] fix the edge case where the model is a peft model (#3473)
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2025-05-20 17:02:04 +02:00
e0dd525021 🙅 PPO value_model can't be None, so it shouldn't be Optional (#3300) 2025-05-19 17:01:08 -07:00
64aa06499b enable activation offloading on XPU (#3444)
Signed-off-by: Matrix Yao <matrix.yao@intel.com>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2025-05-19 11:56:14 +02:00
be93a0c30c enable vllm c-s tests on XPU (#3445)
Signed-off-by: Matrix Yao <matrix.yao@intel.com>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2025-05-19 11:55:57 +02:00
f9fbd91ea9 [CI] fix CI failure of transformer dev (#3457) 2025-05-19 10:08:42 +02:00
54d4f6b13a 🎁 Reward submodule (#3430) 2025-05-15 19:10:22 -07:00
05bc43e960 feat: Implement Two-Sided Clipping for GRPO Trainer (#3434)
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2025-05-13 20:36:39 +02:00
d3dc8ff654 use device agnostic empty_cache in ppo & rloo (#3439)
Signed-off-by: Matrix Yao <matrix.yao@intel.com>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2025-05-13 20:10:14 +02:00
21738c3732 enable trl env on xpu (#3438)
Signed-off-by: Matrix Yao <matrix.yao@intel.com>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2025-05-13 11:36:01 +02:00
eab175d434 🏹 Support kv_cache_dtype to quantize kv-cache in vllm (#3422)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-05-08 17:11:16 -07:00
4da4dc9117 Update README.md 2025-05-07 20:49:35 -07:00
6b3a02385d Update README.md (#3420) 2025-05-07 20:48:22 -07:00
abbbb93d6a 🧪 Testing support for Qwen3 tiny (#3415) 2025-05-07 19:32:42 -07:00
cafa663c84 [Models] Activation checkpointing from TorchTune (#2954)
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
Co-authored-by: DanFosing <danfoss12340@gmail.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Robert <robert.veres00@gmail.com>
Co-authored-by: Robert Veres <robert.veres@languagetool.org>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Mathew Shen <datahonor@gmail.com>
Co-authored-by: Ishan Kumar <ishankumar216@gmail.com>
Co-authored-by: Huazhong Ji <hzji210@gmail.com>
Co-authored-by: tpoisonooo <khj.application@aliyun.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2025-05-07 12:36:11 +02:00
fd04a5461a 🐍 Support Python 3.13 (#2593) 2025-05-06 21:38:23 -07:00
56e5766205 🎁 Reward takes completion ids (#3272)
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2025-05-06 10:34:50 -07:00
89d44caece 📝 vLLM-integration documentation (#3376)
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-05-06 09:37:02 -06:00
adfa7fd59a 🎲 [GRPO] Shuffle mini batches (#3391)
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-05-06 11:09:00 +02:00
cf5183db7f 💔 [GRPO] Decouple gradient accumulation from the number of minibatches generated (#3388)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-05-06 09:59:32 +02:00
1954c02d86 🤝 Compatibility of the TRL CLI with accelerate arguments (#3409)
Co-authored-by: Lewis Tunstall <lewis.c.tunstall@gmail.com>
2025-05-06 00:09:23 -07:00
45f4c58832 ✌️ Add support for FSDP2 (#3317)
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-05-06 08:29:11 +02:00
cc044e35b2 🕊️ Un-restrict diffusers (#3407) 2025-05-02 15:06:53 -07:00
999acd53ec 🕺 Migrate setup configuration from setup.py to setup.cfg and make rich an optional dep (#3403)
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2025-05-02 11:03:57 -07:00
8606b1ad09 🪪 Remove license classifier (#3402) 2025-05-02 10:03:39 -07:00
a673da5773 👉 [DPO] Model forward pass padding side fix (#3307)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-05-01 20:37:55 -07:00
00b8e311aa 🦁 Fix liger initialization (#3401)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-05-01 20:36:46 -07:00
c163cf5081 💔 [SFT] Raise error when formatting_func is used with completion_only_loss (#3385)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-05-01 16:23:27 -07:00
bc9c019c43 [IterativeSFT] Small refresher (#3378) 2025-05-01 16:18:41 -07:00
18596cf232 🧑‍🤝‍🧑 Co-Locating vLLM w/ training to for higher throughput and GPU utilization (#3394)
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-05-01 16:17:26 -07:00
280d35301b 🌊 Add MLflow metrics in profiling context (#3400)
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-05-01 16:15:38 -07:00
13fa8402a3 [GRPO] Reference model initialization bug fix (#3397) 2025-05-01 17:31:21 +02:00
09b669fbf7 [🐯+GRPO] Support FSDP + Fix bug when using LigerGRPO with DDP (#3260)
Co-authored-by: Ubuntu <azureuser@liger-ci-h100-vm.kvghai4yzzmufguwws3040dwlf.dx.internal.cloudapp.net>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2025-04-30 22:49:45 +02:00
01d0be15cb Deprecate TextEnvironment and tools (#3389) 2025-04-29 20:25:36 +02:00
3a42af1c78 DPO fixes for evaluations (#3377) 2025-04-29 17:16:30 +02:00
aaf39604ba PEFT support for Liger GRPO (#3355)
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2025-04-29 17:05:35 +02:00
2bf48478e8 📋 Allow calling trl cli in sft mode with config file (#3380)
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-04-28 14:23:42 -07:00
a8cfca6d01 ⚰️ Remove deprecated (#3364) 2025-04-26 11:11:35 -07:00
1bca49515e Better guards for DeepSpeed imports (#3351) 2025-04-26 10:18:11 +02:00
39e96394a9 🎭 Fix train and eval mode checking in GRPOTrainer and SFTTrainer (#3337)
Co-authored-by: Jiaming Ma <jiaming.ma@connect.polyu.hk>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-04-25 17:42:43 -07:00
8e6ed93dfd 🥸🔢 Adding pad_multiple to SFT trainer (#3365) 2025-04-25 18:12:35 -06:00
29c5e05e3a 🔢 Pad to multiple of (#3362) 2025-04-25 09:53:20 -07:00
a9b27f82d6 ⬆️ Bump dev version (#3357) 2025-04-24 16:22:12 -07:00
cd6b3de356 Release: v0.17 (#3356) 2025-04-24 16:15:45 -07:00
36685c8bba Up to 4x faster: Data Parallel for vLLM server (#3310)
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: Shirin Yamani <75791599+shirinyamani@users.noreply.github.com>
2025-04-24 15:14:16 -07:00
89556c8cbf 🍡 Fix using reward model and DeepSpeed ZeRO 3 (#3326) 2025-04-23 15:09:33 -07:00
f3e8c23044 Define default chat template for SFT (#3309)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-04-23 15:49:42 +02:00
9ee6c3aa56 🏁 Fix adding special tokens in SFT (#3328) 2025-04-22 17:51:51 -07:00
ef05331752 [CPO] Check that max_prompt_length < max_length (#3341) 2025-04-22 15:45:15 -07:00
05e2ba6e01 🦄 Add optional uvicorn log level for vLLM serve (#3338)
Co-authored-by: Jiaming Ma <jiaming.ma@connect.polyu.hk>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-04-22 11:45:13 -07:00
1b4f189e09 💡 Fix type hint in _generate_and_score_completions (#3336) 2025-04-22 08:57:29 -07:00
1faa7f9b36 🧸 Fix unset tokenizer pad_token (#3290)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-04-21 17:20:09 -07:00
66e6eab9bb [doc] Update sft_trainer.md in table x->✓ (#3313)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-04-21 17:05:20 -07:00
27af0aaf4a Fix typo in text_environments.md (#3305) 2025-04-21 16:39:55 -07:00
b4ffda769e 🙋 Add Optional Eager Execution Mode for vLLM Serving (#3335)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-04-21 15:33:59 -07:00
0dad4eb7ca 🎲 [GRPO] Make training dataset shuffle optional (#3334)
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2025-04-21 14:34:31 -07:00
c82f626f94 Empty commit to test new protection rules 2025-04-20 23:07:28 +00:00
33add19161 Empty commit to trigger CI 2025-04-20 23:00:31 +00:00
294f35bf3c ☝️ [GRPO] Generate once per effective batch (#3283)
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2025-04-17 16:35:58 -07:00
9874b3aa04 [GRPO] Add metrics for low and high clipped token probabilities (#3289)
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-04-16 14:43:34 +02:00
1e61f6cc5a 🅾️ Fixes typo in SFTTrainer (#3282) 2025-04-15 15:23:40 -07:00
27adc30162 🧗 Add Ascend NPU support for vLLM server (#3286)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-04-15 15:22:46 -07:00
df737f99c1 🏷️ Fixed naming error in output_dir for Gemma 3 VLM script (#3297) 2025-04-15 14:51:26 -07:00
c04e84c454 Expose EOS token in SFTConfig (#3299)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-04-15 21:53:28 +02:00
d625c5533a ⏱️ Fix vLLM server to support V1 Engine (#3276) 2025-04-10 18:29:50 -07:00
6cdd24a360 🦾 Test vLLM client-server (#3277) 2025-04-10 18:29:04 -07:00
8b38570258 🕊️ Un-restrict diffusers (#3274) 2025-04-10 07:24:11 -07:00
95b1a9f612 Add Fine-tuning a Multimodal Model Using SFT (Single or Multi-Image Dataset) guide to docs (#3235)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-04-10 09:33:41 +02:00
5c1511423b 🔗 Fix Dr. GRPO paper link (#3275) 2025-04-09 19:31:15 -07:00
5e2e9cb442 🩺 Dr. GRPO loss (#3256)
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2025-04-09 11:13:22 -07:00
227df8271e ♾️ [CI] Remove test_raise_error_not_causallm (#3265) 2025-04-09 10:39:36 -07:00
ae1581474e 🚧 Temporarily restrict diffusers to <0.33.0 due to ftfy optional dep issue breaking doc builds (#3273) 2025-04-09 10:20:43 -07:00
47b9515fb1 👎 [GRPO] Adds option to disable dropout (#3234)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-04-09 09:59:06 -07:00
c4891dcfee 🕷 Fix online DPO crash when model is a DataParallel object (#3225)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-04-09 09:29:13 -07:00
055cee255a Revert "reward takes completion ids"
This reverts commit 73a2fb05545db3c2e92f9311473738278b0d9cd0.
2025-04-09 14:41:55 +00:00
73a2fb0554 reward takes completion ids 2025-04-09 14:40:42 +00:00
982ba08092 🐯 is_liger_kernel_available with min version (#3266) 2025-04-09 06:43:59 -07:00
e03e7acc5c ⛏️ Add cli dict parsing for grpo_config (#3082) 2025-04-08 15:55:33 -07:00
9df19e8a75 📜 Fix license and copyrights (#3264) 2025-04-08 15:22:58 -07:00
1d7b8c4f70 Overlong-filtering for GRPO (#3248)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-04-08 12:52:52 -06:00
7e170612a4 💠 Fix multi-gpu padding free (#3245) 2025-04-08 11:43:56 -07:00
559724ee2c 📦 [SFT] Deprecate batched formatting_func (#3147)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-04-08 09:42:17 -07:00
a5a46725c8 🗑️ Deprecate ConstantLengthDataset (#3242) 2025-04-08 08:03:57 -07:00
b6bcafb8bb 🏃 Fix and make CI faster (#3160) 2025-04-08 06:12:08 -07:00
4bfb8eb0d1 🔭 Add support for better KL estimator (k3) in PPOTrainer (#3240)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-04-05 22:33:28 -07:00
4d66bad208 ☑ Update PULL_REQUEST_TEMPLATE.md (#3241) 2025-04-05 16:28:19 -07:00
e90117b3e1 PPOTrainer: fix progress bar for num_mini_batches > 1 (#2531)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-04-05 15:47:28 -07:00
31b54a6237 🌊 Add error for iterable datasets in GRPOTrainer (#3216) 2025-04-05 15:41:53 -07:00
17e33cdaa0 🎀 Simplify logging text (#3219)
Co-authored-by: Lewis Tunstall <lewis.c.tunstall@gmail.com>
2025-04-05 15:38:32 -07:00
5a0cebc786 📢 Improve GRPO trainer error message for invalid num_generations (#3199)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-04-04 21:52:00 -07:00
65308cfd84 ⏯️ Fix logging when resuming from checkpoint GRPO (#3185) 2025-04-04 21:51:36 -07:00
1755e03f6f Update ruff to 11.3 and base Python version to 3.9 (#3230)
Signed-off-by: cyy <cyyever@outlook.com>
2025-04-04 13:50:14 +02:00
793735a698 🐯 Integrate Liger GRPO Loss to GRPO Trainer (#3184)
Co-authored-by: Ubuntu <azureuser@liger-ci-h100-vm.kvghai4yzzmufguwws3040dwlf.dx.internal.cloudapp.net>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-04-03 19:17:00 +02:00
e70a0efeca Group completion metrics by common prefix (#3212)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-04-03 08:11:35 +02:00
7eaca76ed1 📚 Accumulate completions for logging (#3217) 2025-04-02 17:00:43 -07:00
657f9ce6ee 🗝️ Fix type hint in vLLM client (#3205) 2025-04-02 09:40:21 -07:00
485852c942 😷 Fix SFT masking EOS when equal to PAD (#3200) 2025-04-02 08:56:05 -07:00
9f3702f6be [GRPO] Improve completion length logging (#3188) 2025-04-01 10:00:40 +02:00
e751a16df5 🐗 [CI] Fix trufflehog false positives (#3192) 2025-03-31 11:01:55 -07:00
582bc5684b Show unique prompts in GRPO WandB tables (#3191) 2025-03-31 18:50:21 +02:00
c5ba70d4fc Fix breaking typo for flash_attention reducing_memory_usage.md (#3190) 2025-03-31 12:17:10 +02:00
5b586da3cc 📎 Fix is_clipped to compute the effective clip_ratio (#3175)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-03-30 22:24:14 -07:00
488025cd87 ⏯️ Fix: handle None inputs when resuming GRPO Trainer from checkpoint (#3148)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-03-30 21:25:53 -07:00
2594cb39de ❤️‍🩹 [CI] fix transformers dev CI failure (#3176)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-03-29 18:39:40 -07:00
2fe2337067 🏃 Migrate CI to self-hosted runners (#3174) 2025-03-29 11:56:44 -07:00
f6b4d6e569 [Liger] Liger KTO support (#2812)
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-03-28 20:56:59 +01:00
26d86757a7 💎 Gemma 3 VLM SFT example script for single-image and multi-image (#3131)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-03-26 08:16:02 -07:00
9771f259ed 💰 Richer rich table - log all the rewards (#3156) 2025-03-26 07:45:51 -07:00
7bdedd4075 👨‍🍳 vLLM serve: destroy process group on exit and pass worker_cls as string (#3159) 2025-03-26 07:00:57 -07:00
a069a2f19c 🔫 Disable triggering CI when PR is draft (#3154) 2025-03-25 10:59:01 -07:00
ea45f513f3 ⚰️ Remove deprecated (#3153) 2025-03-25 09:57:50 -07:00
a91023990a 🩹 Fix CI (#3155) 2025-03-25 09:16:23 -07:00
1a9387b922 Enable number of printed completions to be set (#3149)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-03-25 08:47:13 +01:00
1884ff1bb8 🤝 Align GRPO equation doc with the implementation (#3151) 2025-03-24 11:37:06 -07:00
bfe2075608 🐇 [Research] Layer Skip SFT (#3111)
Co-authored-by: Mostafa Elhoushi <m.elhoushi@ieee.org>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-03-24 11:02:00 -07:00
6067e2a669 BCOTrainer version upgrade fixes (#2867)
Co-authored-by: Clara Luise Pohland <clara-luise.pohland@telekom.de>
2025-03-24 10:55:00 +01:00
dee37342a8 📊 Fix clip_ratio logging and better document logged values (#3145) 2025-03-23 16:05:42 -07:00
8037f18cdf Fix: Multi gpu hang for ORPO and CPO Trainer (#3069) 2025-03-23 16:25:15 +01:00
a0a53171cc ⬆️ Bump dev version 2025-03-22 21:14:59 +00:00
23a635ed61 Release: v0.16 (#3137) 2025-03-22 14:03:54 -07:00
9b38b0b5ee ⚖️ Add option not to scale rewards (Dr. GRPO) (#3135) 2025-03-22 13:47:52 -07:00
0f26049ea2 ☎️ Documentation for disable gathering of model weights for generation in DeepSpeed ZeRO-3 (#3136) 2025-03-22 13:29:47 -07:00
7511aa4e36 Pack 300 times faster, truncate 100 times faster (#3009)
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-03-22 12:33:31 -07:00
f713f614e9 🚀 Scaling GRPO to 70B+ Models and Multi-Node Training with vLLM Server & NCCL Communication (#3094)
* 🚀allow GRPO to connect to VLLM in remote/local node with NCCL communication

* Update trl/extras/remote_vllm_helper.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* use argparse for options

* add  imports for remote vllm helper

* formatting

* fix arguments

* use cli options

* vllm serve

* clean server

* better naming

* client

* style

* new params in generate

* this method is the new default

* update config

* do not use asserts

* update config

* separate host and post

* proper deprectation

* deprecated arg in the vllm server

* simplify moving

* document host and port

* style

* update trainer

* new generate args

* update doc

* Fix for zero3

* Better naming

* Remove remote_vllm_helper

* remove grpo_with_remote_vllm

* remove cloudpickle from deps

* Some consistency

* Update docs/source/grpo_trainer.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update setup.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* add revision argument to vllm server

* Update docs/source/grpo_trainer.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/grpo_trainer.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Reset the prefix cache after updating weights

* Update vllm_client.py

* Update vllm_client.py

* Update vllm_serve.py

* Add health check endpoint to vLLM server

* connection timeout

* style

* fix doc langauge hint

* move reset_prefix_cache to its own endpoint

* async

* merge peft adaptor to send to vllm

* Looks simple. Wasn't.

* Peft compatibility

* Update docs/source/speeding_up_training.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/speeding_up_training.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update trl/extras/vllm_client.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* GatheredParameters can be disabled

* gather and ungather peft weights within the same deepseed context

* use is_vllm_available

* minor consistency fixes

* fix error when deepspeed is not installed

* fix deepspeed import when not peft

* simpler

* multinode doc

* minor code and comments changes

* style

* optional deps

* vllm_server_timeout as arg

* small refinement in doc

* update deps

* Fix VLLMClient argument in grpo_trainer; Add zero3+peft vllm transfer solution

* Revert "Fix VLLMClient argument in grpo_trainer; Add zero3+peft vllm transfer solution"

This reverts commit d759c9c4d12ff4531482c465c6257a59987ba748.

* log num_tokens

* disable vllm test (in the future we'll add a mock for vllm server for them)

* style

* fix ds3_gather_for_generation

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2025-03-21 12:12:08 -07:00
a34987956c 🎬 Clip higher (#3118)
* epsilon range added

* epsilon doc str updated

* test removed

* pre-commit run

* Update trl/trainer/grpo_config.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* upper epsilon updated

* precommit updates added

* minor format and dtype fixes

* moving upper bound computation in init

* hf.co for paper link

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-03-19 19:28:19 -06:00
0f88c179e3 Merge pull request #3079 from huggingface/flexible_reward
Flexible_reward
2025-03-18 11:32:16 -06:00
beda4328cc Use main process for dataset.map (#3106) 2025-03-18 17:36:12 +01:00
07cfe1677e add "_prepare_fsdp" for DPOTrainer (#2539)
* enable prepare fsdp

* Update trl/trainer/dpo_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* remove activation_checkpointing

* move to utils.py

* fix style

* Update utils.py

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2025-03-17 14:37:15 +01:00
9f7755d8ed 🕊️ Padding-free for SFT (#3076) 2025-03-15 12:52:24 -07:00
4e3f569eb8 Update grpo_trainer.md [ci skip] 2025-03-14 18:48:50 -07:00
979fda1548 title multi-task added for example4 2025-03-15 01:19:31 +00:00
f6fb6a88a9 precommit fixed applied 2025-03-15 01:10:32 +00:00
6cbf8fbc9f Merge branch 'flexible_reward' of github.com:huggingface/trl into flexible_reward 2025-03-15 01:08:08 +00:00
5cb390cd30 Add EOS token to processed input in SFT (#3091)
* Add EOS token to processed input

* Update sft_trainer.py

* fix test
2025-03-14 18:06:15 -07:00
b3c391e628 Update trl/trainer/grpo_trainer.py
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-03-14 19:03:31 -06:00
1b85ca6147 grpo doc updated 2025-03-15 01:03:04 +00:00
e7a1290b0a Update docs/source/grpo_trainer.md
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-03-14 18:57:13 -06:00
3822edd67b Update docs/source/grpo_trainer.md
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-03-14 18:56:54 -06:00
230455cab0 Update docs/source/grpo_trainer.md
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-03-14 18:56:33 -06:00
08f014d559 Update trl/trainer/grpo_trainer.py
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-03-14 18:50:56 -06:00
10740333bd Update trl/trainer/grpo_trainer.py
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-03-14 18:49:07 -06:00
058a733c30 Update tests/test_grpo_trainer.py
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-03-14 18:48:59 -06:00
3f193972d8 Update tests/test_grpo_trainer.py
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-03-14 18:48:39 -06:00
b575596b89 Update tests/test_grpo_trainer.py
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-03-14 18:45:55 -06:00
118c43f0e0 Update docs/source/grpo_trainer.md
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-03-14 18:44:05 -06:00
40b1c33edf Update docs/source/grpo_trainer.md
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-03-14 18:38:08 -06:00
1a2e74cc5a Update docs/source/grpo_trainer.md
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-03-14 18:35:38 -06:00
80f7dcb16d Update docs/source/grpo_trainer.md
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-03-14 18:35:04 -06:00
4404ccd24a Update docs/source/grpo_trainer.md
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-03-14 18:34:50 -06:00
39f77ca2d8 Update docs/source/grpo_trainer.md
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-03-14 18:34:36 -06:00
52085dd96b final version 2025-03-15 00:19:34 +00:00
c7a1c95017 Update docs/source/grpo_trainer.md
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-03-14 18:07:38 -06:00
3003058418 Update trl/trainer/grpo_trainer.py
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-03-14 18:07:31 -06:00
a759cee2e0 Update trl/trainer/grpo_trainer.py
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-03-14 18:07:24 -06:00
0a3bad44f0 Update trl/trainer/grpo_trainer.py
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-03-14 18:07:13 -06:00
bb5b96a823 Update trl/trainer/grpo_trainer.py
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-03-14 18:07:06 -06:00
8466c7273e Update trl/trainer/grpo_trainer.py
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-03-14 18:06:59 -06:00
a871ec8e91 Update tests/test_grpo_trainer.py
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-03-14 18:06:36 -06:00
f7572221db Update docs/source/grpo_trainer.md
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-03-14 18:06:29 -06:00
8ec2e42833 Online fixes 2025-03-14 23:58:33 +00:00
218d493d11 Update trl/trainer/grpo_trainer.py
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-03-14 17:15:54 -06:00
1a9f78eb3a Update docs/source/grpo_trainer.md
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-03-14 16:57:18 -06:00
a10978ebdf reviews reflected 2025-03-14 22:27:46 +00:00
87fbb831d3 Update trl/trainer/grpo_trainer.py
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-03-14 14:04:39 -06:00
52f39d6a24 Update trl/trainer/grpo_trainer.py
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-03-14 13:57:48 -06:00
931f7a14d2 conflict 2 pushes fixed 2025-03-14 19:47:05 +00:00
9951105a90 Merge remote-tracking branch 'origin/flexible_reward' into flexible_reward 2025-03-14 19:36:32 +00:00
5a6e23aac9 review commnts reflected + unittest n doc added 2025-03-14 19:28:59 +00:00
d9104c8b0d Update trl/trainer/grpo_trainer.py
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-03-13 16:27:55 -06:00
d5a5840307 Remove simple_test.py from version control 2025-03-13 22:23:09 +00:00
f3cbd41e2c interactive reward_func added 2025-03-13 22:09:12 +00:00
d41a32f619 restriction removed from util 2025-03-13 18:58:07 +00:00
fc4dae256d 🫣 [GRPO] add cache_implementation option in GRPO (#3075)
* add cache_implementation option in GRPO

* add cache_implementation to config

* Update trl/trainer/grpo_config.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-03-13 19:21:36 +01:00
e4e5671e80 💎 Gemma 3 SFT example on Codeforces dataset (#3070)
* Gemma 3 and padding free

* remove padding free changes

* style

* update sft cli

* update script

* revert

* style
2025-03-13 10:50:52 -07:00
7c76f103da irrelavant reward ignorance added 2025-03-13 17:39:49 +00:00
aad18ef52a 🎭 Minor spelling fix in documentation (caracteres -> characters) (#3074)
Signed-off-by: Ed Snible <snible@us.ibm.com>
2025-03-13 08:59:24 -07:00
b55d9f0412 Fixing JSD loss computation as per definition (#3043)
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2025-03-13 11:52:50 +01:00
4871c82b0c 🏊 [SFT] Compatibility with padding free and iterable dataset (#3053)
* Compatibilitywith padding free and iterable dataset

* Fix collator test

* add a test for streaming

* some cleaning

* improve and fix tests

* tiny revert

* bump datasets to 3.0.0
2025-03-12 11:44:25 -07:00
fd9e5a7cab 🦥 Fixed SFTTrainer.compute_loss hang by re-summing before the gather (#3056) 2025-03-12 05:43:33 -07:00
5463e49a55 use argument names with processing_class (#3062) 2025-03-12 13:03:45 +01:00
22759c8208 👯 [GRPO] Relax the assumption that prompts are unique within a batch (#3052)
* Relax the assumption that prompts are unique within a batch

* style
2025-03-11 15:24:06 -07:00
2ee6fd369f 💠 Fixing SFTTrainer.compute_loss crash with accelerate (#3048)
* Fixed crash in SFTTrainer due to accelerator.gather_for_metrics during training

* Moved sum outside of accelerator.gather_for_metrics

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-03-11 11:08:51 -07:00
844a9c665f 🏁 Passing custom BOS/EOS token to GPROTrainer.generation_config (#3046)
* Passing custom BOS/EOS token to fallback GRPOTrainer.generation_config

* Reordered kwargs per PR comment

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-03-11 11:08:33 -07:00
04f6597377 🌡️ Fix temperature inconsistency in GRPO trainer (#3029)
* fix temperature inconsistency in GRPO trainer

* adding 1e-7 isn't necessary

* comment

---------

Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-03-11 10:36:42 -07:00
e3244d2d09 🚀 Supporting deepspeed>=0.16.4's rename (#2963)
* Added else clause to avoid NameError on optimizer_offload

* Accounted for deepspeed's renaming in 0.16.4

* Switched to packaging.version.parse over the (broken) tuple split

* Moved from NotImplementedError to RuntimeError in else clause
2025-03-05 15:49:21 +01:00
6a02c69789 🎲 Add support for additional generation kwargs in GRPO Trainer (#2989)
* Add support for additional generation kwargs in GRPO Trainer

- Extend GRPOConfig to support additional generation kwargs
- Update GRPOTrainer to incorporate additional generation parameters
- Add tests for training with additional generation kwargs for both standard and vLLM modes

* Add missing vllm_gpu_memory_utilization=0.5

* 🔧 Refactor GRPO generation parameters and configuration

- Restructure GRPOConfig to separate generation parameters
- Add support for top_p, top_k, min_p, repetition_penalty, and length_penalty
- Remove additional_generation_kwargs in favor of explicit parameters
- Update GRPOTrainer to use new generation parameter configuration

* Update tests

* Remove length_penalty and fix tests

* Update defaults and docs

- Change temperature type from Optional[float] to float
- Set default top_p to 1.0 instead of None
- Simplify parameter descriptions by removing redundant "if set to None" text
- Maintain consistent type hints and default values for generation parameters

* GRPO remove optional type hint for temperature parameter

* Remove length_penalty from sampling_kwargs dict in GRPOTrainer

* some refactoring

* top k None support

* change value of in test to amke them work

---------

Co-authored-by: Robert Veres <robert.veres@languagetool.org>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-03-05 09:58:00 +01:00
a1c58aa42a 🗜️ Loosened tokenizer type hint on apply_chat_template (#3005) 2025-03-04 17:41:42 +01:00
3f0695a4ca 🌍 Use global normalization for KL logging (to match normalization for loss) (#3004)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-03-04 17:14:22 +01:00
a72b50b772 📚 Update customization and distributing training documentation (#2991) 2025-03-04 16:37:54 +01:00
ea1d9be2a7 ✌️ Remove double compute of sum in SFTTrainer (#3001)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-03-04 16:35:30 +01:00
402187baab Improve ci (#3007)
* Create codeQL.yml

* Create custom-queries.qls

* Update custom-queries.qls
2025-03-04 15:53:51 +01:00
5858ceab7e 🪙 [SFT] Log num_tokens and some logging fixes (#3006) 2025-03-04 15:45:11 +01:00
7442d42c21 Update pr_style_bot.yml (#3003) 2025-03-03 19:23:16 +01:00
98de0e7c62 🚀 DeepSpeed integration documentation (#2993)
* ds doc

* Update docs/source/deepspeed_integration.md

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2025-03-03 14:51:45 +01:00
491921c1a4 🛣️ inference_mode to no_grad when computing old_per_token_logps (#2987) 2025-02-28 22:58:05 +01:00
ad6a35bdd5 🫔 [GRPO] Pass wrapped model to unwrap_model_for_generation for DeepSpeed Stage-3 compatibility (#2871)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-02-28 18:17:04 +01:00
7bc9858a8f 🔍 Update GRPO config documentation for beta parameter stability (#2992) 2025-02-28 17:39:12 +01:00
b882f57d93 ⚰️ Deprecate liger-kernel (#2949)
* Deprecate liger

* remove import

* oops, shouldn't be here

* Fix other deprecations

* remove liger from gkd for now

* remove liger for teacher

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2025-02-28 14:58:47 +01:00
ac7bde5832 📑 Fix logged metrics for KTO (#2982) 2025-02-28 14:58:31 +01:00
3d94e4e25c 📜 Update README and doc index (#2986)
* Update readme and doc index

* bold

* consistent uppercase
2025-02-28 13:51:58 +01:00
1a303cca8e 🧬 Fix typo in grpo_trainer.py (#2988) 2025-02-28 13:49:47 +01:00
ac327d5e84 🪪 Adds a more fine-grained profiling context (#2975)
* adds a more fine grained profiling context

* precommit

* fix reward func name

* add reward to RM name

* Update trl/extras/profiling.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* some doc and fixes

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-02-27 21:58:39 +01:00
c0854c32c9 🌌 Fix logits computation in trainer prediction step (#2969)
* Fix logits computation in DPO trainer prediction step

* fix compute_metrics for bco and test

* same for cpo

* same from dpo

* for kto

* anf finally orpo

* Apply style fixes

---------

Co-authored-by: kyungdae-jo <kyungdae.jo@navercorp.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2025-02-27 17:09:11 +01:00
aa18ecfde7 👂 Update learning rate doc in KTOConfig (#2912)
* Update kto_config.py

Fix the mismatch between documentation (and suggested) kto learning rate

* fix doc

---------

Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-02-27 14:40:54 +01:00
6849c050b9 🕸 Add distributing training guide (#2956) 2025-02-27 14:31:52 +01:00
27a6f2201b 🧗 Add GRPO Trainer support for third-party accelerators (#2836)
* Add GRPO Trainer support for Ascend NPU

* 更新 grpo_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* code format

* 更新 grpo_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* patch mem_get_info

* stylre

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-02-27 13:25:24 +01:00
f074dcdc86 👧🏽 Adding DoRA support to model config (#2974) 2025-02-27 12:37:22 +01:00
0caff61600 Update grpo_trainer.py (#2973) 2025-02-27 09:38:32 +01:00
019fc6dbaa 🔢 Fix GRPO doc about num_iterations (#2966) 2025-02-26 12:46:08 +01:00
69ad852e56 Parameterize enable_prefix_caching (#2900)
* parameterize enable_prefix_caching

* apply review suggestion

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-02-25 00:40:09 +01:00
45ccdefac4 📌 Pin liger-kernel and vLLM (#2952)
* pin liger-kernel

* style
2025-02-25 00:34:16 +01:00
703484a8c2 🗿 Updated DPO default values for alpha and tau (#2918)
* updated DPO default values for alpha and tau

* same for grpo

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-02-25 00:19:48 +01:00
9b76d5f2e9 ↩️ Fix typo in TextEnvironment init param, should be max_tool_response (#2921)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-02-25 00:08:06 +01:00
cbe0681ba1 📇 GRPO: print completions to console and update docs (#2951)
*  Enhance GRPO logging with configurable completions sampling

- Update `GRPOConfig` to replace `log_completions` with `log_completions_steps`
- Add `print_prompt_completions_sample()` utility function for rich console logging
- Modify `GRPOTrainer` to additionally print 5 random prompt-completion pairs every log_completions_steps steps

* GRPO trainer completions logging, move wandb checks together

* Add rich availability check and use fallback in print_prompt_completions_sample when rich is not available

* Update docstrings on print_prompt_completions_sample

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Revert back to simple log_completions bool

* GRPO log completions fully

* Remove print fallback from print_prompt_completions_sample

* Move accelerator main process check up for grpo log completions

* Explicit variable names in print_prompt_completions_sample

* Make GRPOConfig docstring match field description

* Update log_completions docs again

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update GRPOConfig docs to match field

* improve readibility when prompt or completions are multilines

* log reward

* prevent hanging, don't print without rich, print reward

* style

---------

Co-authored-by: Robert Veres <robert.veres@languagetool.org>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
2025-02-24 23:53:13 +01:00
4e0cf01aef Prevent applying the chat template to tokenized datasets (#2939)
* Update sft_config.py

* Update sft_trainer.py

* Update sft_config.py

* Update sft_trainer.py

* Apply style fixes

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2025-02-24 23:14:49 +01:00
5c05913196 🐯 Fix LigerKernel for SFTTrainer (#2940)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-02-24 17:29:48 +01:00
caba04da42 ☠️ Update max_seq_length to max_length in SFTConfig (#2947) 2025-02-24 16:26:20 +01:00
be5a088337 📋 Add vLLM version to environment printout (#2946) 2025-02-24 14:22:43 +01:00
38861475e6 ♻️ Fix caching in SFT (#2945) 2025-02-24 10:54:39 +01:00
f69707dab4 🐈 Bye bye chat (#2934)
* Bye chat

* better warning

* style error

* Apply style fixes

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2025-02-23 19:18:28 +01:00
76f00fc394 Ensure precommit exits 0 status 2025-02-23 16:34:54 +00:00
8453017622 🧼 Upgrade ruff (#2938) 2025-02-23 17:33:50 +01:00
3608709529 Update pr_style_bot.yml 2025-02-23 14:32:36 +01:00
21f0055893 🤖 Style bot (#2935) 2025-02-23 14:29:22 +01:00
013d360b8f 🔹 Fix: Miscalculated mask shape in comments (#2925) 2025-02-21 17:01:53 +01:00
e5ae703d35 🐦🔥 6x faster GRPO with multi-step optimization (#2899)
* Add num_updates and epsilon parameters to GRPOConfig and GRPOTrainer

* test sampler

* update the loss computation

* fix eval sampler

* should work now

* buffer inputs with grad accum

* optimize when num_iterations == 1

* test

* minor comment removal and fix log metric

* beta position

* clarify comment [ci skip]

* clarify sampler doc [ci skip]

* fix collision with eval logging

* clarify
2025-02-20 19:51:45 +01:00
a92e00e810 🪪 Adds profiling decorators for GRPOTrainer (#2889)
* adds profiling decorator

* naming + precommit

* style

* revert inclusion of slider table

* revert 2

* revert3

* revert4

* revert 5 fml

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-02-20 09:57:42 +01:00
9b3c5bf64f 📍 [GRPO] add gradient_checkpointing (#2848)
* add gradient_checkpointing

* added a helper

* Update trl/trainer/grpo_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update trl/trainer/grpo_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* minor refactor for better readability

* use acceelrate util

* enable_input_require_grads is in base class

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2025-02-18 18:09:16 +01:00
15fec312d5 🍃 GRPO - Do not load reference model when beta == 0 (#2806)
* 🔧 Optimize GRPO training by conditionally loading reference model based on beta value

*  Add test for GRPOTrainer with beta=0 to ensure no reference model and KL divergence

* 🔧 Refactor GRPOTrainer code for improved readability and maintainability

* 🔧 Simplify per_token_loss calculation in GRPOTrainer for clarity

* fix test, style, and some struct for clarity

---------

Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-02-18 17:57:15 +01:00
be1e34003c 🩳 max_seq_length to max_length (#2895)
* `max_seq_length` to `max_length`

* remove in 0.20
2025-02-18 16:53:37 +01:00
6aaf379a82 ⚰️ Remove deprecated (#2894) 2025-02-18 16:53:21 +01:00
49adf74833 Add vLLM guided decoding support to GRPO Trainer (#2811)
*  Add vLLM guided decoding support to GRPO Trainer

* 🔧 Update vLLM guided decoding in GRPO to use regex parameter

* style and docstring

* test

---------

Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-02-18 16:53:05 +01:00
6c54f023ae 🪂 Don't gather logits in SFT to avoid hanging (#2890)
* Don't gather logits

* Remove unused function and test
2025-02-18 15:31:08 +01:00
963243a7d1 Optimize vllm num_generations (#2855)
* small optimization of vllm batching

* style

* adds comment

* style
2025-02-18 11:44:15 +01:00
aafd8cbea5 🍟 [SFT] Handles the dataset if it has been preprocessed (#2863)
* return dataset if it's preprocessed

* add is_processed flag variable

* add test

* move test_sft_trainer_directly_with_pretokenized_data to Tester2

* Update sft_trainer.py

* no need for padding and truncation

* minor reorganization

* Update trl/trainer/sft_trainer.py

* let the collator pad

* style

* fix tests

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2025-02-18 09:56:47 +01:00
822653824b 🧶 [GRPO][vLLM + LoRA] Move unmerge of PEFT model after weight loading (#2873) 2025-02-17 20:34:07 +01:00
ba036576d4 💬 Add maybe_convert_to_chatml map for conversational datasets in SFT (#2862)
* add back get_formatting_func_from_dataset

* maybe_convert_to_chatml

* maybe_convert_to_chatml before maybe_apply_chat_template map

* remove comment

* test

* desc

* style

* Update trl/data_utils.py

---------

Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-02-17 16:47:06 +01:00
293b620950 [GRPO] Fix loss normalization (#2881)
* fix GRPO loss normalization

* fix sum dim

* fix loss= repeated
2025-02-17 13:26:21 +01:00
ae3bd0d07a 🆙 Bump vLLM min version to 0.7.2 (#2860)
Bumps vllm as there were a number of throughput improvements in vllm==0.7.2

Also may resolve issue such as https://github.com/huggingface/trl/issues/2851
2025-02-17 10:54:07 +01:00
6d9fc11fd6 [SFT] fix check for AutoLigerKernelForCausalLM (#2874)
* fix check for AutoLigerKernelForCausalLM

* fix case where AutoLigerKernelForCausalLM is not defined

* update min liger version

* formatting

* fix win CI
2025-02-17 07:50:55 +01:00
ffcb9f4aee ⬆️ Bump dev version 2025-02-13 14:33:44 +00:00
00e5889380 Release: v0.15 2025-02-13 14:28:36 +00:00
5c9cf2003d 👨‍👩‍👧 GRPO + PEFT + vLLM (#2818)
* peft + grpo + vllm

* test change

* support model alread peft

* Update tests/test_grpo_trainer.py

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2025-02-13 15:23:36 +01:00
8830786a23 🪆 Fix for Incorrect ValueError Handling in reward_weights in grpo_trainer.py (#2843)
- Fixed a bug where an extra `len` call inside the error message caused a `TypeError` instead of the expected `ValueError`.
- Replaced `len(len(args.reward_weights))` with the correct `len(args.reward_weights)` to properly calculate the number of reward weights.
- Ensured that a `ValueError` is now raised with an accurate and clear message when the number of reward weights does not match the number of reward functions.

This fix prevents confusion during debugging and ensures proper error handling during validation.

Tested with cases where:
- `args.reward_weights` is None (default case).
- `args.reward_weights` has mismatched lengths with `reward_funcs`.
2025-02-13 13:46:18 +01:00
b0f513c13d Fix PeftModel check when moving weights to vlllm (#2850)
This check meant that peft now because a required dep when running GRPO with vllm. 

This PR should resolve this.
2025-02-13 12:23:10 +01:00
81221661c6 Fix GRPO PEFT (#2725) 2025-02-12 18:36:01 +01:00
7347c292c3 🥾 Allow bootstrap GRPO (#2829)
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2025-02-11 18:56:22 +01:00
2106b31298 👴 Update tokenizer parameter to processing_class in tests (#2828) 2025-02-11 11:46:26 +01:00
9b67eea473 🙌 Share vLLM device with training when only 1 available (#2827)
* Fix GPU device selection in GRPOTrainer in case training with onyl one

* update doc

* style

* update warning
2025-02-11 11:30:37 +01:00
e752fc6c2e ⚖️ Add reward weight in multi-reward settings for GRPO (#2676)
* added reward weights for multi-reward runs in GRPO

* reward_weights are float, moved from GRPOTrainer to GRPOConfig

* minor comment fix

* minor

* fix test

* missing link

---------

Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-02-11 11:15:41 +01:00
674bb75f59 🫘 Add set_seed() call in GRPO to ensure unique seed for each process (#2824)
* Add set_seed() function to ensure unique seed for each process

* share seed sampler

* style
2025-02-11 10:30:27 +01:00
b9df81045b 📤 GRPO refactor loading the model weights to vllm (#2817)
* GRPO refactor loading the model weights to vllm

* style

---------

Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2025-02-10 15:20:38 +01:00
55e680e142 fix: typos in documentation files (#2804) 2025-02-08 20:46:47 +01:00
09eefa73ab ⛰️ Reduce peak vram consumption with efficient selective log_softmax (#2799)
* Reduce mem consumption across many trainers with efficient selective log-softmax approach

* rename

* typo fix

* precommit

* Update tests/test_core.py

* relocate

* precommit

* style

* smaller values for test, and run on cpu

* nit doc improvements

* style

* fix test

---------

Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2025-02-08 00:59:46 +01:00
7fdb69aa7d Fix GRPO example in README (#2800) 2025-02-08 00:29:26 +01:00
5b9236d1e8 🔬 SFT simplification (#2405)
* initial commit

* update

* Refactor SFTTrainer and SFTConfig

* Update SFTConfig class in sft_config.py

* Fix SFTConfig torch_dtype validation and dataset preprocessing flag

* Refactor dataset mapping and conversion

* Refactor dataset mapping in SFTTrainer

* Fix SFTTrainerTester unit test by removing unnecessary code

* Remove unused variables and update tokenization logic

* Remove pack_dataset function

* Add deprecation warning for tokenizer in SFTTrainer constructor

* add docstring back

* Update model parameter type annotation

* Update SFTTrainer class definition

* style

* preprocess_dataset -> _prepare_dataset

* Retro compat

* Update formatting_func type hint in SFTTrainer constructor

* typo

* better comment

* simplify tokenize row

* Fix type hint for peft_config

* fix doc

* Add pack_examples function to `test_data_utils.py`

* promote pack_examples and document

* improve doc

* Add new SFTTrainerTester2 class for testing

* test was reversed

* ©️ Copyrights update (#2454)

* First changes

* Other files

* Finally

* rm comment

* fix nashmd

* Fix example

* Fix example

* 💬 Fix chat for windows (#2443)

* fix chat for windows

* add some tests back

* Revert "add some tests back"

This reverts commit 350aef52f53f8cf34fccd7ad0f78a3dd63867e06.

* 🆔 Add `datast_config` to `ScriptArguments` (#2440)

* datast_config_name

* Update trl/utils.py

* sort import

* typo

* Trigger CI

* Rename `dataset_config_name` to `dataset_config`

* 🏎 Fix deepspeed preparation of `ref_model` in `OnlineDPOTrainer` (#2417)

* Remove unused deepspeed code

* add model prep back

* add deepspeed even if it doesn't work

* rm old code

* 👯 Standardize `model_args` (#2442)

* `model_config` -> `model_args`

* sort

* refactor config

* drop skip prepare dataset

* add sep to packing

* drop prompt-completion for now

* Revert "drop prompt-completion for now"

This reverts commit 16ef195031ac9c860f8f2ac383ff34133fcbe70f.

* Revert "add sep to packing"

This reverts commit dc84d08da7a4b7804c064be1a15605f1770549e2.

* Revert "drop skip prepare dataset"

This reverts commit d2ee070d994a4b29ad33128a8ef99f101994a6c7.

* Revert "refactor config"

This reverts commit f732aa8728e42623ee5817b514263912cab337e4.

* Format

* Update doc-builder workflow to use specific commit sha

* add peft edge cases

* no logits when using liger

* remove unused columns

* proper handle of prompt-completion

* trick to keep messages

* fix messages missing

* for Liger kernel, ensure only input_ids is present

* packing and liger are compatible

* shinny doc and final nits

* another nit

* refactor config and doc

* re add truncation

* fix ci

* drop deprecated params in tests

* fix link

* fix config docstring

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2025-02-08 00:21:36 +01:00
82d12eb751 📠 Log completions for GRPO (#2772)
* log completions

* typo

* wandb

* Fix completions

* Fix style?

* Remove double import

* Revert

* group logging

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2025-02-07 12:41:58 +01:00
84d73fd00b 🎯 [SFT] add token accuracy metric (#2597)
* add token accuracy metric

* fix return type

* shift tokens

* use compute_loss so that the model is called only once

* add to logs

* log from main process

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-02-07 11:09:46 +01:00
2241f17914 🆚 Distinguish padding and eos when they differ (#2793) 2025-02-07 11:08:49 +01:00
cf97133d51 📉 Optimize GRPO memory usage by redefining per_device_batch_size as generations per device (#2776)
* Distribute

* fix some logic errors

* fix and document RepeatRandomSampler

* comment

* doc clarification

* fix type hint

* more readable

* fix eval

* fix tests

* roll back to distribute generation

* improve comment [ci skip]

* fix slice

* catch for eval batch size as well; fix completion_ids in vllm

* log completions

* Revert "log completions"

This reverts commit 1e4af8ffb8dda15d7596e707ac784208db88135a.

* Before the first training step, the model has no optimizer: fix ds3
2025-02-06 20:20:44 +01:00
724acb9716 💡 Add 'Post training an LLM for reasoning with GRPO in TRL' tutorial (#2785) 2025-02-06 18:28:05 +01:00
7134a1e73f Revert "Before the first training step, the model has no optimizer: fix ds3"
This reverts commit bf6e7edea54f2e34b2f6802468ee3224c4aa8030.
2025-02-06 17:19:57 +00:00
bf6e7edea5 Before the first training step, the model has no optimizer: fix ds3 2025-02-06 17:19:05 +00:00
e95f9fb74a 🙃 Fix reward function in GRPO example (#2777) 2025-02-06 09:51:44 +01:00
a85768f120 💡 GRPO vram-efficiency improvement; only compute relevant logprobs (#2773) 2025-02-06 08:52:21 +01:00
78c5ce23fd ↔️ GRPO: Set max_model_len when initializing vLLM instance (#2728)
* Set max_model_len when initializing vLLM instance

* Introduce vllm_max_model_len arg

* Replace vllm args with vllm_init_kwargs

* Update docstring

* Add missing import

* Remove default values from newly deprecated args

* Docs update

* Reverted to adding single arg for max_model_len

* Remove spurious import

* Remove spurious line

* style

---------

Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2025-02-06 00:12:31 +01:00
af4ad47035 🚧 Add Optional ZeRO-3 Weight Gathering for GRPO in Sequence Generation (#2667)
* Add (grpo) unwrap_model_generation zero3 gathering

* proper placement

* Disabling this option is not compatible with vLLM generation.

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2025-02-04 23:24:35 +01:00
b2ae99925d 🔁 🦈 Support iterative GRPO (#2700)
* support for synchronization ref-model added

* support for synchronization ref-model added

* tests for sync_ref_model added

* Update tests/test_grpo_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* split and fix test

* style

* doc

* move after init to ensure accelerator exists

* Update tests/test_grpo_trainer.py

* style

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2025-02-04 23:10:13 +01:00
bd946f93c1 🤖 Properly unwrap torch.compile-ed models in GRPO (#2750)
* properly unwrap torch.compile-ed models with GRPO

* add test and compat with reward models

* ignore test windows

* properly unwrap torch.compile-ed models with GRPO

* add test and compat with reward models

* ignore test windows

* chore: lint

* style

---------

Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-02-04 22:22:10 +01:00
f42e34e613 🔎 Add missing script argument in PPO documentation (#2720) 2025-02-04 21:53:10 +01:00
338fbd546b 📖 Clarification max len in Reward documentation (#2740)
* Nit fix about max_lenth argument.

* copy to docstring

* typo

* consistency

---------

Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-02-04 21:16:29 +01:00
32f8fa8aad 📐 Add vLLM dtype configuration for GRPO trainer (#2738)
* feat: Add vLLM dtype configuration for GRPO trainer

* added vllm dtype info in docstring

* send to vLLM doc

---------

Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-02-04 21:10:56 +01:00
1a2276402f 📌 vLLM >= 0.7.1 for device fix (#2766)
see https://github.com/huggingface/trl/issues/2745
2025-02-04 20:12:22 +01:00
1f344c9377 💔 Decouple loss computing and generation in GRPO (#2762) 2025-02-04 13:21:51 +01:00
85121fc300 🔂 Use vLLM prefix caching for speedup (#2757)
* use vllm prefix caching for speedup

* comment

---------

Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2025-02-04 11:20:50 +01:00
bbdd6db17c ⚠️ Fix attention masking in GRPO (#2708)
* Update grpo_trainer.py

* Update grpo_trainer.py

* Update grpo_trainer.py

* Slight name change

* Fix typo

* Improve readability + move attn mask to args

* revert adding "completion_"

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-02-02 20:44:54 +01:00
6e088d165c docs: Fix typos in alias descriptions (#2729) 2025-02-02 11:59:46 +01:00
a325a0eec5 fix: Fix typo in filename in ultrafeedback-prompt.py (#2716) 2025-02-01 14:53:47 +01:00
0ec1ccd990 💰 Fix incorrect calculation in Olivia's baguette spending logic (#2727) 2025-02-01 14:52:08 +01:00
1c35a48b50 🏰 num_logits_to_keep to logits_to_keep (#2721) 2025-01-31 20:19:39 +01:00
2ce36ae889 📖 Nit fix in SFT Documentation (#2722) 2025-01-31 16:46:23 +01:00
bf6919117e Improve GRPO example (#2717) 2025-01-31 12:04:44 +01:00
265663af6a 📖 Add GRPOTrainer to README.md (#2713)
* [DOCS] add GRPOTrainer to README.md

I replaced RLOOTrainer with GRPOTrainer because you thought you might want to keep it limited, but let me know if you want both.

* Update README.md

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-01-31 10:30:44 +01:00
5ab15d3fef fix: Fix typo in filename Update ultrafeedback.py (#2699) 2025-01-31 10:01:32 +01:00
fecaa991de 📋 Add eval loss logging during prediction in GRPO (#2694)
* add eval loss logging during predition

* make sure the train and eval logs aren't mixed

* test grpo in eval

* fix tests

---------

Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2025-01-30 18:37:45 +01:00
ab30a01baf 💡 Add "Mini-R1: Reproduce Deepseek R1 „aha moment“ a RL tutorial" (#2697)
* more readable

* add tuto
2025-01-30 17:12:04 +01:00
6dc278a042 ☠️ Remove deprecated (#2692)
* remove deprecated

* remove from test

* remove from test 2
2025-01-30 16:30:40 +01:00
67441bb432 🧠 Fix typo in "understand" in ppo_trainer.md (#2695) 2025-01-30 16:30:24 +01:00
62685fbf20 docs: Fix broken "Good First Issue" link in CONTRIBUTING.md (#2693)
* docs: Fix broken "Good First Issue" link in CONTRIBUTING.md

* Update CONTRIBUTING.md

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2025-01-30 13:15:37 +01:00
4197956395 🙈 Fixed typo in the GRPO documentation (#2691) 2025-01-30 11:17:02 +01:00
9ac8d9773b 📄 Add GRPO batch size note in docs (#2672)
* add note for OOM error

* update note

* Apply suggestions from code review

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-01-30 09:57:43 +01:00
094d51b599 📖 Docs fix spelling issues (#2682)
* Update alignprop_trainer.md

* Update best_of_n.md

* Update clis.md

* Update community_tutorials.md

* Update cpo_trainer.md

* Update dataset_formats.md

* Update detoxifying_a_lm.md

* Update dpo_trainer.md

* Update rloo_trainer.md

* Update clis.md

* Update rloo_trainer.md
2025-01-30 09:42:14 +01:00
df8f619ec5 📦 trl.templates in excluded packages (#2690) 2025-01-30 09:31:08 +01:00
56880ba73d ⬆️ Bump dev version (#2689) 2025-01-30 09:23:31 +01:00
801582ec24 📉 Use num_logits_to_keep to reduce memory usage in GRPO (#2683)
* use num_logits to keep

* add comment back

* Update trl/trainer/grpo_trainer.py
2025-01-29 17:12:18 +01:00
ed14ed9043 vLLM for fast generation in GRPO (#2600)
* doc

* fsdp

* use vllm config

* vllm

* Update trl/trainer/grpo_config.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update trl/trainer/grpo_config.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* typo

* top_k, top_p

* Link to vllm pr

* fix missing device

* fix tests

* fix citation

* fix title and paper_id

* formatting

* output the correct number of generations

* initial async vllm

* fix missing args

* fix promps

* Pass prompt_token_ids directly

* Repeat each prompt num_generations times

* get the slice of results per processor

* undo citation

* OMG

* nothing can resist me!!!!

* working

* vllm_device to "auto"

* add vllm test

* add initial vllm docs

* add vllm link and pip instructions

* add multi-gpu strategy fot vllm

* Update docs/source/grpo_trainer.md

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update docs/source/grpo_trainer.md

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update docs/source/grpo_trainer.md

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* add doc strings

* Update docs/source/grpo_trainer.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update trl/trainer/grpo_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/grpo_trainer.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* add important tag

* fix typo

* overrides default batch size and grad accum and better doc

* Under no circumstances should you examine the contents of this commit.

* auto device, warnings, errors

* better error message

* require_torch_accelerator test vllm

* speeding up traing doc

* device as str

* does it prevent deepspeed init to hang?

* update docs

* require torch accelertor for vllm test

* unwrap compat with ds z3

* simplify examble in doc

* More comments, fix ds3 hanging

* faster, not sure why

* style

* move doc about speed

* revert change in config files

* fix default value in doc [ci skip]

* style [ci skip]

* better comment [ci skip]

* fix warning

* Update grpo_config.py

* Update deepspeed_zero1.yaml

* Update trl/trainer/grpo_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Apply suggestions from code review

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/grpo_trainer.md

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2025-01-29 13:01:10 +01:00
4659ad916f 🖊 Fix typos (#2673)
* fix typos

* fix typo

* fix typo

* fix typos

* fix typos

* fix typo

* fix typo

* fix typo

* fix typo

* fix typo

* fix typo

* fix typo

* fix typo

* fix typo

* fix typo
2025-01-28 11:26:36 +01:00
1123bd0f51 🏷️ Add model tags to model trained with GRPO (#2663) 2025-01-26 13:37:15 +01:00
55a329e9f0 🌀 Fix GRPO default completion length doc (#2662) 2025-01-26 10:05:21 +01:00
4720656654 📏 Log completion length in GRPO (#2659) 2025-01-25 20:56:09 +01:00
807046b7d7 📍 Disable caching when grad checkpointing enable in GRPO (#2653)
* disable caching when grad checkpointing

* style
2025-01-25 13:14:34 +01:00
317d2d477b 🔎 Finegrained reward logging for GRPO (#2651) 2025-01-25 11:43:00 +01:00
aeb03cf1a9 👐 DeepSpeed integration for GRPO (#2652) 2025-01-25 10:10:29 +01:00
2578e95023 🚛 Provide all columns of the dataset to the reward function (#2650)
* The reward function is provided with all col from the dataset

* Minor clarifications

* minor renaming in doc [ci skip]

* fix indentation
2025-01-24 20:31:07 +01:00
6f99f42f72 🥞 Fix KTO gradient accumulation loss scaling (#2648) 2025-01-24 16:23:16 +01:00
d14f7f3eb2 🥞 Fix GRPO gradient accumulation loss scaling (#2647) 2025-01-24 16:22:54 +01:00
8e65825d4c 🥞 Fix CPO gradient accumulation loss scaling (#2645) 2025-01-24 12:22:46 +01:00
5e4d7be0e1 Update grpo_trainer.md 2025-01-24 09:06:16 +01:00
f34b70a32e 🌯 Fix context manager runtime error when gather is disabled (#2639) 2025-01-23 21:23:54 +01:00
0e216f7411 🍭 Custom reward function for RLOO (#2612)
* rloo custom reward function and test

* idont even know why i did that

* removing get_reward_custom

* remove get_reward_custom test

* fix code quality check

* adding test

* end this mysery already

* fix test
2025-01-23 22:46:37 +03:30
59c201433c 🥞 Fix BCO gradient accumulation loss scaling (#2638) 2025-01-23 18:57:43 +01:00
40c238395e 🥞 Fix DPO gradient accumulation loss scaling (#2615)
* fix DPO for gradient accumulation

* Update trl/trainer/dpo_trainer.py

* Update trl/trainer/dpo_trainer.py

* Update trl/trainer/dpo_trainer.py

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-01-23 18:12:06 +01:00
a1d2955116 🏆 Custom reward function for GRPO and shiny doc (#2606)
* initial commit

* doc on custom reward function

* test

* doc doc doc

* fix collator

* style

* links?

* I need a docdoc 🎵

* fix link

* I do like writing doc tbh

* it takes time, but it's worth it

* no return!

* type hint

* it's probably the best of both worlds [ci skip]

* new doc before implementation

* tests

* more doc

* style

* multiple pretrained funcs

* fix arg name

* main?

* example for R1

* fix script

* clearer

* import [ci skip]

* Update docs/source/grpo_trainer.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2025-01-23 17:39:45 +01:00
887c1f3fa3 💎 Rename an inner var in GRPO to improve clarity (#2616)
* rename advatages to per_token_loss for clarity

* doc ci
2025-01-23 17:30:22 +01:00
949db2357e 👋 Drop MDX (#2611) 2025-01-23 13:38:15 +01:00
fe4b5efe4e ✂️ Reintroduce truncation_mode in DPOTrainer (#2551)
* reintroduce truncation mode in DPOTrainer

* move truncation_mode in dataset.map invocation

* truncate full sequence

* "." [ci skip]

* Empty commit

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2025-01-22 15:33:50 +01:00
a9b54a852e 🫷 Include stop token in policy model's generation_config (#2528)
* Include stop token in policy model's generation_config

* Fix formatting

* Update trl/trainer/ppo_trainer.py

* Update trl/trainer/ppo_trainer.py

* don't modify args

* clarify doc

* more nice doc

* missing no [ci skip]

* really don't modify args

* oups

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2025-01-22 12:24:42 +01:00
d4222a1e08 🧩 PPO/RLOO/OnlineDPO sequence generation: make deepsped 3 weight gathering optional (#2557)
* PPO/RLOO/OnlineDPO: add ds3_gather_for_generation argument to control weights gathering for generation

* code formatting

* rephrase and document

* more doc

* style [ci skip]

* Trigger CI

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2025-01-21 22:44:18 +01:00
a5c88d6c75 Add uv installation instructions (#2601)
* add uv

* Update docs/source/installation.mdx

* Update docs/source/installation.mdx

* pypi -> PyPI

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2025-01-21 22:09:18 +01:00
b6a084c46e 💾 Reduce memory peak in GRPO by adding max_prompt_length and loop usage in logp computation (#2598)
* add max_prompt len to config

* truncate prompt and compute log probs line by line
2025-01-21 15:12:04 +01:00
d9f056862f 🧰 Tool fine-tuning support DPO (#2479)
* adding tool fine-tuning support for DPO

* precommit

* adding test for DPOTrainer with tool usage

* style

* fix test

* a comment

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2025-01-21 09:32:31 +03:30
3d2c1e49b1 Fix merge error (#2595) 2025-01-20 22:17:39 +01:00
5fd78367ae 🫣 Ignore CLI test for Python 3.9 (#2592)
* ignore cli test for python 3.9

* move import inside tests
2025-01-20 21:26:11 +01:00
0f5ffad26e 👨‍👨‍👧‍👧 GRPO (#2565)
* init grpo [ci skip]

* initial version

* refine args defs

* model card

* initial doc

* fix badges

* fix spaces

* try link to super in doc

* temperature, fix indexing, and std=0.0

* grpo script for cli

* peft support

* move data preparation in `compute_loss`

* weird doc trial

* fix device and some logging

* unwrap_model_for_generation for distributed setting

* Compat with distrib training

* revert grpo config doc trial (didn't work)

* test

* allow model to be str and processing_class to be none; fix loss computation

* advantage is always 0.0: don't log

* fix peft not installed

* proper reward model for testing

* fix script for cli

* add trl grpo to cli doc

* test peft

* flush left

* fix reward calculation

* new reward model

* support any reward model

* fix reward processing class def

* log reward std

* fix reward logging

* fix grad computation

* skip embed layer in test

* remove optimizer_cls_and_kwargs

* improve GRPO default args

* reduce mem usage for grpo test

* reduce mem usage in test grpo

* reduce memory usage for test

* Fix the test

* remove redondant

* fix min version

* Update test_grpo_trainer.py

* Update test_grpo_trainer.py

* Fix test, finally found the solution!

* some doc

* Update doc-builder workflow to use specific commit sha

* more doc

* advantages

* drop cancel fo no grad

* logged metrics [ci skip]

* completion col is ignored [ci skip]

* fix latex

* double space? ~?

* try a latex fix

* with branch

* Empty commit

* Empty commit

* double space seems to be the solution
2025-01-20 19:02:15 +01:00
88514d51e3 Update reducing_memory_usage.md 2025-01-18 21:12:25 +01:00
76837e82b9 🎞️ Fix documentation SFT -max_seq_length instead of max_length (#2590) 2025-01-18 21:10:33 +01:00
35553930da 🫢 Add max_prompt_length parameter in tests (#2588)
* Add max_prompt_length parameter to tokenizer

* style [ci skip]
2025-01-17 19:40:38 +01:00
fd4b283b82 ✂️ Truncate by default (#2587)
* set default for max_length and max prompt lenngth and add guidelines for defaults

* remove dep kwargs

* truncate prompt in prm

* Update CONTRIBUTING.md [ci skip]
2025-01-17 17:03:41 +01:00
1b1140aa69 [RLOO] fix token_level_kl (#2575)
* fix token_level_kl

* fix non_score_reward and rlhf_reward

* add rloo test

* update test

* fix docs

* fix doc
2025-01-17 14:59:25 +01:00
4c7eb6fe29 🐛 Simplify bug report template (#2585) 2025-01-17 14:40:37 +01:00
564fc86759 Update issue_auto_labeller.yml [ci skip] 2025-01-17 14:10:33 +01:00
3215a1c586 Update issue_auto_labeller.yml 2025-01-17 13:59:14 +01:00
cdc16f3ac6 🔖 Issues Auto-Labeller (#2542)
* Initial commit for auto labeller

* Using HF instead of openai

* secrets name change

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-01-17 13:46:24 +01:00
2ecd53ad77 🏎️ vLLM for Online DPO (#2558)
* vllm online dpo

* new arg and add back generation config [skip ci]

* import utils

* optional import and comment

* is_vllm_available

* support conv and not conv [ci skip]

* add old code back

* use func [skip ci]

* fix _generate call

* fix and dedicated func

* top k 50

* style

* add import error

* new testing model

* Update OnlineDPOTrainer class with new features

* test vllm

* fix generate tiny script

* max len arg

* fix comment [ci skip]

* revert num_return_sequences

* vllm dep

* Add require_torch_accelerator import and skip test if vllm is not available

* proper require_torch_accelerator

* add vllm section

* Add hfoption sections to speeding_up_training.md

* no, an id

* Update vllm dependency to exclude Windows platform

* Note on future release

* style
2025-01-17 11:39:13 +01:00
5877786b5a 🪄 Minor comment style modif (#2582) 2025-01-17 11:12:00 +01:00
57d9a97394 Refine model card method docstring (#2566)
* refine model card docstring

* bco

* prm
2025-01-13 15:58:01 +01:00
751fb1d84b 🏛️ Improve DPO configuration documentation structure (#2561)
* better structure dpo config

* fix tests

* fix regex

* add contributing guidelines
2025-01-12 15:23:19 +01:00
edabe0a2d8 [RLOO] Reinforce++ (#2552)
* Reinforce++

* formatting

* fix link
2025-01-09 12:09:29 +01:00
abfffc510b 💔 Fix dataset type unpair conversion docs (#2550)
Co-authored-by: Clara Luise Pohland <clara-luise.pohland@telekom.de>
2025-01-08 19:33:05 +01:00
ed7de87dc7 🎴 Add readme for datasets (#2491)
* adding readme for ultrafeedback dataset

* using ModelCard as DatasetsCard like hf datasets is understaffed

* more info in readme.md of the dataset

* generated readme for all dataset scripts

* precommit

* fixing test

* md format; corrections; generation script link

* some collections

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2025-01-08 17:25:51 +01:00
beb892bfe0 ↩️ Revert ORPO loss changes (#2527)
* revert orpo changes

* add comment
2025-01-08 16:13:20 +01:00
f2d42fa0c2 🔠 Fix SFT truncation documentation (#2521)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-01-08 15:35:49 +01:00
d6a7e9d6f5 ℹ️ XPU support for DPO (#2533)
* add xpu support

* bug fix

* remove header

* Update trl/trainer/dpo_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update trl/trainer/dpo_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update trl/trainer/dpo_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update trl/trainer/dpo_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update trl/trainer/dpo_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update trl/trainer/dpo_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* fix import and use the util

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2025-01-08 15:32:03 +01:00
451677203d 🕊️ DPO padding free (#2520)
* padding free

* specify dtype

* test

* warnings when not flash attention

* fix test

* remove

* docstring padding-free

* flash-attn dep

* Stronger warning

* require_flash_attn in test

* flash-attn in CI

* rm flash-attn from dep

* Remove flash-attn dependency from test workflows

* refactor

* Update .github/workflows/tests.yml

* Update trl/trainer/dpo_trainer.py

* drop require flash-attn

* fix dtype

* refine warning

* Update trl/trainer/dpo_config.py

* Add logic to compute mean logits for chosen and rejected tokens with padding-free

* format

* Update trl/trainer/dpo_trainer.py

* Update trl/trainer/dpo_trainer.py

* fix comment [ci skip]

* fix num logits to keep
2025-01-08 09:22:17 +01:00
2f25f54ab9 ✒️ Fix typo in formatting_func's documentation in ConstantLengthDataset (#2549) 2025-01-07 21:26:28 +01:00
a50124dd3a 🧑‍🤝‍🧑 Proper metrics gathering across ranks before logging (#2474)
* dpo_trainer gather metrics across ranks before logging

according to https://github.com/huggingface/trl/issues/2468

* fix everywhere

* gather_for_metrics

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2025-01-07 15:05:54 +01:00
1d23ecc36f ©️ Update copyrights year (#2547)
* happy new year

* fix wandb import sort
2025-01-07 14:53:09 +01:00
52d213173f 🚜 Use field in dataclasses (#2494)
* in hh-rlhf-helpful-base

* delete tokenize ds

* dataset scripts

* alignprop

* judge tldr

* ddpo

* zen

* sft video

* literal to choices

* chat

* script args

* alignprop

* bco

* better help format

* cpo

* ddpo

* whether or not -> whether

* dpo

* dont set the possible values

* `Optional[...]` to ... or `None`

* xpo

* gkd

* kto

* nash

* online dpo

* Fix typo in learning rate help message

* orpo

* more ... or `None`

* model config

* ppo

* prm

* reward

* rloo

* sft

* online policy config

* make style
2025-01-06 18:29:09 +01:00
d9ee2fd202 Remove graph breaks for torch.compile() in padding free branch in DataCollatorForCompletionOnlyLM (#2158)
* feat: Add info to batch in DataCollatorForCompletionOnlyLM

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* fix: formatting

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* feat: Add info to batch in DataCollatorForCompletionOnlyLM

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* fix: formatting

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* fix: max_length_k to int

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* fix:Added comments

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* test cases

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* test cases

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* test cases

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* feat: Add info to batch in DataCollatorForCompletionOnlyLM

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* fix: formatting

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* feat: Add info to batch in DataCollatorForCompletionOnlyLM

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* test cases

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* test cases

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* test cases

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* unit test changes

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* style

* add test

* remove test

---------

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-01-06 15:50:29 +01:00
763738f457 ☄️ Update Comet integration to include LogCompletionsCallback and Trainer.evaluation_loop() (#2501)
* Implemented integration with Comet in `LogCompletionsCallback`. Implemented related integration test.

* Implemented integration with Comet in `CPOTrainer.evaluation_loop()` during logging of `game_log` table.

* Implemented integration with Comet in `CPOTrainer.evaluation_loop()` during logging of `game_log` table.

* Implemented integration with Comet in `DPOTrainer.evaluation_loop()` during logging of `game_log` table.

* Implemented integration with Comet in `BCOTrainer.evaluation_loop()` during logging of `game_log` table.

* Implemented integration with Comet in `KTOTrainer.evaluation_loop()` during logging of `game_log` table.

* Implemented integration with Comet in `ORPOTrainer.evaluation_loop()` during logging of `game_log` table.
2024-12-28 18:35:01 +01:00
aed5da580e 📦 Packing documentation (#2503) 2024-12-22 12:44:07 +01:00
99451b421a 👬 Rename collator PreferenceCollator to DataCollatorForPreference (#2510) 2024-12-22 12:43:55 +01:00
5239b9462d 💧 Generalize disable_dropout (#2511) 2024-12-22 12:19:17 +01:00
8fb267ff1e 👨‍🍳 Clarify DPO data preparation (#2512) 2024-12-22 12:18:22 +01:00
2e1adbb6ff Remove RLOO example test (#2513) 2024-12-22 12:16:14 +01:00
b668048fe1 Update community_tutorials.md (#2509)
* Update community_tutorials.md

* Update community_tutorials.md
2024-12-20 17:40:42 +01:00
8c49ea39ec 🏚 Remove unused components (#2480) 2024-12-19 19:29:39 +01:00
88ad1a099c fix orpo chosen-nll loss (#2502) 2024-12-19 11:33:06 +01:00
9908dda6d9 🗂️ Reorganize documentation (#2483)
* reorganize doc

* consistent ing

* Add reducing_memory_usage.md

* integration with peft

* Add new files and update table of contents

* Add speeding_up_training.md to docs/source and update _toctree.yml

* unsloth

* Liger kernel

* Truncation

* Update truncation parameters for DPO and SFT

* dedicated Intergation section

* clarify

* illustrate

* Sort

* badge for prm
2024-12-18 16:28:11 +01:00
5e204e1eaa 🏞️ Proper dataset for documentation images (#2499)
* first images

* almost all!

* Final

* Some were missing
2024-12-18 11:28:45 +01:00
82cfeb8930 🤩 Add SmolVLM tutorials to Community Tutorials page (#2498) 2024-12-17 23:31:34 +01:00
0fe73a8af5 🗣️ Improve prose for smol course (#2487) 2024-12-16 11:17:29 +01:00
33fb9efc43 ⚰️ Remove deprecated (#2485) 2024-12-15 21:02:59 +01:00
f68d11f9f9 Bump version 2024-12-15 19:56:54 +01:00
aeca63774f 👨‍🏫 smol course links and badges (#2484)
* smol course links and badges

* try without space

* revert space
2024-12-15 19:38:48 +01:00
117c6d4b52 📥 Fix missing BitsAndBytesConfig import in doc (#2478) 2024-12-15 16:54:38 +01:00
6d4ed070f1 ☄️ Add support for Comet experiment management SDK integration (#2462)
* Added support for Comet URL integration into model cards created by trainers.

* Moved `get_comet_experiment_url()` into utils.py

* Updated Comet badge in the model card to use PNG image instead of text.

* Fixed bug related to running PPO example during model saving. The error as following: 'GPTNeoXForCausalLM' object has no attribute 'policy'. Introduced guard check that attribute `policy` exists.

* Implemented utility method to handle logging of tabular data to the Comet experiment.

* Implemented logging of the completions table to Comet by `PPOTrainer`.

* Implemented logging of the completions table to Comet by `WinRateCallback`.

* Implemented logging of the completions table to Comet by `RLOOTrainer` and `RewardTrainer`.

* Restored line to the main branch version.

* Moved Comet related utility methods into `trainer/utils.py` to resolve merge conflict with master branch,

* Update trl/trainer/utils.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Implemented raising of `ModuleNotFoundError` error when logging table to Comet if `comet-ml` is not installed.

* import comet with other imports

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-12-13 22:08:10 +01:00
cd7156fb34 👀 Add "PaliGemma 🤝 Direct Preference Optimization" in community tutorials (#2475) 2024-12-13 20:29:35 +01:00
ca850be0a2 🕹️ CLI refactor (#2380)
* Refactor main function in dpo.py

* Update setup.py and add cli.py

* Add examples to package data

* style

* Refactor setup.py file

* Add new file t.py

* Move dpo to package

* Update MANIFEST.in and setup.py, refactor trl/cli.py

* Add __init__.py to trl/scripts directory

* Add license header to __init__.py

* File moved instruction

* Add Apache License and update file path

* Move dpo.py to new location

* Refactor CLI and DPO script

* Refactor import structure in scripts package

* env

* rm config from chat arg

* rm old cli

* chat init

* test cli [skip ci]

* Add `datast_config_name` to `ScriptArguments` (#2440)

* add missing arg

* Add test cases for 'trl sft' and 'trl dpo' commands

* Add sft.py script and update cli.py to include sft command

* Move sft script

* chat

* style [ci skip]

* kto

* rm example config

* first step on doc

* see #2442

* see #2443

* fix chat windows

* ©️ Copyrights update (#2454)

* First changes

* Other files

* Finally

* rm comment

* fix nashmd

* Fix example

* Fix example [ci skip]

* 💬 Fix chat for windows (#2443)

* fix chat for windows

* add some tests back

* Revert "add some tests back"

This reverts commit 350aef52f53f8cf34fccd7ad0f78a3dd63867e06.

* 🆔 Add `datast_config` to `ScriptArguments` (#2440)

* datast_config_name

* Update trl/utils.py [ci skip]

* sort import

* typo [ci skip]

* Trigger CI

* Rename `dataset_config_name` to `dataset_config`

* 🏎 Fix deepspeed preparation of `ref_model` in `OnlineDPOTrainer` (#2417)

* Remove unused deepspeed code

* add model prep back

* add deepspeed even if it doesn't work

* rm old code

* Fix config name

* Remove `make dev` in favor of `pip install -e .[dev]`

* Update script paths and remove old symlink related things

* Fix chat script path [ci skip]

* style
2024-12-13 17:52:23 +01:00
179ba53671 🐾 Process-supervised RM Trainer (#2127)
* initial skeleton

* tokenize fn

* adding bos and eos to tokenization fn

* prmtrainer

* fixing small typo in tokenize

* typo in input_ids and labels construction

* numpy dimension

* introduce the stepwise reward trainer

* update markdown files

* let user decide post step separator in config

* doc post_step_separator

* do not add post step_tokens to last step of the reasoning process

* renaming prm to stepwisereward

* formatting

* fix tokenize kwargs

* adapt test to the new post_token args

* adding example script

* fix small typo

* add create_model_card and renaming

* fixing booleans

* Adding the new stepwise_preference instead of placeholders for datasets

* formatting

* Update docs/source/_toctree.yml

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update examples/scripts/stepwise_reward_modeling.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update trl/trainer/stepwise_reward_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update trl/trainer/stepwise_reward_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* update push to hub

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* step_separator can't be None

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* fix suggested typos

* add citation

* reformat doc

* reordering init

* push to hub prm800k

* changing dataset in example

* change dataset format to align with the sky is blue example

* fix tokenization column names

* fix num labels in openai example

* add support for conversational dataset

* remove training whitespace

* replace tokenizer with processing class

* Update docs/source/dataset_formats.mdx

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* remove openai_prm800k

* Update trl/trainer/stepwise_reward_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update trl/trainer/stepwise_reward_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update docs/source/stepwise_reward_trainer.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/stepwise_reward_trainer.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* renaming

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* renaming

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* minor renamings in docs

* using prm800k instead of openai_prm800k

* update num labels to 2 following the new format

* changing doc examples to math examples

* change reference to dataset_formats.mdx

* changing dataset config in test

* remove conversational dataset support

* remove conv dataset support

* fix bos token

* fix scriptarguments in example

* completion to completions

* remove valuerror for step_separator inside steps

* run precommit

* remove conv dataset support

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* renaming zen dataset

* remove unused printing

* unknown label column

* introduce the train on last step arg

* _tokenize support train_on_last_step

* incorporate train_on_last_step to tests

* formatting

* remove comments in trainer

* Refactor `tokenize_row`

* Update max_completion_length parameter in StepwiseRewardConfig

* Collator

* Update comment

* Update type hint

* fix table

* Remove collator

* don't need pad token id

* add error back

* max length args

* use tokenizer arg

* Update doc

* label -> labels

* fixing tokenization issues in tokenize row

* correct labels for token classification

* adding max_length to tokenize_row

* reformat tests

* adding tests for tokenize row

* fixing typos in comments

* update doc

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Add math_shepherd.py script for dataset processing

* split the dataset

* formatting

* same evaluation method for the two training methods

* adding filtering to example script

* formatting

* Add features to avoid casting labels to bool in dataset tokenization

* Update docs/source/stepwise_reward_trainer.mdx [ci skip]

* Add learning_rate parameter to StepwiseRewardConfig class

* update doc

* Remove unused setup_chat_format function

* Fix warning message in stepwise_reward_modeling.py

* Update logging steps in stepwise_reward_trainer.mdx

* little doc change [ci skip]

* Fix copyrights

* fix space after copyrights

* Update dataset loading in stepwise_reward_modeling.py

* refine compute_accuracy and proper test

* fix tests

* style

* renamings

* renaming in init

* doc renaming

* fix sorting and tag

* experiemental [ci skip]

* trigger CI

* other doc fix

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-12-13 15:56:10 +01:00
e3e171a26b 🔨 Support for tools for data utils (#2455)
* function calling training support for SFTTraining

* adding tool support to data_utils

* adding test for function calling tokenizer

* reverting changes to sfttrainer and config,added maybe_apply_chat_template

* arg for maybe_apply_chat_templates docstring

* Doc sectioning

* minor test modification

* minor doc modification

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-12-12 17:11:50 +01:00
b3aff441ff 🎞️ Add "Fine-tuning open AI models using Hugging Face TRL" YouTube video to community tutorials (#2467) 2024-12-12 16:40:28 +01:00
efc687db62 🛠️ Update tests and fix PPO (#2463)
* [bugfix] critic not update

* Update ppo_trainer.py

* Update ppo_trainer.py

* add failing test

* test both policy and critic

* formatting

* fix tests

* formatting

* Update tests/test_ppo_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* fix test

---------

Co-authored-by: NINGBENZHE <53843873+NINGBENZHE@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-12-12 12:53:32 +01:00
f2e362656c ⚖️ Add tests_latest.yml workflow file (#2457)
* Add tests_latest.yml workflow file

* don't check the branch

* Fix workflow
2024-12-11 18:11:41 +01:00
c9c4f18039 [bugfix] Fix DataCollatorForChatML unexpected generation prompt (#2450)
* [bugfix] Fix DataCollatorForChatML unexpected generation prompt

* Update utils.py

* Update test_utils.py

* Update tests/test_utils.py

* Update tests/test_utils.py

* Update tests/test_utils.py

* Update tests/test_utils.py

* Update test_utils.py

* Update tests/test_utils.py

* Update tests/test_utils.py

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-12-11 15:18:54 +01:00
460e780265 👯 Standardize model_args (#2442)
* `model_config` -> `model_args`

* sort
2024-12-10 12:51:20 +01:00
7ba118a229 🏎 Fix deepspeed preparation of ref_model in OnlineDPOTrainer (#2417)
* Remove unused deepspeed code

* add model prep back

* add deepspeed even if it doesn't work

* rm old code
2024-12-10 12:40:13 +01:00
6a05feff02 🆔 Add datast_config to ScriptArguments (#2440)
* datast_config_name

* Update trl/utils.py [ci skip]

* sort import

* typo [ci skip]

* Trigger CI

* Rename `dataset_config_name` to `dataset_config`
2024-12-10 11:09:26 +01:00
2f72f47191 💬 Fix chat for windows (#2443)
* fix chat for windows

* add some tests back

* Revert "add some tests back"

This reverts commit 350aef52f53f8cf34fccd7ad0f78a3dd63867e06.
2024-12-10 10:40:23 +01:00
9410874787 ©️ Copyrights update (#2454)
* First changes

* Other files

* Finally

* rm comment

* fix nashmd

* Fix example

* Fix example [ci skip]
2024-12-10 10:40:00 +01:00
9c5388b69e 🔗 Add "Open in Colab" badges in community tutorials page (#2441) 2024-12-06 10:51:55 +01:00
b02189aaa5 🗂️ Harmonize run and example batch sizes in RLOO docs (#2439)
Doc has different grad_accumulation_steps and per_device_batch size than the actual hyperparameters, can be verified from wandb run.
2024-12-04 19:19:14 +01:00
52201d3c18 🧮 Fix max_steps calculation in RLOOTrainer (#2433) 2024-12-03 21:31:32 +01:00
9ff79a65e3 🔮 Fix unused precomputed ref log probs in DPO (#2431) 2024-12-03 11:36:57 +01:00
9001a8682c 📑 Refactor TrlParser (#2412)
* refactor parser

* Only document some methods

* Update imports in cli_utils.py and remove config option in utils.py

* add `test_parse_args_and_arg_override_config` and remove unnecessary mocks [ci skip]

* fix comment [ci skip]

* fix comment [ci skip]

* Extra arg in config also returned

* fix docstring [ci skip]

* add mock back

* use `deprecate_kwarg`
2024-12-02 19:57:35 +01:00
f6f42651e2 🧑‍🍳 Add precompute batch size argument in DPOTrainer for reference model (#2426)
* added precompute_batch

* review-fixes

* moving up

* Update trl/trainer/dpo_config.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update trl/trainer/dpo_config.py

* Update trl/trainer/dpo_config.py [ci skip]

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2024-12-02 17:17:41 +01:00
148b592313 Update modeling_base.py (#2419) 2024-11-30 12:14:36 +01:00
d6a8f2c2f6 ⚠️ Add warning guidelines and update codebase to follow best practices (#2350)
* Add guidelines for working with warnings in the codebase

* Remove unnecessary warnings and improve code initialization

* Fix warnings and improve accuracy calculation

* Add rich library dependency for text formatting

* Update LoRA weight loading warning message

* Fix logging and import issues in AlignPropConfig

* Fix warnings and improve code readability

* Remove unused import statements

* Refactor CPOTrainer class in cpo_trainer.py

* Remove unnecessary warnings and raise ValueError for missing model

* Fix warnings and improve code consistency

* Update CONTRIBUTING.md to clarify the purpose of warnings

* Fix string formatting in DataCollatorForCompletionOnlyLM class

* Update SimPO loss parameters in CPOTrainer

* Fix warnings and remove unnecessary code in ConstantLengthDataset class

* Clarify warning guidelines

* Rewrite the entire section

* Fix capitalization in CONTRIBUTING.md

* Fix formatting in CONTRIBUTING.md
2024-11-29 16:07:38 +01:00
8d9cfaafeb 🌋 Add support for LLaVA-Next in DPOTrainer (#2413)
* add support for llava-next in dpotrainer

* enable unit test

* code style

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Ignore last layer in test

---------

Co-authored-by: zesong.cwz <zesong.cwz@taobao.com>
Co-authored-by: 1rubbishyuan <2773496952@qq.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-11-29 15:53:50 +01:00
94e4135a17 🔓 Remove lm_head check in AutoModelForCausalLMWithValueHead (#2398)
* Remove lm_head check in `AutoModelForCausalLMWithValueHead`

* Style

* Remove test
2024-11-29 15:52:35 +01:00
ac267781ec 🌐 Community Tutorials (#2411)
* Add community notebooks to API documentation

* fix extension

* add table of community tutorials

* respond to feedback - fix links and split table

* add class references

* rename file and update toc

* Update docs/source/community_tutorials.md

* Update docs/source/community_tutorials.md

---------

Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2024-11-29 11:39:37 +01:00
2c6e0d9705 Add note about special tokens in chat templates for LoRA SFT (#2414) 2024-11-29 10:35:39 +01:00
e1d781353b 👁️ Added SFT support for SmolVLM models via standalone script sft_vlm_smol_vlm.py (#2409)
* Added SFT VLM script for SmolVLM

* Run make precommit

* Updated command example
2024-11-28 18:45:37 +01:00
a34e9bf84f 🖨 Add Script Utilities section to the documentation (#2407)
* Add script_utils.md to the documentation

* Refactor ScriptArguments class documentation

* Refactor TrlParser class to improve code organization and readability
2024-11-28 16:43:08 +01:00
c10cc8995b 🗝️ Update type hints (#2399)
* New type hint structure

* Update type hints

* Delete wrong file

* Remove dict import
2024-11-26 20:37:27 +01:00
9368dccef6 🐢 Fix slow tests (#2397)
* fix slow CI

* fix dpo

* formatting

* Apply suggestions from code review

* `setup_chat_format` may add a pad token

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-11-26 15:38:46 +01:00
43df3a485a 🧳 Move zen generation script and fix tests (#2393)
* Move zen

* step -> stepwise_supervision

* Fix train_test_split shuffle issue

* Fix tests

* Update tests/test_sft_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Fix typo in key name

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-11-26 14:08:06 +01:00
baee06f2e8 🖋️ Fix warning message formatting in KTOTrainer (#2394) 2024-11-26 13:05:25 +01:00
bbd8cbb720 🤐 Fix deprecation warnings (#2395) 2024-11-26 11:29:07 +01:00
4f937c7629 🤐 Fix deprecation warnings (#2392) 2024-11-26 11:18:43 +01:00
16fa13ce72 👮 Deprecate policy in favor of model in PPOTrainer (#2386) 2024-11-26 08:13:10 +01:00
453db5cd79 🤏 New models for tests (#2287)
* first commit

* uncomment

* other tests adaptations

* Remove unused variable in test_setup_chat_format

* Remove unused import statement

* style

* Add Bart model

* Update BCOTrainerTester class in test_bco_trainer.py

* Update model IDs and tokenizers in test files

* Add new models and processors

* Update model IDs in test files

* Fix formatting issue in test_dataset_formatting.py

* Refactor dataset formatting in test_dataset_formatting.py

* Fix dataset sequence length in SFTTrainerTester

* Remove tokenizer

* Remove print statement

* Add reward_model_path and sft_model_path to PPO trainer

* Fix tokenizer padding issue

* Add chat template for testing purposes in PaliGemma model

* Update PaliGemma model and chat template

* Increase learning rate to speed up test

* Update model names in run_dpo.sh and run_sft.sh scripts

* Update model and dataset names

* Fix formatting issue in test_dataset_formatting.py

* Fix formatting issue in test_dataset_formatting.py

* Remove unused chat template

* Update model generation script

* additional models

* Update model references in test files

* Remove unused imports in test_online_dpo_trainer.py

* Add is_llm_blender_available import and update reward_tokenizer

* Refactor test_online_dpo_trainer.py: Move skipped test case decorator

* remove models without chat templates

* Update model names in scripts and tests

* Update model_id in test_modeling_value_head.py

* Update model versions in test files

* Fix formatting issue in test_dataset_formatting.py

* Update embedding model ID in BCOTrainerTester

* Update test_online_dpo_trainer.py with reward model changes

* Update expected formatted text in test_dataset_formatting.py

* Add reward_tokenizer to TestOnlineDPOTrainer

* fix tests

* Add SIMPLE_CHAT_TEMPLATE to T5 tokenizer

* Fix dummy_text format in test_rloo_trainer.py

* Skip outdated test for chatML data collator

* Add new vision language models

* Commented out unused model IDs in test_vdpo_trainer

* Update model and vision configurations in generate_tiny_models.py and test_dpo_trainer.py

* Update model and tokenizer references

* Don't push if it already exists

* Add comment explaining test skip

* Fix model_exists function call and add new models

* Update LlavaForConditionalGeneration model and processor

* `qgallouedec` -> `trl-internal-testing`
2024-11-25 16:31:56 +01:00
ee3cbe1946 💾 Deprecate config in favor of args in PPOTrainer (#2384) 2024-11-25 14:48:08 +01:00
17e8060984 📦 Support for packing tokenized datasets for SFT (#2011)
* feat: add support for packing tokenized datasetS

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

* fix: address review comments

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

* feat: add tests for pretokenized dataset packing

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

---------

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>
2024-11-25 10:36:58 +01:00
163695e85c 🙈 Suppress warning for estimating tokens in trainers (#2389)
* Suppress warning for estimating tokens in trainer

* Suppress warning for estimating FLOPs in ORPO and Reward trainers
2024-11-24 16:55:43 +01:00
672c96546d Update log method to include start_time parameter (#2381) 2024-11-21 21:30:10 +01:00
bdeb117320 📝 Fix typo in dataset generation script (#2379) 2024-11-21 20:37:44 +01:00
6578fdc101 🔀 Add MergeModelCallBack (#2282)
* Create mergekit_utils.py

* adding mergekit as an optional dependancy

* adding MergeModel to callbacks

* adding mergekit_utils dependencies to callbacks

* setting lower bound for mergekit

* setting mergekit lower band to 0.0.5.1

* adding support for MergeModelCallBack __init__.py

* adding support for mergemodelcallback

* mergemodelcallback tests

* Update callbacks.py

* Update __init__.py

* Update __init__.py

* Update test_callbacks.py

* Update trl/trainer/callbacks.py

removing ## from docs

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update trl/trainer/callbacks.py

removing ## from docs

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update trl/trainer/callbacks.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* using different dataset for tests

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update trl/mergekit_utils.py

adding types

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update trl/mergekit_utils.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Apply suggestions from code review

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* replacing get_last_checkpoint

* renaming Merge to merge_models

* setting mergers default value to linear

* removing unnecessary docs and comments

* adding docstring to Mergeconfig

* adding mergekits link to docstring

* precommit

* removing duplicated import

* typos in mergekit_utils docstring

* fixing tests

* making mergemodelcallback tests optional

* Make import optional

* minor

* use tmp dir in test

* sort

* Add import error checks for mergekit extra

* use a common _merge_and_maybe_push method and compat with windows path

* debug windows

* Update dependencies for mergekit and add test dependencies

* Add assertion to check if merged folder exists in the last checkpoint

* Fix temporary directory cleanup in test_callbacks.py

* Add sys import and skip test for Python versions below 3.10 due to cleanup errors with temp dir

* revert change for debug

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-11-21 14:06:45 +01:00
a0066f47f8 Add start_time to _maybe_log_save_evaluate (#2373) 2024-11-20 12:49:49 +01:00
5626806aef 🧲 Use our own require_bitsandbytes (#2370)
* use our own require_bitsandbytes

* rephrase
2024-11-20 11:51:05 +01:00
bb0afc2459 remove redunant call to eval and train (#2372) 2024-11-20 11:24:41 +01:00
066fc37bd3 Fix dev install (#2369) 2024-11-19 13:30:09 +01:00
b80c1a6fb8 🎲 Move random judges in testing utilities (#2365)
* Update judges and testing utilities

* Update judges in test files

* Update judges in test files
2024-11-18 18:43:18 +01:00
b5eabbeb07 🤝 Mixture of judges (#2159)
* base judge

* adding mixture of judges

* update doc

* update doc

* formatting

* fix small typo in doc

* fix randomcontraintjudge

* replace arxiv by hf papers

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* formatting

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* fix naming in __init__

* run precommi

* adding gold answers to judges

* cgpo llm judges

* fix init

* output type

* adjust booleans in test

* adapt moj doc

* renaming and removing factuality and safety judges

* fix typo in import

* fix small typo in naming

* formatting

* Update trl/trainer/judges.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* update parameter name

* update tests

* update doc

* Update trl/trainer/judges.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update doc

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* fix alltruejudge type

* Refactor judge variable names and update test names

* Clarify judgment logic

* Fix invalid binary judgment check in AllTrueJudge class

* Fix invalid binary judgment check in AllTrueJudge class

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-11-18 16:54:57 +01:00
cbf9abcd07 🗺️ Implementation DiscoPOP Loss (#2323)
* Implement DiscoPOP Loss

* Updated DiscoPOP documentation

* Corrected docs/source/dpo_trainer.mdx

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update docs/source/dpo_trainer.mdx

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update trl/trainer/dpo_config.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update trl/trainer/dpo_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update trl/trainer/dpo_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update trl/trainer/dpo_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update trl/trainer/dpo_config.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update trl/trainer/dpo_config.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update trl/trainer/dpo_config.py

* Delete scripts directory

* style

* empty commit

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-11-18 14:15:00 +01:00
6f8fe59aeb 📃 Fix description for parameter "generate_during_eval" in dpo_config (#2364) 2024-11-18 14:03:02 +01:00
1293f37c5f 📉 Add PEFT support for PPOTrainer (#2344)
* Add peft/lora support for

* Fix: style

* Fix: typo

* Add ppo.py PEFT example

* Fixed the optional dependencies error

* skip peft test if peft is unavailable

* Update trl/trainer/ppo_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2024-11-18 11:54:09 +01:00
e7870dd5d6 🗃️ Use specified data_collator in RLOOTrainer and PPOTrainer (#2360)
* Fix "Use specified data_collator instead of hard-coding the option"

* Remove query_responses = [] since it's immediately overwritten afterwards.

* Use self.data_collator

* Use specified data_collator instead of hard-coded one in PPOTrainer

* Move the data_collator creation

* Run make precommit
2024-11-18 11:53:47 +01:00
21d5baf338 🔮 Inference mode in GeometricMixtureWrapper.forward (#2345)
* geom mixture model train

* use inference_mode
2024-11-18 09:58:26 +01:00
76dbb1a576 🪜 Stepwise supervision dataset type (#2148) 2024-11-18 09:58:00 +01:00
b8c9d9c7bc ⚖️ Add use_soft_judge option to WinRateCallback (#2347)
* add `use_soft_judge` option to WinRateCallback

* formatting

* Update trl/trainer/callbacks.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* renamed soft_win_rate to avg_win_prob

* Update trl/trainer/callbacks.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* fix tests

* keep orignal

* formatting

* Update tests/test_callbacks.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update trl/trainer/callbacks.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update tests/test_callbacks.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update tests/test_callbacks.py

* fix test

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2024-11-15 15:49:43 +01:00
623963126b 👋 Remove deprecated tokenizer argument in BCO, GKD, Iterative SFT, Nash MD and XPO (#2349) 2024-11-12 09:22:17 -04:00
2d24d35013 Adding video llm fine-tuning example (#2336)
* adding video example

* exposing more parameters

* fixing formatting

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-11-12 12:56:38 +01:00
dde20b23cf 🖨️ Fix error text in BCO and KTO tokenizing function (#2286) 2024-11-11 19:18:36 -04:00
015321e135 👈 Add tokenizer arg back and add deprecation guidelines (#2348)
* Add deprecation and backward compatibility guidelines

* Update tokenizer argument in trainer classes

* Add warning message for TRL Judges API
2024-11-11 19:06:20 -04:00
454f36d951 💣 Remove transformers version check (#2343) 2024-11-11 09:34:26 -04:00
9b7f9f3519 🪡 Various RLOO fixes (#2325) 2024-11-11 08:59:03 -04:00
518e29ca9c 🫴 Better guide users in error reporting (#2327)
* update issue template

* Add checklist for bug report template

* Fix formatting in bug report template

* Update bug report template with additional instructions for code formatting and screenshots

* Update bug report template with code formatting instructions

* Update bug report template with code examples

* Update code block placeholder in bug report template

* Update .github/ISSUE_TEMPLATE/bug-report.yml

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-11-11 08:42:16 -04:00
ac7b6cfdfa 🧞 Add output_layer to the list of lm_head_namings in AutoModelForCausalLMWithValueHead (#2328) 2024-11-11 08:16:09 -04:00
0238d96c6f DPO trainer supports num_logits_to_keep to save memory (#2129)
* Support num_logits_to_keep, which computes necessary logits in the forward pass.

* update doc

* bug fix

* update

* check is model supports num_logits_to_keep

* ruff format

* update test file

* peft model support

* test passed

* update

* apply use_num_logits_to_keep

* fix num_logits_to_keep compute bug

* compare all outputs

* pytest

* pass test

* use check_min_version

* format

* test_dpo_trainer_use_num_logits_to_keep passed

* add some comments

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-11-10 11:34:51 +01:00
c86b51cd12 Bump liger-kernel to fix grad acc and more features (#2333)
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-11-08 12:16:33 +01:00
ac77c09223 Fix gradient_checkpointing_kwargs assignment in examples (#2331)
Co-authored-by: Ping <ping.zhu@jmuse.cn>
2024-11-07 09:28:10 +01:00
7f2ccbe3a2 fix truncating index in DPOTrainer's concatenated_forward() (#2332) 2024-11-07 09:27:32 +01:00
74e20cbbbc 🪪 Check with token_id instead of token in DPOTrainer (#2324) 2024-11-04 21:08:41 +01:00
27b9e3a93f 🪧 Fix slack notification titles (#2322) 2024-11-04 21:02:27 +01:00
dc2b8b9e90 🧽 Fix judge documentation (#2320)
* Bump dev version to `0.13.0.dev0`

* Update version number to 0.12 in CITATION.cff

* 🧽 Fix judge documentation (#2318)

* Update judge examples and documentation

* without ':'

* Clean doc

* Fix typo in example code

* Add space after Attributes

* Update attribute name in judges.py

* Add installation instructions for llm-blender library

* Update PairRMJudge attributes documentation

* Fix return type in PairRMJudge

* Revert "🧽 Fix judge documentation (#2318)"

This reverts commit 337005d95169371935fb87f1c559c7412f8472a4.

* Revert "🧽 Fix judge documentation (#2318)"

This reverts commit 337005d95169371935fb87f1c559c7412f8472a4.

* 🧽 Fix judge documentation (#2318)

* Update judge examples and documentation

* without ':'

* Clean doc

* Fix typo in example code

* Add space after Attributes

* Update attribute name in judges.py

* Add installation instructions for llm-blender library

* Update PairRMJudge attributes documentation

* Fix return type in PairRMJudge
2024-11-04 19:00:27 +01:00
5e90682836 ⚰️ Remove deprecated args, script arguments, and PPOv2 (#2306)
* Remove deprecated args

* Remove deprecated args in SFTTrainer

* Remove deprecated script argument classes

* Remove deprecated PPOv2Config and PPOv2Trainer classes

* Commented out sync_ref_model line in test_trainers_args.py
2024-11-04 16:07:26 +01:00
3b439967f4 📰 Update blog posts in documentation (#2319)
* Bump dev version to `0.13.0.dev0`

* Update version number to 0.12 in CITATION.cff

* Add publication date to blog post

* 🧽 Fix judge documentation (#2318)

* Update judge examples and documentation

* without ':'

* Clean doc

* Fix typo in example code

* Add space after Attributes

* Update attribute name in judges.py

* Add installation instructions for llm-blender library

* Update PairRMJudge attributes documentation

* Fix return type in PairRMJudge

* Revert "🧽 Fix judge documentation (#2318)"

This reverts commit 337005d95169371935fb87f1c559c7412f8472a4.

* Update blog post publication dates

* revert to p5

* Update image URLs in index.mdx

* Sort and uniform thumbnail

* Update image alignment in index.mdx
2024-11-04 16:00:27 +01:00
2f34a161cd Bump dev version to 0.13.0.dev0 (#2305)
* Bump dev version to `0.13.0.dev0`

* Update version number to 0.12 in CITATION.cff

* 🧽 Fix judge documentation (#2318)

* Update judge examples and documentation

* without ':'

* Clean doc

* Fix typo in example code

* Add space after Attributes

* Update attribute name in judges.py

* Add installation instructions for llm-blender library

* Update PairRMJudge attributes documentation

* Fix return type in PairRMJudge

* Revert "🧽 Fix judge documentation (#2318)"

This reverts commit 337005d95169371935fb87f1c559c7412f8472a4.
2024-11-04 15:59:52 +01:00
6138439df4 🧓 Specify and test min versions (#2303)
* Add conditional check for LLMBlender availability in test_judges.py

* Fix import issues and update test requirements

* Remove unused imports

* Add require_peft decorator to test cases

* Fix import_utils module to use correct package name for llm_blender

* Found min version and test

* Update Slack notification titles

* Update dependencies versions

* Update GitHub Actions workflow to include setup.py and reorder file paths

* Revert "Update Slack notification titles"

This reverts commit be02a7f2de87905e86a847540770968d0416934a.

* Update Slack notification titles

* Remove pull_request branch restriction in tests.yml

* add check code quality back

* Fix PairRMJudge model loading issue
2024-11-01 00:26:53 +01:00
d57a181163 🧩 Add optimizer_cls_and_kwargs attribute to PPOTrainer and RLOOTrainer (#2302) 2024-10-31 23:10:11 +01:00
73c3970c1f 🙅 Ensure dependency optionality (#2301)
* Add conditional check for LLMBlender availability in test_judges.py

* Fix import issues and update test requirements

* Remove unused imports

* Add require_peft decorator to test cases

* Fix import_utils module to use correct package name for llm_blender
2024-10-31 22:37:49 +01:00
013a32b396 Remove stale bot (#2300) 2024-10-31 21:16:30 +01:00
24fb32733f 🔧 Use standard unittest assertion methods (#2283)
* WIP: Partial unit test update

* Update unittest format

* Update tests/slow/test_sft_slow.py comment

* Refactor unit tests: replace pytest.raises with self.assertRaises

* Fix: Restore accidentally deleted 'ref_model' parameter in DPOTrainer

* Re-run pre-commit

* fix: Incorrectly replacing non-TestCase assert

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2024-10-31 15:10:43 +01:00
bb56c6e6af 💾 Fix _save_checkpoint for online methods (#2288)
* Update trainer_utils import and save strategy in online_dpo_trainer.py

* fix back-compat for online-dpo

* better comment

* Update transformers dependency to commit f33904
2024-10-31 12:35:25 +01:00
06be6f409a 🖇️ Better dependency and partitioning of CI tests (#2298)
* clean deps

* new tests

* tests

* Add tests without optional dependencies workflow

* Update dependencies in tests.yml

* cpu version of torch

* Update dependencies and installation commands

* Disable fail-fast in test workflow

* Update test matrix in workflows file

* try fix windows

* Remove "rich" from required packages in setup.py

* Update dependency installation in tests.yml

* Add torch and deepspeed installation for windows-latest

* Fix conditional statement in workflow file

* Add torch and deepspeed installation for Windows

* Fix if statement

* Update torch and deepspeed dependencies

* Update liger package requirement for non-Windows platforms

* remove scipy dep

* Add torch GPU requirement for testing_utils

* Update trl/trainer/judges.py
2024-10-31 11:08:51 +01:00
b2696578ce 🍬 Use any reward model for online methods (#2276)
* Refactor reward processing in OnlineDPOTrainer

* Refactor completion decoding and reward processing

* remove strip

* remove warning

* Add reward_tokenizer to training script

* Add reward_tokenizer and reward_processing_class to OnlineDPOTrainer test

* propagate to xpo and nash

* style

* reduce memory requirement with inference_mode

* fix tests

* pairrm judge llmblender

* setUpClass(cls)

* Add setUpClass method to TestJudges class

* truncation left for reward tokenizer

* don't logcompletion without eval dataset

* only eval when possible
2024-10-28 16:21:40 +01:00
0ce3b65928 🔌 Fix type hint in LogCompletionsCallback (#2285)
* Update callbacks.py for fix small python type error

* Update callbacks.py

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2024-10-28 11:49:35 +01:00
e155cb8a66 ⛓️💥 Don't use eval_dataset in scripts when no eval strategy (#2270) 2024-10-28 11:40:51 +01:00
ea7a1be92c 🧮 Fix the computation of KL divergence loss (#2277) 2024-10-25 18:16:02 +02:00
110d0884c7 🏁 Add bos_token_id only if it exists (#2279)
Co-authored-by: sean.jung <sean.jung@sean-ai.local>
2024-10-25 18:15:08 +02:00
57ba9b93aa 🧘 Replace F.log(F.sigmoid(log_odds) with F.logsigmoid(log_odds) (#2274)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2024-10-24 20:51:55 +02:00
0de75b26f2 🧼 Refactor log_reports.py for Improved Logging, File Processing, and Slack Payload Handling (#2249)
* Update log_reports.py

* comments text update

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* emoji added

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update scripts/log_reports.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update scripts/log_reports.py

* style

* Update scripts/log_reports.py

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-10-24 20:48:12 +02:00
e615974a03 ♾️ Fix test generation max_new_tokens (#2272)
* `eval_strategy="steps" if eval_dataset else "no"`

* tmp skip test

* drop `eval_strategy` in `test_sft_trainer_uncorrect_data`

* remove eval strategy

* Add parameterized test for generate method

* Revert "`eval_strategy="steps" if eval_dataset else "no"`"

This reverts commit 1e8b331fa2c222a699cb3563f44f5702a7d6f50b.

* Revert "tmp skip test"

This reverts commit 44558f84cc43e20254b567d608b44d059a14913b.

* Revert "drop `eval_strategy` in `test_sft_trainer_uncorrect_data`"

This reverts commit a1ef7016286649fce10b3665159abcbfac2219e3.

* Revert "remove eval strategy"

This reverts commit cb7fafa874b108ba91b29f15944b7c4a41705d6d.

* style

* Refactor test_generate method in test_modeling_value_head.py

* `max_new_tokens=9`
2024-10-24 20:20:01 +02:00
c2bb1eed14 Add torch_dtype to model kwargs in reward modeling example (#2266)
Update model_kwargs to include torch_dtype.

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2024-10-24 20:12:26 +02:00
9c376c571f [Judges] use the pair-judges in online-preference trainers (#2243)
* use the pair-judges

* add test

* Update trl/trainer/online_dpo_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update trl/trainer/online_dpo_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* decode and skip special characters

* initial nash

* return tensors

* Update trl/trainer/online_dpo_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update trl/trainer/online_dpo_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update trl/trainer/online_dpo_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* add back the logging

* use batch_decode

* add judges api to XPO trainer

* Update tests/test_online_dpo_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* judge in examples

* judge in config

* add back logs when using reward model

* typo

* add back model_scores logging when using reward model

* log scores for reward model only

* better cond on what to log

* same for rlhf reward

* Update trl/trainer/online_dpo_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* use decode_and_strip_padding

* error if both reward and judge or none are set

* remove unused check

* Uniform way to pass conversation into judge

* heading -> leading

* LogCompletionsCallback compat with online method

* Update Online DPO doc

* check if data is conversational for judges

* update example

* remove comment

* use zip

* fix stats xpo

* Replace judge with PairRMJudge and import AutoModelForSequenceClassification

* update xpo documentation

* Remove doc duplication

* update nash doc

* XPO trl chat

* nash md doc

* HfPairwiseJudge

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-10-24 16:47:10 +02:00
16994738d0 Conversational dataset support for KTOTrainer (#2248)
* `get_batch_sample` -> `generate_from_model[_and_ref]`

* add `num_items_in_batch=None`

* `num_items_in_batch` in `training_step`

* Fix return type hint

* desc for unpair dataset util

* update example

* process in KTO

* Update doc

* KTO  doc rewrite

* fix orpo doc

* add other dataset config names in test

* update doc image

* fix links in doc

* Update reward and log probability metrics in KTOTrainer doc

* skip enc-dec test

* Update docs/source/kto_trainer.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2024-10-24 14:01:41 +02:00
99225bb6d6 Bump the minimum transformers version to v4.46 (#2245)
* Bump the minimum transformers version

* Bump version in `requirements.txt`

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-10-24 10:42:30 +02:00
88be2c07e5 🚩 setup_chat_format: throw error if there is already a template in base model (#2252)
* setup_chat_format: throw error if there was already a template

* fix lint

* clarify in docs

* fix test?

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-10-22 13:29:32 +02:00
f2349d2af0 Adjust padding in batch generation (#2251)
* pad batch generation

* Use pad utility

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/utils.py

* reshaping

* fix test_utils.py

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-10-22 09:36:43 +02:00
d843b3dadd Use processing_class instead of tokenizer in LogCompletionsCallback (#2261) 2024-10-22 09:35:04 +02:00
84dab850f6 🧽 Fix typo in dataset format doc (#2259)
doc update
2024-10-21 17:06:19 +02:00
92f6d246d3 🏗️ Refactor DPO data processing (#2209)
* in progress

* refactor concatenated_inputs and concatenated_forward

* progress

* further modif

* padding side

* eos prompt enc dec

* prompt_padding_side

* drop prompt apdding side collator

* working on decoder only

* dpo trainer

* Fix loss_mask type conversion bug

* bad attention mask

* try to get the same tokens as main

* fix loss mask

* fix unused col

* added comment

* raise error when paddind token not set

* remove private method tests

* initial vlm support

* make it work for paligemma

* minor test updates

* style

* improve readibility

* improve doc

* style

* flush left and truncate

* flush left in the code

* fix empty_cols and make max_length optional

* always add eos token

* minor changes and doc

* style

* fix docstring

* preference collator in doc

* fix doc

* optional max_completion_length

* Investigating CI failing

* style

* just dpo trainer test

* just idefics

* paligemma

* llava

* test cli

* dataset in test

* all tests

* Update trl/trainer/dpo_trainer.py

* Update trl/trainer/dpo_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update trl/trainer/dpo_trainer.py

* Update trl/trainer/dpo_trainer.py

* reference to ref

* rich descriptions

* fix logits reporting

* fix truncation

* remove chat template from dpo_vlm

* `get_batch_sample` -> `generate_from_model[_and_ref]`

* add `num_items_in_batch=None`

* `num_items_in_batch` in `training_step`

* Fix return type hint

* test tokenize row

* fix test

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2024-10-21 12:47:33 +02:00
31b7820aad 🔀 Rename get_batch_sample and add num_items_in_batch to compute_loss (#2246) 2024-10-18 21:02:24 +02:00
b9aa965cce Enhance log report script: add error handling and logging (#2232)
* Update log_example_reports.py

1. Added logging: Imported the logging module and set up a logger in the main function. This allows for better error tracking and debugging.

2. Improved file reading: Used a with statement to ensure the file is properly closed after reading. Also added error handling to catch and log any issues when reading the file.

3. Error handling for Slack SDK import: Added a try-except block to handle cases where the slack_sdk might not be installed.

4. Enhanced Slack message sending: Added error handling and logging for the Slack message sending process. This will help identify any issues with the Slack integration.

* style

* Update log_reports.py

1. Logging: Added logging to track errors and important events.

2. Error Handling: Wrapped the log file processing in a try-except block to handle potential errors gracefully.

3. Logging Total Failed Tests: Added a log statement to report the total number of failed tests

* style

* further improve

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-10-18 19:40:30 +02:00
a67f2143c3 Update SFT examples (#2244) 2024-10-17 14:11:46 +02:00
494b4afa10 [CLI] Setting capture output to False (#2239)
* setting capture output to False

* Update trl/commands/cli.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2024-10-17 11:04:23 +02:00
02f4e750c0 DPO support remove_unused_columns (#2233) 2024-10-16 10:00:27 +02:00
2ba3005d1c Updated ScriptArguments warning messages (#2230) 2024-10-15 07:46:58 +02:00
7e394b03e8 🎭 Deprecate [SFT/DPO/Reward]ScriptArguments in favour of ScriptArguments (#2145)
* `DPOScriptArguments` to `ScriptArguments`

* use dataset_train_split

* Use scriptarguments

* dataset names in command lines

* use `ScriptArguments` everywhere

* ignore biais buffer to end

* remove in v0.13

* rm comment

* update test commands

* Update docs/source/rloo_trainer.md

* Update tests/test_rloo_trainer.py

* Added dataset_train_split argument to ppo.py and rloo.py

* update scripts with dataset_train_split
2024-10-14 11:14:58 +02:00
14f3613dac Update commands for code linting in contributing guidelines (#2225)
* update commands for code liniting in contributing guidelines

* update docs on code formatting in contributing guidelines

* fix markdown rendering error

* Update CONTRIBUTING.md

* Update CONTRIBUTING.md

* Update CONTRIBUTING.md

* Update CONTRIBUTING.md

* Update CONTRIBUTING.md

* "sans" -> "without"

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-10-13 09:22:24 +02:00
5e24101b36 📒 Fix type/format confusions (#2223) 2024-10-11 23:39:19 +02:00
b81a6121c3 Add GKD to dataset_formats.mdx (#2222)
* Update dataset_formats.mdx

* Update dataset_formats.mdx

* Update docs/source/dataset_formats.mdx

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Modified to Prompt-completion

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2024-10-11 21:52:20 +02:00
7f0d246235 Add Sequence-Level KD (#2220)
* Fix templates for dpo, etc.

* Update dpo.py

Add the third issue fixs

* make this a utility.

* Add Sequence-Level KD

* add to the docs-strings and the documentation

* reviewed

* Update docs/source/gkd_trainer.md

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2024-10-11 20:14:09 +02:00
70036bf87f 🕊️ Migration PPOv2 -> PPO (#2174)
* delete old ppo

* rename ppov2 files

* PPOv2 -> PPO

* rm old doc

* rename ppo doc file

* rm old test

* rename test

* re-add v2 with deprecation

* style

* start update customization

* Lion

* Finish update customization

* remove ppo_multi_adaptater

* remove ppo example

* update some doc

* rm test no peft

* rm hello world

* processing class

* Update docs/source/detoxifying_a_lm.mdx

Co-authored-by: Edward Beeching <edbeeching@users.noreply.github.com>

* Update trl/trainer/ppov2_config.py

Co-authored-by: Edward Beeching <edbeeching@users.noreply.github.com>

* Update docs/source/customization.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/detoxifying_a_lm.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* po to example overview

* drop lion

* remove "Use 8-bit optimizer"

* Update docs/source/customization.mdx

* Update docs/source/customization.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* it applies to all trainers

---------

Co-authored-by: Edward Beeching <edbeeching@users.noreply.github.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2024-10-11 17:28:39 +02:00
d0aa421e5e Conversational dataset support for ORPOTrainer (#2184)
* default learning rate

* update trainer

* update test

* update script

* update dataset format

* add line in dpo doc

* update orpo doc

* refine implicit/explicit

* update demo chat
2024-10-11 17:08:28 +02:00
5375d71bbd trl env report all cuda devices (#2216) 2024-10-11 16:32:34 +02:00
6004e033a4 Updated README.md with CLI examples and additional usage instructions (#2199)
* Updated README.md with CLI examples and additional usage instructions

Added Command Line Interface (CLI) examples for SFT, DPO, and Chat features.
Improved the "How to Use" section by providing code examples for SFTTrainer and RewardTrainer.
Included installation instructions for both Python Package and source-based installation.
Refined highlights to better showcase efficiency and scalability features.
Updated the repository clone instructions for working with examples.
Added new links to CLI documentation and contribution guide for better navigation.

* Update README.md

* Update README.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update README.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update README.md

* update badges

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-10-11 16:31:38 +02:00
f436c3e1c9 Update README.md (#2180)
* Update README.md

* Update README.md

* Update README.md

Co-authored-by: Edward Beeching <edbeeching@users.noreply.github.com>

* Update README.md

* Update README.md

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Edward Beeching <edbeeching@users.noreply.github.com>
2024-10-11 16:14:46 +02:00
cd1aa6bdcc [Judges] Soft judges for PairRM (#2221)
* initial soft judges

* add soft-judge to PairRM

* remove comments

* fix from review
2024-10-11 15:53:42 +02:00
b3f93f0bad Report to "none" in GKD test (#2214) 2024-10-10 19:05:55 +02:00
6c32c8bfcd Improve slack reporting (#2182)
* Update log_example_reports.py

1. Added logging: Imported the logging module and set up a logger in the main function. This allows for better error tracking and debugging.

2. Improved file reading: Used a with statement to ensure the file is properly closed after reading. Also added error handling to catch and log any issues when reading the file.

3. Error handling for Slack SDK import: Added a try-except block to handle cases where the slack_sdk might not be installed.

4. Enhanced Slack message sending: Added error handling and logging for the Slack message sending process. This will help identify any issues with the Slack integration.

* style

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-10-10 17:42:06 +02:00
3107a40f16 Update incorrect data processing in DataCollatorForChatML (#2172)
* Update incorrect data processing in DataCollatorForChatML

Fix the extra BOS token and the absence of an EOS token in the returned input_ids, and potentially the absence of a target string in the returned labels.

* Update trl/trainer/utils.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* style

* move comment

* add test for DataCollatorForChatML

* update comment with more details

* update assert reports and comments, and adds verification that the last token of input_ids should be EOS token

* new line at the end of file for code quality

* Update tests/test_utils.py

* Update tests/test_utils.py

* Update tests/test_utils.py

* update tests

* fix test

* Update tests/test_utils.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update tests/test_utils.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* formatting

* fix typo

* simplify

* Revert "simplify"

This reverts commit 7e4006c87265665183032932ca05dffef567e38b.

* tokenize full messages

* dont add eos

* eos is in the last token

* simplify DataCollatorForChatML

* Update tests/test_utils.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-10-10 12:49:10 +02:00
419791695c Drop decoder_input_ids in DPOTrainer (#2208) 2024-10-10 10:20:40 +02:00
7e5924d17e [GKD] interpolate in prob. space (#2204)
* interpolate in prob. space

* better var names

* use logsumexp

* set beta dtype

* beta tensor
2024-10-09 12:13:18 +02:00
ed9ea74b62 [DPO] Adding weighted preference optimization (WPO) (#2141)
* skeleton

* add weighting arg in config

* formatting

* fix doc

* do not compute gradients in weighting term

* fixed detach

* add WPO doc

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-10-08 19:52:54 +02:00
511c92c91c Get the aux_loss_coef at BCOTrainer, CPOTrainer, KTOTrainer, and ORPOTrainer initialization (#2201)
* Fix aux_loss coefficient bug of BCOTrainer

* Fix aux_loss coefficient bug of CPOTrainer

* Fix aux_loss coefficient bug of KTOTrainer

* Fix aux_loss coefficient bug of ORPOTrainer
2024-10-08 16:17:09 +02:00
c6cb6353a5 Get the aux_loss_coef at DPOTrainer initialization (#2200) 2024-10-08 16:06:48 +02:00
adb3e0560b ♾️ [CI] Use transformers from source in "tests_no_optional_dep" (#2198) 2024-10-08 12:19:04 +02:00
adf58d80d0 skip_prompt=True in TextIteratorStreamer (#2193)
* skip_prompt in `TextIteratorStreamer`

* Update trl/commands/cli.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update generation streamer in chat.py

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-10-07 17:38:40 +02:00
9aa022503c Update README.md (#2186)
* Update README.md

Fix grammatical errors in README.md
fixes issue #2185

Description:

I found a grammatical error in the README.md of the project. This PR fixes the error to improve the overall readability and clarity of the documentation.

Changes:
Corrected grammatical errors
Updated lines to reflect the correct grammar
Reasoning: The original text contained a grammatical error that could confuse readers. This fix ensures that the documentation is accurate and easy to understand.

Closes #2185

* Update README.md

Co-authored-by: Edward Beeching <edbeeching@users.noreply.github.com>

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Edward Beeching <edbeeching@users.noreply.github.com>
2024-10-07 14:30:00 +02:00
82ad390caf Fix RLOO checkpointing (#2114)
* Fix RLOO checkpointing for transformers>=4.45.0

* Add missing import

* Fix pre-commit issues

* Added test for RLOO checkpointing

* Ensure that tokenizer matches SFT and Reward model

* Pre-commit formatting

* processing class

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-10-07 13:11:17 +02:00
ac038ef03a Update CONTRIBUTING.md (#2181)
* Update CONTRIBUTING.md

* Update CONTRIBUTING.md

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-10-07 06:56:19 -04:00
51ca76b749 [CI] fix dpo gpu ci tests (#2189)
* fix dpo ci test

* color-blind
2024-10-07 10:59:43 +02:00
7005ab4d11 🃏 Model card: "unsloth" tag (#2173) 2024-10-07 10:57:05 +02:00
ffb1ab74ba Update documentation CLI Chat (#2191) 2024-10-07 10:33:51 +02:00
47d08a9626 Rename trainer arg tokenizer to processing_class (#2162) 2024-10-07 09:39:32 +02:00
70327c18e6 add trl to tag for models (#2178) 2024-10-07 08:12:44 +02:00
f05c3fa8fc minor KTO setting changes + KL batch size (#2153)
* add argument for dropout

* increase default lr

* change default lr in examples

* fix bug in calculation of KL batch size

* KL batch size should be args.per_device_train_batch_size

* Update kto_trainer.mdx with hparam recs

* typo

* allow dropout to be disabled

* update lr in sample scrippt

* Update kto_config.py

* Update trl/trainer/kto_trainer.py

* Update docs/source/kto_trainer.mdx

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-10-06 13:13:11 +02:00
4799ba4842 Capybara replaced with ultrafeedback_binarized (#2183) 2024-10-05 18:49:48 +02:00
d45c86e2a7 Conversational dataset support for CPOTrainer (#2144)
* extract prompt and apply chat template in cpo trainer

* default leanring rate

* simplify example

* update doc

* test all formats

* extend exptract prompt

* improve doc format

* link in dataset formats

* Update docs/source/cpo_trainer.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/cpo_trainer.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2024-10-04 18:01:02 +02:00
c6b0d1358b 🗑️ Set deprecation version for DPO and SFT arguments to version 0.13 (#2170) 2024-10-04 17:46:55 +02:00
3321084e30 Update trl version in CITATION.cff (#2171) 2024-10-04 12:24:09 +02:00
a9cffc7caf Default dataset_text_field to "text" (#2078)
* clarify ConstantLengthDataset usage

* dont provide dataset text field when formatting func is provided

* kto maybe_apply_chat_template

* default text field

* doc

* remove maybe_apply_chat_template from kto example

* dataset text field always a str

* remove `dataset_text_field="text"`

* update doc
2024-10-04 10:55:47 +02:00
32a928cfc2 🏷️ Model badges in trainer documentation (#2160) 2024-10-04 10:55:06 +02:00
1a3bb372ac Fix typo in error message (#2168)
occured -> occurred
2024-10-04 09:36:52 +02:00
d4564b7c64 ↩️ Revert tokenizer hotfix #2163 2024-10-04 00:14:12 +02:00
1be4d86ccc 🩹 [Hotfix] Add setter for tokenizer (#2163) 2024-10-03 16:13:50 +02:00
78249d9de4 Conversational dataset support for DPOTrainer (#2131)
* conversational dataset support for dpo

* support standard dataset for extract prompt

* test standard dataset for extract prompt

* fix maybe

* fix maybe apply prompt

* style

* overwrite default learning rate of DPO

* style

* rlaif script

* `writer_batch_size` in `train_test_split`

* initial dpo doc refactoring

* vision data section in doc

* lil format modif

* refine Vision datasets

* refine doc

* test new loss type format

* restrcture loss function

* table loss type

* simplify `unsloth`

* improve doc

* looged metrics up

* refine loss section

* Fix label_smoothing parameter in DPOConfig

* dataset for test

* update readme

* Update docs/source/dpo_trainer.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* try colorized code block

* refine doc style

* further refine doc

* Update docs/source/dpo_trainer.mdx

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* re add pali gemma test

* Add missing period

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-10-02 10:04:03 +02:00
5c21de30ae [CI] Don't use eval_strategy="steps" when no eval dataset (#2152)
* `eval_strategy="steps" if eval_dataset else "no"`

* tmp skip test

* drop `eval_strategy` in `test_sft_trainer_uncorrect_data`

* remove eval strategy
2024-10-01 21:46:41 +02:00
0a566f0c58 🩹 Fix attention mask warning in chat CLI (#2147)
* explicit attention mask

* fix chat command
2024-10-01 10:53:18 +02:00
de3876577c [GKD] Set custom EOS tokens in generation config (#2142)
* Expose EOS token IDs in GKD generation

* Apply suggestions from code review

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Revert

* Refactor EOS token setting

* Remove EOS from config

* Refactor

* Add unit test

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2024-09-30 13:53:16 +02:00
1201aa61b4 rename example (#2139) 2024-09-27 21:45:21 +02:00
c00722ce0a 🃏 Model card for TRL (#2123)
* template and util

* test for online dpo

* template in package_data

* template in manifest

* standardize push_to_hub

* wandb badge and quick start

* bco

* xpo

* simplify `create_model_card`

* cpo

* kto

* dpo

* gkd

* orpo

* style

* nash-md

* alignprop

* bco citation

* citation template

* cpo citation

* ddpo

* fix alignprop

* dpo

* gkd citation

* kto

* online dpo citation

* orpo citation

* citation in utils

* optional citation

* reward

* optional trainer citation

* sft

* remove add_model_tags bco

* Remove unnecessary code for adding model tags

* Fix model tag issue and update URL format

* Remove unused code for adding model tags

* Add citation for XPOTrainer

* Remove unused code in SFTTrainer

* Add model card generation in RLOOTrainer

* Remove unused import and method call in reward_trainer.py

* Add model card generation

* Remove unused code and update error message in ORPOTrainer class

* Add import statements and create model card in IterativeSFTTrainer

* Add dataset name to push_to_hub() call

* Update trainer.push_to_hub() dataset names

* script args

* test

* better doc

* fix tag test

* fix test tag

* Add tags parameter to create_model_card method

* doc

* script args

* Update trl/templates/model_card.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* unittest's `assertIn` instead of `assert`

* Update trl/templates/model_card.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2024-09-27 15:23:05 +02:00
124189c86a Add correct label for WinRateCallback table (#2134)
Small fix to make it clear in WandB which table it which
2024-09-27 10:33:41 +02:00
d5eeaab462 arXiv to HF papers (#2133) 2024-09-27 09:00:49 +02:00
5368be1e1e 🧹 Style (#2132)
* drop `# flake8: noqa` in examples

* `__init__.py`

* fix init

* unwrap_model_for_generation

* ignore import violation in init
2024-09-26 21:02:48 +02:00
b169e1030d Add table for WinRateCallback (#2116)
* Add table for WinRateCallback

* Fix tests

* Apply suggestions from code review

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Refactor

* Remove super

* Clean

* Clean

* Apply suggestions from code review

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2024-09-26 19:28:44 +02:00
9af4734178 ♻️ Standardize script_args (#2130) 2024-09-26 15:23:42 +02:00
a0d714949f Tokenize row during in training_step in OnlineDPOTrainer (#2117)
* tokenize while training

* same for nashmd and xpo

* Update trl/trainer/online_dpo_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2024-09-26 11:58:14 +02:00
a0e28143ec Eos token encouragement Clarification (#2128)
* Update nash_md_trainer.md

* Update online_dpo_trainer.md

* Update xpo_trainer.mdx

* Fixing XPO Script Location
2024-09-26 11:47:48 +02:00
32d9d34eb1 Standardize pushing to Hub in examples (#2126) 2024-09-26 10:00:51 +02:00
fb1b48fdbe Remove max_length from RewardDataCollatorWithPadding (#2119) 2024-09-26 09:59:12 +02:00
b5e4bc5984 Update example_overview.md (#2125) 2024-09-25 20:45:57 +02:00
7a24565d9d Generalizes VSFT script to support REDACTED (#2120)
* generalizes vst script

* precommit

* change launch command to use accelerate

* updates docs

* rename to sft_vlm

* fix script location

* fix formatting

* comma

* add model link

* fix name

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-09-25 19:54:44 +02:00
44a06fc487 BCOTrainer conversational dataset support (#2107)
* update test

* maybe_apply_chat_template

* simplify bco example

* Update documentation

* Update examples/scripts/bco.py

* Update docs/source/bco_trainer.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2024-09-24 18:15:57 +02:00
a84fc5d815 Fix packing test (#2111)
* Fix pack test

* same for eval
2024-09-24 17:12:54 +02:00
80038a5a92 [online-dpo] allow parse-args as list of floats (#2108)
* use a seperate argument for list of floats

* do super first

* fix docstrings

* typos

* use list of floats only

* check if it has len

* fix docstring

* fix suggestion

* fix default

* Update trl/trainer/online_dpo_config.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update trl/trainer/xpo_config.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update trl/trainer/nash_md_config.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update trl/trainer/nash_md_config.py

* additional tests

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-09-24 16:56:27 +02:00
cece86b182 fix formatting (#2109)
* fix formatting

* formatting
2024-09-24 16:05:55 +02:00
d005980d8b Fix documentation links (#2105) 2024-09-24 15:35:29 +02:00
cc23b511e4 [RewardTrainer] Tokenize inputs within trainer (#2102)
* Pretokenize in reward modelling

* Fix README example

* Apply suggestions from code review

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Move chat template formatting inside trainer

* Refactor tests

* Fix README

* Disable wandb

* Update readme

* add comment `remove_unused_columns`

* Update trl/trainer/reward_config.py

* doc

* implicit*

* explicit

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-09-24 13:03:32 +02:00
2cad48d511 [CLI] trl env for printing system info (#2104) 2024-09-24 09:57:24 +02:00
6859e048da Fix PPO/RLOO examples (#2100) 2024-09-23 11:49:36 +02:00
92eea1f239 Clean up README and remove openrlbenchmark dependency (#2085)
* Clean up README

* Add Kashif and Quentin

* Refactor

* Apply suggestions from code review

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Add citation

* Omit benchmarks from dev install

* Remove openrlbenchmark

* Apply suggestions from code review

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2024-09-23 09:21:41 +02:00
663002f609 KTO: fix logits metric, add logits metric to BCOTrainer too (#2094)
Co-authored-by: Clara Luise Pohland <clara-luise.pohland@telekom.de>
2024-09-21 19:08:10 +02:00
44d998b2af Fix _process_tokens for empty prompts in KTOTrainer (#2093)
The function _process_tokens in trl/trainers/kto_trainer.py crashes if the prompt_input_ids are an empty list.
- added a check for nonzero length
- added a check for nonzero length of answer_input_ids for consistency

The checks happen when determining when subtracting 1 from max_length (happens when BOS or EOS is already present).
2024-09-21 12:49:54 +02:00
9b80f3d50c fix: device could be in meta, transformers#33154 (#2089)
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
2024-09-21 09:11:34 +02:00
2038e52c30 Fix typo in orpo example. (#2092) 2024-09-21 09:11:01 +02:00
10c2f63b2a training_args for all TrainingArguments (#2082) 2024-09-19 15:03:47 +02:00
9fb871f62f [SFT] fix neftune_noise_alpha in SFTTrainer (#1841)
* fix neftune_noise_alpha

* del neftune_noise_alpha first

* check len after removing handle

* make sure we do not load twice

* Update trl/trainer/sft_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* remove neftune from SFTTrainer as the superclass has it now

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2024-09-19 11:57:36 +02:00
3cec013a20 Bump dev version 2024-09-19 10:47:21 +02:00
cc80ac6b47 Fix DeepSpeed for PPOv2Trainer.save (#2080) 2024-09-19 09:29:57 +02:00
4c0c98d950 Standardize dataset naming (#2081)
* `ds`, `raw_dataset` etc -> `dataset`

* Update docs/source/detoxifying_a_lm.mdx
2024-09-19 08:59:28 +02:00
0d2bee51aa [WIP] Fix logits/chosen and logits/rejected metrics in KTOTrainer (#2077)
* fix metrics

* fix formatting

* fix "#" sign
2024-09-18 21:09:21 +02:00
6920c2d1bb Conversational dataset support for Online DPO (#2075)
* first modifications in the documentation

* Add script for processing ultrafeedback prompt dataset

* Remove unused variable in ultrafeedback.py

* style

* apply chat template within the init

* extend test

* new default lr

* nash md and xpo conv test

* Update prompt length check to 512 characters

* remove `maybe_apply_chat_template` in XPO and Nash examples

* polish online dpo doc

* better section name

* LogCompletionsCallback doc

* optional generation config

* reorder stats (consistency with online dpo)

* update online dpo doc

* format online dpo config

* format nash_md config

* update nash md

* Nash MD -> Nash-MD

* xpo doc

* doc
2024-09-18 14:10:38 +02:00
4d8267610f Use wrapped model for reference completions in WinRateCallback and set default freq to eval_steps in LogCompletionsCallback` (#2074)
* Use wrapped model for reference completions

* Add unit test for LoRA

* Apply suggestions from code review

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Fix quality

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2024-09-18 13:55:49 +02:00
c3143832cb processor(prompt, images=image) to processor(images=image, text=prompt) (#2076)
* `prompt, images=image` to `images=image, text=prompt`

* special case of model being str in BCO
2024-09-17 12:09:16 +02:00
e74dbf2d6a Added error when ref_model and model have same id (#2057)
* Added error check to RLOO, PPOv2, OnlineDPO that ref_policy and policy should have different identities.

* Update online_dpo_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* style

* extend to other trainers

* bco as well

* case models are strings

* add tests

* style

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-09-17 10:48:32 +02:00
41fe228654 Minor doc fixes and comments (#2073)
* Sort toctree

* rm trainer.mdx

* add missing `

* comment

* online dpo
2024-09-16 16:42:22 +02:00
07f0e687cb Use transformers utilities when possible (#2064)
* use transformers' availability functions

* require from transformers

* rm file

* fix no peft

* fix import

* don't alter  _peft_available

* fix require_diffusers

* style

* transformers>=4.40 and add back `is_liger_kernel_available`
2024-09-16 15:56:49 +02:00
dc2bd07408 Nash md (#1853)
* initial skeleton

* initial config and class

* move TrainerCallback to callbacks.py

* initial trainer mockup

* formatting

* add back header

* script with reward model

* call ref policy forward with torch no_grad

* fix api

* clean up the configs

* use the new API

* fix typo

* get get_reward without grads

* remove unused no_grad calls

* fix formatting

* initial GeometricMixtureWrapper

* Update trl/models/modeling_base.py

Co-authored-by: Alvaro Bartolome <36760800+alvarobartt@users.noreply.github.com>

* undo changes to callback

* GenerationMixin needs generation_config

* calculate score with model and mixture model outputs

* fix scores and mixture_scores tensors

* undo

* use interleaved version to calcuate chosen-rejected

* Revert "use interleaved version to calcuate chosen-rejected"

This reverts commit 4a63a60971a7db173d10771548f17f650d955c2a.

* fix mixture scores

* Fix global step

* use mixture_coeff

* record scores_margin only

* fix del

* First version of Nash MD trainer

* undo

* fix formatting

* fix toc

* initial refactorin

* mixin fixes

* fix refactoring

* cleanup comments

* add log_stats

* add test

* initial docs

* fix logs

* fix missing_eos_penalty

* fix output_dir

* add peft_config to docs and super

* undo init changes

* Update docs/source/_toctree.yml

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update trl/trainer/nash_md_config.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* add dataset format

* add authors

* add dynamic parameter callback

* update test

* fix comments

* test GeometricMixtureWrapper

* header

* formatting

* formatting

* add paper and abstract

* Update docs/source/nash_md_trainer.md

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* DynamicParameterCallback

* drop callback in favor of getter

* revert kto config change

* revert kto config change

* fix contribution

* `coeff` to `coef`

* log dynamic coefs

* Update docs/source/nash_md_trainer.md

* Update docs/source/nash_md_trainer.md

* fix tests

* use self.ref_model

* one-line

---------

Co-authored-by: Alvaro Bartolome <36760800+alvarobartt@users.noreply.github.com>
Co-authored-by: Daniil Tiapkin <daniil.tiapkin@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-09-16 13:46:52 +02:00
cdafc9333c [KTO] Overrides default learning_rate in KTOConfig (#2070)
* learning rate recomentations for kto

* update from suggestion

* override default lr

* add tip tag

* Update trl/trainer/kto_config.py

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2024-09-16 12:24:43 +02:00
40f05226de Standardizing datasets for testing (#2065)
* zen dataset

* Update dataset test bco

* some tests

* Simple chat template

* bco

* xpo

* kto

* gkd

* trainer_args

* sft

* online dpo

* orpo

* zen script
2024-09-14 22:34:15 +02:00
f6c664301d remove min_new_tokens=args.max_new_tokens (#2069) 2024-09-14 19:37:12 +02:00
08ba866c86 Fix dataset in GKD script (#2067)
I added the wrong dataset name in a prior commit 🙈
2024-09-14 12:29:13 +02:00
ebc85b2e39 PEFT support for Online DPO (#2041)
* Promote `PPOv2Trainer` and `PPOv2Config` to top-level import

* Deprecate `PPOTrainer` and `PPOConfig`

* changes

* Revert "Promote `PPOv2Trainer` and `PPOv2Config` to top-level import"

This reverts commit 96ae02a54154acd2c5c3cc873af3519fedd33d0b.

* Revert "Deprecate `PPOTrainer` and `PPOConfig`"

This reverts commit 65990deb81df1dcaeb2245f01582e8bb45511335.

* peft

* peft

* try to simplify

* revert utils changes

* update dpo script

* peft

* style

* revert gitignore

* test_online_dpo_peft

* ref model

* peft example command

* typo

* remove param.requires_grad = False for the reward model

* make `model` required arg

* update example script

* update xpo trainer

* Update examples/scripts/dpo_online.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/dpo_online.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* merge and unload

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2024-09-13 19:15:18 +02:00
88bede66fc Standardise API for WinRateCallback and LogCompletionsCallback (#2061)
* Use wrapped model

* Make WinRateCallback work

* Make LogCompletions work

* Make LogCompletions work

* Fix scripts

* Fix path

* Refactor

* Remove padding

* Refactor

* Fix docs

* Fix scripts

* Fix TLDR template

* Use explicit args

* Fix callback import

* Add docstring
2024-09-13 17:38:42 +02:00
7a2bbe3957 Shuffle examples before they are packed (#2037)
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2024-09-13 14:23:24 +02:00
d47220f299 make cuda-only tests device-agnostic (#2044)
* update code

* update

* fix style

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2024-09-13 14:23:12 +02:00
d8324924c8 Support for SFTTrainer.evaluate() and SFTTrainer.predict() with null train_dataset (#2004)
* add null train_dataset check

* Fix pre-commit errors

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2024-09-13 14:22:43 +02:00
4c92ba5769 ©️ Copyrights (#2063)
* copyrights

* fail if missing
2024-09-13 14:18:47 +02:00
a5b98fcf97 Mask loss in gkd when generating from the student (#2058)
* mask loss in gkd

* fix minor issue in test

* Update tests/test_gkd_trainer.py

* fixing masking issues

* Update tests/test_gkd_trainer.py

* Update tests/test_gkd_trainer.py

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-09-13 11:30:59 +02:00
e51a5ac985 Add missing autodocs (#2056) 2024-09-11 21:54:28 +02:00
31b93876a7 📝 Document dataset format (#2020)
* first piece of doc

* improve readibility

* some data utils and doc

* simplify prompt-only

* format

* fix path data utils

* fix example format

* simplify

* tests

* prompt-completion

* update antropic hh

* update dataset script

* implicit prompt

* additional content

* `maybe_reformat_dpo_to_kto` -> `unpair_preference_dataset`

* Preference dataset with implicit prompt

* unpair preference dataset tests

* documentation

* ...

* doc

* changes applied to dpo example

* better doc and better log error

* a bit more doc

* improve doc

* converting

* some subsections

* converting section

* further refinements

* tldr

* tldr preference

* rename

* lm-human-preferences-sentiment

* `imdb` to `stanfordnlp/imdb`

* Add script for LM human preferences descriptiveness

* Remove sentiment_descriptiveness.py script

* style

* example judge tlrd with new dataset

* Syle

* Dataset conversion for TRL compatibility

* further refinements

* trainers in doc

* top level for functions

* stanfordnlp/imdb

* downgrade transformers

* temp reduction of tests

* next commit

* next commit

* additional content

* proper tick format

* precise the assistant start token

* improve

* lower case

* Update titles in _toctree.yml and data_utils.mdx

* revert make change

* correct dataset ids

* expand a bit dataset formats

* skip gated repo tests

* data utilities in API

* Update docs/source/dataset_formats.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/dataset_formats.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/dataset_formats.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/dataset_formats.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* tiny internal testing for chat template testing

* precise type/format

* exlude sft trainer in doc

* Update trl/trainer/utils.py

* XPO in the doc

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2024-09-11 20:11:25 +02:00
85696aa64c Gkd trainer (#1814)
* initial

* initial gkd script

* fix output dir name

* smaller max_new_tokens_response size

* fix tab

* use temperature from config

* initial docs

* initial test

* add generalized_jsd_loss

* some docs

* fix order of interpolation

* use log_target=True

* fix formatting

* docstrings

* add peft example

* more docs

* formatting

* fix ordering

* use unwrap_model_for_generation

* initial DataCollatorForLastCompletionLM

* add generation inputs

* logits from the completions

* add eps to probs

* select the logits after removing the padding

* formatting

* interpolate log_probs

* add back online sampling

* update tests

* fix typos

* Update docs/source/gkd_trainer.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/gkd_trainer.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/gkd_trainer.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/gkd_trainer.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/gkd_trainer.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/gkd_trainer.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/_toctree.yml

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/gkd.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/gkd.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/gkd.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/gkd.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/gkd.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/gkd.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/gkd.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/gkd.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* use Qwen2

* Update trl/trainer/gkd_config.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update trl/trainer/gkd_config.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update trl/trainer/gkd_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update trl/trainer/gkd_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update trl/trainer/gkd_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update trl/trainer/gkd_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update trl/trainer/gkd_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update trl/trainer/gkd_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update tests/test_gkd_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update trl/trainer/gkd_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* fixes

* renamed lamda to lmbda due to keyword

* fix config name

* move collator to utils

* fix formatting

* Update trl/trainer/gkd_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update trl/trainer/gkd_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* the larger the lmbda the more on policy it should be

* Use JSD instead of KL

* use DataCollatorForChatML

* fix labels

* use torch_call

* Update examples/scripts/gkd.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/gkd.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/gkd.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/gkd.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/gkd.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/gkd.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* set default collator to DataCollatorForChatML

* return only the prompts

* fix labels of generated outputs

* formatting

* fix comment

* add missing _prepare_deepspeed

* no attention mask when generating

* update test

* set a sensible max_seq_length

* set default in the collator

* Update tests/test_gkd_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update tests/test_gkd_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* fix padding

* formatting

* Update tests/test_gkd_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* fix tests

* TestGeneralizedJSDLoss

* fix typos

* use a mask to calculate jsd loss

* use the super() training_step after the inputs are created

* fix the docs

* create generate_on_policy_outputs

* loss does not need labels

* use_cache is false when gradient checkpointing is True

* use self.assert

* fix toc

* generate_on_policy_outputs needs token_id

* use papers link

* teacher_model is in eval mode so no need for disabling dropout

* log completions and use_liger

* prompt from train if no eval

* fix logging and add cache empty

* add_generation_prompt=True

* fix prompts

* Update docs/source/gkd_trainer.md

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update docs/source/gkd_trainer.md

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update docs/source/gkd_trainer.md

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update examples/scripts/gkd.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* minor doc changes

* fix temp default

* Update examples/scripts/gkd.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/gkd.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/gkd.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/gkd.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/gkd.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/gkd.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/gkd.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/gkd.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/gkd.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/gkd.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/gkd.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/gkd.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/gkd.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/gkd.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/gkd.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/gkd.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/gkd.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/gkd.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/gkd.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/gkd.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/gkd.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* update docs

* fix dataset format

* fix dataset format

* no need for scores in generation

* teacher_model_init_kwargs

* Update _toctree.yml

* Update docs/source/gkd_trainer.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update tests/test_gkd_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/gkd_trainer.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/gkd.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/gkd.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/gkd.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/gkd.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/gkd.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/gkd.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* fix

* remove rich

* add determinstic test

* fix code

* use bigger teacher model

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-09-11 19:16:59 +02:00
642c4b1855 Remove debug and sanity_check args (#2055) 2024-09-11 17:56:02 +02:00
9a6061fc2f Clean up DPO example (#2043)
* Clean up DPO example

* Fix bs

* Remove rentrant

* Fix tests

* Nuke sanity checks

* Switch dataset

* Remove sanity check from XPO
2024-09-11 17:45:00 +02:00
a8fd6dcd17 Remove RichProgressCallback from examples (#2053)
* Disable RichProgressCallback by default in examples

* Nuke rich

* Clean
2024-09-11 16:51:05 +02:00
e2966c8d99 Integrate OrpoTrainer with PyTorchXLA for faster step time on TPUs (#2001)
* make Orpotrainer run faster on tpu

* less data transfer

* train-trl.py

* fix

* set device_map=auto

* add is_torch_xla_available guards

* delete file

* address comments

* make presubmit

* Update transformer version in setup.py

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2024-09-11 15:11:28 +02:00
37934d70a9 Windows back in CI (#2051)
* Revert "Temporary pin the transformers hash in the CI (#2049)"

This reverts commit f8cf88ab6573699a1a49420f859fdf6aa2f10326.

* Update commit

* Apply suggestions from code review

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2024-09-11 14:24:07 +02:00
9c043e596b Fix logits compuation in KTO trainer prediction step (#2050)
* Fix logits compuation in KTO trainer prediction step

* Update trl/trainer/kto_trainer.py

* Update trl/trainer/kto_trainer.py

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-09-11 13:31:42 +02:00
a20e822737 Deprecate PPOTrainer (#2016)
* Promote `PPOv2Trainer` and `PPOv2Config` to top-level import

* Deprecate `PPOTrainer` and `PPOConfig`

* FutureWarning

* Update trl/trainer/ppo_config.py
2024-09-10 19:04:29 +02:00
3511856767 [XPO] xpo trainer (#1943)
* initial xpo trainer

* compute rewards and ref log probs in smaller batches

* add logging

* initial log docs

* fix global_step increment

* fix metric descriptions

* use messages API

* use training_step API

* fix logs

* add test

* add back max_new_tokens

* use max_new_tokens

* refactor

* top_k is an int

* fix formatting

* fix the loss

* fix logging

* fix logging

* fix logging

* fix loss

* calcuate pi_log_ratio once

* fix stats

* fix loss

* do not log loss again

* fix docs

* add disable_dropout_in_model via flag

* comments

* revert doc change

* rm empty cache in online dpo

* improve doc xpo config

* some comment

* fix loggings stats

* fix docs

* save the model

* fix model and reward model

* Update trl/trainer/xpo_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

---------

Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2024-09-10 16:08:30 +02:00
f8cf88ab65 Temporary pin the transformers hash in the CI (#2049)
* tmp ci fix

* Update .github/workflows/tests-main.yml

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update .github/workflows/tests.yml

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update .github/workflows/tests-main.yml

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-09-10 16:01:28 +02:00
2ee0b62cdb Change non_eos_penalty to missing_eos_penalty to be consistent across OnPolicy trainers (#2033)
* Subtract a penalty from OnPolicy Trainers if output does not contain an EOS token

* Caught a few other problems

* Updated the documentation for RLOO trainer and PPOv2Trainer

* Corrected the default type and value for missing_eos_penalty

* Made RLOO Trainer consistent with Online DPO and PPOv2

* Removed --non_eos_penalty from all documentation

* Made missing_eos_penalty examples positive (because we subtract).

* Caught two more incorrect examples

* Removed unnecessary whitespace to make ruff happy

* Update trl/trainer/utils.py

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-09-10 14:40:23 +02:00
ac071d6225 Drop canonical dataset namespaces (#2048)
* drop canonical

* Delete ultrafeedback_prompt_only.py dataset script

* reduce dif in best_of_n

* try to revert best_of_n to make github happy

* anyway...
2024-09-10 12:12:00 +02:00
72f19c3fce fix: unpackaging error in Custom Mixture of Experts model when aux_loss_enabled is set to True. (#2039)
* fix: prevent unpackaging error due to additional **aux_loss** returned by **concatenated_forward** function when **aux_loss_enabled** is set to True.

* Refactor: Simplify tuple unpacking in `concatenated_forward` call in `get_batch_loss_metrics` function

* Refactor: improve code quality
2024-09-09 11:47:54 +02:00
8d7b54d4bf Fix packing doc in SFTConfig and fix error when neither dataset_text_field nor formatting_func is provided. (#2035)
* fix dataset and value error in sft

* Update trl/trainer/sft_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* move the test to the right place

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2024-09-09 11:39:37 +02:00
a638f73f5c Improves formatting of docstring + newlines (#2006)
* Improves formatting of docstring + newlines

* Linting fix

* Update utils.py

* Set to "Parameters" in config files

* some fixes

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-09-09 10:26:46 +02:00
8a518ee619 Remove unused functions (#2017) 2024-09-08 14:05:46 +02:00
7a67de3c1c Fix docs formatting of ˋ\timesˋ sign in ˋkto_trainer.mdxˋ (#2031)
* correct formatting of star sign in kto_trainer.mdx

The "*" symbol in markdown doesn't show. I changed it to $\times$ so the mathematical formula is clearer

* fix markdown

* one more try
2024-09-08 11:54:04 +02:00
3412f513f2 Refactor reward modelling script to work with chat models (#2026)
* Make Qwen2 works

* Make it work

* Refactor

* Add doc

* Add dataset

* Fix

* Quality
2024-09-06 13:12:38 +02:00
fc20db8873 Clean configs documentation (#1944)
* Clean BCO

* Optional[int]

* fix sft config

* alignprop config

* upadte tempfile to work with output_dir

* clean kto config

* intro docstring

* style

* reward config

* orpo config

* warning in trainer, not in config

* cpo config

* ppo v2

* model config

* ddpo and per_device_train_batch_size (instead of (train_batch_size)

* rloo

* Online config

* tmp_dir in test_ddpo

* style

* remove to_dict and fix post-init

* batch size in test ddpo

* dpo

* style

* `Args` -> `Parameters`

* parameters

* ppo config

* dont overwrite world size

* style

* outputdir in test ppo

* output dir in ppo config

* revert non-core change (1/n)

* revert non-core changes (2/n)

* revert non-core change (3/n)

* uniform max_length

* fix uniform max_length

* beta uniform

* style

* link to `ConstantLengthDataset`

* uniform `dataset_num_proc`

* uniform `disable_dropout`

* `eval_packing` doc

* try latex and α in doc

* try title first

* doesn't work

* reorganize doc

* overview

* better latex

* is_encoder_decoder uniform

* proper ticks

* fix latex

* uniform generate_during_eval

* uniform truncation_mode

* ref_model_mixup_alpha

* ref_model_mixup_alpha and ref_model_sync_steps

* Uniform  `model_init_kwargs` and `ref_model_init_kwargs`

* rpo_alpha

* Update maximum length argument names in config files

* Update loss_type descriptions in config files

* Update max_target_length to max_completion_length in CPOConfig and CPOTrainer

* Update padding value in config files

* Update precompute_ref_log_probs flag documentation

* Fix typos and update comments in dpo_config.py and sft_config.py

* post init warning for `max_target_length`
2024-09-04 10:07:49 +02:00
7acb9c2319 Feat: Add support for APO-zero in KTOTrainer (#1952)
* feat : add kto command

* feat : add support for apo loss in KTO Trainer

* feat : make kto script compatible with dpo-formatted datasets

* fix: lint data utils

* add loss_type in kto test

* fix: data utils docstrings

* fix: add dataset reformat test

* fix: lint tests

* fix: only reference kl_logps if needed

---------

Co-authored-by: Karel D'Oosterlinck <karel@contextual.ai>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2024-09-04 09:31:46 +02:00
684038057e Allow WinRateCallback to be used without reference model (#2013)
* tests

* make ref model optional

* style

* remove attribute error
2024-09-04 00:05:05 +02:00
1f6a1d2f9a Remove prompts arg from WinrateCallback (#2010)
* rm prompts and add doc

* proper judge type and doc

* test for callback

* style
2024-09-03 17:24:08 +02:00
d60a1f50fe [ci] pin numpy to < 2 on win (#2009) 2024-09-03 13:03:38 +02:00
728a9a3b5f [Docs] Add Liger-Kernel usage to SFTTrainer page (#2007)
* Add Liger-Kernel usage in SFTTrainer

* initial commit

* update flaws

* fix flaws

* Update sft_trainer.mdx

* Update docs/source/sft_trainer.mdx

Co-authored-by: Byron Hsu <byronhsu1230@gmail.com>

* Update docs/source/sft_trainer.mdx

Co-authored-by: Byron Hsu <byronhsu1230@gmail.com>

* Update docs/source/sft_trainer.mdx

Co-authored-by: Byron Hsu <byronhsu1230@gmail.com>

* Update sft_trainer.mdx

---------

Co-authored-by: Byron Hsu <byronhsu1230@gmail.com>
2024-09-03 08:40:58 +02:00
850ddcf598 [pre-commit] update pre-commit yaml (#2002)
* update pre-commit yaml

* fix test

* use element_type
2024-09-02 19:15:25 +02:00
d57e4b7265 [Online-DPO] fixes to the training scripts and setup.py (#1997)
* fixes

* fixed typo

* add tests for liger

* fix imports

* class name
2024-08-30 22:05:14 +02:00
11f442fc05 move slow-tests CI to new cluster (#1996) 2024-08-30 12:29:21 +02:00
437e8ccaba Bump dev version 2024-08-29 14:39:18 +00:00
4dd0dc2988 Adds experimental Liger support to SFT script (#1992)
* adds cli and import utils

* updates SFT script

* adds liger model to trainer

* adds liger nightly dep

* precommit

* fix import

* Update trl/commands/cli_utils.py

* Fix quality

* moved use_liger arg to sft config

* remove arg

* remove use liger from sft trainer

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2024-08-29 14:48:35 +02:00
4f59e923ac Relax numpy upper bound and bump deepspeed version (#1990)
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-08-29 13:17:48 +02:00
10f70fa333 Add ignore_index in DPOTrainer's nn.CrossEntropyLoss (#1987)
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-08-28 16:41:41 +02:00
47ab034ca9 [DPO] tokenize and process DPO data via batches (#1914)
* tokenize and process DPO data via batches

* use helpers

* updated _process_tokens

* fixed

* incorporate build_tokenized_answer in the _tokenizer

* Update trl/trainer/dpo_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Update trl/trainer/dpo_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* fix tokenizer for is_vision_model

* Update trl/trainer/dpo_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* give the _tokenize the tokenizer as well as optional processor

* fix tests

* add bos and eos tokens

* add prompt_pixel_attention_mask

* Update trl/trainer/dpo_trainer.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* truncate by max_length

* formatting

* fix for enc-dec

* For encoder-decoder models, we need to use the prepared decoder_input_ids

* add tests for _build_tokenized_answer and _tokenize_feature

* check for EOS and BOS tokens

* formatting

* do not include pixel mask if they are not provided

* undo refactor

* undo add_bos_token_if_needed change

* refactor tokenizer into smaller helpers

* add back comments

* fix type hints

* format

* fix t5 tests

* args are never optional

* move cat to appropriate helper

* fix _truncate_tokens

* add tests for _truncate_tokens

* remove dead code

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2024-08-28 16:14:53 +02:00
e755eee660 Refactor Online DPO (#1839)
* online dpo trainer based on rloo trainer

* push changes

* refactor

* use `batch_generation` method

* precommit

* remove breakpoint()

* quick refactor

* push the current changes

* quick change

* refactor

* use the config name as the experiment name

* fix logging

* update online DPO docs

* use llm as a judge

* quick change

* quick fix

* cache changes

* new semantics

* style and arg order change

* rm duplicated num_epochs

* rm plot script

* num_epoch

* revert some changes

* revert changes

* revert whitespace

* rm whitespace

* revert change

* policy->model

* optional judge and reward model

* cleaning online dpo script

* warning when both reward mdoel and judge provided

* return -1 when the judge fails

* dataset num proc

* add judges in online dpo; fix collate and process within the trainer

* lr_scheduler.step() after optimizer step

* update odpo test

* reduce nestiness

* allow pickle

* generation config typing

* online dpo llm judge

* fix data collator pad token

* add space

* fix pref score

* -1 for judges

* self.model_wrapped = self.model

* onlinedpo inherits from training arguments

* num_epoch -> num_steps_in_epochs

* update -> epoch

* epoch -> step; step_in_epoch -> ppo_epoch; rm run_name

* num_steps_in_epoch -> num_ppo_epochs

* epoch_idx -> ppo_epoch_idx

* make init consistent with dpo

* try another option

* progress...

* odpo

* current progress

* log and other changes

* rename for legacy

* rename for legacy

* rename and move truncate

* rename

* new config

* LogCompletionsCallback

* style

* rename trainer

* truncate right in utils

* update example

* reward model path

* properly log

* fix example

* add generation prompt and log special tokens

* true penalty

* defaults from the paper

* Remove MPS (#1983)

* Set KV cache false when gradient checkpointing is enabled (#1984)

* Remove MPS

* Fix

* Various tweask

* Remove padding from table

* Clean up

* Fix test

* Revert log freq

* Fix docs

* Fix tests aain!

* Fix typo

* Revert

* Fix regression

* Apply suggestions from code review

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Fix DPO config test

* Fix doc tree

* Clean docs moar

* Add docstring

* raise NotImplemented error for judge

* Refactor cache clearning

---------

Co-authored-by: Michael Noukhovitch <mnoukhov@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2024-08-28 15:39:51 +02:00
ac31d1205e Skip the failing Online DPO test (#1989)
* Harmonisation of tests between main and PR

* disable tqdm

* skip the test

* `"Programming Language :: Python :: 3.11"` and drop 3.7

* Update .github/workflows/tests.yml

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update setup.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update .github/workflows/tests-main.yml

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

---------

Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2024-08-28 14:55:18 +02:00
c44ab6d1e9 torch.load with weights_only=True (#1988)
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-08-28 11:13:22 +02:00
a15a80e0d5 gather the target model params as well (#1978) 2024-08-28 09:27:26 +02:00
264f1279fd Promote PairRMJudge to top-level import (#1985)
* allow `from trl import PairRMJudge`

* test_pair_rm_judge

* Update setup.py

---------

Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-08-27 21:04:05 +02:00
0cda2f2f01 Restore test (#1982) 2024-08-27 11:16:32 +02:00
e0ff66103e Update tests for _get_kl_dataset (#1974)
* Test for #1970

* style

* drop last element in the batch for test

* check prompt_input_ids not modified

---------

Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-08-27 11:00:43 +02:00
3a3ed88f28 Fix dataset_num_proc missing in PPOConfig (#1966)
* fix a few minor bugs in ppo.py

* dataset_num_proc as training arg

* num proc in config

* Update examples/scripts/ppo.py

---------

Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2024-08-27 10:59:45 +02:00
b65657f41d Fix flaky Hub tests (#1981)
* Fix flaky Hub tests

* Trigger Build

* test buld
2024-08-27 10:14:39 +02:00
de024ece28 Use weights_only for load (#1933)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2024-08-26 18:18:38 +02:00
2fbc0f4fc2 Fix issue template path (#1973)
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-08-26 14:40:37 +02:00
cf5168ea7c New mismatch pair creation strategy (#1970)
* new mismatch pair creation strategy

* style

---------

Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-08-26 13:29:22 +02:00
1e4fb80cbc Fix issue with unnecessary cached during logp calc. (#1969) 2024-08-26 12:38:58 +02:00
fe41acd6ae add arg padding_free to DataCollatorForCompletionOnlyLM (#1887)
* add arg `padding_free` to DataCollatorForCompletionOnlyLM

* Update tests/test_data_collator_completion_only.py

* Update trl/trainer/utils.py

* Update tests/test_data_collator_completion_only.py

* Update tests/test_data_collator_completion_only.py

* Update tests/test_data_collator_completion_only.py

* Update tests/test_data_collator_completion_only.py

* Update test_data_collator_completion_only.py

* Update tests/test_data_collator_completion_only.py

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
2024-08-26 09:48:39 +02:00
c71262c9c6 Fix issue with precompute_ref_log_probs not working when rpo_alpha is None (#1961)
* Fix issue with precompute_ref_log_probs not working when rpo_alpha is None

* Test: Add test for precompute_ref_log_probs with rpo_alpha=None

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-08-25 12:15:57 +02:00
dcee683d96 Add issue/PR templates, code of conduct & better contributing guide (#1963)
* Add issue/PR templates, code of conduct & better contributing guide

* Apply suggestions from code review

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2024-08-23 23:12:40 +02:00
4788e5cda5 Support LLaVA-NeXT in Vision SFT (#1959)
* support llava next

* mention version for llava-next

---------

Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-08-23 11:37:40 +02:00
6cea2ef964 [ODPO] Refactor training script to use messages API (#1958)
* Refactor dataset prep

* Add moar doc
2024-08-22 20:03:12 +02:00
64d9816eac Fix response truncation in examples/notebooks/gpt2-sentiment.ipynb (#1957) 2024-08-22 16:22:46 +02:00
67564fdbbe "help wanted" in label to exempt from stale (#1956)
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-08-22 11:27:37 +02:00
e529579232 Fix global step for consistent checkpointing with global updates (#1950) 2024-08-21 10:19:37 +02:00
dc4cfab700 Log WandB tables on main process (#1951) 2024-08-20 16:42:51 +02:00
66d3a82dd2 Add a simple-to-understand example for online DPO (#1947)
* Update online_dpo_trainer.md

* Update docs/source/online_dpo_trainer.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/online_dpo_trainer.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/online_dpo_trainer.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/online_dpo_trainer.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update online_dpo_trainer.md

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2024-08-20 16:14:40 +02:00
3eda856371 Don't mark issues as stale if nobody answered (#1949)
* don't mark issues as stale if nobody answered

* refactor

---------

Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-08-20 15:13:40 +02:00
616a273ac2 Fix model wrapping for online DPO (#1946) 2024-08-19 18:17:11 +02:00
9955583829 Drop token arg in push_to_hub (#1945)
* Skip token in `push_to_hub`

* fix doc

* move comment

---------

Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-08-19 11:34:11 +02:00
bed205a2d2 Properly tag models when pushed to 🤗 Hub (#1940)
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-08-18 11:16:27 +02:00
42933fa647 Optional Additional Loss to Center Reward Models' Outputs (#1932)
* Implemented Eisenstein reward model centering

* Forgot self in accessing args

* Added docstring for center_rewards_coefficient.

* Fixed bug.

* Update trl/trainer/reward_config.py

Added a reference.

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* Switched to Quentin's suggestion

* Update trl/trainer/reward_config.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* doc

* 0.01

* style

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-08-17 22:44:03 +02:00
bbdef00961 Fix model to save in PPOv2 (#1776)
* fix model to save in ppov2

currently saving self.backup_model but this should be self.model
self.backup_model is only a temp model used to store the policy and
value function whereas self.model should have just the policy to save

* simplified logic

* remove unused ordereddict

* format

* fix the fix

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-08-17 17:47:01 +02:00
0956dc17cc Add tests for DPO for VLM (#1935)
* add dpo visual test

* skip last layer of llava in test

---------

Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-08-16 16:29:40 +02:00
a7dc892717 Anchored preference optimization loss for DPO (#1928)
* feat: anchored pref optimization

* Update trl/trainer/dpo_trainer.py

* format and properly deprecate loss_type

* add aot in error message and reorder

* add "sppo_hard", "nca_pair" in label_smoothing warning warning

* add tests

* doc

* doc fixes

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-08-14 17:37:49 +02:00
b0372e66a5 Improve DPO/loss doc (#1929)
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-08-14 16:52:26 +02:00
c1b272f4a6 minor BCO fixes (#1923)
* checkpointing BCO UDM classifier

* kto_config remove unused parameters

* BCO fix loading

* kto_config remove unused parameters

* kto_config remove unused parameters

---------

Co-authored-by: Clara Luise Pohland <clara-luise.pohland@telekom.de>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-08-14 15:27:13 +02:00
f05f63c1ea PartialState().local_main_process_first() when map in examples (#1926)
* `PartialState().local_main_process_first()` when map in examples

* allow load from cache

---------

Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-08-14 12:01:03 +02:00
54f806b6ff Standardize dataset_num_proc usage (#1925)
* uniform dataset_num_proc

* num_proc in shuffle

* Update examples/datasets/anthropic_hh.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/ppo.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/ppo.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

---------

Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2024-08-13 15:10:39 +02:00
a9a756553f Add explicit library name for TRL repos (#1922) 2024-08-13 11:36:01 +02:00
96bb3deb32 fix orpo trainer loss device (#1919) 2024-08-12 15:55:23 +02:00
dbea3da917 torch.cuda.amp.autocast() -> torch.amp.autocast("cuda") (#1921)
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-08-12 14:43:38 +02:00
150a93101b lr_scheduler.step() call after optim.step() (#1918)
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-08-12 14:21:50 +02:00
cbcaa46cd3 Various args and test fix (#1909)
* report to none

* simplify AlignPropTrainerTester

* rm unused marker

* Don't share setup in dpo trainer

* style

* don't share setup in test rich

* fix setup and classmethod

* fix args for sft

* test_trainer_args

* various arg fix

* report to none and vsdt simplifi

* drop generate_during_eval

* fix run_name

* style

* drop setUpClass

* style

* new ref values for ppo trainer tester

* update ref val

---------

Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-08-09 10:07:58 +02:00
e3fe28ee1a Fix AlignPropTrainer import (#1908) 2024-08-07 11:33:11 +02:00
fb0b9edc24 Fix GPT2 sentiment notebook reward (#1738)
* Fix reward change

* clean up notebook

* fix eval metric

* regenerate output with correct model

* swap wrong operation order

* Update gpt2-sentiment.ipynb

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2024-08-06 22:19:05 +02:00
fc76fe8d11 [Online-DPO] num_generation_per_prompt is fixed (#1898)
* num_generation_per_prompt is fixed

* remove unused no_grads

* removed bin

* fix scores

* fix scores

* formatting

* undo
2024-08-06 18:21:35 +02:00
b60ce797d8 Support Rank Stabilized LoRA in the ModelConfig/LoraConfig (#1877)
* feat: support RS-LoRA in the ModelConfig

* build: bump minimum peft version to support rslora

* test: add test for get_peft_config

* test: make test python 3.8 friendly

* rm unused marker

* minor changes

* simplify, clarify doc

* update deps (peft in test)

* re-ordering

* fix setup

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-08-06 18:02:59 +02:00
6faf4c0d81 [RPO] use loss from v3 of paper (#1904)
* RPO loss from v3

* Update trl/trainer/dpo_config.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

* fix docs

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2024-08-06 16:28:46 +02:00
29bd0046a9 fix process orpo example (#1903)
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-08-06 12:57:11 +02:00
4867c2a3db Support IterableDataset for SFTTrainer (#1899)
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-08-05 18:04:17 +02:00
332062372d Drop setUpClass in reward tester (#1895)
* drop setUp class in reward tester

* report to none

* style

---------

Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-08-05 16:01:43 +02:00
b580e45c94 [WIP] Drop save/load test on windows (#1897)
* just test modelling

* Trigger CI

* always trigger

* only test_from_save_trl

* parametrize

* just one model

* file

* rm ref model

* assert exists

* style

* Update Makefile

* Update tests.yml

* Update Makefile

* Update test_modeling_value_head.py

* Update test_modeling_value_head.py

* skip windows

* skip test_from_save_transformers

* also skip test_from_save_trl

---------

Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-08-05 16:01:06 +02:00
2004d62c5c fix serialization of RunningMoments on multiple GPUs (#1892)
Co-authored-by: Clara Luise Pohland <clara-luise.pohland@telekom.de>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-08-04 10:57:28 +02:00
ac7c8b1284 evaluation_strategy -> eval_strategy (#1894)
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-08-02 16:01:35 +02:00
df12913602 Fix SFT for VLM example (#1865)
* fix vsft example commands

* fix use_cache and get tokenizer from processor

* rm unused AutoTokenizer

* Squashed commit of the following:

commit 8bd2ab82f4cedc8b3459126aa145c63180078392
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Sun Jul 28 14:06:19 2024 +0200

    Refactor judges (#1856)

    * BaseJudge -> BasePairwiseJudge

    * hf judge asyncio

    * refactor judges

    * doc

    * doc

    * doc

    * memeber judge

    * :inherited-members:

    * :inherited-members:

    * doc

    * give up

    * judge tldr with judge class

    * fix rank in multithread

    * format

    * improve doc

    * update doc

    * typo doc

    * doc online dpo

    * Update judge_tldr.py

    ---------

    Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>

commit 82b07d6b0169bb8150f2fa4ee0a58b678d597163
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Fri Jul 26 11:43:48 2024 +0200

    Llama in modelling value head tests (#1878)

commit 72bf6c21beedd95b1deb1ff95bd4d1bad5380503
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Fri Jul 26 11:33:07 2024 +0200

    Skip BigBird save and load test until next transformers version (#1874)

commit 74e54b5946b3e46c9fef516b6f5403943c7c4096
Author: Edward Beeching <edbeeching@users.noreply.github.com>
Date:   Fri Jul 26 09:36:25 2024 +0200

    fix online dpo example (#1879)

commit 393097356c3494a1310cd59b0205358723468443
Author: Rishav Dash <57321948+Rishav-hub@users.noreply.github.com>
Date:   Thu Jul 25 14:17:37 2024 +0530

    Bug Fix while training using SFTTrainer with DataCollatorForCompletionOnlyLM (#1861)

    * Bug Fix while training using SFTTrainer with DataCollatorForCompletionOnlyLM

    Added ```dataset_text_field``` in the SFTConfig while training

    * Update docs/source/sft_trainer.mdx

    ---------

    Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

commit db8e09e3463837d6f80d593f2806c0d83d97c787
Author: Rishav Dash <57321948+Rishav-hub@users.noreply.github.com>
Date:   Thu Jul 25 14:06:57 2024 +0530

    Import missing ```setup_chat_format``` (#1862)

commit 1dae55f90f6e929500df4fc4ee5bbc0146e35574
Author: elie <97572401+eliebak@users.noreply.github.com>
Date:   Thu Jul 25 10:27:34 2024 +0200

    add fsdp_qlora config and bnb_4bit_quant_storage (#1863)

commit c8cef79e6c895c9950ad7af61897f3a89372c56d
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Wed Jul 24 21:06:57 2024 +0200

    arXiv to HF Papers (#1870)

commit 7dcf437a1997cb1b252e8ea0b8ad7dca13261d7e
Author: Kashif Rasul <kashif.rasul@gmail.com>
Date:   Wed Jul 24 12:27:50 2024 +0200

    [online-DPO] online dpo cleanups (#1864)

    * online dpo cleanups

    * remove unused self.policy

    * add OnlineDPOTrainer and config to __init__.py

    * import from trainer

    * online dpo test

    * rename policy to model and ref_policy to ref_model

    * renamed internally

    * formatting

commit 4e85bd75a9dfca0074eef3a90130054c283eed39
Author: Costa Huang <costa.huang@outlook.com>
Date:   Thu Jul 18 14:35:31 2024 -0400

    Online DPO and Online trainer refactor (#1809)

    * online dpo trainer based on rloo trainer

    * push changes

    * refactor

    * use `batch_generation` method

    * precommit

    * remove breakpoint()

    * quick refactor

    * push the current changes

    * quick change

    * refactor

    * use the config name as the experiment name

    * fix logging

    * update online DPO docs

    * push docs

    * increment global step so tensorboard works again.

    * precommit

    * remove unused common online trainer

    * add online DPO docs

    * quick refactor

    * push changes

    * Update docs/source/online_dpo_trainer.md

    Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

    ---------

    Co-authored-by: Michael Noukhovitch <mnoukhov@gmail.com>
    Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

commit c9d56366ede5990d690f3b7a3f249c434f3633d6
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Thu Jul 18 18:28:49 2024 +0200

    rm token (#1852)

* add section in doc

* Squashed commit of the following:

commit 890232fa2861c40d46adeaf975a4209eb04fe841
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Tue Jul 30 14:29:47 2024 +0200

    update example overview (#1883)

    Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>

commit 9929370dee9975f1c6d80b32198ea3e7fd0dcc06
Author: Clara Pohland <54847419+claralp@users.noreply.github.com>
Date:   Sun Jul 28 21:10:08 2024 +0200

    Move BCO to separate BCOTrainer with fixes (#1869)

    * kto_trainer: skip KL data for BCO

    * kto_trainer: BCO allow no positives or no negatives in batch

    * kto_trainer: make RunningMoments object serializable

    * add BCOTrainer

    * fix BCO UDM for not interleaved data

    * kto_trainer: remove unused UDM part

    * bco_trainer: add tests and docs, minor fixes

    * code style fixes

    * Update docs/source/bco_trainer.mdx

    Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

    * fix BCO UDM for bfloat16

    * Update trl/trainer/bco_config.py

    * Update trl/trainer/bco_config.py

    Co-authored-by: Seungjae Jung <seanexplode@gmail.com>

    * Update trl/trainer/utils.py

    Co-authored-by: Seungjae Jung <seanexplode@gmail.com>

    * Update trl/trainer/bco_trainer.py

    Co-authored-by: Seungjae Jung <seanexplode@gmail.com>

    * Update trl/trainer/bco_config.py

    * Update _toctree.yml

    * Update trl/trainer/bco_config.py

    * Update trl/trainer/bco_trainer.py

    * RunningMoments, fix multi GPU serialization

    * fix tests

    ---------

    Co-authored-by: Clara Luise Pohland <clara-luise.pohland@telekom.de>
    Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
    Co-authored-by: Seungjae Jung <seanexplode@gmail.com>

commit 6171cddee5165869af8b40b526476680cebe47ef
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Sun Jul 28 15:51:38 2024 +0200

    Re-add BigBird Pegasus save/load test (#1882)

    Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>

commit 33d2151f4fa37728fea9448420301a1380fee745
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Sun Jul 28 15:07:10 2024 +0200

    Re-add BigBird Pegasus save/load test (#1876)

    * skip bigbird in ci

    * readd big bird test

    * pytest parametrize

    * dont check the version

    * rm model name

    * re add big bird

    * Merge branch 'main' into readd-bigbird-save-load-test

    ---------

    Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>

commit 8bd2ab82f4cedc8b3459126aa145c63180078392
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Sun Jul 28 14:06:19 2024 +0200

    Refactor judges (#1856)

    * BaseJudge -> BasePairwiseJudge

    * hf judge asyncio

    * refactor judges

    * doc

    * doc

    * doc

    * memeber judge

    * :inherited-members:

    * :inherited-members:

    * doc

    * give up

    * judge tldr with judge class

    * fix rank in multithread

    * format

    * improve doc

    * update doc

    * typo doc

    * doc online dpo

    * Update judge_tldr.py

    ---------

    Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>

commit 82b07d6b0169bb8150f2fa4ee0a58b678d597163
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Fri Jul 26 11:43:48 2024 +0200

    Llama in modelling value head tests (#1878)

commit 72bf6c21beedd95b1deb1ff95bd4d1bad5380503
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Fri Jul 26 11:33:07 2024 +0200

    Skip BigBird save and load test until next transformers version (#1874)

commit 74e54b5946b3e46c9fef516b6f5403943c7c4096
Author: Edward Beeching <edbeeching@users.noreply.github.com>
Date:   Fri Jul 26 09:36:25 2024 +0200

    fix online dpo example (#1879)

commit 393097356c3494a1310cd59b0205358723468443
Author: Rishav Dash <57321948+Rishav-hub@users.noreply.github.com>
Date:   Thu Jul 25 14:17:37 2024 +0530

    Bug Fix while training using SFTTrainer with DataCollatorForCompletionOnlyLM (#1861)

    * Bug Fix while training using SFTTrainer with DataCollatorForCompletionOnlyLM

    Added ```dataset_text_field``` in the SFTConfig while training

    * Update docs/source/sft_trainer.mdx

    ---------

    Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

commit db8e09e3463837d6f80d593f2806c0d83d97c787
Author: Rishav Dash <57321948+Rishav-hub@users.noreply.github.com>
Date:   Thu Jul 25 14:06:57 2024 +0530

    Import missing ```setup_chat_format``` (#1862)

commit 1dae55f90f6e929500df4fc4ee5bbc0146e35574
Author: elie <97572401+eliebak@users.noreply.github.com>
Date:   Thu Jul 25 10:27:34 2024 +0200

    add fsdp_qlora config and bnb_4bit_quant_storage (#1863)

commit c8cef79e6c895c9950ad7af61897f3a89372c56d
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Wed Jul 24 21:06:57 2024 +0200

    arXiv to HF Papers (#1870)

commit 7dcf437a1997cb1b252e8ea0b8ad7dca13261d7e
Author: Kashif Rasul <kashif.rasul@gmail.com>
Date:   Wed Jul 24 12:27:50 2024 +0200

    [online-DPO] online dpo cleanups (#1864)

    * online dpo cleanups

    * remove unused self.policy

    * add OnlineDPOTrainer and config to __init__.py

    * import from trainer

    * online dpo test

    * rename policy to model and ref_policy to ref_model

    * renamed internally

    * formatting

commit 4e85bd75a9dfca0074eef3a90130054c283eed39
Author: Costa Huang <costa.huang@outlook.com>
Date:   Thu Jul 18 14:35:31 2024 -0400

    Online DPO and Online trainer refactor (#1809)

    * online dpo trainer based on rloo trainer

    * push changes

    * refactor

    * use `batch_generation` method

    * precommit

    * remove breakpoint()

    * quick refactor

    * push the current changes

    * quick change

    * refactor

    * use the config name as the experiment name

    * fix logging

    * update online DPO docs

    * push docs

    * increment global step so tensorboard works again.

    * precommit

    * remove unused common online trainer

    * add online DPO docs

    * quick refactor

    * push changes

    * Update docs/source/online_dpo_trainer.md

    Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

    ---------

    Co-authored-by: Michael Noukhovitch <mnoukhov@gmail.com>
    Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

commit c9d56366ede5990d690f3b7a3f249c434f3633d6
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Thu Jul 18 18:28:49 2024 +0200

    rm token (#1852)

* simplify script

* doc

* use traning args

* args instead of trianing args

* fix doc

* drop eval

* rm eval section

* re-add bigbirg

---------

Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-08-02 10:31:51 +02:00
ddf4c8dc3e fix dpo_trainer bug for LLMs without bos_token in config (#1885)
* fix dpo_trainer bug for LLMs without bos_token in config

* fix adding bos_token_id bug in dpo,orpo,cpo trainers

* formatting for fixing bos_token adding bug

* Update trl/trainer/utils.py

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-07-31 12:42:06 +02:00
890232fa28 update example overview (#1883)
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-07-30 14:29:47 +02:00
9929370dee Move BCO to separate BCOTrainer with fixes (#1869)
* kto_trainer: skip KL data for BCO

* kto_trainer: BCO allow no positives or no negatives in batch

* kto_trainer: make RunningMoments object serializable

* add BCOTrainer

* fix BCO UDM for not interleaved data

* kto_trainer: remove unused UDM part

* bco_trainer: add tests and docs, minor fixes

* code style fixes

* Update docs/source/bco_trainer.mdx

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* fix BCO UDM for bfloat16

* Update trl/trainer/bco_config.py

* Update trl/trainer/bco_config.py

Co-authored-by: Seungjae Jung <seanexplode@gmail.com>

* Update trl/trainer/utils.py

Co-authored-by: Seungjae Jung <seanexplode@gmail.com>

* Update trl/trainer/bco_trainer.py

Co-authored-by: Seungjae Jung <seanexplode@gmail.com>

* Update trl/trainer/bco_config.py

* Update _toctree.yml

* Update trl/trainer/bco_config.py

* Update trl/trainer/bco_trainer.py

* RunningMoments, fix multi GPU serialization

* fix tests

---------

Co-authored-by: Clara Luise Pohland <clara-luise.pohland@telekom.de>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: Seungjae Jung <seanexplode@gmail.com>
2024-07-28 21:10:08 +02:00
6171cddee5 Re-add BigBird Pegasus save/load test (#1882)
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-07-28 15:51:38 +02:00
33d2151f4f Re-add BigBird Pegasus save/load test (#1876)
* skip bigbird in ci

* readd big bird test

* pytest parametrize

* dont check the version

* rm model name

* re add big bird

* Merge branch 'main' into readd-bigbird-save-load-test

---------

Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-07-28 15:07:10 +02:00
8bd2ab82f4 Refactor judges (#1856)
* BaseJudge -> BasePairwiseJudge

* hf judge asyncio

* refactor judges

* doc

* doc

* doc

* memeber judge

* :inherited-members:

* :inherited-members:

* doc

* give up

* judge tldr with judge class

* fix rank in multithread

* format

* improve doc

* update doc

* typo doc

* doc online dpo

* Update judge_tldr.py

---------

Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-07-28 14:06:19 +02:00
82b07d6b01 Llama in modelling value head tests (#1878) 2024-07-26 11:43:48 +02:00
72bf6c21be Skip BigBird save and load test until next transformers version (#1874) 2024-07-26 11:33:07 +02:00
74e54b5946 fix online dpo example (#1879) 2024-07-26 09:36:25 +02:00
393097356c Bug Fix while training using SFTTrainer with DataCollatorForCompletionOnlyLM (#1861)
* Bug Fix while training using SFTTrainer with DataCollatorForCompletionOnlyLM

Added ```dataset_text_field``` in the SFTConfig while training

* Update docs/source/sft_trainer.mdx

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-07-25 10:47:37 +02:00
db8e09e346 Import missing ``setup_chat_format`` (#1862) 2024-07-25 10:36:57 +02:00
1dae55f90f add fsdp_qlora config and bnb_4bit_quant_storage (#1863) 2024-07-25 10:27:34 +02:00
c8cef79e6c arXiv to HF Papers (#1870) 2024-07-24 21:06:57 +02:00
7dcf437a19 [online-DPO] online dpo cleanups (#1864)
* online dpo cleanups

* remove unused self.policy

* add OnlineDPOTrainer and config to __init__.py

* import from trainer

* online dpo test

* rename policy to model and ref_policy to ref_model

* renamed internally

* formatting
2024-07-24 12:27:50 +02:00
4e85bd75a9 Online DPO and Online trainer refactor (#1809)
* online dpo trainer based on rloo trainer

* push changes

* refactor

* use `batch_generation` method

* precommit

* remove breakpoint()

* quick refactor

* push the current changes

* quick change

* refactor

* use the config name as the experiment name

* fix logging

* update online DPO docs

* push docs

* increment global step so tensorboard works again.

* precommit

* remove unused common online trainer

* add online DPO docs

* quick refactor

* push changes

* Update docs/source/online_dpo_trainer.md

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

---------

Co-authored-by: Michael Noukhovitch <mnoukhov@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2024-07-18 14:35:31 -04:00
c9d56366ed rm token (#1852) 2024-07-18 18:28:49 +02:00
4dce042a38 Add WinRateCallback and Judges (#1598)
* Add WinRateCallback

* Enable PairRM

* Refactor

* Streamline

* Add HF judge

* Add base judge

* Use better prompt

* Clean

* Add max tokens

* Use logging

* Add batched inference

* Squashed commit of the following:

commit 9e9dc96e676a3601882b5cf11842bd22267fd2c5
Author: Maxim Kopecki <kopecki.maxim@gmail.com>
Date:   Wed Jul 10 19:11:13 2024 +0200

    Added missing token kwarg in Peft model loading (#1825)

commit 7ddef5c1582f14f32b6dd692f8e4b904fd478038
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Wed Jul 10 18:26:11 2024 +0200

    Make use of `trust_remote_code` consistent (#1806)

    Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>

commit a9cddf8c55a0b2af101a3d18bd92f263f4ae4500
Author: Adnan Khan <AdnaneKhan@users.noreply.github.com>
Date:   Wed Jul 10 11:25:07 2024 -0400

    Delete unused benchmark.yml workflow. (#1822)

commit 2860ce5091e689bab167454453e9ddbe2337de3d
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Tue Jul 9 09:22:52 2024 +0200

    DPO Llava 1.5 and PaliGemma support (#1797)

    * llava support dpo

    * add_special_tokens=False only when possible

    * format

    * pali gemma

    * refactor size

    * remove image resize

    ---------

    Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>

commit 30e33bd92da1f5569493e16da8971247cc376927
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Tue Jul 9 05:37:12 2024 +0200

    upgrade gh actions (#1818)

    Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>

commit d5a0d2d345ec26646ceaa06adfe6133aad18702a
Author: Costa Huang <costa.huang@outlook.com>
Date:   Mon Jul 8 11:12:41 2024 -0400

    Set dev version (#1817)

commit 314e8eb367cbfaf74c2e9717085346360e779508
Author: Puneet Singh Bhooi <puneetb@iiitd.ac.in>
Date:   Mon Jul 8 19:11:36 2024 +0530

    fix broken url in `docs\source\index.mdx` (#1813)

commit e10792032be644a65dcbcf2ebe9ec947497d4d46
Author: Costa Huang <costa.huang@outlook.com>
Date:   Mon Jul 8 09:38:09 2024 -0400

    0.9.6 release (#1816)

commit 78045dedc8678af04f4e35ffe63f37be196a435b
Author: Alvaro Bartolome <36760800+alvarobartt@users.noreply.github.com>
Date:   Mon Jul 8 01:59:26 2024 +0200

    Fix `TRL_USE_RICH` environment variable handling (#1808)

    * Add `strtobool` custom implementation from `distutils`

    * Fix `TRL_USE_RICH` handling via `strtobool`

    * Run `make precommit`

commit 747612f9d3063de56b6524e5feb0c9feab21d4c4
Author: Alvaro Bartolome <36760800+alvarobartt@users.noreply.github.com>
Date:   Fri Jul 5 16:28:59 2024 +0200

    Fix `torch_dtype` handling in `{DPO,SFT}Trainer` when provided via CLI (#1807)

    * Fix `torch_dtype` handling through CLI

    The `torch_dtype` is not properly handled when provided via the TRL CLI
    since it's provided initially as a string, but is then casted to
    `torch.dtype` before providing it to the `{DPO,SFT}Trainer`, which means
    that those trainers should handle the scenario where `torch_dtype` is a
    `torch.dtype` too.

    * Add `torch_dtype` tests in `test_{dpo,sft}_trainer.py`

    * Forward contribution credits

    * Run `make precommit`

    ---------

    Co-authored-by: Tash Srivastava <yash-srivastava19@users.noreply.github.com>

commit 9e3a35bd3d85ee506d180120f01bde2229b60265
Author: Michael <mnoukhov@gmail.com>
Date:   Fri Jul 5 07:29:48 2024 -0400

    Remove extra print in reward_trainer.py (#1799)

    `print_rich_table` is called twice and the first call doesn't restrict to `num_print_samples`. Remove the first, extra call

commit 4402b36dcf79a0921a858c77375cfbb285d603c7
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Thu Jul 4 14:29:25 2024 +0200

    clean examples (#1791)

    Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>

commit 78f8228874d5cf9c0e68952533cb377202e1eb22
Author: Noah Tye <hi@noahtye.com>
Date:   Wed Jul 3 11:10:50 2024 -0700

    Bugfix: Preserve token fields when converting TrainingArguments to SFTConfig (#1794)

    * Preserve token fields when converting TrainingArguments to SFTConfig

    TrainingArguments.to_dict() redacts token fields, so we have to
    individually copy them over when converting to SFTConfig to avoid
    breaking push_to_hub functionality.

    Also adds a test.

    * run precommit

    * one-line args_as_dict definition per suggestion from kashif

    * generalize token copying to match TrainingArguments behavior

    * unwrap |= on dict, to support python 3.8

    * use .update instead of |= or for-loop

commit b6af2edc93b275afcee22a3eb71f9a5702ff9fd8
Author: Kashif Rasul <kashif.rasul@gmail.com>
Date:   Wed Jul 3 08:29:16 2024 +0200

    add model_init_kwargs to training_args (#1787)

commit cd85b14fbbaf7e4d9b01ef8ec19655666af20047
Author: Tommaso Buonocore <buonocore.tms@gmail.com>
Date:   Sat Jun 29 15:35:48 2024 +0200

    Fixed typo in SFT trainer docs (#1788)

    'STFConfig' instead of 'SFTConfig' appears multiple times in the doc, causing error when running the code snippets.

commit a57544f47a2fbc4940b4d49dde32f54406398c91
Author: Kashif Rasul <kashif.rasul@gmail.com>
Date:   Thu Jun 27 15:47:58 2024 +0200

    fix docs and examples (#1780)

commit b68ff96f0c74368961e194081e122959cd1f4d4d
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Wed Jun 26 16:26:37 2024 +0200

    Visual DPO (#1647)

    * Remove extra whitespaces

    * idefics

    * vdpo

    * sft idefics

    * pad with test

    * use prompt instead of tokenizer

    * rm name main

    * support vlm in tokenize row

    * temp fix for regex in lora_target_module

    * format

    * vdpo

    * tmp float16 hard code

    * concatenated_forward support for vision

    * style and new command line

    * all-linear

    * format

    * delete old examples

    * get image

    * upcast

    * new test

    * modified test

    * new strat for tokenizer

    * rm token transfer

    * integrate vision in dpo example

    * format

    * add FDivergenceType back

    * precommit

    * pillow test dep

    * optional prompt

    * `evaluation_strategy` to `eval_strategy`

    * revert vsft change (oos)

    * update test

    * test

    * comment and support more in process

    * update process

    * update doc for vdpo

    * caution about limited support

    * Update docs/source/dpo_trainer.mdx

    Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

    * revert DPO example changes

    * cleaner way to check if a model is vision

    * comment

    * update vdpo example

    * rename

    ---------

    Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
    Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

commit c8c01cc05569f5ffea6726b2111f799a63e03aaa
Author: Mubin Manasia <48038715+Mubin17@users.noreply.github.com>
Date:   Wed Jun 26 03:23:36 2024 -0600

    Fix Documentation Overflow Issues for Long URLs in SFTConfig (#1774)

    * Update sft_config.py

    * Update sft_config.py

commit 3479606c8c6dbb5da96e4990b491e63a48fc7483
Author: Costa Huang <costa.huang@outlook.com>
Date:   Wed Jun 26 03:18:22 2024 -0400

    Remove the leading space in the tldr preference dataset (#1773)

commit 7965b7834052ab3d60a1cc5de382e2f56b3772e7
Author: Haozhe Ji <jihaozhe@gmail.com>
Date:   Tue Jun 25 22:47:32 2024 +0800

    add Efficient Exact Optimization (EXO) (#1735)

    * add exo

    * fix a detail

    * Update trl/trainer/dpo_trainer.py

    * Update trl/trainer/dpo_trainer.py

    * Update trl/trainer/dpo_trainer.py

    ---------

    Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

commit 56bd1bba26ac52aad976c1a1a0b3d9e1137b18c7
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Tue Jun 25 16:14:26 2024 +0200

    `evaluation_strategy` to `eval_strategy` (#1771)

    Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>

commit 94d53e6617edc6434a38b2ac51c21e5da3329cda
Author: Clara Pohland <54847419+claralp@users.noreply.github.com>
Date:   Mon Jun 24 21:27:00 2024 +0200

    MoE Models: option to add load balancing loss (#1765)

    * KTO: add aux loss

    * use router_aux_loss_coef in KtoTrainer when aux_loss enabled

    * align optional aux_loss in DPO, KTO, CPO, ORPO

    * precommit changes

    * fix KL forward kwargs

    * add aux_loss doku entry

    * apply docs suggestions

    ---------

    Co-authored-by: Clara Luise Pohland <clara-luise.pohland@telekom.de>

commit b5be100ae0b37d743cd49435297f917eb54a0574
Author: Mihir Prabhudesai <mihirp1998.mp@gmail.com>
Date:   Mon Jun 24 12:05:44 2024 -0400

    Added Reward Backpropogation Support  (#1585)

    * added alignprop template

    * added alignprop support

    * Update alignprop_trainer.mdx

    * Update alignprop_trainer.mdx

    * added better why statement

    * fixed inference code

    * changed self to pipeline

    * removed aesthetic classifier

    * added aesthetic to auxiliary models

    * added unseen prompt logging

    * removed unseen prompt log

    * fixed minor

    * remove not needed import in trl/__init__.py

    Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

    * fixed styling

    * updated _toctree

    ---------

    Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

commit 6e1652bc5e8ff6d348c7f06048f4102a050f1544
Author: Haoran Xu <45837851+fe1ixxu@users.noreply.github.com>
Date:   Sun Jun 23 09:54:30 2024 -0700

    Add CPO-SimPO method (#1760)

    * enable cpo-simpo

    * highlight SimPO and CPO-SimPO

    * add test for cpo_alpha

    * formatting

    * Update docs/source/cpo_trainer.mdx

    ---------

    Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

commit 65374c6a711709157ea59297dce43dfb458d1c78
Author: Costa Huang <costa.huang@outlook.com>
Date:   Fri Jun 21 11:20:54 2024 -0400

    New sentiment and descriptiveness dataset (#1757)

    * push changes

    * handle edge cases where the chosen and the rejected are the same

commit 99560911123f739226b77813f27d5c90ed7f9ba2
Author: Juyoung Suk <scottsuk0306@gmail.com>
Date:   Fri Jun 21 18:01:08 2024 +0900

    Add dataset_text_field in examples/scripts/sft.py (#1758)

commit 34d273f227b30507c6d94ff1f93b6939794f38a3
Author: Costa Huang <costa.huang@outlook.com>
Date:   Thu Jun 20 13:16:43 2024 -0400

    Support num_train_epochs (#1743)

    * add a test case for num_train_epochs

    * fix ci

    * quick change

    * disable push to hub

    * debug windows ci

    * try another fix

    * skip subprocess tests on windows

commit 3bf94492a8dc84ac192f7c5206553e1460f53aa4
Author: Mert Sayar <mert.sayar@gmail.com>
Date:   Thu Jun 20 18:22:20 2024 +0300

    Fix masking of response tokens (#1718)

    Current handling of `response_masks` inside `batch_forward_pass`
    function does not take padding into consideration which results with
    shape unmatch during masking. Since response mask is a mask tensor of
    response tokens, response tokens should not be concatenated with a
    `torch.zeros(query_length)` and masking operation should be done without
    slicing.

    Remove the concatenation of the response mask, remove the slicing from
    the response mask since response mask already has the length of `end -
    start + 1`, which is equal to length of `masks[j, start:end]`.

commit ba6abee37f0f0463f6d891d63d0c2242039fc8ec
Author: idanshen <49375140+idanshen@users.noreply.github.com>
Date:   Thu Jun 20 09:14:16 2024 -0400

    Support for returning past_key_values from the model (#1742)

    * add support for returning past_key_values from the model

    * change order of  keys

commit a57e75967c2b787f42f4e402ed7ca23cd9bad9a9
Author: 1485840691 <110707330+1485840691@users.noreply.github.com>
Date:   Wed Jun 19 18:02:51 2024 +0800

    Integrate f-divergence to DPO (Follow up) (#1610)

    * Step 1: update ppo_trainer and hello_world example

    * Step 2: Refine comments and add parameter type

    * Step 2: Add missing parameter comments

    * Step 1: Organize ptx loss into a function and add ptx_loss to train_stats

    * Step 1 updates: add comment to ptx_loss function, fix a bug and add warning message

    * Step 2: 1) Add ppo_ptx trainig example as ppo; 2) separate pretrain data fetch and iterate

    * Step 2: Remove loss from columns_to_log in ppo_ptx example

    * Remove data set revision in load imbd dataset

    * Run pre-commit and fix format issues

    * Initial draft of f-divergence fn

    * Update f-divergence to avoid overflow

    * fix test errors and comments

    * Add Unit tests for dpo loss with alpha and js div f

    * Adjust format

    * Fix test error

    * Reverse this update

    * Add test cases

    * Reverse un-needed updates

    * Update code style

    * Try to fix code fmt error

    * remove extra end line

    ---------

    Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

commit ae23d40f3b4d91d60a6153825ecf0319449d34b1
Author: Shihyueh Hsu <66808901+AIR-hl@users.noreply.github.com>
Date:   Tue Jun 18 22:07:24 2024 +0800

    change the `process` function in the example of DPO (#1753)

    * change the `process` function in the example of DPO

    * fix

commit 83b367b11a308b488ff9ddcf19cf4cfd6a7db642
Author: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Date:   Tue Jun 18 11:31:17 2024 +0200

    CI / `KTOTrainer`: Remove old tests (#1750)

    * remove old tests

    * remove datasets

    * Update test_dpo_trainer.py

    * Update test_dpo_trainer.py

commit d1ed730ab8281b1b0c78d7d61bc0f6603a9ce958
Author: Michael <mnoukhov@gmail.com>
Date:   Mon Jun 17 10:50:21 2024 -0400

    prepare deepspeed accomodate fp16 and bf16 (#1728)

    * prepare deepspeed accomodate fp16 and bf16

    * precommit

commit 8f8e95e25d10c433cc1f2f8c7dcfed218bb13ac7
Author: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Date:   Mon Jun 17 16:49:00 2024 +0200

    CPO / DPO: Fix red CI (#1749)

    * fix red CI

    * precommit

commit 4e23d958f20fd4fdd795cb06c2cdb7ebea704855
Author: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Date:   Mon Jun 17 16:41:36 2024 +0200

    fix red CI

commit 50c46205b6fe741f11959adf7ec9cc0386f406bc
Author: Kawin <kawin.ethayarajh@gmail.com>
Date:   Mon Jun 17 07:14:44 2024 -0700

    small KTO fixes (#1734)

    * add warning for imbalanced data

    * update documentation

    * update script commands to be same as in dpo

    * use batch_size KL examples and batch_size target examples to calculate batch_size losses

    * fix deepspeed issue

    * speed up forward with no_grad for KL

    * add some removed metrics

    * Update trl/trainer/kto_trainer.py

    * Update trl/trainer/kto_trainer.py

    * Update trl/trainer/kto_trainer.py

    add reference to paper

    Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

    * Update trl/trainer/kto_trainer.py

    Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

    * Update trl/trainer/kto_trainer.py

    Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

    * Update trl/trainer/kto_trainer.py

    Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

    * Update trl/trainer/kto_trainer.py

    Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

    * Update trl/trainer/kto_trainer.py

    Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

    * Update trl/trainer/kto_trainer.py

    Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

    * Update trl/trainer/kto_trainer.py

    Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

    * Update trl/trainer/kto_trainer.py

    Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

    * Update trl/trainer/kto_trainer.py

    Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

    * Update trl/trainer/kto_trainer.py

    Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

    * Update trl/trainer/kto_trainer.py

    Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

    * add more detailed comments

    * convert assert to ValueError

    * Update kto_trainer.py

    * precommit formatting

    * remove nans in metrics by gathering across machines

    * fix formatting

    * fix choice of mismatched examples for KL term

    * describe weights

    * fix hanging issue in distributed training

    * linting

    * move metrics to cpu

    * Update trl/trainer/kto_trainer.py

    Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

    * Update trl/trainer/kto_trainer.py

    * Update trl/trainer/kto_trainer.py

    * remove kto_pair

    * speed up data processing

    * move bco code inside

    * raise error for kto_pair argument

    * fix formatting

    ---------

    Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
    Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
    Co-authored-by: Winnie Xu <winnie.xu97@gmail.com>

commit 6105d03f92e7069ffaa565d05418dec371569e6a
Author: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Date:   Mon Jun 17 16:01:06 2024 +0200

    `TrlParser`: Add ignore extra args option (#1748)

    * add ignore extra args option

    * Update trl/commands/cli_utils.py

commit e247bbd7d5f57f8012ca71cfef6ad6a589874c34
Author: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Date:   Mon Jun 17 15:16:07 2024 +0200

    CI / core: Pin `numpy` to `!=2.0.0` for CI and to users (#1747)

    * Update setup.py

    * Update setup.py

    * Update setup.py

    * Update test_best_of_n_sampler.py

    dummy commit

    * pin numpy

    * Update tests/test_best_of_n_sampler.py

    * Update setup.py

commit 3d044961960a2ab1ec1f51cfe62c6bf6b9a94807
Author: Michael <mnoukhov@gmail.com>
Date:   Mon Jun 17 08:43:33 2024 -0400

    better trl parser with yaml config (#1739)

    * working trl parser with config

    correctly overrides yaml config with command line arguments
    adds return_remaining_strings
    when return_remaining_strings is False, raises error if yaml contains
    extra args that are not in the dataclasses
    simpler and cleaner than previous yaml parsing and merging
    addresses #1733

    * lowercase trlparser

commit 2d244f8acb204cb2ddb83a4ef017ca4b1f2d366a
Author: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Date:   Mon Jun 17 11:56:13 2024 +0200

    Workflow: Notify tests results on slack channel (#1744)

    * Update tests-main.yml

    * Update docker-build.yml

commit f5168fdbaf9cbf6a3f1bdc64dc44b9db3a9ae333
Author: Igor Melnyk <igoraries@gmail.com>
Date:   Wed Jun 12 05:54:54 2024 -0400

    adds AOT (#1701)

    * adds AOT

    * Applied format changes

    * added docs and tests

    ---------

    Co-authored-by: Igor Melnyk <igor.melnyk@ibm.com>

commit 79686e1ac701b1f5e3709a65efa8f13363bcde06
Author: jetlime <paul.houssel@yahoo.de>
Date:   Wed Jun 12 00:35:31 2024 +1000

    ktotrainer: Refuse datasets which contain only one class of labels (#1724)

    * ktotrainer: refuse dataset which contain only one class of labels

    * ktotrainer: document new dataset constraint

commit 34ebc4ccaf376c862a081ff4bb0b7e502b17b2fb
Author: Luc Georges <McPatate@users.noreply.github.com>
Date:   Mon Jun 10 11:17:54 2024 +0200

    feat(ci): add trufflehog secrets detection (#1721)

    * feat(ci): add trufflehog secrets detection

    * fix(ci): remove unnecessary permissions

commit 1d84e2b888ea0f3c1ce9c5c175f7f680d85273a8
Author: Michael <mnoukhov@gmail.com>
Date:   Fri Jun 7 11:42:08 2024 +0200

    Fix default padding_value in dpo_config.py (#1692)

    dpo_config default padding value should be None, not 0, otherwise it by default overrides the padding value of any tokenizer to 0

commit 2f71b8b1e2e54184cc278f267cca1bda051f68ea
Author: Michael <mnoukhov@gmail.com>
Date:   Fri Jun 7 10:37:27 2024 +0200

    fix yaml parser for derived config classes (#1713)

    fixes #1712
    reformatted cli_utils with ruff

commit 5bcb8ad0d6eaee1b1d2f993380100c37c4421fd0
Author: Kashif Rasul <kashif.rasul@gmail.com>
Date:   Fri Jun 7 08:48:17 2024 +0100

    RDPO fix nll loss (#1705)

commit b8b972fde183ec036885738e1439cd99877c2ad5
Author: Haoran Xu <45837851+fe1ixxu@users.noreply.github.com>
Date:   Thu Jun 6 14:06:47 2024 -0700

    Add a variant of CPO, SimPO (#1703)

    * add a variant of cpo: simpo

    * correct cpo-simpo loss

    * avoid 0 int error in logging

    * add simpo description

    * Update trl/trainer/cpo_trainer.py

    Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

    * fix formatting

    * add test for simpo

    * Update docs/source/cpo_trainer.mdx

    Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

    * add a docstring for simpogamma

    * move simpo description to the above docstring

    * change simpo description in the doc

    * formatting

    ---------

    Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

commit 3eb9ccb104e2c46360adb937f3f25871c167eb90
Author: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Date:   Thu Jun 6 19:33:20 2024 +0200

    set dev version (#1710)

    * Update setup.py

    * Update __init__.py

commit 974b0d380f12c357b70265c5f2dd2c8cb39a6a3e
Author: Costa Huang <costa.huang@outlook.com>
Date:   Thu Jun 6 10:13:00 2024 -0400

    0.9.4 release (#1708)

commit 39a7d1c121d26224fd7455d3d2038e0d20831c54
Author: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Date:   Thu Jun 6 15:50:17 2024 +0200

    SFTTrainer: Fix backward Compatibility issue with `TrainingArguments` (#1707)

    * fix BC

    * fixup

commit 0bdc63839f1abe67c56befa63251425b1ffc1ace
Author: Guilherme Freire <guilhermebfreire@gmail.com>
Date:   Thu Jun 6 14:42:58 2024 +0100

    Fixed doc string and docs for the SFTConfig update (#1706)

commit 275d33b3ef4f7afd40f79cc53591659bacfa3499
Author: Costa Huang <costa.huang@outlook.com>
Date:   Wed Jun 5 14:34:59 2024 -0400

    0.9.3 release (#1699)

commit c0819ee99fdf673e9843ef91789b928ae9050623
Author: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Date:   Wed Jun 5 17:29:03 2024 +0200

    Update sft_trainer.py (#1698)

commit a03e7cc4e443e30eea942ca66bfce19407784f32
Author: Costa Huang <costa.huang@outlook.com>
Date:   Wed Jun 5 11:00:19 2024 -0400

    Release 0.9.2 (#1697)

    * Release: 0.9.0

    * Release

commit a13cb8952c55cfa4fc696d900a1b2a81d329c82d
Author: Costa Huang <costa.huang@outlook.com>
Date:   Wed Jun 5 10:20:54 2024 -0400

    Quick fix on GPT4-eval (#1696)

    * quick fix

    * precommit

commit 84156f179f91f519e48185414391d040112f2d34
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Mon Jun 3 20:09:05 2024 +0200

    Fix typo in DPOTrainer's warnings (#1688)

commit 4eb0b905e28857341123d5329a6ca1b9d929734f
Author: Alex Brooks <alex.brooks@ibm.com>
Date:   Mon Jun 3 10:24:32 2024 -0600

    Skip packing validation (#1673)

    * Add test for skipping preproc if packing=True

    Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

    * Allow skipping of validation for packing=True

    Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

    * Use dummy dataset in no packing preproc test

    Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

    ---------

    Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

commit 6c203f9fef50c41d27fc4ed9965df7e458f02377
Author: Alexey Rozhkov <alexisrozhkov@gmail.com>
Date:   Mon Jun 3 10:16:22 2024 +0100

    Fix overriding optimize_device_cache with optimize_cuda_cache in PPOConfig (#1690)

    * Don't override optimize_device_cache when optimize_cuda_cache is not provided
    Raise an exception when both optimize_cuda_cache and optimize_device_cache are set

    * Minor fix

commit f18253bf2d747f68acc9cd89da95c85ebf59dbb9
Author: Kashif Rasul <kashif.rasul@gmail.com>
Date:   Mon Jun 3 09:43:02 2024 +0100

    intial RPO loss (#1686)

    * intial RPO loss

    * fix sign

    * clean up

commit 151a452d14c8ebccbaf8a033812ceb2dc77f634d
Author: Samuel <s.kiegeland@gmx.de>
Date:   Wed May 29 20:29:38 2024 +0200

    Fix max completion length (#1588)

commit 488b502d31c052801eacd9a047bf3db06623e9c2
Author: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Date:   Wed May 29 20:19:26 2024 +0200

    fix (#1678)

commit 3c0a10b1aedbe533005dbfe18f2cc8057093f80b
Author: Wang, Yi <yi.a.wang@intel.com>
Date:   Mon May 27 20:52:20 2024 +0800

    fix dataset load error (#1670)

    Signed-off-by: Wang, Yi <yi.a.wang@intel.com>

commit b031adfdb8708f1f295eab6c3f2cb910e8fe0c23
Author: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Date:   Fri May 24 15:20:16 2024 +0200

    FIX / PPO: Fix `enable_input_require_grads` issues with PPO models (#1664)

    * Update modeling_base.py

    * Update ppo_config.py

    * Update ppo_trainer.py

    * style

commit e7cb597230bb0c630c67790881b0808f7b16cb05
Author: Costa Huang <costa.huang@outlook.com>
Date:   Thu May 23 11:37:16 2024 -0400

    Fix ppov2 test case (#1661)

    * Fix PPOv2 / RLOO refactor's stuff

    * update terminology to use stop token

commit bc8dfbf4e2169010b3094913a1fa4f888f750111
Author: Kashif Rasul <kashif.rasul@gmail.com>
Date:   Thu May 23 15:28:04 2024 +0200

    update eval_strategy (#1662)

commit e4ed7a3a5aa0f1e1b4f78317b3c7b25e5bf597f4
Author: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com>
Date:   Thu May 23 18:34:22 2024 +0530

    do not upcast adapters when using FSDP+QLoRA (#1654)

commit 9a7efbd05126fa6a1448a95f670e8d04cac90d62
Author: syrn1k <85796210+syrn1k@users.noreply.github.com>
Date:   Thu May 23 15:58:49 2024 +0300

    🤫 TR-DPO implementation (#1593)

    * 🤫 TR-DPO implementation baseline

    * fix comments

    * docs

    * fix linters

    * test added

    * move configs to DPOConfig

    * fix typo

    * add docs

    * fix import

    * use state.global_step

    * fix order of arguments

    * make sure plugins are not none

    * Update trl/trainer/utils.py

    Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>

    * Update trl/trainer/utils.py

    Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>

    * checking that reference model weights have changed

    * sync_target_model as staticmethod

    * set reference model

    ---------

    Co-authored-by: Nikita Surnachev <n.surnachev@tinkoff.ru>
    Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
    Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>

commit b344bcea2c0b30d58ab6ebb0380647f24056ac58
Author: Anush Kini <33577829+Abilityguy@users.noreply.github.com>
Date:   Thu May 23 18:27:25 2024 +0530

    [DPO] Add 'robust' loss_type (#1653)

    * Initial commit

    * pre-commit fix

    * Minor change to comments

    * Added some documentation on how to use Robust DPO

commit 35e12dc5959fa8a08edd72b34aadcb0acb284e51
Author: Nicolinho <Nicolinho@users.noreply.github.com>
Date:   Thu May 23 14:36:15 2024 +0200

    Fix inheritance order in PPOv2Config (#1659)

    * fix inheritance order in PPOv2Config

    * fix inheritance order in rloo_config

commit 1da6be18e0e21a11ee2a2121ae744c5e2e904409
Author: Ali Bakly <anbakly@gmail.com>
Date:   Thu May 23 14:10:29 2024 +0200

    docs: correct cDPO usage in DPOTrainer (#1655)

commit e249cd802fb81cff3c4ceb1427cb666a138221d3
Author: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Date:   Thu May 23 14:10:05 2024 +0200

    add support for training collator (#1658)

commit a02513c3b7085adba5fd18727296f4f4affd3ffb
Author: Zach Mueller <muellerzr@gmail.com>
Date:   Thu May 23 06:48:00 2024 -0400

    Apply deprecated `evaluation_strategy` (#1559)

    * Deprecate

    * Update tests/test_dpo_trainer.py

    ---------

    Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

commit 13454d2f4b243b7260fa4ec828297812c3f975fc
Author: Costa Huang <costa.huang@outlook.com>
Date:   Wed May 22 08:31:10 2024 -0400

    PPO / Reinforce Trainers (#1540)

    * Add ppov2 trainer

    * make eos trick optional, remove unused args

    * quick fix

    * precommit

    * update debugging script

    * fix out of bound `drop_last=True`; use built-in scheduler

    * Add PPO examples

    * push changes

    * quick change

    * quick change

    * various bug fixes

    * remove unnecessary grad accumulation setting

    * push new changes

    * fix DS3 model saving

    * update ppo.py

    * refactor

    * quick change

    * refactor

    * update ppo trainer

    * refactor

    * quick test

    * add ds2 /ds3 7 processes config

    * add vllm trainer

    * quick change

    * experiment with reward normalization

    * push changes

    * quick push

    * push changes

    * push various changes

    * refactor to use ModelConfig

    * quick change

    * refactor

    * refactor

    * Simplify DS logic

    * quick update

    * remove unnecessary files

    * precommit

    * deepspeed fix; handle edge case when eos_token_id = 0

    * add PPO tldr example

    * add TL;DR example

    * fix undefined var

    * utilize all samples in rloo

    * quick setting

    * remove the unnecessary `value_model`

    * use exact_div

    * allow saving the deepspeed model

    * refactor

    * remove dead code

    * Use some shared utilities

    * add some end-to-end test cases

    * add PPOv2 docs and RLOO docs / tests

    * update docs

    * quikc push

    * fix ci

    * fix type annotation for ci

    * quick update

    * update trainer docs

commit 99f2c94b2200927a1dc156f16e012dca11f865e1
Author: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com>
Date:   Wed May 15 19:55:46 2024 +0530

    don't cast the trainable lora layers to half precision (#1644)

    * don't cast the trainable lora layers to half precision

    * quality

commit 6401d080c9f97e0610678b12d3d0056347675726
Author: Wing Lian <wing.lian@gmail.com>
Date:   Tue May 14 09:41:07 2024 -0400

    Pairwise Noise Contrastive Alignment (#1632)

    * add NCA paired preference loss

    * chore: lint

    * set more lenient tolerance for integration tests

    * Update tests/test_dpo_trainer.py

    * skip test

    * fix

    ---------

    Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
    Co-authored-by: younesbelkada <younesbelkada@gmail.com>

commit d632a5b289782c7384f5275426054e79acc0b744
Author: bartoszzuk <57541034+bartoszzuk@users.noreply.github.com>
Date:   Tue May 14 12:25:54 2024 +0200

    Fixed wrong logs prefixes in KTOTrainer (#1641)

    * Fixed wrong logs prefixes in KTOTrainer

    * Pre-commit formating

commit 5aeb752053876cce64f2164a178635db08d96158
Author: Tiezhen WANG <38108242+xianbaoqian@users.noreply.github.com>
Date:   Fri May 10 23:19:15 2024 +0800

    Update sft_llama2.py to work with the latest API (#1637)

    * Update sft_llama2.py to work with the latest API

    SFTTrainer now takes a STFConfig argument

    * Update dpo_llama2.py

    * precommit

commit b8b89783ca1ab081d25651a9a13e9358cc8e1869
Author: Ilya Gusev <phoenixilya@gmail.com>
Date:   Fri May 10 15:43:13 2024 +0200

    [ORPO] Correct label mask for pad tokens (#1625)

    * [ORPO] Correct label mask for pad tokens

    Recent [fix](57aebe9c36) for calculating NLL loss for a whole sequence introduced a bug. When input_ids are copied to labels, pad tokens are not masked.

    This PR aims to path this by masking labels based on the attention mask.

    * -100 -> label_pad_token_id

    Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

    ---------

    Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

commit 8799952876631d7c772ac80f9cbcff155da960e2
Author: Costa Huang <costa.huang@outlook.com>
Date:   Fri May 10 09:32:20 2024 -0400

    visualize rm prediction (#1636)

    * visualize rm prediction

    * quick update

    * quick check

    * quick fix

    * update eval steps

commit 3b4c24946b7d5580fd354b0e3800fc1047b82a41
Author: Xiao Yu <39458711+jasonyux@users.noreply.github.com>
Date:   Fri May 3 18:19:35 2024 -0400

    fixed adding bos and eos token unconditionally (#1591)

    * fixed adding bos and eos token unconditionally

    * fixed typo of tokenizer -> self.tokenizer. Also added update to ORPO

    * fixed code quality, and added BOS/EOS fix to KTO

    * code reformatting with pre-commit run --all-files

    * bug fix: check input id length before checking for EOS/BOS

commit 0347f583e3883f9144a959d1e6f748a4cc91cd09
Author: lewtun <lewis.c.tunstall@gmail.com>
Date:   Fri May 3 15:59:59 2024 +0200

    Fix ZeRO-3 generation context manager (#1617)

* judge refactoring and unittest

* format

* init

* doc

* format

* improve doc

* basejudge

* improve doc and add BaseAPIJudge

* Doc

* style

* refactor callback

* remove openai and pairrm judge from test

* doc

* rm dpo online example

* new prompts and completions

* skip hf judge and add hf token

---------

Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2024-07-18 15:16:59 +02:00
98ad01ddfd dpo vlm blog post (#1844)
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-07-17 18:03:49 +02:00
fef8240c23 fix arg parsing in chat.py (#1846)
Co-authored-by: leandro <leandro.vonwerra@spoud.io>
2024-07-17 17:32:17 +02:00
915ffc7c61 add link to DPO datasets collection (#1845) 2024-07-17 11:18:35 -04:00
5828a666bf Fix issues of KTOTrainer (#1840)
* Fix issues of KTOTrainer

* Update trl/trainer/kto_trainer.py

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-07-17 08:46:14 +02:00
052a8e14b5 fix ppov2_trainer tensorboard log bugs (#1836) 2024-07-16 16:08:15 +02:00
a2adfb836a ref_model -> model_ref (#1835)
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-07-15 18:50:29 +02:00
4ebfc5de28 refactor trainer callbacks (#1826)
* refactor trainer callbacks

* fix import

* more import fixes
2024-07-15 11:07:16 -04:00
9e9dc96e67 Added missing token kwarg in Peft model loading (#1825) 2024-07-10 19:11:13 +02:00
7ddef5c158 Make use of trust_remote_code consistent (#1806)
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-07-10 18:26:11 +02:00
a9cddf8c55 Delete unused benchmark.yml workflow. (#1822) 2024-07-10 11:25:07 -04:00
2860ce5091 DPO Llava 1.5 and PaliGemma support (#1797)
* llava support dpo

* add_special_tokens=False only when possible

* format

* pali gemma

* refactor size

* remove image resize

---------

Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-07-09 09:22:52 +02:00
30e33bd92d upgrade gh actions (#1818)
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-07-08 23:37:12 -04:00
d5a0d2d345 Set dev version (#1817) 2024-07-08 11:12:41 -04:00
314e8eb367 fix broken url in docs\source\index.mdx (#1813) 2024-07-08 15:41:36 +02:00
e10792032b 0.9.6 release (#1816) 2024-07-08 09:38:09 -04:00
78045dedc8 Fix TRL_USE_RICH environment variable handling (#1808)
* Add `strtobool` custom implementation from `distutils`

* Fix `TRL_USE_RICH` handling via `strtobool`

* Run `make precommit`
2024-07-07 19:59:26 -04:00
747612f9d3 Fix torch_dtype handling in {DPO,SFT}Trainer when provided via CLI (#1807)
* Fix `torch_dtype` handling through CLI

The `torch_dtype` is not properly handled when provided via the TRL CLI
since it's provided initially as a string, but is then casted to
`torch.dtype` before providing it to the `{DPO,SFT}Trainer`, which means
that those trainers should handle the scenario where `torch_dtype` is a
`torch.dtype` too.

* Add `torch_dtype` tests in `test_{dpo,sft}_trainer.py`

* Forward contribution credits

* Run `make precommit`

---------

Co-authored-by: Tash Srivastava <yash-srivastava19@users.noreply.github.com>
2024-07-05 16:28:59 +02:00
9e3a35bd3d Remove extra print in reward_trainer.py (#1799)
`print_rich_table` is called twice and the first call doesn't restrict to `num_print_samples`. Remove the first, extra call
2024-07-05 13:29:48 +02:00
4402b36dcf clean examples (#1791)
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-07-04 14:29:25 +02:00
78f8228874 Bugfix: Preserve token fields when converting TrainingArguments to SFTConfig (#1794)
* Preserve token fields when converting TrainingArguments to SFTConfig

TrainingArguments.to_dict() redacts token fields, so we have to
individually copy them over when converting to SFTConfig to avoid
breaking push_to_hub functionality.

Also adds a test.

* run precommit

* one-line args_as_dict definition per suggestion from kashif

* generalize token copying to match TrainingArguments behavior

* unwrap |= on dict, to support python 3.8

* use .update instead of |= or for-loop
2024-07-03 20:10:50 +02:00
b6af2edc93 add model_init_kwargs to training_args (#1787) 2024-07-03 08:29:16 +02:00
cd85b14fbb Fixed typo in SFT trainer docs (#1788)
'STFConfig' instead of 'SFTConfig' appears multiple times in the doc, causing error when running the code snippets.
2024-06-29 15:35:48 +02:00
a57544f47a fix docs and examples (#1780) 2024-06-27 15:47:58 +02:00
b68ff96f0c Visual DPO (#1647)
* Remove extra whitespaces

* idefics

* vdpo

* sft idefics

* pad with test

* use prompt instead of tokenizer

* rm name main

* support vlm in tokenize row

* temp fix for regex in lora_target_module

* format

* vdpo

* tmp float16 hard code

* concatenated_forward support for vision

* style and new command line

* all-linear

* format

* delete old examples

* get image

* upcast

* new test

* modified test

* new strat for tokenizer

* rm token transfer

* integrate vision in dpo example

* format

* add FDivergenceType back

* precommit

* pillow test dep

* optional prompt

* `evaluation_strategy` to `eval_strategy`

* revert vsft change (oos)

* update test

* test

* comment and support more in process

* update process

* update doc for vdpo

* caution about limited support

* Update docs/source/dpo_trainer.mdx

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* revert DPO example changes

* cleaner way to check if a model is vision

* comment

* update vdpo example

* rename

---------

Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-06-26 16:26:37 +02:00
c8c01cc055 Fix Documentation Overflow Issues for Long URLs in SFTConfig (#1774)
* Update sft_config.py

* Update sft_config.py
2024-06-26 11:23:36 +02:00
3479606c8c Remove the leading space in the tldr preference dataset (#1773) 2024-06-26 09:18:22 +02:00
7965b78340 add Efficient Exact Optimization (EXO) (#1735)
* add exo

* fix a detail

* Update trl/trainer/dpo_trainer.py

* Update trl/trainer/dpo_trainer.py

* Update trl/trainer/dpo_trainer.py

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-06-25 16:47:32 +02:00
56bd1bba26 evaluation_strategy to eval_strategy (#1771)
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
2024-06-25 10:14:26 -04:00
94d53e6617 MoE Models: option to add load balancing loss (#1765)
* KTO: add aux loss

* use router_aux_loss_coef in KtoTrainer when aux_loss enabled

* align optional aux_loss in DPO, KTO, CPO, ORPO

* precommit changes

* fix KL forward kwargs

* add aux_loss doku entry

* apply docs suggestions

---------

Co-authored-by: Clara Luise Pohland <clara-luise.pohland@telekom.de>
2024-06-24 21:27:00 +02:00
b5be100ae0 Added Reward Backpropogation Support (#1585)
* added alignprop template

* added alignprop support

* Update alignprop_trainer.mdx

* Update alignprop_trainer.mdx

* added better why statement

* fixed inference code

* changed self to pipeline

* removed aesthetic classifier

* added aesthetic to auxiliary models

* added unseen prompt logging

* removed unseen prompt log

* fixed minor

* remove not needed import in trl/__init__.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* fixed styling

* updated _toctree

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
2024-06-24 12:05:44 -04:00
6e1652bc5e Add CPO-SimPO method (#1760)
* enable cpo-simpo

* highlight SimPO and CPO-SimPO

* add test for cpo_alpha

* formatting

* Update docs/source/cpo_trainer.mdx

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-06-23 18:54:30 +02:00
65374c6a71 New sentiment and descriptiveness dataset (#1757)
* push changes

* handle edge cases where the chosen and the rejected are the same
2024-06-21 11:20:54 -04:00
9956091112 Add dataset_text_field in examples/scripts/sft.py (#1758) 2024-06-21 11:01:08 +02:00
34d273f227 Support num_train_epochs (#1743)
* add a test case for num_train_epochs

* fix ci

* quick change

* disable push to hub

* debug windows ci

* try another fix

* skip subprocess tests on windows
2024-06-20 13:16:43 -04:00
3bf94492a8 Fix masking of response tokens (#1718)
Current handling of `response_masks` inside `batch_forward_pass`
function does not take padding into consideration which results with
shape unmatch during masking. Since response mask is a mask tensor of
response tokens, response tokens should not be concatenated with a
`torch.zeros(query_length)` and masking operation should be done without
slicing.

Remove the concatenation of the response mask, remove the slicing from
the response mask since response mask already has the length of `end -
start + 1`, which is equal to length of `masks[j, start:end]`.
2024-06-20 11:22:20 -04:00
ba6abee37f Support for returning past_key_values from the model (#1742)
* add support for returning past_key_values from the model

* change order of  keys
2024-06-20 09:14:16 -04:00
a57e75967c Integrate f-divergence to DPO (Follow up) (#1610)
* Step 1: update ppo_trainer and hello_world example

* Step 2: Refine comments and add parameter type

* Step 2: Add missing parameter comments

* Step 1: Organize ptx loss into a function and add ptx_loss to train_stats

* Step 1 updates: add comment to ptx_loss function, fix a bug and add warning message

* Step 2: 1) Add ppo_ptx trainig example as ppo; 2) separate pretrain data fetch and iterate

* Step 2: Remove loss from columns_to_log in ppo_ptx example

* Remove data set revision in load imbd dataset

* Run pre-commit and fix format issues

* Initial draft of f-divergence fn

* Update f-divergence to avoid overflow

* fix test errors and comments

* Add Unit tests for dpo loss with alpha and js div f

* Adjust format

* Fix test error

* Reverse this update

* Add test cases

* Reverse un-needed updates

* Update code style

* Try to fix code fmt error

* remove extra end line

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-06-19 12:02:51 +02:00
ae23d40f3b change the process function in the example of DPO (#1753)
* change the `process` function in the example of DPO

* fix
2024-06-18 10:07:24 -04:00
83b367b11a CI / KTOTrainer: Remove old tests (#1750)
* remove old tests

* remove datasets

* Update test_dpo_trainer.py

* Update test_dpo_trainer.py
2024-06-18 11:31:17 +02:00
d1ed730ab8 prepare deepspeed accomodate fp16 and bf16 (#1728)
* prepare deepspeed accomodate fp16 and bf16

* precommit
2024-06-17 10:50:21 -04:00
8f8e95e25d CPO / DPO: Fix red CI (#1749)
* fix red CI

* precommit
2024-06-17 10:49:00 -04:00
4e23d958f2 fix red CI 2024-06-17 16:41:36 +02:00
50c46205b6 small KTO fixes (#1734)
* add warning for imbalanced data

* update documentation

* update script commands to be same as in dpo

* use batch_size KL examples and batch_size target examples to calculate batch_size losses

* fix deepspeed issue

* speed up forward with no_grad for KL

* add some removed metrics

* Update trl/trainer/kto_trainer.py

* Update trl/trainer/kto_trainer.py

* Update trl/trainer/kto_trainer.py

add reference to paper

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* add more detailed comments

* convert assert to ValueError

* Update kto_trainer.py

* precommit formatting

* remove nans in metrics by gathering across machines

* fix formatting

* fix choice of mismatched examples for KL term

* describe weights

* fix hanging issue in distributed training

* linting

* move metrics to cpu

* Update trl/trainer/kto_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update trl/trainer/kto_trainer.py

* Update trl/trainer/kto_trainer.py

* remove kto_pair

* speed up data processing

* move bco code inside

* raise error for kto_pair argument

* fix formatting

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: Winnie Xu <winnie.xu97@gmail.com>
2024-06-17 10:14:44 -04:00
6105d03f92 TrlParser: Add ignore extra args option (#1748)
* add ignore extra args option

* Update trl/commands/cli_utils.py
2024-06-17 16:01:06 +02:00
e247bbd7d5 CI / core: Pin numpy to !=2.0.0 for CI and to users (#1747)
* Update setup.py

* Update setup.py

* Update setup.py

* Update test_best_of_n_sampler.py

dummy commit

* pin numpy

* Update tests/test_best_of_n_sampler.py

* Update setup.py
2024-06-17 15:16:07 +02:00
3d04496196 better trl parser with yaml config (#1739)
* working trl parser with config

correctly overrides yaml config with command line arguments
adds return_remaining_strings
when return_remaining_strings is False, raises error if yaml contains
extra args that are not in the dataclasses
simpler and cleaner than previous yaml parsing and merging
addresses #1733

* lowercase trlparser
2024-06-17 14:43:33 +02:00
2d244f8acb Workflow: Notify tests results on slack channel (#1744)
* Update tests-main.yml

* Update docker-build.yml
2024-06-17 11:56:13 +02:00
f5168fdbaf adds AOT (#1701)
* adds AOT

* Applied format changes

* added docs and tests

---------

Co-authored-by: Igor Melnyk <igor.melnyk@ibm.com>
2024-06-12 11:54:54 +02:00
79686e1ac7 ktotrainer: Refuse datasets which contain only one class of labels (#1724)
* ktotrainer: refuse dataset which contain only one class of labels

* ktotrainer: document new dataset constraint
2024-06-11 16:35:31 +02:00
34ebc4ccaf feat(ci): add trufflehog secrets detection (#1721)
* feat(ci): add trufflehog secrets detection

* fix(ci): remove unnecessary permissions
2024-06-10 11:17:54 +02:00
1d84e2b888 Fix default padding_value in dpo_config.py (#1692)
dpo_config default padding value should be None, not 0, otherwise it by default overrides the padding value of any tokenizer to 0
2024-06-07 11:42:08 +02:00
2f71b8b1e2 fix yaml parser for derived config classes (#1713)
fixes #1712
reformatted cli_utils with ruff
2024-06-07 10:37:27 +02:00
5bcb8ad0d6 RDPO fix nll loss (#1705) 2024-06-07 09:48:17 +02:00
b8b972fde1 Add a variant of CPO, SimPO (#1703)
* add a variant of cpo: simpo

* correct cpo-simpo loss

* avoid 0 int error in logging

* add simpo description

* Update trl/trainer/cpo_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* fix formatting

* add test for simpo

* Update docs/source/cpo_trainer.mdx

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* add a docstring for simpogamma

* move simpo description to the above docstring

* change simpo description in the doc

* formatting

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-06-06 17:06:47 -04:00
3eb9ccb104 set dev version (#1710)
* Update setup.py

* Update __init__.py
2024-06-06 13:33:20 -04:00
974b0d380f 0.9.4 release (#1708) 2024-06-06 10:13:00 -04:00
39a7d1c121 SFTTrainer: Fix backward Compatibility issue with TrainingArguments (#1707)
* fix BC

* fixup
2024-06-06 09:50:17 -04:00
0bdc63839f Fixed doc string and docs for the SFTConfig update (#1706) 2024-06-06 09:42:58 -04:00
275d33b3ef 0.9.3 release (#1699) 2024-06-05 14:34:59 -04:00
c0819ee99f Update sft_trainer.py (#1698) 2024-06-05 11:29:03 -04:00
a03e7cc4e4 Release 0.9.2 (#1697)
* Release: 0.9.0

* Release
2024-06-05 11:00:19 -04:00
a13cb8952c Quick fix on GPT4-eval (#1696)
* quick fix

* precommit
2024-06-05 10:20:54 -04:00
84156f179f Fix typo in DPOTrainer's warnings (#1688) 2024-06-03 14:09:05 -04:00
4eb0b905e2 Skip packing validation (#1673)
* Add test for skipping preproc if packing=True

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* Allow skipping of validation for packing=True

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* Use dummy dataset in no packing preproc test

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

---------

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2024-06-03 18:24:32 +02:00
6c203f9fef Fix overriding optimize_device_cache with optimize_cuda_cache in PPOConfig (#1690)
* Don't override optimize_device_cache when optimize_cuda_cache is not provided
Raise an exception when both optimize_cuda_cache and optimize_device_cache are set

* Minor fix
2024-06-03 11:16:22 +02:00
f18253bf2d intial RPO loss (#1686)
* intial RPO loss

* fix sign

* clean up
2024-06-03 09:43:02 +01:00
151a452d14 Fix max completion length (#1588) 2024-05-29 20:29:38 +02:00
488b502d31 fix (#1678) 2024-05-29 20:19:26 +02:00
3c0a10b1ae fix dataset load error (#1670)
Signed-off-by: Wang, Yi <yi.a.wang@intel.com>
2024-05-27 14:52:20 +02:00
b031adfdb8 FIX / PPO: Fix enable_input_require_grads issues with PPO models (#1664)
* Update modeling_base.py

* Update ppo_config.py

* Update ppo_trainer.py

* style
2024-05-24 15:20:16 +02:00
e7cb597230 Fix ppov2 test case (#1661)
* Fix PPOv2 / RLOO refactor's stuff

* update terminology to use stop token
2024-05-23 11:37:16 -04:00
bc8dfbf4e2 update eval_strategy (#1662) 2024-05-23 15:28:04 +02:00
e4ed7a3a5a do not upcast adapters when using FSDP+QLoRA (#1654) 2024-05-23 15:04:22 +02:00
9a7efbd051 🤫 TR-DPO implementation (#1593)
* 🤫 TR-DPO implementation baseline

* fix comments

* docs

* fix linters

* test added

* move configs to DPOConfig

* fix typo

* add docs

* fix import

* use state.global_step

* fix order of arguments

* make sure plugins are not none

* Update trl/trainer/utils.py

Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>

* Update trl/trainer/utils.py

Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>

* checking that reference model weights have changed

* sync_target_model as staticmethod

* set reference model

---------

Co-authored-by: Nikita Surnachev <n.surnachev@tinkoff.ru>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
2024-05-23 14:58:49 +02:00
b344bcea2c [DPO] Add 'robust' loss_type (#1653)
* Initial commit

* pre-commit fix

* Minor change to comments

* Added some documentation on how to use Robust DPO
2024-05-23 14:57:25 +02:00
35e12dc595 Fix inheritance order in PPOv2Config (#1659)
* fix inheritance order in PPOv2Config

* fix inheritance order in rloo_config
2024-05-23 08:36:15 -04:00
1da6be18e0 docs: correct cDPO usage in DPOTrainer (#1655) 2024-05-23 08:10:29 -04:00
e249cd802f add support for training collator (#1658) 2024-05-23 08:10:05 -04:00
a02513c3b7 Apply deprecated evaluation_strategy (#1559)
* Deprecate

* Update tests/test_dpo_trainer.py

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-05-23 12:48:00 +02:00
13454d2f4b PPO / Reinforce Trainers (#1540)
* Add ppov2 trainer

* make eos trick optional, remove unused args

* quick fix

* precommit

* update debugging script

* fix out of bound `drop_last=True`; use built-in scheduler

* Add PPO examples

* push changes

* quick change

* quick change

* various bug fixes

* remove unnecessary grad accumulation setting

* push new changes

* fix DS3 model saving

* update ppo.py

* refactor

* quick change

* refactor

* update ppo trainer

* refactor

* quick test

* add ds2 /ds3 7 processes config

* add vllm trainer

* quick change

* experiment with reward normalization

* push changes

* quick push

* push changes

* push various changes

* refactor to use ModelConfig

* quick change

* refactor

* refactor

* Simplify DS logic

* quick update

* remove unnecessary files

* precommit

* deepspeed fix; handle edge case when eos_token_id = 0

* add PPO tldr example

* add TL;DR example

* fix undefined var

* utilize all samples in rloo

* quick setting

* remove the unnecessary `value_model`

* use exact_div

* allow saving the deepspeed model

* refactor

* remove dead code

* Use some shared utilities

* add some end-to-end test cases

* add PPOv2 docs and RLOO docs / tests

* update docs

* quikc push

* fix ci

* fix type annotation for ci

* quick update

* update trainer docs
2024-05-22 08:31:10 -04:00
99f2c94b22 don't cast the trainable lora layers to half precision (#1644)
* don't cast the trainable lora layers to half precision

* quality
2024-05-15 16:25:46 +02:00
6401d080c9 Pairwise Noise Contrastive Alignment (#1632)
* add NCA paired preference loss

* chore: lint

* set more lenient tolerance for integration tests

* Update tests/test_dpo_trainer.py

* skip test

* fix

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Co-authored-by: younesbelkada <younesbelkada@gmail.com>
2024-05-14 15:41:07 +02:00
d632a5b289 Fixed wrong logs prefixes in KTOTrainer (#1641)
* Fixed wrong logs prefixes in KTOTrainer

* Pre-commit formating
2024-05-14 12:25:54 +02:00
5aeb752053 Update sft_llama2.py to work with the latest API (#1637)
* Update sft_llama2.py to work with the latest API

SFTTrainer now takes a STFConfig argument

* Update dpo_llama2.py

* precommit
2024-05-10 17:19:15 +02:00
b8b89783ca [ORPO] Correct label mask for pad tokens (#1625)
* [ORPO] Correct label mask for pad tokens

Recent [fix](57aebe9c36) for calculating NLL loss for a whole sequence introduced a bug. When input_ids are copied to labels, pad tokens are not masked.

This PR aims to path this by masking labels based on the attention mask.

* -100 -> label_pad_token_id

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-05-10 15:43:13 +02:00
8799952876 visualize rm prediction (#1636)
* visualize rm prediction

* quick update

* quick check

* quick fix

* update eval steps
2024-05-10 09:32:20 -04:00
3b4c24946b fixed adding bos and eos token unconditionally (#1591)
* fixed adding bos and eos token unconditionally

* fixed typo of tokenizer -> self.tokenizer. Also added update to ORPO

* fixed code quality, and added BOS/EOS fix to KTO

* code reformatting with pre-commit run --all-files

* bug fix: check input id length before checking for EOS/BOS
2024-05-04 00:19:35 +02:00
0347f583e3 Fix ZeRO-3 generation context manager (#1617) 2024-05-03 15:59:59 +02:00
75de236c09 corrects loss function for Self-play Preference Optimization hard label version (#1615)
* corrects sppo hard lable version

* formatting

* formatting
2024-05-03 08:09:57 +02:00
7075cec94d Update HH dataset on helpful only subset (#1613)
* Update HH dataset on helpful only subset

* format
2024-05-02 12:12:12 -04:00
adf17a5a26 support loss function for Self-play Preference Optimization (#1612)
* support loss function for Self-play Preference Optimization

* update docs

* update value error msg

* update typehint

* Update docs/source/dpo_trainer.mdx

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* include sppo in tests

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-05-02 16:06:58 +02:00
0d40e186ee Docs: Fix build main documentation (#1604)
* Fix build documentation

* Update build_pr_documentation.yml
2024-05-02 11:44:29 +02:00
683bc5af6f Excluding tests from setup.py (#1607) 2024-05-02 10:30:27 +02:00
5f0913122b Use auto device map (#1596) 2024-05-02 09:22:31 +02:00
d1aa0b6b2c [KTOTrainer] add BCO (reward shift and underlying distribution matching) (#1599)
* add `Loss Functions` section in the doc.

* add bce loss with reward shift in KTOTrainer

* add underlying distribution matching

* update example to use underlying distribution matching

* add config description

* fix 'referenced before assignment' error

* add 'bco' and 'udm' test cases

* run pre-commit

* add `scikit-learn` dependency

* raise error is sklearn is not available

* call TrainingArguments().__post_init__() for proper init
2024-04-30 14:06:45 +02:00
d88ec14602 Update __init__.py (#1602) 2024-04-30 10:25:43 +02:00
6c18e40e97 fix typo (#1594) 2024-04-29 10:42:31 +02:00
1d0a7ea17b add warning in SFTTrainer (#1577) 2024-04-23 20:00:10 +02:00
9f68ead8cf FIX: Fix CI on transformers main (#1576)
* Update run_dpo.sh

* Update run_sft.sh

* Update clis.mdx

* Update example_config.yaml

* Update test_cli.py

* Update testing_constants.py

* Update test_dpo_trainer.py
2024-04-23 14:31:45 +02:00
f30daa4225 [SFT] add SFT Trainer Config dataclass (#1530)
* initial SFT Config

* remove pdb

* fix chat_template

* undo formatting

* add back removed commits

* fix the tests

* add back options to SftScriptArguments

* use sft_script_args

* Update trl/commands/cli_utils.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update trl/commands/cli_utils.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* rename SFTScriptArguments and split names

* formatting docstrings

* docstring

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2024-04-23 11:55:13 +02:00
24fd8dd513 [DPO] DPOConfig class (#1554)
* initial DPOConfig

* fix doc string

* use DPOConfig

* fix missing import

* fix DpoScriptArguments

* override args config when given in init

* use DPOConfig

* fix output dir name

* over-ride with depreicated arguments if given

* use DPOConfig in tests

* fix comment

* add custom_message

* use dataset_train_name and dataset_test_name

* beta is also in the training_args

* fix loss_type docs

* Update trl/commands/cli_utils.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update trl/commands/cli_utils.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update trl/commands/cli_utils.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* use DPOScriptArguments

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2024-04-23 11:06:28 +02:00
c050ebc073 [DPO] add 'bco_pair' loss_type (#1524)
* add 'bco_pair' loss_type

* add BCO description to DPO doc

---------

Co-authored-by: sean.jung <sean.jung@seanjungui-MacBookPro.local>
2024-04-22 18:46:51 +02:00
abc0584736 fix add_special_tokens issue for data with template (#1509) 2024-04-22 18:44:10 +02:00
6d1cb85e73 set dev version (#1568) 2024-04-22 10:59:35 +02:00
e90e8d91d2 Release: v0.8.6 (#1567) 2024-04-22 10:58:13 +02:00
113aaae033 CLI: Add warning when ignored params are passed + parse config file if config if passed (#1565)
* add warning

* no need for `config` field
2024-04-22 10:48:59 +02:00
0865572748 Update __init__.py (#1557) 2024-04-18 14:51:40 +02:00
a6532a11c2 set dev version (#1556) 2024-04-18 13:58:17 +02:00
3595eb00e0 Release: v0.8.5 (#1555) 2024-04-18 13:56:36 +02:00
9afd901d0f enable multiple eos tokens (#1553) 2024-04-18 12:19:18 +02:00
e04432d5e3 FIX: make the train / test fields modulable (#1551)
* make the train / test fields modulable

* format

* fix --output_dir issue
2024-04-18 11:33:30 +02:00
75c1c47fcc set dev version (#1548) 2024-04-17 17:25:01 +02:00
a5788ac99b Release: v0.8.4 (#1547) 2024-04-17 17:19:28 +02:00
3bbe7e0407 Fixed ref model not used in PPO generation (#1534) 2024-04-17 07:22:56 -07:00
edf60e826b Update run_sft.sh (#1546) 2024-04-17 16:17:05 +02:00
5d1deb1445 CLI: Set dataset_text_field to None to allow ChatML automatic template (#1545)
* Update cli_utils.py

* Update test_cli.py
2024-04-17 14:45:14 +02:00
476c4b8dc0 [KTO] support to load the adapter twice (#1542)
Co-authored-by: Clara Luise Pohland <clara-luise.pohland@telekom.de>
2024-04-16 17:43:40 +02:00
e823458a6a save_model -> save_pretrained in ppo_trainer.mdx (#1537) 2024-04-15 09:35:03 +02:00
1c0d8bca15 VSFT hotfix - adds gen prompt to template and processor to hub (#1532)
* adds gen prompt to template and processor to hub

* fixes hub model id, removes Path
2024-04-12 20:14:12 +02:00
363369a717 [CPO] fix memory leak due to retained value (#1531) 2024-04-12 15:32:01 +02:00
aba4df02c1 set dev version (#1529) 2024-04-12 12:37:34 +02:00
98226473e4 Release: v0.8.3 (#1528) 2024-04-12 12:22:05 +02:00
87f4c70e60 [CLI] fix imports (#1527) 2024-04-12 12:17:05 +02:00
995f1174da set dev version (#1523) 2024-04-11 15:51:57 +02:00
143e11123d Release: v0.8.2 (#1522) 2024-04-11 15:42:47 +02:00
346c99d222 Adds VLM Training support to SFTTrainer + VSFT script (#1518)
* adds option to skip dataset preparation in SFTTrainer

* before changing the template

* adds support for new schema

* a few fixes to data collator to support new schema

* updates args

* precommit

* adds sys prompt to chat template and other fixes

* updates template, fixes collator for multiple images

* precommit

* rename vsft to vstf_llava

* adding integration tests

* adds integration test for vsft

* precommit

* adds back chat template

* docs

* typo

* adds eval, precommit

* adds peft launch args

* formatting

* fixes no deps tests by checking if PIL lib exists

* Update __init__.py

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
2024-04-11 15:35:59 +02:00
087fe544b0 add data for sfttrainer doc (#1521) 2024-04-11 15:08:43 +02:00
ebbd37ba99 allow pre-tokenized datasets (#1520) 2024-04-11 14:50:39 +02:00
e667550a5a Allow streaming (datasets.IterableDataset) (#1468)
* safe-guard iterabledatasets

* import datasets

* reference the correct IterableDataset

* make pre-commit
2024-04-11 11:11:07 +02:00
57aebe9c36 [ORPO] Update NLL loss to use input_ids instead (#1516)
* Calculate loss on `input_ids` instead of only on response

* Use `concatenated_labels` if `is_encoder_decoder`
2024-04-09 14:10:09 +02:00
85f5fd220d correct metrics (#1514)
Co-authored-by: Clara Luise Pohland <clara-luise.pohland@telekom.de>
2024-04-08 17:09:04 +02:00
4dca169404 use kwarfs for RM (#1515) 2024-04-08 17:05:37 +02:00
f35b68a301 Speed up PPO with ZeRO-3 by 10x 🔥 (#1483)
* Speed up PPO by 10x 🔥

* Revert

* Clean up

* Use relative import

* Clean

* Fix typing for docs
2024-04-08 14:30:44 +02:00
5cf863576a Change the device index to device:index (#1490)
Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-04-08 14:20:42 +02:00
9a28b3fd05 Fix RichProgressCallback (#1496)
* fix RichProgressCallback

* Refine code styling in RichProgressCallback tests
2024-04-04 21:13:54 +02:00
4f8057ad23 [KTO] fix interleaving, reporting, hanging bugs (#1499)
* add warning for imbalanced data

* update documentation

* update script commands to be same as in dpo

* use batch_size KL examples and batch_size target examples to calculate batch_size losses

* fix deepspeed issue

* speed up forward with no_grad for KL

* add some removed metrics

* Update trl/trainer/kto_trainer.py

* Update trl/trainer/kto_trainer.py

* Update trl/trainer/kto_trainer.py

add reference to paper

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* add more detailed comments

* convert assert to ValueError

* Update kto_trainer.py

* precommit formatting

* remove nans in metrics by gathering across machines

* fix formatting

* fix choice of mismatched examples for KL term

* describe weights

* fix hanging issue in distributed training

* linting

* move metrics to cpu

* Update trl/trainer/kto_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update trl/trainer/kto_trainer.py

* Update trl/trainer/kto_trainer.py

* fix tokenization error: lack of bos

* change user warning for weight hyperparams

* minor update to docs

* reshape attention mask

* reformat

* add test for bos/eos tokens

* move dependency location

* Update tests/test_kto_trainer.py

* don't report nan metrics

* don't report nan metrics and remove data interleaving

* fix bugs in calculating metrics

* no need to gather KL term

* minor changes

* use nanmean for losses

* remove disabling of wandb

* revert changes

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2024-04-03 23:41:12 +02:00
ab0d11d815 Correct ppo_epochs usage (#1480)
* Correct ppo_epochs usage

The usage of ppo_epochs is incorrect here. 

In 8534f0edf8/trl/trainer/ppo_config.py (L104C8-L104C58)

the ppo_epochs was described as "Number of optimisation epochs per batch of samples". 

However, here it is used as the usual epoch number, in which you do one iteration over the training dataset.

* Update ppo_trainer.mdx

* Update docs/source/ppo_trainer.mdx

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-04-02 12:22:16 +02:00
c674c66a45 Fix DPO Unsloth example (#1494) 2024-04-02 12:16:56 +02:00
45da5df53e use log1p for loss (#1491) 2024-04-02 12:06:54 +02:00
04fd8d9400 Fix typo in how_to_train.md (#1503)
Said "big" where it should say "bug".
2024-04-02 12:05:07 +02:00
bf2aed3876 add dpo link (#1502) 2024-04-02 12:04:34 +02:00
0ee349dcd4 Update KTO example to use better model and ChatML support (#1485)
* Update KTO example

* Tweak params

* Fix values

* Fix LoRA params
2024-03-27 10:47:42 +01:00
7ff6206510 Ignore chat files (#1486)
* Ignore chat files

* Update .gitignore

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Update .gitignore

---------

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
2024-03-27 10:44:23 +01:00
e4b20ecbc4 hackey update to ModelConfig to allow lora_target_modules="all-linear" (#1488)
the type hint forces a list which raises a "all-linear" layer not found. forcing a string makes it work. updating the type hint to `Union[str, list[str]]` also raise a parsing error
2024-03-27 09:04:41 +01:00
6c2f829bb7 [KTO] Use batching to speed up data processing (#1470)
* Refactor test

* Make batched tokenizer

* Make is FAST 🔥!

* Hack to the max

* Run on main process

* Refactor

* Add unit test

* f

* r

* Refactor

* Remove bs

* Refactor to tokenize once

* Add typing

* Add test for KL getter
2024-03-26 19:46:23 +01:00
c4f0f41935 Update KTO example with good dataset & chat format (#1481)
* Update KTO example with good dataset & chat format

* Add error for chat template
2024-03-25 16:56:43 +01:00
dc6a934269 add missing classes (#1479) 2024-03-24 22:08:28 +01:00
9ce7ac6925 Fix hyperparameters in KTO example (#1474)
* Fix hparams in KTO example

* Clean

* Fix
2024-03-24 14:29:22 +01:00
99553c19ae Add use_cache=False in {ORPO,CPO}Trainer.concatenated_forward (#1478)
* Add `use_cache=False` in `concatenated_forward`

Prevents `ORPOTrainer` from using the cache, as it's not required for computing the logits and runs into conflicts with Flash Attention 2

* Add `use_cache=False` to `concatenated_forward`

Co-authored-by: Kashif Rasul <kashif@users.noreply.github.com>

---------

Co-authored-by: Kashif Rasul <kashif@users.noreply.github.com>
2024-03-24 11:33:20 +01:00
2ce8e45bb2 ORPO trainer (#1435)
* initial orpo skeleton

* typos

* calculate orpo loss

* fix class name

* fix tests

* fix typo

* Update docs/source/orpo_trainer.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/orpo_trainer.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/orpo_trainer.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* rename max_target_length

* Update examples/scripts/orpo.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/orpo.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/orpo.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* more docs

* log log_odds_ratio and log_odds

* average_log_prob as per paper

* added logging section

* add nll_loss

* fix typo

* more verbose

* rename log_odds to log_odds_chosen

* allow datasets to be loaded

* remove dup debug arg

* tokenizer exists

* fix typo

* use trl-internal-testing/hh-rlhf-trl-style dataset

* formatting

* add missing imports

* fix output dir name

* Update examples/scripts/orpo.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* move dataset_num_proc to configs

* Update trl/trainer/orpo_config.py

Co-authored-by: Alvaro Bartolome <alvarobartt@gmail.com>

* Update trl/trainer/orpo_trainer.py

Co-authored-by: Alvaro Bartolome <alvarobartt@gmail.com>

* add ORPOTrainer to readme

* fix typo

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: Alvaro Bartolome <alvarobartt@gmail.com>
2024-03-22 22:07:11 +01:00
d1df79f83c Add CPOTrainer (#1382)
* add CPOTrainer

* add docs

* fix formatting

* removed precompute_ref_log_probs arg

* remove precompute_ref_log_probs

* typos

* finish cpo trainer doc

* remove redundant lines

* typo

* formatting

* compute chosen nll loss also for enc-dec models

* fix gradient error of inplace operation for enc-dec models

* formatting

* use CPOConfig

* formatting

* use model_init_kwargs from CPOConfig

* comments in example

* fix doc string

* fix typo in docstring

* update year

* fixed typo

* use preference dataset

* fix learning rate

* move dataset_num_proc to configs

* Update cpo paper link from HF: cpo_trainer.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* update description for CPO: cpo_trainer.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* remove _prepare_deepspeed for cpo

Because CPO does not need init for reference model

* Add explanation to CPO loss

* format

* fix bug when lengths are given

* add CPOTrainer to README

* fix grammer

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2024-03-22 21:32:45 +01:00
d10f7663b0 [peft] Update test_reward_trainer.py to fix tests (#1471)
* [peft] Update test_reward_trainer.py

Since we are requiring peft >= 0.4.0

* formatting
2024-03-22 19:12:54 +01:00
423991c204 Use the standard dataset for DPO CLI (#1456)
* Use the standard dataset

* update docs

* update dpo examples

* fix cli error

* fix CI

* use trl-internal-testing/hh-rlhf-trl-style
2024-03-20 13:14:08 -04:00
988d4c4e1a set dev version (#1463) 2024-03-20 12:30:48 +01:00
8534f0edf8 Release: v0.8.1 (#1462) 2024-03-20 11:32:06 +01:00
5095e7f948 add eos token to generate (#1459) 2024-03-20 10:30:27 +01:00
9fcf61d706 Fix chat CLI for model revisions (#1458)
* Fix chat CLI for model revisions

* Clean
2024-03-20 09:35:34 +01:00
66b043a910 set dev version (#1454) 2024-03-19 17:30:48 +01:00
f2c71771cc Release: v0.8.0 (#1453)
* Release: v0.7.12

* 0.8.0 instead
2024-03-19 17:19:38 +01:00
631c33cbb3 FEAT: Update README to add DPO + CLIs (#1448)
* Update README.md

* Update README.md

* move dpo/ppo description to docs

* rework readme

* Update README.md

---------

Co-authored-by: leandro <leandro.vonwerra@spoud.io>
2024-03-19 16:55:56 +01:00
3f7ff60528 model --> model_name_or_path (#1452)
* `model` --> `model_name_or_path`

* fix style
2024-03-19 16:52:42 +01:00
1705aebeba Fix yaml parsing issue (#1450) 2024-03-19 16:07:50 +01:00
4e622a9033 chat cli (#1431)
* first draft

* move chat to cli

* fix makefile

* make script less verbose

* fix parsing

* fix style

* add more examples

* fix setup.py

* add copyright

* fix verbose init

* attribute FastChat

* add docs
2024-03-19 12:37:06 +01:00
eb2d5b2972 CI / CLI: Properly raise error when CLI tests failed (#1446)
* properly raise error

* another fix

* Update tests.yml

* Update tests-main.yml
2024-03-19 11:39:07 +01:00
f976c6d234 Before update the tr_loss, make sure tr_loss_step is in the same device. (#1439)
* before update the loss from dpo, make sure it's in the same device of tr_loss

* Update trl/trainer/dpo_trainer.py

Co-authored-by: guy1992l <83535508+guy1992l@users.noreply.github.com>

---------

Co-authored-by: guy1992l <83535508+guy1992l@users.noreply.github.com>
2024-03-19 10:28:44 +01:00
abc7301bab Fix PPOTrainer README example (#1441)
* Fix example

* Delete newline
2024-03-19 10:18:49 +01:00
6cfa5cfc81 fix doc build on main (#1437) 2024-03-18 14:24:02 +01:00
a2aa0f0b09 FEAT: Add CLIs in TRL ! (#1419)
* CLI V1

* v1 CLI

* add rich enhancmeents

* revert unindented change

* some comments

* cleaner CLI

* fix

* fix

* remove print callback

* move to cli instead of trl_cli

* revert unneeded changes

* fix test

* Update trl/commands/sft.py

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* remove redundant strings

* fix import issue

* fix other issues

* add packing

* add config parser

* some refactor

* cleaner

* add example config yaml file

* small refactor

* change a bit the logic

* fix issues here and there

* add CLI in docs

* move to examples/sft

* remove redundant licenses

* make it work on dpo

* set to None

* switch to accelerate and fix many things

* add docs

* more docs

* added tests

* doc clarification

* more docs

* fix CI for windows and python 3.8

* fix

* attempt to fix CI

* fix?

* test

* fix

* tweak?

* fix

* test

* another test

* fix

* test

* fix

* fix

* fix

* skip tests for windows

* test @lvwerra approach

* make dev

* revert unneeded changes

* fix sft dpo

* optimize a bit

* address final comments

* update docs

* final comment

---------

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
2024-03-18 12:20:54 +01:00
304e208f77 Create standard dataset for TRL (#1424)
* add scripts to create standard dataset

* precommit

* push changes

* add script to play with
2024-03-14 10:57:48 -04:00
4fe8b027f6 [Kto] torch_dtype kwargs fix (#1429)
* set torch_dtype from string type

* fix test
2024-03-14 13:49:44 +01:00
fb6ebb1e11 [KTO] fix tokenization bugs (#1418)
* add warning for imbalanced data

* update documentation

* update script commands to be same as in dpo

* use batch_size KL examples and batch_size target examples to calculate batch_size losses

* fix deepspeed issue

* speed up forward with no_grad for KL

* add some removed metrics

* Update trl/trainer/kto_trainer.py

* Update trl/trainer/kto_trainer.py

* Update trl/trainer/kto_trainer.py

add reference to paper

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* add more detailed comments

* convert assert to ValueError

* Update kto_trainer.py

* precommit formatting

* remove nans in metrics by gathering across machines

* fix formatting

* fix choice of mismatched examples for KL term

* describe weights

* fix hanging issue in distributed training

* linting

* move metrics to cpu

* Update trl/trainer/kto_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update trl/trainer/kto_trainer.py

* Update trl/trainer/kto_trainer.py

* fix tokenization error: lack of bos

* change user warning for weight hyperparams

* minor update to docs

* reshape attention mask

* reformat

* add test for bos/eos tokens

* move dependency location

* Update tests/test_kto_trainer.py

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2024-03-14 08:22:50 +01:00
66078c7c01 CI: Fix CI on main (#1422)
* fix CI on main

* final fix
2024-03-13 13:54:22 +01:00
58c0888996 Add support for FSDP+QLoRA and DeepSpeed ZeRO3+QLoRA (#1416)
* don't do mp casting

* don't use `prepare_for_kbit` when using fsdp+qlora or dsz3+qlora

* changes to enable fsdp+qlora and dsz3+qlora

* revert

* Update sft_trainer.py

* quality

* fix deprecation using changes from PR https://github.com/huggingface/trl/pull/1415

* fixes

* quality

* Update trl/trainer/sft_trainer.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* quality

* relaunch tests

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
2024-03-13 10:43:45 +01:00
486e7a4071 model init when args are given (#1413)
Co-authored-by: Lewis Tunstall <lewis.c.tunstall@gmail.com>
2024-03-11 13:47:37 +01:00
7630f877f9 Fix import error from deprecation in transformers (#1415)
* Fix import error from  deprecation in transformers

* Fix import path
2024-03-11 13:23:56 +01:00
4d862da181 [KTO] fix various bugs (#1402)
* add warning for imbalanced data

* update documentation

* update script commands to be same as in dpo

* use batch_size KL examples and batch_size target examples to calculate batch_size losses

* fix deepspeed issue

* speed up forward with no_grad for KL

* add some removed metrics

* Update trl/trainer/kto_trainer.py

* Update trl/trainer/kto_trainer.py

* Update trl/trainer/kto_trainer.py

add reference to paper

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* add more detailed comments

* convert assert to ValueError

* Update kto_trainer.py

* precommit formatting

* remove nans in metrics by gathering across machines

* fix formatting

* fix choice of mismatched examples for KL term

* describe weights

* fix hanging issue in distributed training

* linting

* move metrics to cpu

* Update trl/trainer/kto_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update trl/trainer/kto_trainer.py

* Update trl/trainer/kto_trainer.py

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2024-03-08 12:04:52 +01:00
22b4f548f4 fix RM script (#1393) 2024-03-07 08:49:52 +01:00
4219cbfedc Fix the pad_token_id error (#1394)
* Fix the pad_token_id error

Signed-off-by: yuanwu <yuan.wu@intel.com>

* Add the load_in_8bit argument in rl_training.py

Signed-off-by: yuanwu <yuan.wu@intel.com>

* Reformate the patch

Signed-off-by: yuanwu <yuan.wu@intel.com>

* Fix the check failed

Signed-off-by: yuanwu <yuan.wu@intel.com>

---------

Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-03-05 02:18:42 +01:00
3bd02380c7 Log ddpo reward as float to fix numpy conversion during bf16 training (#1391) 2024-03-04 02:50:50 +01:00
067db7553a [KTO] prevent nans from appearing in metrics (#1386)
* add warning for imbalanced data

* update documentation

* update script commands to be same as in dpo

* use batch_size KL examples and batch_size target examples to calculate batch_size losses

* fix deepspeed issue

* speed up forward with no_grad for KL

* add some removed metrics

* Update trl/trainer/kto_trainer.py

* Update trl/trainer/kto_trainer.py

* Update trl/trainer/kto_trainer.py

add reference to paper

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* add more detailed comments

* convert assert to ValueError

* Update kto_trainer.py

* precommit formatting

* remove nans in metrics by gathering across machines

* fix formatting

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2024-03-01 12:19:55 +01:00
93e85ed808 [KTO] merge eval dataset only if it exists (#1383)
* merge eval dataset if it exists

* add eval dataset test
2024-03-01 12:15:14 +01:00
14e0d78807 fix bugs in KTO implementation (#1380)
* add warning for imbalanced data

* update documentation

* update script commands to be same as in dpo

* use batch_size KL examples and batch_size target examples to calculate batch_size losses

* fix deepspeed issue

* speed up forward with no_grad for KL

* add some removed metrics

* Update trl/trainer/kto_trainer.py

* Update trl/trainer/kto_trainer.py

* Update trl/trainer/kto_trainer.py

add reference to paper

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* add more detailed comments

* convert assert to ValueError

* Update kto_trainer.py

* precommit formatting

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2024-02-29 09:01:52 +01:00
b32656f726 FIX: Fix the CI again .. (#1374)
* Update tests-main.yml

* Update tests-main.yml

* Update tests-main.yml

* Update .github/workflows/tests-main.yml

* Update tests-main.yml

* Update tests-main.yml
2024-02-27 12:46:20 +01:00
9399bc113b Update tests-main.yml (#1373) 2024-02-27 12:07:50 +01:00
11f122ad49 Update tests-main.yml (#1372) 2024-02-27 11:45:02 +01:00
009c9a610b feature request add force_use_ref_model (#1367) 2024-02-27 11:19:16 +01:00
7712d42f8c add eval_packing (#1369) 2024-02-27 11:19:06 +01:00
7c2213b9e5 add ci message sending on TRL (#1370) 2024-02-27 11:18:55 +01:00
ddeebce176 Add some arguments for support XPU (#1366)
* Add use_bnb and load_in_4bit arguments.

Make it optional and not supported on all platforms

Signed-off-by: yuanwu <yuan.wu@intel.com>

* Change the use_reentrant default value to False

If the default value of gradient_checkpointing is True, set the
use_reentrant default value as False. Because the following error
happens.

RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 191 with name base_model.model.model.layers.31.self_attn.v_proj.lora_B.default.weight has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration.

Signed-off-by: yuanwu <yuan.wu@intel.com>

* Add model_dtype for loading the model in model_dtype

Signed-off-by: yuanwu <yuan.wu@intel.com>

* Reformate the patch

Signed-off-by: yuanwu <yuan.wu@intel.com>

---------

Signed-off-by: yuanwu <yuan.wu@intel.com>
2024-02-27 02:49:16 +01:00
cf68d871cf Fix version for Python<3.8 (#1363) 2024-02-27 02:41:09 +01:00
2a2676e7ec set seed in sft/dpo/reward_modeling to make result reproducable (#1357)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-02-23 11:12:45 +01:00
ca90cba351 fix 8-bit multi-gpu training bug (#1353)
* fix 8-bit multi-gpu training bug see https://github.com/huggingface/trl/issues/1348

* Update dpo_llama2.py

make gradient_checkpointing_kwargs configurable.

* Update dpo_llama2.py

remote unnecessary config of device_map

* format with make precommit

---------

Co-authored-by: ubuntu <lili@liveremier.ai>
2024-02-23 03:58:43 +01:00
4f97fb4a74 more userfriendly (#1350) 2024-02-22 10:06:35 +01:00
a46cd84a64 Kto trainer (#1181)
* initial file

* initial tokenizer

* UnpairedPreferenceBatchSampler

* use batch_sampler

* use interleave_datasets

* add loss

* fix imports

* use SequentialSampler when training

* formatting

* add other helpers

* add prediction_step

* fix the kto pair docs

* tests

* compute_reference_log_probs

* add get_eval_dataloader

* fix typo

* kto with is_encoder_decoder true

* Update docs/source/dpo_trainer.mdx

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* fixed typo

* Update trl/trainer/kto_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/kto_trainer.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/kto_trainer.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* renamed KTO dataset keys

* use DPOTrainer's get_batch_logps

* add get_batch_samples

* typo

* Handle last token in prompt

* Create KTOConfig class that subclasses transformers.TrainingArguments

* Update KTO tests to handle KTOConfig

* Update KTO script to use KTOConfig

* formatting

* Update docs/source/dpo_trainer.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/kto_trainer.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update trl/trainer/kto_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/kto_trainer.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update trl/trainer/training_configs.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/kto.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/kto.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* use max_completion_length

* Update examples/scripts/kto.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* add back get_batch_logps

* use max_completion_length

* move config to its own file

* Check tokenize params on Trainer init

* Clone labels for end-dec model to solve RuntimeError

* formatting

* fix enc-dec later

* completion_decoder_input_ids is optional for enc-dec

* fix breaking test

* add a kl key for KL estimation with shuffled completion

* add loss ad weights

* fix bug in chosen_idx

* add back metrics

* fix typos

* fix kto_loss docs

* typo

* set loss to None when there is no target completions in batch

* use nan tensor instead of none

* fix reference_logps test

* fix logits

* a bit more robust options

* log only the correct prompt-completion during eval

* Update trl/trainer/kto_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/kto.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/kto.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/kto_trainer.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/dpo_trainer.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* add docs for desirable_weight and undesirable_weight args

* dropout is always disabled

* remove DDP hack

* formatting

* move more arguments of trainer to config

* comment out T5 test for now

* Add docstring to KTOTrainer

* moved Config docstrings to the appropriate class

* add autodoc to markdown

* formatting

* updated copyright year

* add model tags

* do not add BOS to start of completion

* Move data_collator to KTOTrainer

* formatting

* data_collator is not in args

* shuffle_completion with specific input_columns

* remove all but the needed columns

* Update docs/source/dpo_trainer.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/kto.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update tests/test_kto_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* moved more args to kto_config

* fjx test

* use all_exhausted strategy and shuffle after

* use KTOConfig in HfArgumentParser

* use ModelConfig

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: Pablo Vicente Juan <p.vicente.juan@gmail.com>
2024-02-19 14:43:17 +01:00
1f56bffdf8 Update Example to reflect #aa35fec (#1333) 2024-02-18 14:10:04 +01:00
1bfe0b8fcb set dev version (#1332) 2024-02-16 09:49:05 +01:00
0f13e51efa Release: v0.7.11 (#1331) 2024-02-16 09:05:04 +01:00
1e77d8aeb2 [core / xxxTrainer] Automatic tagging (#1329)
* automatic tagging

* add comments

* fix tests

* fix
2024-02-15 14:47:32 +01:00
3b1911c2a9 add tests on transformers peft main (#1328) 2024-02-15 05:19:31 +01:00
851e7fe556 [core / DDPO] Fix diffusers import issue (#1314)
* fix

* more clean up
2024-02-15 04:45:27 +01:00
31b02d0cd0 Update README.md to clarify model requirement (#1315)
Clarify that language models must be transformers models for text.  This is a bit redundant with intro description, but attempts to better address a question that that comes up (issue 1257).

Closes: #1257
2024-02-15 04:38:17 +01:00
9bc478ecbb pre-commit: replace linters + formatters with Ruff; fix some issues (#1300)
* pre-commit: replace linters + formatters with Ruff

* Don't use bare except

* Clean up `noqa`s

* Enable Ruff UP; apply auto-fixes

* Enable Ruff B; apply fixes

* Enable Ruff T with exceptions

* Enable Ruff C (complexity); autofix

* Upgrade Ruff to 0.2.0
2024-02-15 04:37:41 +01:00
29f162b86c Best practice recommendation update for dpo_trainer.mdx (#1325)
In the document as it is now the best practice recommendations don't seem neither consistent nor correct. 

For example, the documentation links a tweet with a recommendation to merge adaptors into a quantized model, and a script that supposedly illustrates how to apply that recommendation. But the script actually does the opposite of what the tweet recommends, first dequantizing the model. 

There are similar inconsistencies/ambiguities further in that paragraph. For example, saying that using an unquantized model would lead to lower performance (I changed it to "higher memory demand").

Overall, I updated the paragraph to improve consistency and provided links to slightly more evidence-based merging recommendations.
2024-02-14 11:43:48 +01:00
6852097169 Fix PPOTrainer argument train_dataset -> dataset (#1321)
Both the argument's name as well as the value need to be renamed.
Otherwise we get both

NameError: name 'train_dataset' is not defined

and

TypeError: PPOTrainer.__init__() got an unexpected keyword argument 'train_dataset'
2024-02-06 22:37:04 +01:00
f12a1da74b Fix AttributeError in dpo_trainer for reference_free case in dpo_loss function (#1313)
* Update dpo_trainer.py

update reference_free parameter for dpo_loss

* Update dpo_trainer for reference_free case

updated the docstring typo and set device parameter to ref_logratios tensor
2024-02-02 11:02:40 +01:00
ae87b3aefa Fix typos in docs for Multi Adapter RL (MARL). (#1312)
* Fix more typos

* Fix typos in docs.
2024-02-02 07:37:08 +01:00
3f7cee7643 ENH: Run CI only if relevant files are modified (#1309)
* Update tests.yml

* Update .github/workflows/tests.yml
2024-02-01 23:49:32 +01:00
ae8431bd50 Codemod Unittest assertions to bare asserts (#1301)
* Remove stray commas from test data

* Codemod Unittest assertions to bare asserts

* Make `assertAlmostEqual` tests more idiomatic

* DRY some test strings
2024-02-01 23:49:03 +01:00
66a976c6bd Update sft_trainer.mdx to add note on launching DDP training (#1308)
As requested here: https://github.com/huggingface/trl/issues/1303#issuecomment-1920437586
2024-02-01 23:42:14 +01:00
814930377c Add num_proc arg to the eval_dataset processing (#1307) 2024-02-01 17:58:00 +01:00
88685f2cd4 Types: Fix PEP 484 implicit-optional compliance (#1297)
This was done automatically with hauntsaninja/no_implicit_optional.
2024-01-31 14:51:58 +01:00
6f40f20233 Fix DPOTrainer docstrings (#1298)
Some issues were leading the auto-generation of the API reference to fail and the args were overlapped in the documentation page
2024-01-31 14:49:41 +01:00
036213bd85 Fix sft trainer when args is None (#1295)
* fix sft trainer when args is None

* add test

* fix
2024-01-31 03:31:53 +01:00
6042596705 Fix DPO slow tests (#1292)
* Update test_dpo_slow.py

* style
2024-01-30 10:15:46 +01:00
070c75ec54 load data only on main process + fix dpo example test (#1291) 2024-01-30 10:14:22 +01:00
b415224a4a fix DPO trainer + mistral + FA2 (#1290) 2024-01-30 08:25:29 +01:00
9186710671 fix padding in dpo trainer (#1284) 2024-01-30 08:24:48 +01:00
aa35fec099 raise value error if one passes a ref_model and a peft_config (#1289) 2024-01-30 08:06:03 +01:00
737d771941 Add multiprocessing in the DPO trainer. (#1286)
* Update dpo_trainer.py

Added support for num_proc to tokenize the training dataset.

* Update dpo_trainer.py

added type in the new num_proc variable

* added test case

* add test case

* fix type

---------

Co-authored-by: imraviagrawal <ravi.agrawal@umass.edu>
Co-authored-by: Ravi Agrawal <raviagrawal@Ravis-MacBook-Pro.local>
2024-01-30 02:55:07 +01:00
ef441ea028 Update dpo_trainer.mdx (#1280) 2024-01-27 10:29:10 +01:00
af623aeba6 Fix sft ci (#1279) 2024-01-26 19:18:23 +01:00
3843cfc32f Fix SFT tuner (#1278) 2024-01-26 17:49:50 +01:00
9a71e67be9 Remove tyro (#1176)
* refactor

* Remove tyro in `ppo.py`

* quick update

* update default args

* quick push

* precommit

* refactor

* quick change

* remove tyro

* quick change

* precommit

* quick change

* fix hello_world

* remove docstring diffences

* add `module load cuda/12.1`

* push changes

* precommit

* make dpo runnable

* fix circular import

* quick fix

* refactor

* quick update

* path change

* update plots

* fix docs

* quick change

* Update trl/trainer/model_config.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update trl/trainer/model_config.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update trl/trainer/utils.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update examples/scripts/dpo.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* address comments. use attn_implementation

* precommit

* remove duplicate code

* update peft.py

* fix test no op dep

* Update trl/trainer/utils.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* precommit

* add docs

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
2024-01-26 07:51:15 -08:00
09ca565b24 FIx SFTTrainer bugs on TRL main (#1276)
* Update sft_trainer.py

* Update trl/trainer/sft_trainer.py
2024-01-26 13:50:37 +01:00
4edc688311 Only load data on main process (#1255)
* fix: only load data on main process

* define is_main_process once

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* avoid re-initializing PartialState on train dataset check

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* avoid re-initializing PartialState on eval dataset check

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* process dataset on main first to take advantage of caching

* fix typo in docs

* use decorator to manage state

* Revert "fix typo in docs"

This reverts commit 0880a188812a698f7106853245ce1ba96a036831.

* Revert "Revert "fix typo in docs""

This reverts commit ff7ee33fbeedcd0032b728d86a17cfcb10e43f9b.

* Revert "use decorator to manage state"

This reverts commit 7ac7a45949f621941fedc522f0d2ca7b29367c3a.

* use is_local_main_process instead of is_main_process

* fix: use context manager instead of attribute

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update trl/trainer/sft_trainer.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
2024-01-26 10:38:07 +01:00
29d439a204 [DPO] average_log_prob when loss is IPO (#1265)
* average_log_prob when loss is IPO

* updated docs with the fix
2024-01-24 12:18:04 +01:00
5760e5d3db Fix typo in extra_columns variable name (#1269)
Co-authored-by: Otto Laitila <otto.laitila@op.fi>
2024-01-23 14:46:13 +01:00
a3c5b7178a Update utils.py (#1256) 2024-01-22 15:32:29 +01:00
222d275b8a set dev version (#1254) 2024-01-19 11:58:47 +01:00
09ca7607d5 Release: v0.7.10 (#1253) 2024-01-19 11:52:51 +01:00
1e68753216 fix: fix loss_type and some args desc (#1247) 2024-01-18 17:20:52 +01:00
1f59eeb9bb Fix chatml template (#1248)
* first draft

* 64

* sourabs suggestion

* wip tests

* make style happy

* add check

* docstring

* fix docstring

* Update tests/test_model_utils.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* move tests

* add todo for abstract class

* make style happy

* add slow tests and imports

* add documentation

* sft_trainer.mdx aktualisieren

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* fix template & add test

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
2024-01-18 16:47:25 +01:00
928d14445e Add setup_chat_format for adding new special tokens to model for training chat models (#1242)
* first draft

* 64

* sourabs suggestion

* wip tests

* make style happy

* add check

* docstring

* fix docstring

* Update tests/test_model_utils.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* move tests

* add todo for abstract class

* make style happy

* add slow tests and imports

* add documentation

* sft_trainer.mdx aktualisieren

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
2024-01-18 11:05:32 +01:00
3319993bd1 Fix weird doc bug (#1244)
* Update utils.py

* Update trl/trainer/utils.py

* Update trl/trainer/utils.py
2024-01-18 10:48:56 +01:00
4fb3d0c860 Update sft_trainer.py (#1241) 2024-01-17 15:16:07 +01:00
bcccdeb6f9 [core / SFTTrainer] Fix breaking change (#1229)
* fix breaking change

* revert

* fix

* final fix

* fix

* fix tests
2024-01-17 14:45:22 +01:00
ef209e311f [core / tests ] v1 slow tests (#1218)
* v1 slow tests

* nit

* add qlora tests for DPO

* add decorator

* release memory + log reports

* report to none to avoid seg fault issues

* update setup

* fix

* add exampel testing

* fix nit

* change temp filename

* add workflow file

* fix comment

* add slack push script

* more tests for DPO

* add dpo example tests

* another makefile command

* fix

* add paths + clean up

* nit

* Update slow-tests.yml

* trigger tests

* up

* up

* more fixes

* fix

* final fixes

* minor fixes

* oops

* add more text

* fix

* more

* trigger CI

* up

* fix

* remove

* run the tests on 2 GPUs only

* final fix SFT

* revert config files + address comments

* fix

* add Phi

* final fixes

* final fix
2024-01-17 10:17:57 +01:00
341f6a6787 fix: improve error message when pad_token_id is not configured (#1152)
* fix: improve error message when `pad_token_id` is not configured

* Add test for error raised when pad_token is None

* Fix pre-commit errors

* Fix error in the test environment
2024-01-17 09:34:20 +01:00
97b9fa212a Update dpo_trainer.py (#1160)
Log metrics on all distributed processes
2024-01-15 15:40:44 +01:00
a7d796c9a2 Remove a repeating line in how_to_train.md (#1226) 2024-01-15 15:18:49 +01:00
fa074e6a15 Create slow-tests.yml (#1223) 2024-01-12 09:29:57 +01:00
776939dcc4 Add support for ChatML dataset format in (#1208)
* Add support for ChatML dataset format in
SFTTrainer

* fix formatting

* fix tests

* more comment

* fix intent

* fix doc string

* Update dataset_formatting.py

* Update dataset_formatting.py

* add documentation

* Update sft_trainer.mdx

* add leonardos comment and more tests

* added more tests and fixed batching

* style

* comment in
2024-01-12 08:05:32 +01:00
163ca9f059 Refactor RewardConfig to own module (#1221)
* Refactor RewardConfig to own module

* Fix init

* Fix import
2024-01-12 17:50:37 +11:00
2eeb7b04cf [core / Docker] Add workflow to build TRL docker images (#1215)
* add docker build

* Update docker/trl-latest-gpu/Dockerfile

* Update docker/trl-source-gpu/Dockerfile
2024-01-11 11:03:43 +01:00
9f8d0e48ad Fix args type (#1214)
* fix args type

* add args desc
2024-01-10 16:35:19 +01:00
c9b7145c75 Update Unsloth SFT, DPO docs (#1213)
* Update sft_trainer.mdx

* Update sft_trainer.mdx

* Update dpo_trainer.mdx

* Update dpo_trainer.mdx

* Update sft_trainer.mdx
2024-01-10 09:08:08 +01:00
baf3c1c293 Fix FSDP error (#1196)
* Fix FSDP error

Fixes error when `loss` field of model output is non-empty, and indexing as [0] returns loss instead of logits. Can happen with FSDP.

* Apply suggestions from code review

force return_dict

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
2024-01-09 18:21:23 +01:00
b181e401a7 Fix shape descriptions in calculate_loss method (#1204) 2024-01-09 14:24:41 +01:00
26da9e80cb Check tokenize params on DPOTrainer (#1197)
* Check if tokenizer and max len params are None

* Update warning messages for missing parameters
2024-01-09 14:10:22 +01:00
d6cc88ab2c set dev version (#1207) 2024-01-09 13:06:30 +01:00
7a95cc8696 release: v0.7.9 (#1206) 2024-01-09 13:02:31 +01:00
d1715514de Revert "Address issue #1122 (#1174)" (#1205)
This reverts commit d57d0f9ca46a63d370b91791352edda0154576f5.
2024-01-09 10:20:50 +01:00
d116887ed4 [DPOTrainer] Fix peft + DPO + bf16 if one uses generate_during_eval or pre-computed logits (#1203)
* fix peft + DPO + bf16

* fix

* revert old behaviour

* fix tests

* fix

* fix

* fix

* fix
2024-01-09 09:35:50 +01:00
a236c5750f Fix reported KL in PPO trainer (#1180)
* Fix reported KL in PPO trainer

previously this was always reporting the estimated KL, even when using `kl_penalty = 'full'` (or `abs`, etc).
Now we return the actual KL calculated in `compute_rewards()`, and report that.

* fix test
2024-01-09 06:48:25 +01:00
4ae35afdd6 Fix instruction token masking (#1185)
* Fix instruction token masking

Fix instruction token masking if the first instruction is tokenized differently than the others, or in general if no instruction is detected before the first response.

* Bugfix for edge case

(in case either of the templates isn't found at all, ...idxs[0] might not exist)

* Add test for instruction masking fix
2024-01-09 06:41:53 +01:00
b21ed0ddbc set dev version (#1201) 2024-01-09 05:19:10 +01:00
384b868fe6 Release: v0.7.8 (#1200) 2024-01-09 05:13:26 +01:00
3267be0fcd Allow swapping PEFT adapters for target/ref model. (#1193)
* Allow swapping PEFT adapters for target/ref model.

* Update DPOTrainer docs.

* python format

* isort

* Update docs/source/dpo_trainer.mdx

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update docs/source/dpo_trainer.mdx

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update docs/source/dpo_trainer.mdx

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update docs/source/dpo_trainer.mdx

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update docs/source/dpo_trainer.mdx

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

---------

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2024-01-08 16:12:45 +01:00
dbcb2f0021 Allow separate devices for target/ref models. (#1190)
* Allow separate devices for target/ref models.

* Remove original/duplicate.

* Cleanup original, black formatting.

---------

Co-authored-by: Jon Durbin <jonathan@convai.com>
2024-01-08 10:26:40 +01:00
d5910b0ff5 Handle last token from generation prompt (#1153)
* Handle last token from generation prompt

* Remove prints

* Reformat dpo_trainer file
2024-01-08 09:15:53 +01:00
104a02d207 SFTTrainer: follow args.remove_unused_columns (#1188) 2024-01-08 06:09:10 +01:00
ad597dbcb3 Fix misleading variable "epoch" from the training loop from PPOTrainer Doc. (#1171)
* Fix misleading variable "epoch" from PPOTrainer Doc. 

The usage of the variable “epoch” is misleading in the original Doc, the dataloader does not contain the data for ALL epochs, but 1 only, thus 
"for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader))"
is misleading and does not actually stores the epoch #. 

The correct version comes from the TRL PPO notebook tutorial 
(https://github.com/huggingface/trl/blob/main/examples/notebooks/gpt2-sentiment-control.ipynb), which uses an outer loop to capture the epochs.

I posted also the question on forum: https://discuss.huggingface.co/t/confusing-and-possibly-misleading-ppo-trainer-code-from-trl-api-doc-tutorial/67531

* Remove batch_id
2024-01-08 05:50:00 +01:00
d57d0f9ca4 Address issue #1122 (#1174)
* Address issue #1122

    Issue [#1122](https://github.com/huggingface/trl/issues/1122)
    takes care of an inconsistency between `_prepare_packed_dataloader`
    and `_prepare_non_packed_dataloader`

* made attention_mask field in ConstantLengthDataset a tensor
2024-01-08 05:43:34 +01:00
ec3d41b879 Fix batch all gather (#1177)
* Fix batch all gather

* quick fix
2024-01-04 17:41:52 +01:00
be32d304db Update sft_trainer.py (#1162)
Fix spelling mistakes in argument description for trl -> SFT Trainer
2024-01-04 16:33:53 +01:00
dc53b8c6b0 Correct shape (#1170) 2024-01-04 16:27:39 +01:00
20428c48ba add: support for peft in ddpo. (#1165)
* add: support for peft in ddpo.

* revert to the original modeling_base.

* style

* specify weight_name

* explicitly specify weight_name

* fix: parameter parsing

* fix: trainable_layers.

* parameterize use_lora.

* fix one more trainable_layers

* debug

* debug

* more fixes.

* manually set unet of sd_pipeline

* make trainable_layers cleaner.

* more fixes

* remove prints.

* tester class for LoRA too.
2024-01-02 12:52:36 +01:00
6614b8aa6b Minor fixes to some comments in some examples. (#1156) 2023-12-29 14:12:05 +01:00
df7b770da8 change device order of metrics (#1154) 2023-12-29 10:55:58 +01:00
18a33ffcd3 SFT Tokenizer Fix (#1142) 2023-12-27 10:25:56 +01:00
911d3658e2 [xxxTrainer] Add unsloth tag (#1130)
* add unsloth tag

* add it on all trainers

* few changes

* add in docs

* revert

* final commit
2023-12-26 16:39:10 +01:00
95ec8577df add peft_module_casting_to_bf16 in DPOTrainer (#1143)
* add peft_module_casting_to_bf16 in DPOTrainer

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* Update trl/trainer/dpo_trainer.py

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
2023-12-26 11:25:53 +01:00
3539f3e3cd set dev version (#1145) 2023-12-26 10:26:15 +01:00
e451298b50 Release: v0.7.7 (#1144) 2023-12-26 10:24:47 +01:00
3efb484694 [PPOTrainer / DDPOTrainer] Fix ppo & ddpo push to Hub (#1141)
* fix ppo push to Hub

* fix also ddpo

* more tags
2023-12-26 10:06:20 +01:00
8f5b4923c8 reformatted (#1128) 2023-12-23 10:16:27 +01:00
e0dec27272 reformatted (#1129) 2023-12-23 10:13:38 +01:00
6ef785a6fb Add type hints to core.py (#1097)
* Add type hinting to core.py functions

* Fixes

* Remove unused functions

* Remove unused import
2023-12-22 17:05:20 +01:00
950ee2187d clear up the parameters of supervised_finetuning.py (#1126)
no_gradient_checkpointing is always false

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2023-12-22 17:00:28 +01:00
c1bb1f39f6 set dev version (#1135) 2023-12-22 15:09:37 +01:00
54babd9508 Release: v0.7.6 (#1134) 2023-12-22 15:03:24 +01:00
0c4edb750e [xxxTrainer] multi-tags support for tagging (#1133)
* multi-tags support for tagging

* oops
2023-12-22 14:52:16 +01:00
17ec68d980 set dev version (#1132) 2023-12-22 14:12:24 +01:00
9be5680039 Release: v0.7.5 (#1131) 2023-12-22 14:01:44 +01:00
f11e213fd8 [Docs] Add unsloth optimizations in TRL's documentation (#1119)
* add unsloth

* Update sft_trainer.mdx (#1124)

Co-authored-by: Daniel Han <danielhanchen@gmail.com>

---------

Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2023-12-22 13:45:26 +01:00
814fe396d4 rename kto loss (#1127) 2023-12-22 13:32:16 +01:00
06b7959b72 save eval_dataset for subsequent calls (#1125) 2023-12-21 17:28:56 +01:00
b07935f867 [xxxTrainer] Add tags to all trainers in TRL (#1120)
* add tags to sfttrainer

* extend it to other trainers

* add for ddpo
2023-12-21 17:04:18 +01:00
2aff709144 Update description in setup.py (#1101) 2023-12-21 15:35:12 +01:00
830cadfc4c fix gradient checkpointing when using PEFT (#1118) 2023-12-20 13:35:56 +01:00
f2acd821e0 Make prepending of bos token configurable. (#1114)
* make prepending of bos token configurable.

* address comments

* fix bug

Co-Authored-By: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update trl/trainer/sft_trainer.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
2023-12-20 11:28:50 +01:00
f100ca34cc peft_module_casting_to_bf16 util method, append_concat_token flag, remove callback PeftSavingCallback (#1110)
* SFT Trainer enhancements

* remove the callback `PeftSavingCallback`

* bump the version of transformers to `4.31.0`

* remove `PeftSavingCallback` from all places.
2023-12-19 17:43:25 +01:00
d708ec272f [Feature] Add Ascend NPU accelerator support (#1096)
* add npu support

* make precommit
2023-12-15 15:34:35 +01:00
8140129595 Updated documentation for docs/source/reward_trainer.mdx to import the correct Enum for the reward modelling LoRA config (#1092) 2023-12-15 11:24:20 +01:00
48b3ef0b7b [DPO] use ref model logprobs if it exists in the data (#885)
* use logprobs if it exists in the batch

* add features to tokenized batch if in data

* make get_batch_logps a static method

* add tokenize_batch_element dataset mapper

* Remove tokenize_batch method from DPODataCollator

* Initial sketch to precompute reference_logps

* run ref model via pytorch dataloader

* add a padding helper

* clean up the helper

* use logprob item()

* default behaviour

* clean up collator

* add docstring

* copy data back to cpu if needed

* use get_train_dataloader methods

* fix tests

* rename: more explicit variable name precompute_ref_log_probs

* improve comment

* update comment

* Update trl/trainer/dpo_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* refactor models into setup parameters

* parametrize precompute_ref_log_probs flag

* remove useless test

* Update trl/trainer/dpo_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update tests/test_dpo_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update tests/test_dpo_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update trl/trainer/dpo_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update trl/trainer/dpo_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* update function arg name

* distinguish between pad token_id and mask values

* fix tokenization #932 by @nrailg

* fix test

* undo test refactor

* new line

* undo breaking change

* Update token counter condition to allow Llama tokenizer

* Acount for merged tokens on certain tokenizers such Llama-2 tokenizer

* Update variable name to match list value when truncating response

* map function on multi-gpu and gather

* Add test cases for DPOTrainer tokenization step

* revert since we need the prepeared model

* Use gather_with_metrics on ref_logps precomputation to keep original dataset size

* Add flag to keep track of when ref_logps are precomputed

* make variable names private

* formatting

* if precompute_ref_log_probs is true one can use non-peft to populate log-probs

* Use tokenizer padding token unless padding_value is set

* Move dataset.map(tokenize_batch) outside dataloader to avoid serialization errors

* eval can be none

* move to cpu to avoid gpu oom

* remove unneeded cast to float32

* remove unneeded

* fix merge

* fix merge

* fix merge

* add precompute log-prob status via tqdm

* Truncate answer if too longer once prompt has been truncated

* Add prompt_input_ids to batch to enable generation

* formatting and add lora example

* fix formatting

* Tokenize row now expects sample to have space on chosen/rejected for llama

* Revert "Tokenize row now expects sample to have space on chosen/rejected for llama"

This reverts commit dd07a10fe8c19b6ac6bbcc7b8144189756710d52.

* raise error when using zero-3 with precompute_ref_log_probs

---------

Co-authored-by: Pablo Vicente Juan <p.vicente.juan@gmail.com>
Co-authored-by: Shoaib Burq <saburq@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2023-12-12 17:16:46 +01:00
c0ce52ab26 consistency on log (#1084) 2023-12-12 10:58:21 +01:00
393dbf6749 Removing tyro in sft_llama2.py (#1081)
* refactor

* precommit
2023-12-11 11:28:20 -06:00
94fa4b022b Make CI happy (#1080)
* Update test_ppo_trainer.py

* Update test_ppo_trainer.py

* Update test_ppo_trainer.py
2023-12-11 16:52:17 +01:00
cb7819e627 add local folder support as input for rl_training. (#1078)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2023-12-11 16:37:01 +01:00
8f0fc4c8f7 Add args to SFT example (#1079) 2023-12-11 16:16:47 +01:00
d275cb431e [DPO] add KTO loss (#1075)
* add KTO loss

* fix docs

* Update trl/trainer/dpo_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* formatting

* add link to papers

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2023-12-11 11:41:03 +01:00
7d0a8eea4e Add missing loss_type in ValueError message (#1067) 2023-12-07 08:40:53 +01:00
5a233546ee enable multiple eval datasets (#1052)
* enable multiple eval datasets

* added test

* try to avoid infinite computation

* make sure eval set is not infinite

* downsizing the test
2023-12-06 20:26:24 +01:00
9fb00cf007 [SFTTrainer] Fix Trainer when args is None (#1064)
* fix sfttrainer when args is None

* oops
2023-12-06 19:02:09 +01:00
ee44946814 [core] Fix failing tests on main (#1065)
* fix tests on main

* fix last test
2023-12-06 18:31:02 +01:00
7f2401bd6e update doc for the computer_metrics argument of SFTTrainer (#1062) 2023-12-06 17:46:36 +01:00
23bf9d4b58 Improve PreTrainedModelWrapper._get_current_device (#1048)
* use LOCAL_RANK in _get_current_device

* use PartialState in _get_current_device

* update annotation
2023-12-05 17:47:40 +01:00
501c347083 Update doc CI (#1060) 2023-12-05 13:31:01 +01:00
f06f357e9c [SFT Trainer] precompute packed iterable into a dataset (#979)
* precompute packed iterable into a dataset

* add generator function

* fix typo

* fix style

* fix test

* fix style

* add test

* minor refactor

* fix test

* Apply suggestions from code review

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* style

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: younesbelkada <younesbelkada@gmail.com>
2023-12-04 13:13:18 +01:00
4cdc03ab5c Fixing accelerator version function call. (#1056)
Co-authored-by: Partha Ghosh <pghosh@brown.is.localnet>
2023-12-04 12:39:58 +01:00
a60ceefa69 Update dpo_trainer.py (#1049) 2023-12-01 17:03:09 +01:00
baa8f09cb3 Revert "[DPO] Refactor eval logging of dpo trainer (#954)" (#1047)
This reverts commit 6d9ea38ae18c7e266f797b62de4a68a12a13aba4.
2023-12-01 10:33:31 +01:00
c859f5fa5f remove spurious optimize_cuda_cache deprecation warning on init (#1045)
Signed-off-by: Chander Govindarajan <mail@chandergovind.org>
2023-12-01 10:26:42 +01:00
481ef96293 Fixes reward and text gathering in distributed training (#850)
* adds a tensor gather on rewards

* adds dist gather on texts

* style

* adds a tensor gather on rewards

* adds dist gather on texts

* style

* simplifies gathering of rewards

* style

* simplify logic

* precommit

* Update trl/trainer/ppo_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* quick change

* push changes

---------

Co-authored-by: Costa Huang <costa.huang@outlook.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2023-11-30 10:32:09 -05:00
6d9ea38ae1 [DPO] Refactor eval logging of dpo trainer (#954)
* first attempts at refactor of dpo trainer

* removed extra stuff in prediction step

* import fixes

* label names

* all working

---------

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
2023-11-30 12:09:33 +01:00
c203e47fbf spelling is hard (#1043) 2023-11-30 12:09:13 +01:00
c84e5918a6 [DPO] cDPO loss (#1035)
* add cDPO loss

* add comment

* docs

* info about label_smoothing not being used
2023-11-30 11:50:30 +01:00
4b67af37b6 Update utils.py (#1012)
* Update utils.py

update compute_accuracy to deal with the cases where str_chosen and str_rej got the same scores, which is probably what the developers don't want

* Update utils.py

updated so only warning is reserved

* Update trl/trainer/utils.py

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

---------

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
2023-11-29 16:02:50 +01:00
55d7c952c7 [DPO] IPO Training loss (#1022)
* initial IPO loss

* fix loss

* fixed comments

* added docs

* fix doc-strings

* add tests

* Update trl/trainer/dpo_trainer.py

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* fixes for review

* Added doc about beta in the Trainer's docstring

---------

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
2023-11-24 15:52:40 +01:00
μT
3719f7a929 Add missing elements to sft_trainer document (#1029) 2023-11-23 12:34:27 +01:00
e7961e45f1 Remove duplicate data loading in rl_training.py (#1020)
We load dataset twice, but in line 149 (new), we do 
`ds = train_dataset.map` anyway
2023-11-23 12:25:07 +01:00
b307faf07b [Multi-Adapter PPO] Fix and Refactor reward model adapter (#982)
* reward adapter loaded as part of init

more flexible, clearer args

* fixed script for multi gpu

unwrap model since it is DDP
downside, with reward adapter it seems we need to use
find_unused_parameters=True

* remove gradient from reward score calculation

* change supported_args back to None
2023-11-21 14:48:18 +01:00
aea1da8e2b Adds requires_grad to input for non-quantized peft models (#1006)
* Update sft_trainer.py

* style

* add tests
2023-11-20 15:57:46 +01:00
e5eb4db8b5 Update how_to_train.md (#1003)
* Update how_to_train.md

fix description about `min_new_tokens`

* Update docs/source/how_to_train.md

Co-authored-by: Costa Huang <costa.huang@outlook.com>

---------

Co-authored-by: Costa Huang <costa.huang@outlook.com>
2023-11-20 10:33:34 +01:00
28bdb6a373 Fixed wrong trigger for warning (#971)
func.__code__.co_varnames was used to count the function arguments for formatting_func. This code actually counted the function variables rather than function parameters.
2023-11-15 14:36:54 +01:00
e140d22881 make distributed true for multiple process (#997)
* make distributed true for multiple process

* Update trl/trainer/ppo_trainer.py

distributed should have more than 1 process

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
2023-11-15 11:20:25 +01:00
e23a541af9 add docs (#992) 2023-11-14 19:31:10 +01:00
be3faa768e [DataCollatorForCompletionOnlyLM] warn if eos_token_id and pad_token_id are identical (#988)
Display a warning message if the  and  values are the same in order to prevent unintended behavior during multi-turn training.
2023-11-14 19:24:56 +01:00
13679aa97e Update README.md (#994) 2023-11-14 18:29:08 +01:00
9e9f024399 Fix a bunch of outdated references to examples/ (#977) 2023-11-10 11:29:21 +01:00
c2884b5096 [Tests] Add non optional packages tests (#974)
* add non-peft tests

* change name

* test

* change

* fix test
2023-11-09 15:01:46 +01:00
2f726ce4e8 set dev version (#970) 2023-11-08 11:54:01 +01:00
a78a05d7b7 Release: v0.7.4 2023-11-08 10:30:29 +00:00
1b258247cd Pin bnb to <=0.41.1 (#968)
* pin bnb to 0.41.1

* Update setup.py

* Update setup.py
2023-11-08 11:28:17 +01:00
9c93dec05e fix peft config typehint (#967) 2023-11-08 11:11:39 +01:00
d1dad6ebda set dev version (#966) 2023-11-08 11:00:24 +01:00
8ce810250e Release: v0.7.3 (#965) 2023-11-08 10:52:47 +01:00
8e9cae8072 fix: dpo trainer ds config (#957)
* fix: dpo trainer ds config

ref_model and model shouldn share the same ds config, so we shouldn modify the ds config directly. or else, it will cause sth wrong when init deepspeed engine

* fix: import sort

import sort by isort
2023-11-06 14:37:04 +01:00
654543a8cf Added support for custom EncoderDecoder models (#911) 2023-11-06 09:52:10 +01:00
c273b18c1c Adds model kwargs to SFT and DPO trainers (#951)
* adds model kwargs to SFT and DPO trainers

* adds checks for model_kwarg passing when model is not str

* changed warning to ValueError

* renames model_kwargs to model_init_kwargs

* corrects argument names in
2023-11-06 09:48:18 +01:00
6c6ff24926 [DPO] Merge initial peft model if trainer has a peft_config (#956)
* failing test
Co-authored-by: Shoaib Burq <saburq@gmail.com>

* merge initial peft model
2023-11-06 09:45:46 +01:00
6ff0fac2c1 Fix unwrapping peft models (#948)
* First unwrap the model and then process the input embeddings

* Changed base_model to base_model.model to stay consistent with peft model abstractions
2023-11-05 08:31:47 +01:00
951ca1841f [CI] Fix CI with new transformers release (#946)
* fix CI with transformers release

* final fix
2023-11-03 10:38:58 +01:00
cc1de9820a Introducing the Iterative Trainer (#737)
* initial skeleton

* iterative trainer for decoder only

* iterative trainer unittest

* encoder_decoder support

* fix typo in unittest

* init

* fix typo

* fix init typo

* adding loggings and safety checker

* fixed minor issues

* doc

* table of contents update

* add test for seq2seq2 models

* change year

* adding text as step input

* precommit

* fixing typo

* run precommit

* fixing typo in safety checker

* fix text tokenization issue

* add truncate and inherit from trainer

* remove iterative config from tests

* remove iterative config from init

* fix peft model

* change truncation side based on truncation_mode

* removed iterativeconfig autodoc

* fixed typo in trainer.mdx

* remove mention of iterative config in docs

* make sure optimizer and scheduler are created

* adding max_steps to test

* remove log_stats fn

* remove compute loss

* fixing encoder decoder detection

* fix PPODecorator

* run precommit

* fix testing

* fix small typos in iterative trainer

* adapted function log and eval
2023-11-02 17:37:48 +01:00
a64a522fcc Update dpo_trainer.py (#941) 2023-11-02 11:27:49 +01:00
5b32372b71 Optionally logging reference response (#847)
* Optionally logging reference response

* log ref rewards as welll

* peft logic re-write

* fix peft test case

* refactor

* push changes

* test

* Apply suggestions from code review

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* quick fix

* black

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
2023-10-31 17:55:09 -04:00
d759004e52 Fix stale bot (#935)
* Update stale.py

* Update stale.py

* fix
2023-10-31 20:10:38 +01:00
cbc6c9bb3e [core / DDP] Fix RM trainer + DDP + quantization + propagate gradient_checkpointing_kwargs in SFT & DPO (#912)
* make use of forward hooks

* correctly delete attributes

* fix RM DPP issues

* revert unneeded changes

* more fixes

* fix diff

* fix

* propagate to SFT

* Update examples/scripts/reward_modeling.py

* propagate the fix on DPO trainer

* add to example scripts

* trigger CI
2023-10-31 18:50:17 +01:00
f3cd86578b Update dpo_llama2.py (#934) 2023-10-31 18:20:53 +01:00
b763432eaf [SFTTrainer] Make sure to not conflict between transformers and TRL implementation (#933)
* standardize neftune

* up

* fix again
2023-10-31 16:04:09 +01:00
2bbd594ec5 hotfix for dpo trainer (#919)
addresses #914
2023-10-31 10:58:41 +01:00
b89b712dbf fix DPO + GC issues (#927) 2023-10-31 10:55:46 +01:00
ec9e76623e [Feature] Enable Intel XPU support (#839)
* enable xpu support

* fix bug

* review commits

* fix style

* add xou decorator

* refactor review commit

* fix test

* review commit

* fix test

* Update benchmark.yml (#856)

* Standardise example scripts (#842)

* Standardise example scripts

* fix plotting script

* Rename run_xxx to xxx

* Fix doc

---------

Co-authored-by: Costa Huang <costa.huang@outlook.com>

* Fix version check in import_utils.py (#853)

* dont use get_peft_model if model is already peft (#857)

* merge conflict

* add xou decorator

* resolve

* resolves

* upstream

* refactor and precommit

* fix new tests

* add device mapping for xpu

---------

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: Costa Huang <costa.huang@outlook.com>
Co-authored-by: Adam Pauls <adpauls@gmail.com>
Co-authored-by: abhishek thakur <1183441+abhishekkrthakur@users.noreply.github.com>
2023-10-31 10:15:35 +01:00
d192244f54 Bump tyro (#928) 2023-10-30 20:48:34 -04:00
051d5a1f61 updating PPOTrainer docstring (#897)
* adding specific dict structure to tracker_kwargs doc string to enable changing tracker params like wandb experiment name for ease, avoids needing to go deep into accelerate source

* push changes

* set default dict

* refactor

* use typing extension

---------

Co-authored-by: Laura O'Mahony <lauraomahony@L-MacBook-Pro.fritz.box>
Co-authored-by: Costa Huang <costa.huang@outlook.com>
2023-10-30 13:22:53 -04:00
2068fdcd93 Generalize NEFTune for FSDP, DDP, ... (#924)
* Update sft_trainer.py

* quality
2023-10-30 11:17:14 +01:00
02f5c1d8ce fix stackllama2 sft gradient checkpointing (#906)
* fix stackllama2 sft gradient checkpointing

* stackllama2 sft use tyro as arg parser
2023-10-25 09:58:26 -04:00
7de7db6765 deactivate MacOS CI (#913) 2023-10-24 16:06:12 +02:00
4e7d5b5abe [Update reward_trainer.py] append PeftSavingCallback if callbacks is not None (#910) 2023-10-24 14:32:45 +02:00
a90e13321b Fix broken link/markdown (#903)
* Fix broken link/markdown

* attempt to fix mps issue

* attempt fix mps issue

* test

---------

Co-authored-by: Costa Huang <costa.huang@outlook.com>
2023-10-24 14:27:03 +02:00
5b2aeca6c0 [NEFTune] Make use of forward hooks instead (#889)
* make use of forward hooks

* correctly delete attributes

* address suggestions
2023-10-24 14:18:44 +02:00
1f3314fd2f Add whiten ops before compute advatanges (#887)
* Add whiten ops before compute advatanges

1. From LLaMA 2 paper, it says:
```
We also find it important to whiten the final linear scores (shown here by reversing the sigmoid with the logit function) in order to increase stability and balance properly with the KL penalty term (β) above.
```
2. This function is taken from [alpaca_farm](64e489c67e/src/alpaca_farm/rl/ppo_trainer.py (L86))

* Fix type def of self

---------

Co-authored-by: Lin Junpeng <linjunpeng@sensetime.com>
2023-10-23 11:32:45 -04:00
304ee70eef Fix couple broken links on lib homepage (#908) 2023-10-23 11:46:37 +02:00
0a5aee7d99 [reward_modeling] Cleaning example script (#882)
* remove load in repeated multiple times & truncation

* trigger CI
2023-10-19 16:00:20 +02:00
db592a2eb6 fix: remove useless token (#896) 2023-10-19 14:28:33 +02:00
122edc8f5d fix peft_config type (#883)
Co-authored-by: wanglei.w <wanglei.w@bytedance.com>
2023-10-18 23:45:38 +02:00
f91fb2bda2 remove duplicate key in reward_modeling.py (#890) 2023-10-18 23:45:18 +02:00
01e4ad0009 fix syntax error 2023-10-17 21:22:53 +02:00
1e56ff0f16 Fix security breach 2023-10-17 08:01:24 +02:00
c4ed3274be [SFTTrainer] Adds NEFTune into SFTTrainer (#871)
* v1 neftune

* docstring

* add doc + fix nit

* add more docs

* Apply suggestions from code review

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2023-10-17 06:58:05 +02:00
14b6bc6691 [DPO] add SLiC hinge loss to DPOTrainer (#866)
* add SLiC hinge loss

* fix links

* beta when loss is hinge is reciprocal of margin

* fix tests

* fix docs

* doc strings

* fix method name

* raise error if loss_type is not correct

* Update trl/trainer/dpo_trainer.py

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* fix formatting

---------

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
2023-10-16 16:02:57 +02:00
eb4d2f381a set dev version (#864) 2023-10-12 15:51:54 +02:00
78e08bd658 Release: 0.7.2 (#863) 2023-10-12 15:29:10 +02:00
96d4854455 Support both old and new diffusers import path (#843)
* Update modeling_sd_base.py

* Update trl/models/modeling_sd_base.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* make precommit

* cleaner approach

* oops

* better alternative

* rm uneeded file

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Co-authored-by: younesbelkada <younesbelkada@gmail.com>
2023-10-12 15:06:09 +02:00
3ef21a24e7 [core] Fix import issues (#859)
* fix import issues

* cleaner approach
2023-10-11 19:04:49 +02:00
f7707fd4c6 dont use get_peft_model if model is already peft (#857) 2023-10-11 18:58:56 +02:00
dd9b8f4189 Fix version check in import_utils.py (#853) 2023-10-11 18:55:43 +02:00
ddd318865b Standardise example scripts (#842)
* Standardise example scripts

* fix plotting script

* Rename run_xxx to xxx

* Fix doc

---------

Co-authored-by: Costa Huang <costa.huang@outlook.com>
2023-10-11 17:28:15 +02:00
8aa12d3c95 Update benchmark.yml (#856) 2023-10-11 11:06:48 -04:00
95aea7c072 Use uniform config (#817)
* Use uniform config

* quick fix

* refactor

* update docs
2023-10-09 09:15:06 -04:00
eda1f36c57 Raise error in create_reference_model() when ZeRO-3 is enabled (#840)
* Raise error when using  with ZeRO-3

* Fix

* Refactor

* Revert

* Restore remote code

* Revert example
2023-10-09 10:49:01 +02:00
ac0d5b726d add DDPO to index (#826)
* add DDPO to index

* Update index.mdx
2023-10-06 14:42:56 +02:00
6826d592ae Clarify docstrings, help messages, assert messages in merge_peft_adapter.py (#838)
An assertion was also corrected to the intended test condition
2023-10-06 11:04:58 +02:00
c058ee6f05 [MINOR:TYPOS] Update README.md (#829) 2023-10-05 14:33:20 +02:00
fbeb146eea Set trust remote code to false by default (#833) 2023-10-04 22:53:57 +02:00
98845b9282 Fix DeepSpeed ZeRO-{1,2} for DPOTrainer (#825) 2023-10-03 09:56:00 +02:00
9f6326e65a Unify sentiment documentation (#803)
* Update documentation

* update docs

* test

* format

* Update docs/source/example_overview.md

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* update

* add quantization dependency and update docs

* Update docs/source/example_overview.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/example_overview.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/example_overview.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/example_overview.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/sentiment_tuning.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/sentiment_tuning.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/sentiment_tuning.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/sentiment_tuning.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/sentiment_tuning.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/sentiment_tuning.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* update

* quick update 2

---------

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2023-10-02 10:35:49 -04:00
7dcc71b1a6 Small fixes to the PPO trainer doc. (#811)
One outstanding issue is that ppo_trainer.save_model doesn't exist.
How do we actually save the model after training?
2023-10-02 11:01:05 +02:00
6b73adc900 add option for compute_metrics in DPOTrainer (#822) 2023-09-29 12:33:47 +02:00
249d3e3259 Add RMSProp back to DPO (#821)
* init

* add install instructions
2023-09-26 10:44:44 -07:00
ad8d50e30d init custom eval loop for further DPO evals (#766)
* init

* run

* Update custom eval loop to aid DPO debugging (#770)

* sample_during_eval -> generate_during_eval

* Remove unused return_tokens

* Add import utils for W&B, prevent test fails

* Optimize dataloader random batch selection

* Separate prompt and response in logs

Makes it much easier to quickly read the starts of the generations

* Simplify logging

* reset eval steps

* manual merge fixes

* revert merge

* remove self.max_length

* style

* fix max_length

---------

Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
2023-09-26 08:09:15 -07:00
d608fea0d1 Allow passing the token_ids as instruction_template in DataCollatorForCompletionOnlyLM (#749)
* Update utils.py

* correctly assign instruction_template in DataCollatorForCompletionOnlyLM

* correctly use instruction_token_ids in DataCollatorForCompletionOnlyLM

* DataCollatorForCompletionOnlyLM: fix instruction_template / response_template type check: handle cases where instruction_template is None

* make precommit

* Test DataCollatorForCompletionOnlyLM with pre-tokenized instruction_template
2023-09-26 11:38:30 +02:00
92b03f5fdc fixes ppo trainer generate nit (#798) 2023-09-26 10:19:29 +02:00
7877e92991 Update sft_trainer.mdx (#808) 2023-09-22 17:55:54 +02:00
1d7e3c2ae2 Update sft_trainer.mdx to highlight Flash Attention features (#807)
* Update sft_trainer.mdx

* Update sft_trainer.mdx
2023-09-22 17:42:21 +02:00
eb6aa20401 clarify PEFT docs (#797) 2023-09-21 11:22:20 +02:00
b8f0c4cf12 Add deepspeed experiment (#795)
* Add deepspeed experiment

* add deepspeed pip install

* update hello world.sh

* update comments

* remove cleanup
2023-09-20 09:32:42 -04:00
e11a45c5d8 Revert "Add default Optim to DPO example (#759)" (#799)
This reverts commit d603e7c52704054a9e7f306ae63acdafaa3d179a.
2023-09-20 10:32:55 +02:00
08cfc4179b Add margin to RM training (#719)
* Start adding margin to RM training

* Fix typo and cleanup

* Fix incompatibilities when not using margin

* Format using 'make precommit'

* Add documentation and test for reward trainer

* Run 'make precommit'

* Update docs/source/reward_trainer.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Fix missed merge conflict in reward trainer docs

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2023-09-20 10:18:38 +02:00
d603e7c527 Add default Optim to DPO example (#759)
* add optim

* make configurable
2023-09-19 07:56:52 -07:00
5d30cd4d30 Changed the default value of the log_with argument (#792)
This change avoids setting report_to="all" (the default behavior in
transformers v4), which could lead to unexpected error messages for
inexperienced users. Note that the default value of report_to will
change anyway to "none" in transformers v5.
2023-09-19 13:04:17 +02:00
46975236be Temp benchmark ci dir (#765)
* Support fork in benchmark CI

* use temporary dir for benchmark CI

* debug

* revert back

* dependency fix

* refactor script
2023-09-18 11:16:16 -04:00
9a8d52cc5a Fix type checking (#748) 2023-09-18 13:54:41 +02:00
0a6c42c12c Update benchmark.yml (#782) 2023-09-15 13:45:21 -04:00
221be13d26 Update benchmark.yml (#781) 2023-09-15 11:34:09 -04:00
a922af6927 Update benchmark.yml (#780) 2023-09-15 11:28:16 -04:00
42e7a0a824 Update benchmark.yml (#779) 2023-09-15 11:18:55 -04:00
15d52e759b Update benchmark.yml (#778) 2023-09-15 11:02:10 -04:00
24e914a0ab Update benchmark.yml (#777) 2023-09-15 10:57:08 -04:00
637612d95f Benchmark CI fix (#776) 2023-09-15 10:33:45 -04:00
35694baef2 Benchmark CI fix (#775) 2023-09-15 08:52:24 -04:00
d2f27df50a Update benchmark.yml (#773)
* Update benchmark.yml

* quick change
2023-09-15 09:40:20 +02:00
5cee9a0478 Support fork in benchmark CI (#764) 2023-09-14 08:44:36 -04:00
3f7710aed7 docs: add initial version of docs for PPOTrainer (#665)
* docs: add initial version of docs for  `PPOTrainer`

* Apply suggestions from code review Leandro

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* updated docs based on feedback leandro
- specified reference to reward model
- added batched generator
- added line of saving model
- remove reference model

* Apply suggestions from code review

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

---------

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
2023-09-14 10:34:19 +02:00
ca0af3944d Benchmark CI (actual) (#754)
* refactor and benchmark

* update code

* Add accelerate logging

* logs

* quick fix

* update config

* precommit

* modify training example

* fix multi-gpu all_reduce error `Tensors must be CUDA and dense`

* support more models and benchmark

* update

* add changes

* upload benchmark

* precommit

* add tyro as a dependency

* add tyro

* pre-commit

* precommit

* weird...

* lol typo

* precommit

* sigh

* push changes

* Update benchmark/README.md

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Add experiments

* upload image to tag specific folder

* add openrlbenchmark documentation

* rename

* remove unused field

* precommit

* update slurm template

* add dependency

* update dependency

* ..

* .

* quick change

* push changes

* update

* update

* remove wandb tag code

* quick change

* precommit

* update test

* update dependency

* update test

* update benchmark dependency

---------

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
2023-09-13 13:34:00 -04:00
e4f9a483d9 Refactor and benchmark (#662)
* refactor and benchmark

* update code

* Add accelerate logging

* logs

* quick fix

* update config

* precommit

* modify training example

* fix multi-gpu all_reduce error `Tensors must be CUDA and dense`

* support more models and benchmark

* update

* add changes

* upload benchmark

* precommit

* add tyro as a dependency

* add tyro

* pre-commit

* precommit

* weird...

* lol typo

* precommit

* sigh

* push changes

* Update benchmark/README.md

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Add experiments

* upload image to tag specific folder

* add openrlbenchmark documentation

* rename

* remove unused field

* precommit

* push changes

---------

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
2023-09-13 10:24:18 -04:00
80890b17be [PPOTrainer] - add comment of zero masking (from second query token) (#763)
It took a while to understand why zero-masked tokens are one less than the length of query tokens. 

If I got it correctly, it is because the first logit (and state-value) from the outputs refers to the second token in the query. 

Hope this comment can be helpful to others who may encounter a similar question in the first-pass reading of the code :)
2023-09-13 10:23:04 +02:00
cf9d2a7133 Imrpove benchmark ci (#760) 2023-09-13 09:29:06 +02:00
c02ce6d3f5 Extend DeepSpeed integration to ZeRO-{1,2,3} (#758)
* Generalise deepspeed

* Refactor

* Add reward model arg

* Fix pipeline tokenizer

* Fix deprecation

* Pin deepspeed lower

* Fix docs

* Revert top_k change

* Add ZeRO-3 context manager

* Revert docs change

* Fix docs

* Polish docs

* Update docs/source/customization.mdx

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
2023-09-12 18:59:49 +02:00
9141aa42ba EOS token processing for multi-turn DPO (#741)
* init

* fix

* add doc

* style

* clarify example
2023-09-12 09:49:51 -07:00
05723c0b88 benchmark CI fix (#755) 2023-09-12 09:04:57 -04:00
b87ec2d5a0 update to prepare_model_for_kbit_training (#728)
* update to `prepare_model_for_kbit_training`

from deprecated `prepare_model_for_int8_training`
and add `use_gradient_checkpointing=args.gradient_checkpointing` to
automatically follow the gradient checkpointing choice

is also the workaround for #694

* workaround for gradient checkpointing issue

calling model.gradient_checkpointing_enable() twice causes issues
this workaround calls it in prepare_model_for_kbit_training and then
changes the arg to false to make sure it isn't called again in
huggingface trainer inner loop

also changes stack_llama_2 sft trainer to use correct device map for ddp
training so that you can test this issue
2023-09-12 10:56:10 +02:00
27df071ad8 add benchmark ci (#752) 2023-09-11 13:35:53 -04:00
67452ef213 fix import of torch_utils (#751) 2023-09-11 18:46:19 +02:00
22a90198e5 [DPO] self.accelerator._prepare_deepspeed return tuples (#745) 2023-09-08 11:50:06 +02:00
4f81e7736d Seq2Seq model support for DPO (#586)
* dpo_collator for seq2seq models

* dpo trainer support

* refactoring

* update collator

* computes decoder input ids if possible

* decoder input ids for dpo trainer

* added test for seq2seq

* quality

* fixed typo

* fixed string padding for seq2seq

* fixed minor issues in padding

* fixed typo in dpo.py

* add docstring

* run all precommit

* fixed gradient accumulation steps in test

* reformatting

* fixing dpo tests

* update .mdx
2023-09-07 18:03:10 +02:00
14292b08af fixed metrics typo (#743) 2023-09-07 18:02:20 +02:00
453c4eca14 Enable gradient checkpointing to be disabled for reward modelling (#725)
* Enable gradient checkpointing to be disabled for reward modelling

* Update examples/scripts/reward_trainer.py

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Tidy docs

* Remove commas

---------

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
2023-09-06 14:08:15 +02:00
decc832d3e Add epsilon to score normalization (#727) 2023-09-06 10:28:07 +02:00
1111295776 check correctly for condition (#668) 2023-09-06 10:24:55 +02:00
c04074e248 Fix DeepSpeed ZeRO-3 in PPOTrainer (#730)
* Initialise ref model with ZeRO-3

* Fix deadlock

* Refactor & fix KL div

* Refactor

* Refactor

* Fix imports

* Add types

* Add accelerate configs

* Add more DeepSpeed configs

* Fix types

* Disable debug

* Refactor

* Add docs

* Disable eval mode for peft

* Restore eval mode

* Revert ref model prep for peft

* Update examples/scripts/README.md

Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>

* Add docs

---------

Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>
2023-09-05 11:00:49 +02:00
d484dc2a93 Refactor RewardTrainer hyperparameters into dedicated dataclass (#726)
* Refactor RewardTrainer hyperparameters into dedicated dataclass

* Revert

* Add doc string

* Fix warning

* Handle backwards compat

* Fix tests

* Add docs

* Refactor to RewardConfig

* Fix case conditions

* Fix
2023-09-05 09:05:42 +02:00
34e6948d45 [core] Bump peft to 0.4.0 (#720)
* bump peft to 0.4.0

* all of them
2023-09-01 15:01:36 +02:00
9f69f06a1c Add pyproject.toml (#690)
* example pyproject.toml

* update target to py38

* make pyproject.toml equivalent to accelerate
2023-09-01 11:42:18 +02:00
jp
5bb46687c5 Fix: RuntimeError: 'weight' must be 2-D issue (#687)
* Update dpo_trainer.py

* Fix: self.args.deepspeed > self.is_deepspeed_enabled

* Update trl/trainer/dpo_trainer.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
2023-09-01 11:27:54 +02:00
25d6700c5e fix sft mistakes (#717) 2023-08-31 16:56:29 +02:00
4d31d0c4f8 Update docs on gms8k (#711) 2023-08-31 16:48:07 +02:00
0ff39d2a87 fix device issue (#681)
* fix device issue

* fix device issue

* fix device issue

* merge changes

* fix device issue
2023-08-31 16:37:42 +02:00
b4899b29d2 set dev version (#710) 2023-08-30 17:00:34 +02:00
6aae9e75f3 Release: VERSION (#709) 2023-08-30 12:48:10 +02:00
79b90e19ba a workaround for failing log_stats (#708) 2023-08-30 12:23:57 +02:00
7f636c9ed7 set dev version (#707) 2023-08-30 11:58:22 +02:00
98d8cc509d Release: v0.7.0 (#706) 2023-08-30 11:55:54 +02:00
9d09b3e107 TextEnvironments (#424)
* WIP skeleton

* minimal working poc

* cleanup

* rename variables

* quick typo fix

* add v1 masking (#429)

* add v1 masking

* working v1

* adapt from suggestion

* avoid warning `Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.`

* fix masking

- mask the responses from API call only

* quality

* address comments

* Update trl/environment/base.py

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* adapt a bit

* wip on tokenization/masking in textenv

* small fixes

* update viz

* add example

* print debug text and pass masks

* style

* format and move tensor to device

* update example

* update example

* This seems to work

* fix masking

* fix rich output to console

---------

Co-authored-by: Costa Huang <costa.huang@outlook.com>
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: leandro <leandro.vonwerra@spoud.io>

* Add masking (#461)

* add v1 masking

* working v1

* adapt from suggestion

* avoid warning `Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.`

* fix masking

- mask the responses from API call only

* quality

* address comments

* Update trl/environment/base.py

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* adapt a bit

* wip on tokenization/masking in textenv

* small fixes

* update viz

* add example

* print debug text and pass masks

* style

* format and move tensor to device

* update example

* update example

* This seems to work

* fix masking

* fix rich output to console

* fix batched generation

* improve stopping criteria

* improve error handling in tool call

---------

Co-authored-by: younesbelkada <younesbelkada@gmail.com>
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Co-authored-by: Costa Huang <costa.huang@outlook.com>

* fix uknown tool

* fix rewards and increase bs

* remove unused script

* ugly WIP fix

* do not return modified obj for in-place operations

* do not return modified obj for in-place operations

* clean up stopping criterium

* push updates

* push update

* format, add docs

* rename file

* add kwargs to reward fn

* simplify example

* simplify example

* bug fix

* add a trivia example

* pre-commit

* max tool response length

* fix regex for multi-line

* refactor tool exceptions

* fix exceptions in tool

* add docs

* fix style

* make rich optional

* add docstrings

* add  tests

* add TextEnv tests (WIP)

* update triviaqa code

* update docs

* refactor text env

* update tests (WIP)

* add end2end test

* update docs

* upload tool demo

* refactor

* customizable system prompt

* add text env docs

* update index and toc

* fix `TextHistory` show methods

* add max length

* fix style

* fix typo

* refactor to kwargs in init and tasks to queries

* kwargs for reward docs

* Update examples/triviaqa.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update examples/tool_demo.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update docs/source/learning_tools.mdx

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update docs/source/learning_tools.mdx

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update docs/source/learning_tools.mdx

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update docs/source/text_environments.md

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update examples/triviaqa.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update examples/triviaqa.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* move to tool folder

* remove assets

* remove tool demo

* move rich import test to import utils

* add copyright

* fixes for masks in ppo trainer

* add text env api docs

* make precommit + add ppo test with mask

* move examples and add python

* fix style

* update triviaqa example

* add more docs

* update docs

* Update docs/source/learning_tools.mdx

* Apply suggestions from code review

* precommit

---------

Co-authored-by: Costa Huang <costa.huang@outlook.com>
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Co-authored-by: younesbelkada <younesbelkada@gmail.com>
Co-authored-by: leandro von werra <leandro@hf.co>
2023-08-30 11:44:06 +02:00
336d63eb80 [Docs] fix example README.md (#705) 2023-08-30 11:27:50 +02:00
7fc970983c [DPO] fix DPO ref_model=None (#703)
* fix by @tannonk

* Update trl/trainer/dpo_trainer.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* add import

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
2023-08-29 12:57:10 +02:00
d3bbee3ab8 set dev version (#685) 2023-08-24 11:04:07 +02:00
eb5465df7e Release: v0.6.0 (#684) 2023-08-24 10:18:46 +02:00
1c272240ac Simplify immutable TrainingArgs fix using dataclasses.replace (#682) 2023-08-24 09:50:48 +02:00
Wei
b095245830 fix PeftConfig loading from a remote repo. (#649)
* fix PeftConfig loading from a remote repo.

* failed to catch hf_hub_download() EntryNotFoundError.

At least in huggingface-hub 0.10.1, the error for "not found" is:
huggingface_hub.utils._errors.EntryNotFoundError: 404 Client Error

* pass precommit checks.

* replace some bare excepts with specific codes

* catch LocalEntryNotFoundError additionally.
2023-08-24 09:50:20 +02:00
c115453fba Update sft_llama2.py (#678)
Add argument num_workers. Fixed error on line 103 if streaming set = False
2023-08-23 16:56:31 +02:00
16f214c58d fix unmutable TrainingArguments issue (#676) 2023-08-23 10:54:59 +02:00
e9a437992e propagating eval_batch_size to TrainingArguments (#675)
Co-authored-by: Rahul Jha <rahuljha@netflix.com>
2023-08-23 10:52:25 +02:00
c837fbe5b9 Fix DPO blogpost thumbnail (#673) 2023-08-22 11:53:21 +02:00
01c4a35928 Denoising Diffusion Policy Optimization (#508)
* Broken first pre-draft

* Change structure to leverage user-definition of pipeline
 - reward function, pipeline and scheduler will be left to the user to define
 - pipeline and scheduler contract interfaces is what the framework will define
 - none of this actually works

* Incremental progress: trying to get the set-up running e2e

* Incemental progress: successfully running code

* Incremental progress: running setup
Next steps: fix accelerate gardient acc assertion error when we set value > 1

* Formatting and code standards

* Incremental prog: break down code a bit
- new config flag to notify code of async reward fetching
- break off image handling code and throw it on to user to define how to handle it
- more code restructuring

* Incremental progress:
1. More code sectioning off into own methods (more for readibility than anything else)

* Incremental progress:
1. clear up contracts
2. type the reward function and prompt function

* Code shuffling and expansion of tracker, accelerator config args to beyond wandb

* More small additions
Add tensorboard logging function
Remove wandb logging function for now
Consolidate the data that get's thrown to the logging function
Add README

* Formatting

* Formatting

* Remove print statement
Make tensorboard tracking the sole tracking for the training example

* 1. start of testing
2. more refactoring
3. start of docstrings
4. parameter rename

* Basic Tests
Formatting

* Docs according to the norm

* Doocs, credits and rename file

* docs and corrections

* Put example config to respectable state

* Add recent run params

* Correct the name of the library

* Move requirements to EXTRAS

* - Add license banners
- Guard import of DDPO functions with if_diffusers_available
- doc strings for output types

* Add snippet to pull weights from huggingface + banner

* Test if passes on CI/CD

* Minor refactor

* Test dummy unet

* Possible fix for randomly disappearing attribute

* Shuffling arrangement in hopes of meeting memory requirements

* Proper Names

* Appease windows memory allocator issues for the cpu device

* Remove print statements

* Update docs/source/ddpo_trainer.mdx

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* Update docs/source/ddpo_trainer.mdx

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* Add docstrings and correct url

* Spelling and grammar

* Add more documentation and commandline parsing for example script

* Markdown synatx correction

* Revert accidentally committed file and put the correct one

* More docs

* Remove subclassing and add docs for leftoover subclassing

* Put back subclassing

* Reward metadata and more docs

* Remove save_load_save flag

* Grammar

* Update trl/trainer/ddpo_trainer.py

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Update tests/test_ddpo_trainer.py

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Update setup.py

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Update examples/scripts/stable_diffusion_tuning.py

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Edits to the readme for DDPO

* Renamed modelling_sd_base to modeling_sd_base

* Insert try and catch for bitsandbytes import

* Change to smaller model

* Correct tolerance for floating point comparison

* Remove dummy unet and move to check is isfinite

* 1. Expand interface to ensure other Stable Diffusion pipelines could be covered
2. remove extra identification

* 1. Remove most of the asserts except for one and add value error
2. Remove default run name

* Remove progress bar

* Docs

* Put back progress bar

* 1. Revert progress bar deletion completely
2. grammar
3. relocate line

* Experiment

* Remove experiment parts and format properly

* Change formatting and edit info in docs

* Grammar

* Refactor out most of nitty gritty of loading/saving from trainer to example model
Readme addition

* Docs additions

* 1. Proper formatting fr the test file
2. incorporatioon of pull frm hub if fails try local
3. doc strings for interface
4. highlight in the trainer, that this is only ready fr sd pipelines

* Resources for before and after

* Attempt at embedding images

* Post testing example script

* Consistent naming and document edits in light of new args

* Remove resources and add CDN links in html in doc file

---------

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
2023-08-21 19:24:52 +02:00
1aca98fbcf add check of arguments (#660) 2023-08-21 12:02:07 +02:00
029f961b7c Handle potentially long sequences with DataCollatorForCompletionOnlyLM (#644)
* avoid RuntimeError on long sequences

* add unittests and format

* remove dependency on external repo

* bug fix in DataCollatorForCompletionOnlyLM
2023-08-18 10:30:25 +02:00
8ec912ffa6 Add more args to SFT example (#642)
* add more args

* fix style issues
2023-08-18 10:15:43 +02:00
f360c37466 Allow for ref_model=None in DPOTrainer (#640)
* Update dpo_trainer.py

Make ref_model optional.

* add tests for ref_model=None

* better handling for ref_model=None

* Update dpo_trainer.py

Correct docstring

* move instantiation of self.ref_model closer to model

* use .disable_adapters instead of .get_base_model

* handle ref_model=None in get_batch_samples

* fix failing test in dpo_trainer due to disable_dropout_in_model

* Update trl/trainer/dpo_trainer.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
2023-08-18 10:02:16 +02:00
217313014b Update README.md (#657)
* Update README.md

fix reward modeling example

* Update README.md

more concise fix
2023-08-17 22:00:58 +02:00
b946e875b1 Resolve various typos throughout the docs (#654)
* Resolve various typos throughout the docs

I found the first few manually, and then found the rest via codespell

* HuggingFace -> Hugging Face

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
2023-08-17 12:27:54 +02:00
6dd50b45d8 Add checks on backward batch size (#651)
* Add checks on backward batch size

* add test case

* update test case

* Update citation
2023-08-17 10:35:44 +02:00
98120d6aeb Disable dropout in DPO Training (#639)
* disable dropout in dpo

* quick fix docs

* precommiot

* add disable_dropout_in_model to DPOTrainer

* disable_dropout -> disable_dropout_in_model

* .

* .
2023-08-14 14:40:45 +02:00
3b2c820db6 Add score scaling/normalization/clipping (#560)
* Add reward/score scaling/normalization/clipping

* Run pre-commit to fix styles and remove some dupe code

* Make sure score module and pretrained_model have the same dtype

* Add multi_adapter_rl_v2.py

* Add log_with

* Add more verbose help message for use_score_norm

* Fix score clipping for float16

* Minor fix
2023-08-10 10:30:56 +02:00
25fd6f2313 Move repo (#628)
* update actions

* update references
2023-08-09 17:48:25 +02:00
3f1477cdc0 Improve docs (#612)
* WIP

* improve inference docs

* improve training faq

* update toctree

* fix toctree

* fix improve blog

* improve blog

* fix customization

* reword faq a bit

* reword inference a bit

* add references back

* integrate feedback from code review

* fix link in html
2023-08-08 11:45:16 +02:00
2cff1e4385 Allow already tokenized sequences for response_template in DataCollatorForCompletionOnlyLM (#622)
* Allow tokenized ids in DataCollatorForCompletionOnlyLM. Add test and docs

* Formatting

* Documentation

* Remove unused code from test

---------

Co-authored-by: Ivan Sanchez <ivan.sanchez@zyte.com>
2023-08-08 11:33:12 +02:00
d7d7902938 use log_with argument (#620) 2023-08-08 10:13:22 +02:00
77b0cc1707 [DPO] stack-llama-2 training scripts (#611)
* initial stack-llama-2 scripts

* removed unused function

* add accelerate

* link to stack-llama-2 code

* running the model

* pre-commit fixes

* use the merge_peft script

* Add section on logged metrics
2023-08-07 14:36:16 +02:00
17f22c1c20 Add docs explaining logged metrics (#616) 2023-08-04 12:50:39 -04:00
e448bb69f0 [Modeling] Add token support for hf_hub_download (#604)
* add token support for hf_hub_download

* allow to pass it to from_pretrained
2023-08-03 12:49:31 +02:00
9aa4e3ce2b set dev version (#608) 2023-08-02 10:43:27 +02:00
ca8a508913 Release: 0.5.0 (#607) 2023-08-02 10:31:43 +02:00
a00ab445ba refactor grad accum (#546)
* refactor grad accum

* quick fix

* use correct place to step optim

* push changes

* cleanup and fix division by zero in `masked_var`

* revert back changes

* use unbiased var

* deal with division by zero

* add test case

* calculate advantage only once

* format

* add warning

* add more warnings

* quick fix

* remove unhelpful warning

* fix test cases

* fix test cases

* bump version given the breaking change

* black

* refactor

* update test cases

* error out

* push changes

* remove exact div

* add comments
2023-08-01 09:00:41 -04:00
431f0c9a2f Fix comparison in DataCollatorForCompletionOnlyLM (#588) (#594)
* Add unit test to DataCollatorForCompletionOnlyLM to reproduce the bug.

* Change comparison target from examples[i][input_ids] to batch[labels][i] in DataCollatorForCompletionOnlyLM
2023-07-31 14:13:35 +02:00
64bc9bc9e6 docs: Replace SFTTrainer with RewardTrainer in comment (#589)
Likely just a copy-paste error
2023-07-28 15:37:25 +02:00
5a1e1bf06e Introducing DataCollatorForChatCompletionOnlyLM (#456)
* added DataCollatorForChatCompletionOnlyLM

* added simple test

* merged the two collators and fixed ### in completion

* fix response template

* fixing ordering in test

* quality

* fixed minor comments & make doc

* chat test back

* Update tests/test_sft_trainer.py

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
2023-07-28 14:17:03 +02:00
e8dd8102d8 Update the example sft_trainer.py (#587)
Added save the model, because by default it saves only checkpoints not the final version.
2023-07-28 13:50:41 +02:00
1b46c61d43 [PPO] fix corner cases with PPO batch size and forward_batch_size (#563)
* fix corner cases PPO

* forward contrib credits from initial contribution

* forward contrib credits from initial discussions

---------

Co-authored-by: 1485840691-eng <1485840691-eng@users.noreply.github.com>
Co-authored-by: shubhlohiya <shubhlohiya@users.noreply.github.com>
2023-07-28 11:05:34 +02:00
3b0a1b5f8c Add missing max_seq_length arg to example sft_trainer.py (#585) 2023-07-27 18:17:43 +02:00
31658b4263 Computes the KL penalty using the entire distribution (#541)
* adds full log probs

* Adds tests, comments

* precommit

* bug all -> full

* adds option description to sentiment analysis script, fixes a few bugs
2023-07-27 12:08:24 +02:00
f7227fb296 Fix model output dim in reward trainer example (#566)
* correct glitches in reward modelling

* add the eval_split option

* correct code format
2023-07-26 11:02:23 +02:00
b3c2e73e70 [DPO] Resolve logging for DPOTrainer (#570)
* Resolve logging for DPOTrainer

* Ensure the WandB logger correctly prefixes all logs

* Run pre-commit

Whoops, hadn't run `pre-commit install` yet
2023-07-26 08:06:25 +02:00
d78d917880 Add comment to explain how the sentiment pipeline is used to run the … (#555)
* Add comment to explain how the sentiment pipeline is used to run the reward model in the StackLLaMA example

* Apply 'make precommit'
2023-07-24 18:09:45 +02:00
cdde7f71d7 Add DataCollatorForCompletionOnlyLM in the docs (#565)
* add `DataCollatorForCompletionOnlyLM` in the docs

* nit
2023-07-24 16:47:41 +02:00
51d5f08d88 add epochs and num steps on CLI (#562) 2023-07-24 14:01:54 +02:00
8762507d3a Minor typo and whitespace fixes (#559)
* [docs] remove extra whitespace

* [examples] fix help for dataset_name
2023-07-24 13:56:55 +02:00
1bd852aa8f remove unused batch_size arg (#554) 2023-07-24 13:23:33 +02:00
170d58ffce [SFTTrainer] Add warning for wrong padding_side (#550)
* add warning for wrong padding_side

* add warning

* revert

* oops
2023-07-22 10:53:16 +02:00
84c9209037 ADD: num_proc to SFTTrainer (#547)
* ADD: num_proc to SFTTrainer

* make precommit

* Update trl/trainer/sft_trainer.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update trl/trainer/sft_trainer.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update trl/trainer/sft_trainer.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update trl/trainer/sft_trainer.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* add batch_size

* Update trl/trainer/sft_trainer.py

Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
2023-07-20 15:41:48 +02:00
d0fe348a0a Add use_auth_token arg to sft_trainer example (#544)
* Add use_auth_token arg to sft_trainer example

* Update examples/scripts/sft_trainer.py

---------

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
2023-07-19 21:12:18 +02:00
5857d0acc6 [examples] make the sft script more modulable (#543)
* make the script more modulable

* docs + some changes
2023-07-19 18:13:55 +02:00
fd50e063e1 [DPO] remove response/pairs from the DPO side (#540)
* remove response/pairs from the DPO side

* Simplify get_hh helper function

* removed unused import

* update tests and docs for dpo_trainer

---------

Co-authored-by: Tom Aarsen <Cubiegamedev@gmail.com>
Co-authored-by: Shoaib Burq <saburq@gmail.com>
2023-07-19 17:36:24 +02:00
bcff7c2dab Relax reward trainer constraint (#539)
* relax reward trainer constraint

* Update trl/trainer/reward_trainer.py

Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>

* relax also for DPO

---------

Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
2023-07-19 14:12:23 +02:00
0e8d9f8504 fix offline case (#538) 2023-07-19 12:16:13 +02:00
7f297b38c6 all the concated batches are on same device (#528) 2023-07-18 13:21:17 +02:00
84393f3b94 DPO Trainer (#416)
* initial DPO Trainer

* typo

* initial dpo from reward trainer

* calc. log_probs from logits

* remove dpo config for now

* fix inits

* add intial DPODataCollatorWithPadding

* use the RewardDataCollatorWithPadding

* initial test

* means of loss

* add assert

* just call the train instead of step

* functional debug example before refactor

* check the params have changed

* initial DPODataCollatorWithPadding

* Data collator with masking

* going through trainer.accelerate to wrap ref_model

* style / imports

* style / imports

* `broadcast_buffers=False` fix to distributed training

* better fix for DDP issues

* arguments and style clean-up

* better doc, some light refactoring

* better imports

* initial dpo doc

* fix test

* fix formatting

* fix

* called models once

* fix tests

* add example

* fix doc string

* intitial example with anthropic hh dataset

* refactored dpo trainer

* revert

* return metrics

* fixed tests

* updated docs

* update test

* fixed typo

* note about the beta

* added dpo authors

* fix docstrings

* add prediction_step

* remove compute_metrics and log metrics manually

* fix typo

* add DPOTrainer doc

* add dpo to toc

* ValueError

* add to index and example

* fix docs

* fix assert

---------

Co-authored-by: TevenLeScao <teven.lescao@gmail.com>
Co-authored-by: Gaetan LOPEZ <gaetanloplat@gmail.com>
Co-authored-by: younesbelkada <younesbelkada@gmail.com>
2023-07-17 14:52:14 +02:00
388bdc03ac Fix sentiment nit (#517) 2023-07-14 14:11:24 +02:00
5c7bfbc8d9 [examples] Big refactor of examples and documentation (#509)
* added sfttrainer and rmtrainer example scripts.

* added few lines in the documentation.

* moved notebooks.

* delete `examples/summarization`

* remove from docs as well

* refactor sentiment tuning

* more refactoring.

* updated docs for multi-adapter RL.

* add research projects folder

* more refactor

* refactor docs.

* refactor structure

* add correct scripts all over the place

* final touches

* final touches

* updated documentation from feedback.
2023-07-14 12:00:56 +02:00
36b77ae81d Use local process index for _get_current_device() (#515)
This PR fixes a bug in `_get_current_device()` where the global process index was being returned instead of the local one. 

With this fix, it is possible to run training in **multi-node** environments and avoid the dreaded `RuntimeError: CUDA error: invalid device ordinal` :)
2023-07-14 10:53:33 +02:00
2049d03e82 Put labels tensors onto GPU to fix eval bug on deepspeed (#513) 2023-07-13 11:51:21 +02:00
31b98aa5a6 set dev version 2023-07-13 08:28:52 +00:00
d06b131097 git commit -m 'Release: v0.4.7' 2023-07-13 08:17:49 +00:00
f3230902b1 [SFTTrainer] Fix the sequence length check of SFTTrainer (#512)
* fix the sequence length check of `SFTTrainer`

* forward contrib credits from initial contribution

* forward contrib credits from initial contribution

* final comments

---------

Co-authored-by: mrm8488 <mrm8488@users.noreply.github.com>
Co-authored-by: BramVanroy <BramVanroy@users.noreply.github.com>
2023-07-12 15:25:17 +02:00
bbc7eeb29c [PPOTrainer] Add prompt tuning support on TRL (#500)
* add prompt tuning support on TRL

* fix CI

* revert + add docs
2023-07-06 15:16:37 +02:00
163dae5579 [PPOTrainer] Add prefix tuning support (#501)
* add prefix tuning support

* fix CI

* better check
2023-07-06 14:56:05 +02:00
64c8db2f9a Update ppo_trainer.py (#499) 2023-07-06 10:32:19 +02:00
25d4d81801 Disable mlm by default in DataCollatorForCompletionOnlyLM, add ignore_index and docstring (#476)
* add docstring and ignore index

* hard-code mlm=False

* make precommit

* FIX: re-add mlm parameter

---------

Co-authored-by: Bram Vanroy <Bram.Vanroy@UGent.be>
2023-07-06 10:22:40 +02:00
685620ac6c correctly implement gradient checkpointing (#479)
switch to new peft api
add max_length to RewardTrainer
2023-07-06 09:26:13 +02:00
2b531b9223 Adds some options to stabilize the KL penalty (#486)
* adds options for the kl penalty

* style

* adds kl penalty to trl sentiment example args

* ppo_config -> config

* fix tests (equal -> allclose)

* style

* add a random seed option

* updates kl penalty description

---------

Co-authored-by: Costa Huang <costa.huang@outlook.com>
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
2023-07-05 11:23:10 +02:00
4f7f73dd09 Remove padding in batched generation. (#487)
* fix padding

* Update examples/sentiment/scripts/gpt2-sentiment.py

* fix style

---------

Co-authored-by: leandro von werra <leandro@hf.co>
2023-07-05 10:41:06 +02:00
c60c41688e FIX: contributing guidelines command (#493)
* FIX: contributing guidelines command

* Update CONTRIBUTING.md

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Update CONTRIBUTING.md

---------

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
2023-07-04 14:27:52 +02:00
cbb98dabb1 fix typo in reward_modeling.py (#494) 2023-07-04 14:17:32 +02:00
a86eaab8e8 add ratio threshold to avoid spikes (#488) 2023-07-04 10:09:53 +02:00
aa9770c6bd Refactor README (#460)
* v1

* update

* link

* nits
2023-07-03 14:30:15 +02:00
0fe603eca1 Update sft_trainer.py (#474)
* Update sft_trainer.py

Allows the user to give their own peft model arg. https://github.com/lvwerra/trl/issues/473

* cleaner
2023-06-28 00:44:15 +02:00
843c14574f fix CI RM (#468) 2023-06-26 14:30:06 +02:00
009b82412f Debug the tortuous logic in _prepare_dataset function (#464)
* Debug the tortuous logic in `_prepare_dataset` function

There are two issues with the previous `_prepare_dataset` function.

1. Tortuous and burdensome logic: the `is_already_dataset` variable is confusing and not helpful. So, remove it.
2. The comments and the logics do not match. 

For instance, in the previous version, the comments said "check if torch dataset ... and do nothing". However, when "dataset" is a torch.utils.data.Dataset and `packing = True`? It will still move into the _prepare_non_packed_dataloader(...) function call. 

The corrected version will do nothing if the dataset is already a torch dataloader/dataset/ConstantLengthDataset.

* Lint: sft_trainer.py

* Lint empty line
2023-06-24 08:43:03 +02:00
82c8f20601 Pre-commit (#448)
* Pre-commit

* modify CI

* modify make file

* temporarily disable codespell

* update make file

* update contribution guide

* pushc changes
2023-06-23 11:37:18 -04:00
b56e8b3277 Improve stabiliy: change default hyperparamers 2023-06-23 09:04:24 -04:00
0161a8e602 added shuffle parameter. I found it useful to turn off shuffle here and shuffle independently of this. (#457) 2023-06-23 11:47:08 +02:00
6e34c5932b set dev version 2023-06-23 09:20:25 +00:00
e1531aa526 Release: v0.4.6 2023-06-23 09:17:31 +00:00
cb6c45474a fix google colab issue (#459) 2023-06-23 11:13:36 +02:00
fe55b440e7 set dev version 2023-06-23 08:42:20 +00:00
431456732c Release: 0.4.5 2023-06-23 08:13:50 +00:00
9679d87012 Multi adapter RL (MARL) - a single model for RM & Value Head (#373)
* fix doc

* adapt from suggestions

* working v1 multiple adapters

* style

* style && quality

* oops

* docs

* add tests and docs

* add RM script

* Apply suggestions from code review

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/0_abstraction_rl.mdx

* Apply suggestions from code review

* Update docs/source/0_abstraction_rl.mdx

* add 4bit

* replace with `reward_adapter`

* explain break

* simple comment

* fix llama tokenizer

* fixes

* fixes

* rename

* quality

* rm unneeded file

* add disclaimer

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2023-06-22 11:19:45 +02:00
099f0bf42b Add accelerate project_config passthrough (#437) 2023-06-22 10:16:34 +02:00
33f88ead0b [ConstantLengthDataset] Fix packed dataset issue (#452)
* fix packed dataset issue

* Apply suggestions from code review

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* address

* more docs

* trigger CI

* fix failing CI

---------

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
2023-06-22 10:12:55 +02:00
7705daa672 [SFTTrainer] Introducing DataCollatorForCompletionOnlyLM (#445)
* v1 of alpaca datacollator

* make sure to match the response tokens

* add test

* add it in main init

* add check

* adapt test

---------

Co-authored-by: Costa Huang <costa.huang@outlook.com>
2023-06-20 17:51:23 +02:00
fe49697e66 add stale bot (#447) 2023-06-19 17:26:17 +02:00
d1ad5405cb [SFTTrainer] Fix non packed dataset (#444)
* fix non packed dataset

* fixing tests and documentation

* Update docs/source/sft_trainer.mdx
2023-06-16 18:51:20 +02:00
1e88b84ab9 fix packing issue (#442) 2023-06-16 13:55:47 +02:00
c39207460f Drop support for Python 3.7 (#441)
* drop support for Python 3.7

* adapt
2023-06-16 13:30:01 +02:00
61af5f26b6 Fix correct gradient accumulation (#407)
* add correct grad acc

* add some tests but they fail

* test should pass

* style

* fix
2023-06-14 08:43:35 -04:00
7a89a43c3f handle the offline case (#431)
* handle the offline case

* adds warning
2023-06-13 15:36:12 +02:00
fead2c8c77 best-of-n sampler class (#375)
* First draft of best-of-n sampler class

* Formatting

* Add best-of-n class to init

* Rearrange files

* Correction

* Make sure input query is in shape

* check for numpy.ndarray type

* Fix for shapes and types AND linter fixes

* Make reward pipeline a callback for more broader application

* Documentation for best-of-n sampler class usage

* Docs update for best-of-n class

* Doc fixes for best-of-n sampler class

* Remove colon from new addition

* Change user callback output type and associated side-effects of said change

* Relocate param because of collision

* Documentation update

* Make input param keyword easier to grasp

* Remove comments and add docstrings

* Tests and fixes for best_of_n sampler class

* Change input arg name

* Formatting

* Removed unnecessary cloning
2023-06-13 10:25:21 +02:00
b4bb12992e Update test_reward_trainer.py (#421) 2023-06-09 15:52:41 +02:00
b21baddc5c [doc build] Use secrets (#420) 2023-06-09 15:52:10 +02:00
216c119fa9 Enable autotag feature w/ wandb (#411)
* Enable autotag feature

* use `logging.info`

* Update trl/trainer/ppo_config.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update trl/trainer/ppo_config.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
2023-06-09 11:20:18 +02:00
a2747acc0f Add slurm utility (#412)
* Add slurm utility

* move files
2023-06-09 11:04:43 +02:00
b61a4b95a0 set dev version 2023-06-08 14:28:37 +00:00
5c5d7687d8 Release: v0.4.4 2023-06-08 14:26:14 +00:00
096f5e9da5 unpin accelerate (#418) 2023-06-08 16:25:03 +02:00
2a0ed3a596 set dev version 2023-06-08 08:55:33 +00:00
ff13c5bc6d Release: v0.4.3 2023-06-08 08:52:04 +00:00
d3e05d6490 Update setup.py (#414) 2023-06-08 10:49:03 +02:00
fadffc22bc Update test_reward_trainer.py (#410) 2023-06-07 12:22:22 +02:00
d405c87068 set dev version 2023-06-07 10:22:06 +00:00
b46716c4f5 Release: v0.4.2 2023-06-07 09:43:23 +00:00
ec8a5b7679 Remove unused imports in docs. (#406)
* remove unused var

* bug fix

* update docs, add e2e CI

* black

* isort

* CI
2023-06-06 18:06:49 +02:00
376d152d3f Resolve broken evaluation/prediction for RewardTrainer (#404)
* Implement evaluation/prediction for RewardTrainer

* Stick with unittest assertions

* Perform prediction forward calls without gradient

* Remove Literal to preserve Python 3.7 support

I recognize that I can also import from typing_extensions with a try-except,
but that is a bit overkill for this I feel.

* Remove eval_steps=1 to prevent flaky test on CI

The flaky test is caused by a division by zero when dividing by the runtime.
This is done on the transformers side, so it's not a TRL issue.
In practice, this won't happen - it only happens because both the model
and dataset are tiny.
2023-06-06 16:49:30 +02:00
ef57cddbc3 StackLLaMA: fix supervised finetuning and reward model training (#399)
* better reward modelling

tokenizer can be separately specified from model
removed old llama tokenizer hacks
evaluate after first step option to make nicer graphs
black + isort

* removed tokenizer hacks from supervised ft

* black and flake8
2023-06-06 10:41:07 +02:00
20111ad03a Fixed some type annotations of trl.trainer.PPoTrainer (#392)
* Fixed some type annotations of trl.trainer.PPoTrainer

- Ref model should be Optional
- The usual annotation for the Huggingface tokenizers is PreTrainedTokenizerBase. Not using that messes up people's annotation checks.
- Fixed the comments wrt the other two points

* fix quality and style

* synced & requality & restyled
2023-06-06 10:32:37 +02:00
a4793c2ede StackLlama: fixed RL training and added args (#400)
* fixed rl training args

added steps argument and break to respect max training epochs
added more PPOConfig args to script args
removed llama tokenizer hacks
removed extra args in dataset
changed to llamatokenizer from autotokenizer
black + isort

* black and flake8

* style, quality, and switch back to AutoTokenizer
2023-06-05 10:30:20 +02:00
0ddf9f657f StackLLaMA: correctly merge peft model (#398)
* correctly merge stackllama models

correctly merge weights with peft's merge_and_unload
load sequence classification model for reward models

* style, black line length 119

* flake8
2023-06-05 10:25:53 +02:00
3138ef6f5a fix 4 bit SFT (#396) 2023-06-02 10:49:41 +02:00
a5b0414f63 keep state_dict kwargs instead of popping it in save_pretrained (#393) 2023-05-31 10:56:45 +02:00
e174bd50a5 from_pretrain with peft adapter on the hub (# 379) (#380)
* from_pretrain with peft adapter on the hub (# 379)

* Update the comment

* PR comment
2023-05-31 10:38:25 +02:00
86c117404c fix typo in ppo_trainer.py (#389)
`dataloader must be a torch.utils.data.Dataset`: `dataloader` should be `dataset`
2023-05-30 15:23:02 +02:00
a94761a02c Update customization.mdx (#390) 2023-05-30 15:22:41 +02:00
5fb5af7c34 [core] Add 4bit QLora (#383)
* add 4bit

* style
2023-05-24 13:52:38 +02:00
25fa1bd880 fix warning issue (#377) 2023-05-18 08:43:44 +02:00
6916e0d2df [docs] fix SFT doc (#367)
* fix doc

* adapt from suggestions
2023-05-15 16:26:27 +02:00
1704a864e7 Delete test_training.py (#371) 2023-05-15 16:21:28 +02:00
e547c392f9 Remove obsolete layer_norm_names parameter and add peft>=0.3.0 to requirements (#366)
* remove obsolete layer_norm_names parameter

* remove obsolete parameter layer_norm_names and add peft>=0.3.0 to requirements

* make style - oops

* typo
2023-05-15 16:08:11 +02:00
a31bad83fb add is_trainable in kwargs (#363)
Add is_trainable in kwargs to enable continue training of peft model.
2023-05-15 16:08:00 +02:00
31cc361d17 Fix bug when loading local peft model (#342)
* Fix bug when loading local peft model 

Fix bug in https://github.com/lvwerra/trl/issues/341

* Fix loading bug when load lora mode

Fix loading bug when load lora model but not resuming training

1. Implement the fix logic described in https://github.com/lvwerra/trl/pull/342#pullrequestreview-1422298054

2. Set peft lora weight to trainable.

* Remove is_trainable

Leave is_trainable to future PR.

* add test_load_pretrained_peft

Check that the model saved with peft class interface can be loaded properly.
2023-05-11 23:07:50 +02:00
ab453ec183 140/best n sampling (#326)
* Create best_of_n.ipynb

* First draft

* Refactor as ref vs ppo vs non-ppo

* Changed notebook location and added README to explain motivation

* 1. Spelling and formatting refactor
2. Minor refactor of notebook

* Formatting of notebook
2023-05-11 17:56:12 +02:00
933c91cc66 fix tensorboard issue (#330) 2023-05-11 17:45:59 +02:00
ffad0a19d0 relax negative KL constraint (#352) 2023-05-11 17:45:47 +02:00
e0172fc8ec add parameter to control max_length (to mitigate OOM errors) (#359) 2023-05-11 15:28:32 +02:00
dec9993129 stack_llama: update instructions in README, fix broken _get_submodules and save tokenizer (#358)
* update instructions in README and fix broken _get_submodules

* save tokenizer

* add note about peft>=0.3.0
2023-05-11 12:29:02 +02:00
c85cdbdbd0 Fix argument's description (#339) 2023-05-04 14:29:07 +02:00
e59cce9f81 fix sft issues (#336) 2023-05-03 12:53:32 +02:00
c60fd915c1 [core] officially support SFT (Supervised Finetuning) (#323)
* add v1

* revert

* correct filename

* add tests and final tweaks

* fix tests

* adapt from offline suggestions

* Update trl/trainer/sft_trainer.py

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* fixes

* remove warning

* multiple fixes

* fixes

* fix

* final fixes

* final fix

* more clarification

* Apply suggestions from code review

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* add test

* add arg

* add callback instructions

* add formatting_prompts_func

* try docs

* add CLD

* fix docstrings

* format

* Update docs/source/sft_trainer.mdx

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* remove `prepare_in_int8_kwargs`

* change `return_overflowing_tokens`

* add warnings

* address comments

* revert pretrained kwargs

* quality

* fix sft script

---------

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2023-05-03 10:42:01 +02:00
08f550674c added doc for using torch.distributed.launch/run (#324)
* added doc for using torch.distributed.launch/run

* Update docs/source/customization.mdx

---------

Co-authored-by: Afshin Oroojlooyjadid <afshin.oroojlooyjadid@oracle.com>
Co-authored-by: younesbelkada <younesbelkada@gmail.com>
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
2023-04-28 16:18:07 +02:00
52fecee883 Give a key to the wandb PPOConfig config entry (#315)
* Give a key to the wandb PPOConfig config entry

There is a lot of stuff with very generic keys in the `PPOConfig` dict, and the user may have logged a `wandb` config dict elsewhere.
I know I had that problem. To counter that, I pass the PPOConfig dict in a dict under the key `trl_ppo_trainer_config`, to prevent collisions & be very clear.

* did black --line-length 119 --target-version py38 examples tests trl
isort examples tests trl and black --check --line-length 119 --target-version py38 examples tests trl
isort --check-only examples tests trl
flake8 examples tests trl
2023-04-26 22:14:55 +02:00
3cfe194e34 [core] Officially Support Reward Modeling (#303)
* v1

- add working version
- add all possible tests
- add docs

* add some contents

* clean up

* fixes

* patch test for now

* fix test

* clean up

* fix

* this time fix

* Update docs/source/trainer.mdx

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* fixe

* update

* final changes

* oops

* Update docs/source/reward_trainer.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/reward_trainer.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update docs/source/reward_trainer.mdx

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* switch to chosen / rejected

* fixes

* add example

* add accuracy metric

* pass PEFT config

* refactor compute metrics

---------

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2023-04-26 11:51:56 +02:00
ad325152cc add details on multi-GPU / multi-node (#320) 2023-04-26 11:12:15 +02:00
1f29725381 fix broken tests (#318) 2023-04-25 13:57:40 +02:00
23a06c94b8 fix DS for peft ref_model in ppo trainer (#309)
peft ref_model is got by calling `disable_adapter` method, e.g. ,
```
with self.accelerator.unwrap_model(self.model).pretrained_model.disable_adapter():
    ref_logprobs, _, _, _ = self.batched_forward_pass(self.model, queries, responses, model_inputs)
```
2023-04-25 12:52:49 +02:00
5c24d5bb2e fixed typo (#312) 2023-04-25 11:38:28 +02:00
503ac5d82c clean examples folder (#294)
* clean examples folder

* Update examples/toxicity/README.md
2023-04-25 11:33:54 +02:00
ce37eadcfa Log Token distribution of Query / Response (#295)
* reset git

* move to log_step_stats, make optional

* fix stack

* reset script

* fix types

* always log, add dist
2023-04-17 17:49:14 +02:00
160d0c9d6c [t5] Fix negative kl issue (#262)
* fix negative kl issue

* fix

* make style
2023-04-14 11:50:17 +02:00
d1c7529328 Fix arguments description (#298)
* Fix arguments description

* fix-argument-description

* Fix-argument-description
2023-04-12 16:00:42 +02:00
fc468e0f35 Small improvements / fixes to toxicity example (#266)
* fixes during debugging

* Update examples/toxicity/scripts/gpt-j-6b-toxicity.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
2023-04-10 14:24:06 -07:00
131e5cdd10 add functionality to push best models to the hub during training (#275)
* add functionality to push best models to the hub during training

* fix indentation

* Update tests/test_ppo_trainer.py

Co-authored-by: Nathan Lambert <nathan@huggingface.co>

* Update trl/trainer/ppo_trainer.py

Co-authored-by: Nathan Lambert <nathan@huggingface.co>

* Update trl/trainer/ppo_trainer.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* fix style

---------

Co-authored-by: Nathan Lambert <nathan@huggingface.co>
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
2023-04-10 11:32:53 -07:00
bb4a9800fa fix typo in gpt2-sentiment.ipynb (#293)
inital -> initial
2023-04-10 20:08:55 +02:00
3804a72e6c Fix swapped helper texts (#284) 2023-04-10 10:23:37 +02:00
a004b02c4a Add LLaMA tutorial to docs (#278)
* docs docs docs

* add truncated blog to docs
2023-04-07 08:16:42 -07:00
8b234479bc fix doc string problem in ppo trainer loss function (#279)
* fix a loss function docstring problem

`hidden_dim` should be `response_length`

* Update ppo_trainer.py
2023-04-07 10:22:02 +02:00
meg
cf20878113 Adding pointer back to Meta's LLaMA. (#277) 2023-04-06 14:04:12 -07:00
d8ae4d08c6 stack-llama (#273)
* adds the main scripts

* adds non-score reward clamping

* Adds adapter merge script.

* style

* adds non_reward clamp option to config

* reverts kl clamping

* style

* makes model name required for adapter merge

* updates merge adapter so it does not refer to HF internal llama checkpoints

* renames to stack_llama, adds clearer instructions

* updates readme, adds ds config

* Update examples/stack_llama/scripts/rl_finetuning_peft.py

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Update examples/stack_llama/scripts/rl_finetuning_peft.py

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* removes ds config, renamed scripts

* style

* updates launch commands

---------

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
2023-04-05 17:11:43 +02:00
a2749d9e0c Use active model to generate response in example on README (#269) (#271)
Co-authored-by: rmilleti <rmilleti@amazon.com>
2023-04-03 15:36:46 +02:00
ed87942a47 Add LlaMa in tests + create_reference_model (#261)
* add LlaMa in tests

* Update tests/test_modeling_value_head.py

* add warning message

---------

Co-authored-by: Nathan Lambert <nathan@huggingface.co>
2023-03-30 10:49:46 +02:00
734624274d [core] Fix ds issue (#260)
* fix ds issue

* more comments
2023-03-29 14:20:27 +02:00
237eb9c6a5 [distributed] Fix early stopping and DP (#254)
* fix ES DP

* fix coef

* wrap in a private method

* fix value

* fix trainer logic
2023-03-28 14:31:16 +02:00
2672a942a6 [core] Fix DeepSpeed zero-3 issue (#182)
* fix zero-3 issue

* Update trl/trainer/ppo_trainer.py

Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com>

* adapt

* make style

* fix

* add docs

* fix

---------

Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com>
2023-03-28 13:43:52 +02:00
b5cce0d13e Using batched generate in sentiment scripts (#249)
Co-authored-by: gaurav.vi <gaurav.vi@media.net>
2023-03-27 12:09:50 +02:00
0b165e60bc Fix typo (#253) 2023-03-27 11:59:15 +02:00
404621f0f9 Improve logging for PPO + Docs page (#243)
* init pr

* try and fix docpreview

* fix

* try to fix tests

* nit

* fix tests

* convert to tensor
2023-03-24 09:34:57 +01:00
89df6abf21 feat(ci): enable pip cache (#198)
* feat(ci): add pip caching to CI

* feat(ci): create workflow to cleanup cache

* feat(ci): enable `pip` caching in CI
2023-03-24 09:33:43 +01:00
9523474490 PPO config __init__ is bloated (#241)
* Moving `total_ppo_epochs`, forward_batch_size` and `log_with` to post init method and let the dataclass automatically assign the other member variables.

* Using default factory functions for initializing dict

* Using fields + metadata for args description

* Reformatting the file using black(jupyter)

* Trying styling checks again

* Adding new args from PR 238

---------

Co-authored-by: gaurav.vi <gaurav.vi@media.net>
2023-03-24 09:33:22 +01:00
1620da371a adds early stopping (#238)
* adds early stopping

* zero opt grad

* style

* Fixed typo in early stopping property description

* Auto stash before rebase of "origin/main"
2023-03-23 15:24:04 +01:00
9b60207f0b [core] Add warning when negative KL (#239)
* add warning

* oops

* fix

* Update trl/trainer/ppo_trainer.py

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

---------

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
2023-03-22 12:18:43 +01:00
a6ebdb6e75 Reduce memory consumption in batched_forward_pass (#234)
* Reduce memory consumption by not storing logits in forward_pass

* Add docstring of return_logits
2023-03-22 10:18:18 +01:00
9c3e9e43d0 Batched generation (#228)
* add `_generate_batch`

* fix style

* omit tensor conversion

* no multiple pad by default

* add test

* stylez

* update docstring

* encoder/decoder check

* input shape safety

* moar style

---------

Co-authored-by: leandro von werra <leandro@hf.co>
2023-03-21 16:48:34 +01:00
0610711dda [core] refactor peft API (#231)
* refactor peft API

* update gpt2 peft script

* refactor

* few fixes

* fix bug

* make style

* update docs

* more update

* fix docs

* fix issues and add tests

* make style

* update dcos
2023-03-21 13:35:21 +01:00
24627e9c89 set dev version 2023-03-17 10:40:04 +00:00
e6183176bc Release: 0.4.1 2023-03-17 10:07:44 +00:00
6b88bba62b [test] attempt to fix CI test for PT 2.0 (#225)
* attempt to fix CI test

* attempt to fix CI to PT 2.0

* fix 3.7 issue

* fix

* make quality

* try

* Update tests/test_ppo_trainer.py
2023-03-17 10:42:38 +01:00
44f708ee15 [peft] Fix DP issues (#221)
* fix DP issues

* add instructions

* more details

* test

* add pad labels

* ultimate fix

* explain black magic
2023-03-16 11:19:47 +01:00
90f0090580 adds a missing detach to the ratio (#224)
* adds a missing detach to the ratio

* style
2023-03-16 10:47:56 +01:00
768c3892c8 Grad accumulation and memory bugfix (#220)
* adds args and grad accum steps to sentiment examples

* updates to minibatch size in peft 20b example

* adds arg and grad acc to toxicity example

* adds detach to all entries in the step stats to reduce memory usage

* adds accelerator accumulation context

* makes gradient_accumulation_steps part of the PPOConfig

* Update trl/trainer/ppo_trainer.py

* style
2023-03-16 09:56:36 +01:00
7940683014 [core] fix DP issue (#222)
* fix DP issue

* fix

* oops

* Empty-Commit

* skip test
2023-03-16 08:43:12 +01:00
03d9844730 Let's support naive Pipeline Parallelism (#210)
* add fixes in to support PP

* add same logic for enc-dec

* add more checks

* fix 20b issues

* clean up

* update scripts

* dp safety checker

* added multi gpu tests

* fix order

* change

* fix script
2023-03-15 08:28:52 +01:00
357730f8fd Small changes when integrating into H4 (#216)
* nits

* style
2023-03-14 14:15:23 -07:00
b75d83ab28 spell corrections (#214)
* spell corrections

* Update docs/source/quickstart.mdx

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
2023-03-12 16:39:44 +01:00
0a95577dd6 spell mistakes (#213) 2023-03-12 16:37:20 +01:00
34773e97a2 Update README.md blog post link (#212) 2023-03-12 09:04:18 +01:00
ddb6df367d adds sentiment example for a 20b model (#208)
* adds sentiment example for a 20b model

* Update examples/sentiment/scripts/gpt-neox-20b_peft/s03_gpt-neo-20b_sentiment_peft.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update examples/sentiment/scripts/gpt-neox-20b_peft/s03_gpt-neo-20b_sentiment_peft.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update examples/sentiment/scripts/gpt-neox-20b_peft/s03_gpt-neo-20b_sentiment_peft.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update examples/sentiment/scripts/gpt-neox-20b_peft/s03_gpt-neo-20b_sentiment_peft.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* removed numbers from script names

* adds examples to docs

* cm -> clm

* style

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
2023-03-09 15:29:31 +01:00
6c0252545a set dev version 2023-03-09 11:39:42 +00:00
c9a0a8711b Release: 0.4.0 2023-03-09 09:44:58 +00:00
5c08afc1bc [core] Update dependency (#206)
* update dependency

* clarify instructions

* Update setup.py

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Apply suggestions from code review

---------

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
2023-03-09 10:22:04 +01:00
679f29d408 peft integration (#163)
* adds a hacky peft example

* fixes bug due to missing "prepare_model_for_training"

* Formatting

* adds peft to requirements

* Update trl/trainer/ppo_trainer.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* gpt neo runs

* changes requested on the PR

* style

* updates to prepare_model_for_int8_training PEFT PR https://github.com/huggingface/peft/pull/105

* updates to prepare_model_for_int8_training PEFT PR https://github.com/huggingface/peft/pull/105

* adds missing 8-bit attribute to modeling base

* adds lr to example script

* adds missing train to trainer

* disables caching temporarily while I debug something

* debugging issues with unstable training

* Fix peft + int8 (#170)

* add fix

* another fix

* Auto stash before merge of "peft-example" and "origin/peft-example"

* adds peft model types to modeling base

* reduces memory usage using adapters and no ref model.

* adds support for EleutherAI/gpt-neox-20b

* example for peft finetune of cm model

* removes hacky research code

* fixing the rebase and some typos

* style

* style2

* adds gradient checkpointing to base model

* cleans up comments

* moves config and other pretrained_model properties to __init__

* make style

* added tests

* change dependency

* Update .github/workflows/tests.yml

* fix test

* fix style and failing tests

* make quality

* revert change

* rm unneeded change

* revert changes

* rm changes

* rm changes

* rm uneeded change

* Update trl/models/modeling_base.py

* revert uneeded changes

* make style

* adapt suggestions

* fix tests

* attempt to fix

* fix

* fix

* add no peft test

* revert

* remove unneded check

* more tests

* fix logic

* add `save_pretrained` support

* fix quality

* clean up

* clean up

* stronger test

* refactor comments

* make style

* attempt to add non-peft tests

* remove test runner

* format

* fix test

* move `train` on top

* fix peft import

* make quality

* fixes typo

* adds peft example to docs

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Co-authored-by: younesbelakda <younesbelkada@gmail.com>
2023-03-07 15:08:21 +01:00
76fd085c25 Add 1.12.1 torch compatibility in sum method (#190)
* Add 1.12.1 torch compatibility in sum method

* Replace try-catch with more explicit if-statement

* code style + quality

* Apply suggestions from code review

---------

Co-authored-by: Yehor Panchenko <panchenko.yehor@huawei.com>
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
2023-03-06 10:11:07 +01:00
6126433a4b [core] Fix quality issue (#197)
* fix quality

* Update Makefile

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* add more tests and fix them

---------

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
2023-03-06 09:52:06 +01:00
f95be7736f Allow running evaluate-toxicity with cpu (#195) 2023-03-06 09:30:13 +01:00
a05ddbdd83 set correct dev version 2023-03-02 09:20:10 +00:00
206bb1e2b0 Release: 0.3.1 2023-03-02 09:12:56 +00:00
e7220be712 Fix reference to example (#184) 2023-03-02 09:58:32 +01:00
a1616f75fc Update detoxifying_a_lm.mdx (#186) 2023-03-01 20:57:47 +01:00
meg
e1b836ce9c Clarifications of acronyms and initialisms (#185) 2023-03-01 20:56:38 +01:00
bfcf71ac3d set dev version 2023-03-01 13:42:43 +01:00
294 changed files with 72118 additions and 4489 deletions

67
.github/ISSUE_TEMPLATE/bug-report.yml vendored Normal file
View File

@ -0,0 +1,67 @@
name: "\U0001F41B Bug Report"
description: Submit a bug report to help us improve TRL
labels: [ "bug" ]
body:
- type: markdown
attributes:
value: |
Thanks for taking the time to fill out this bug report! 🤗
🚩 If it is your first time submitting, be sure to check our [bug report guidelines](https://github.com/huggingface/trl/blob/main/CONTRIBUTING.md#did-you-find-a-bug)
- type: textarea
id: reproduction
validations:
required: true
attributes:
label: Reproduction
description: |
Please provide a code sample that reproduces the problem you ran into. It can be a Colab link or just a code snippet.
If you have code snippets, error messages, stack traces please provide them here as well.
Important! Use code tags to correctly format your code. See https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
Do not use screenshots, as they are hard to read and (more importantly) don't allow others to copy-and-paste your code.
value: |
```python
from trl import ...
```
outputs:
```
Traceback (most recent call last):
File "example.py", line 42, in <module>
...
```
- type: textarea
id: system-info
attributes:
label: System Info
description: |
Please provide information about your system: platform, Python version, PyTorch version, Transformers version, devices, TRL version, ...
You can get this information by running `trl env` in your terminal.
placeholder: Copy-paste the output of `trl env`
validations:
required: true
- type: checkboxes
id: terms
attributes:
label: Checklist
description: |
Before submitting, please confirm that you've completed each of the following.
If an item doesn't apply to your issue, check it anyway to show you've reviewed it.
options:
- label: "I have checked that my issue isn't already filed (see [open issues](https://github.com/huggingface/trl/issues?q=is%3Aissue))"
required: true
- label: "I have included my system information"
required: true
- label: "Any code provided is minimal, complete, and reproducible ([more on MREs](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks))"
required: true
- label: "Any code provided is properly formatted in code blocks, (no screenshot, [more on code blocks](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks))"
required: true
- label: "Any traceback provided is complete"
required: true

View File

@ -0,0 +1,31 @@
name: "\U0001F680 Feature request"
description: Submit a proposal/request for a new TRL feature
labels: [ "Feature request" ]
body:
- type: textarea
id: feature-request
validations:
required: true
attributes:
label: Feature request
description: |
A clear and concise description of the feature proposal. Please provide a link to the paper and code in case they exist.
- type: textarea
id: motivation
validations:
required: true
attributes:
label: Motivation
description: |
Please outline the motivation for the proposal. Is your feature request related to a problem? e.g., I'm always frustrated when [...]. If this is related to another GitHub issue, please link here too.
- type: textarea
id: contribution
validations:
required: true
attributes:
label: Your contribution
description: |
Is there any way that you could help, e.g. by submitting a PR? Make sure to read the CONTRIBUTING.MD [readme](https://github.com/huggingface/trl/blob/main/CONTRIBUTING.md)

View File

@ -0,0 +1,32 @@
name: "\U0001F31F New trainer addition"
description: Submit a proposal/request to implement a new trainer for a post-training method
labels: [ "New trainer" ]
body:
- type: textarea
id: description-request
validations:
required: true
attributes:
label: Method description
description: |
Put any and all important information relative to the method
- type: checkboxes
id: information-tasks
attributes:
label: Open source status
description: |
Please note that if the method implementation isn't available or model weights with training datasets aren't available, we are less likely to implement it in `trl`.
options:
- label: "The method implementation is available"
- label: "The model weights are available"
- label: "The training datasets are available"
- type: textarea
id: additional-info
attributes:
label: Provide useful links for the implementation
description: |
Please provide information regarding the implementation, the weights, and the authors.
Please mention the authors by @gh-username if you're aware of their usernames.

31
.github/PULL_REQUEST_TEMPLATE.md vendored Normal file
View File

@ -0,0 +1,31 @@
# What does this PR do?
<!--
Congratulations! You've made it this far! You're not quite done yet though.
Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution.
Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change.
Once you're done, someone will review your PR shortly. They may suggest changes to make the code even better.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [ ] Did you read the [contributor guideline](https://github.com/huggingface/trl/blob/main/CONTRIBUTING.md#create-a-pull-request),
Pull Request section?
- [ ] Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
- [ ] Did you write any new necessary tests?
## Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

19
.github/codeql/custom-queries.qls vendored Normal file
View File

@ -0,0 +1,19 @@
import codeql
from WorkflowString interpolation, Workflow workflow
where
interpolation.getStringValue().matches("${{ github.event.issue.title }}") or
interpolation.getStringValue().matches("${{ github.event.issue.body }}") or
interpolation.getStringValue().matches("${{ github.event.pull_request.title }}") or
interpolation.getStringValue().matches("${{ github.event.pull_request.body }}") or
interpolation.getStringValue().matches("${{ github.event.review.body }}") or
interpolation.getStringValue().matches("${{ github.event.comment.body }}") or
interpolation.getStringValue().matches("${{ github.event.inputs.* }}") or
interpolation.getStringValue().matches("${{ github.event.head_commit.message }}")
interpolation.getStringValue().matches("${{ github.event.* }}") and
(
step.getKey() = "run" or // Injection in run
step.getKey() = "env" or // Injection via env
step.getKey() = "with" // Injection via with
)
select workflow, "🚨 Do not use directly as input of action"

View File

@ -13,7 +13,6 @@ jobs:
with:
commit_sha: ${{ github.sha }}
package: trl
repo_owner: lvwerra
version_tag_suffix: ""
secrets:
token: ${{ secrets.HUGGINGFACE_PUSH }}
hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}

View File

@ -9,10 +9,10 @@ concurrency:
jobs:
build:
if: github.event.pull_request.draft == false
uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@main
with:
commit_sha: ${{ github.event.pull_request.head.sha }}
pr_number: ${{ github.event.number }}
package: trl
repo_owner: lvwerra
version_tag_suffix: ""
version_tag_suffix: ""

33
.github/workflows/clear_cache.yml vendored Normal file
View File

@ -0,0 +1,33 @@
name: "Cleanup Cache"
on:
workflow_dispatch:
schedule:
- cron: "0 0 * * *"
jobs:
cleanup:
runs-on: ubuntu-latest
steps:
- name: Check out code
uses: actions/checkout@v4
- name: Cleanup
run: |
gh extension install actions/gh-actions-cache
REPO=${{ github.repository }}
echo "Fetching list of cache key"
cacheKeysForPR=$(gh actions-cache list -R $REPO | cut -f 1 )
## Setting this to not fail the workflow while deleting cache keys.
set +e
echo "Deleting caches..."
for cacheKey in $cacheKeysForPR
do
gh actions-cache delete $cacheKey -R $REPO --confirm
done
echo "Done"
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}

26
.github/workflows/codeQL.yml vendored Normal file
View File

@ -0,0 +1,26 @@
name: "CodeQL Analysis - Workflows"
on:
workflow_dispatch:
jobs:
analyze:
name: "Analyze GitHub Workflows"
runs-on: ubuntu-latest
permissions:
security-events: write
actions: read
contents: read
steps:
- name: "Checkout repository"
uses: actions/checkout@v4
- name: "Initialize CodeQL"
uses: github/codeql-action/init@v2
with:
languages: "yaml"
queries: +security-and-quality, ./.github/codeql/custom-queries.qls
- name: "Perform CodeQL Analysis"
uses: github/codeql-action/analyze@v2

View File

@ -1,13 +0,0 @@
name: Delete dev documentation
on:
pull_request:
types: [ closed ]
jobs:
delete:
uses: huggingface/doc-builder/.github/workflows/delete_doc_comment.yml@main
with:
pr_number: ${{ github.event.number }}
package: trl

84
.github/workflows/docker-build.yml vendored Normal file
View File

@ -0,0 +1,84 @@
name: Build TRL Docker image
on:
push:
branches:
- main
workflow_dispatch:
concurrency:
group: docker-image-builds
cancel-in-progress: false
jobs:
trl:
name: "Build and push TRL Docker image"
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Get TRL version from PyPI
run: |
VERSION=$(curl -s https://pypi.org/pypi/trl/json | jq -r .info.version)
echo "VERSION=$VERSION" >> $GITHUB_ENV
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Login to DockerHub
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_PASSWORD }}
- name: Build and Push
uses: docker/build-push-action@v4
with:
context: docker/trl
push: true
tags: |
huggingface/trl:${{ env.VERSION }}
huggingface/trl
- name: Post to Slack
if: always()
uses: huggingface/hf-workflows/.github/actions/post-slack@main
with:
slack_channel: ${{ secrets.CI_DOCKER_CHANNEL }}
title: 🤗 Results of the TRL Dev Docker Image build
status: ${{ job.status }}
slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
trl-dev:
name: "Build and push TRL Dev Docker image"
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Login to DockerHub
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_PASSWORD }}
- name: Build and Push
uses: docker/build-push-action@v4
with:
context: docker/trl-dev
push: true
tags: |
huggingface/trl:dev
- name: Post to Slack
if: always()
uses: huggingface/hf-workflows/.github/actions/post-slack@main
with:
slack_channel: ${{ secrets.CI_DOCKER_CHANNEL }}
title: 🤗 Results of the TRL Dev Docker Image build
status: ${{ job.status }}
slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}

View File

@ -0,0 +1,15 @@
name: "Hugging Face Issue Labeler"
on:
issues:
types: opened
jobs:
triage:
runs-on: ubuntu-latest
permissions:
issues: write
steps:
- uses: actions/checkout@v3
- uses: August-murr/auto-labeler@main
with:
hf-api-key: ${{ secrets.CI_HF_API_TOKEN }}

127
.github/workflows/pr_style_bot.yml vendored Normal file
View File

@ -0,0 +1,127 @@
name: PR Style Bot
on:
workflow_dispatch:
permissions:
contents: write
pull-requests: write
jobs:
run-style-bot:
if: >
contains(github.event.comment.body, '@bot /style') &&
github.event.issue.pull_request != null
runs-on: ubuntu-latest
steps:
- name: Extract PR details
id: pr_info
uses: actions/github-script@v6
with:
script: |
const prNumber = context.payload.issue.number;
const { data: pr } = await github.rest.pulls.get({
owner: context.repo.owner,
repo: context.repo.repo,
pull_number: prNumber
});
// We capture both the branch ref and the "full_name" of the head repo
// so that we can check out the correct repository & branch (including forks).
core.setOutput("prNumber", prNumber);
core.setOutput("headRef", pr.head.ref);
core.setOutput("headRepoFullName", pr.head.repo.full_name);
- name: Check out PR branch
uses: actions/checkout@v3
env:
HEADREPOFULLNAME: ${{ steps.pr_info.outputs.headRepoFullName }}
HEADREF: ${{ steps.pr_info.outputs.headRef }}
with:
# Instead of checking out the base repo, use the contributor's repo name
repository: ${{ env.HEADREPOFULLNAME }}
ref: ${{ env.HEADREF }}
# You may need fetch-depth: 0 for being able to push
fetch-depth: 0
token: ${{ secrets.GITHUB_TOKEN }}
- name: Debug
env:
HEADREPOFULLNAME: ${{ steps.pr_info.outputs.headRepoFullName }}
HEADREF: ${{ steps.pr_info.outputs.headRef }}
PRNUMBER: ${{ steps.pr_info.outputs.prNumber }}
run: |
echo "PR number: ${{ env.PRNUMBER }}"
echo "Head Ref: ${{ env.HEADREF }}"
echo "Head Repo Full Name: ${{ env.HEADREPOFULLNAME }}"
- name: Set up Python
uses: actions/setup-python@v4
- name: Install dependencies
run: |
pip install ruff pre-commit
- name: Download Makefile from main branch
run: |
curl -o main_Makefile https://raw.githubusercontent.com/huggingface/trl/main/Makefile
- name: Compare Makefiles
run: |
if ! diff -q main_Makefile Makefile; then
echo "Error: The Makefile has changed. Please ensure it matches the main branch."
exit 1
fi
echo "No changes in Makefile. Proceeding..."
rm -rf main_Makefile
- name: Run make style and make quality
run: |
make precommit || true
- name: Commit and push changes
id: commit_and_push
env:
HEADREPOFULLNAME: ${{ steps.pr_info.outputs.headRepoFullName }}
HEADREF: ${{ steps.pr_info.outputs.headRef }}
PRNUMBER: ${{ steps.pr_info.outputs.prNumber }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
echo "HEADREPOFULLNAME: ${{ env.HEADREPOFULLNAME }}, HEADREF: ${{ env.HEADREF }}"
# Configure git with the Actions bot user
git config user.name "github-actions[bot]"
git config user.email "github-actions[bot]@users.noreply.github.com"
# Make sure your 'origin' remote is set to the contributor's fork
git remote set-url origin "https://x-access-token:${GITHUB_TOKEN}@github.com/${{ env.HEADREPOFULLNAME }}.git"
# If there are changes after running style/quality, commit them
if [ -n "$(git status --porcelain)" ]; then
git add .
git commit -m "Apply style fixes"
# Push to the original contributor's forked branch
git push origin HEAD:${{ env.HEADREF }}
echo "changes_pushed=true" >> $GITHUB_OUTPUT
else
echo "No changes to commit."
echo "changes_pushed=false" >> $GITHUB_OUTPUT
fi
- name: Comment on PR with workflow run link
if: steps.commit_and_push.outputs.changes_pushed == 'true'
uses: actions/github-script@v6
with:
script: |
const prNumber = parseInt(process.env.prNumber, 10);
const runUrl = `${process.env.GITHUB_SERVER_URL}/${process.env.GITHUB_REPOSITORY}/actions/runs/${process.env.GITHUB_RUN_ID}`
await github.rest.issues.createComment({
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: prNumber,
body: `Style fixes have been applied. [View the workflow run here](${runUrl}).`
});
env:
prNumber: ${{ steps.pr_info.outputs.prNumber }}

43
.github/workflows/publish.yml vendored Normal file
View File

@ -0,0 +1,43 @@
name: Publish to PyPI
on:
push:
branches:
- main
- v*-release
paths:
- "VERSION"
jobs:
publish:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Read version
id: get_version
run: echo "version=$(cat VERSION)" >> $GITHUB_OUTPUT
- name: Debug - Show version.txt content
run: echo "Version is ${{ steps.get_version.outputs.version }}"
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.x"
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install build twine
- name: Build package
run: python -m build
- name: Publish to PyPI
if: ${{ !contains(steps.get_version.outputs.version, 'dev') }}
env:
TWINE_USERNAME: __token__
TWINE_PASSWORD: ${{ secrets.PYPI_TOKEN }}
run: |
python -m twine upload dist/*

111
.github/workflows/slow-tests.yml vendored Normal file
View File

@ -0,0 +1,111 @@
name: Slow tests (on push)
on:
push:
branches: [main]
paths:
# Run only when python files are modified
- "trl/**.py"
- "examples/**.py"
env:
RUN_SLOW: "yes"
IS_GITHUB_CI: "1"
SLACK_API_TOKEN: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
jobs:
run_all_tests_single_gpu:
runs-on:
group: aws-g4dn-2xlarge
env:
CUDA_VISIBLE_DEVICES: "0"
TEST_TYPE: "single_gpu"
container:
image: pytorch/pytorch:2.8.0-cuda12.8-cudnn9-devel
options: --gpus all --shm-size "16gb"
defaults:
run:
shell: bash
steps:
- name: Git checkout
uses: actions/checkout@v4
- name: Install system dependencies
run: |
apt-get update && apt-get install -y make git curl
- name: Install uv
run: |
curl -LsSf https://astral.sh/uv/install.sh | sh
- name: Create Python virtual environment
run: |
uv venv
uv pip install --upgrade setuptools wheel
- name: Install dependencies
run: |
source .venv/bin/activate
uv pip install ".[dev]"
uv pip install pytest-reportlog parameterized
- name: Run slow SFT tests on single GPU
if: always()
run: |
source .venv/bin/activate
make slow_tests
- name: Generate Report
if: always()
run: |
source .venv/bin/activate
uv pip install slack_sdk tabulate
python scripts/log_reports.py >> $GITHUB_STEP_SUMMARY
run_all_tests_multi_gpu:
runs-on:
group: aws-g4dn-2xlarge
env:
CUDA_VISIBLE_DEVICES: "0,1"
TEST_TYPE: "multi_gpu"
container:
image: pytorch/pytorch:2.8.0-cuda12.8-cudnn9-devel
options: --gpus all --shm-size "16gb"
defaults:
run:
shell: bash
steps:
- name: Git checkout
uses: actions/checkout@v4
- name: Install system dependencies
run: |
apt-get update && apt-get install -y make git curl
- name: Install uv
run: |
curl -LsSf https://astral.sh/uv/install.sh | sh
- name: Create Python virtual environment
run: |
uv venv
uv pip install --upgrade setuptools wheel
- name: Install dependencies
run: |
source .venv/bin/activate
uv pip install ".[dev]"
uv pip install pytest-reportlog parameterized
- name: Run slow SFT tests on Multi GPU
if: always()
run: |
source .venv/bin/activate
make slow_tests
- name: Generate Reports
if: always()
run: |
source .venv/bin/activate
uv pip install slack_sdk tabulate
python scripts/log_reports.py >> $GITHUB_STEP_SUMMARY
rm *.txt

View File

@ -1,47 +1,253 @@
name: tests
name: Tests
on:
push:
branches: [ main ]
pull_request:
branches: [ main ]
paths:
# Run only when relevant files are modified
- ".github/**.yml"
- "examples/**.py"
- "scripts/**.py"
- "tests/**.py"
- "trl/**.py"
- "pyproject.toml"
env:
TQDM_DISABLE: 1
CI_SLACK_CHANNEL: ${{ secrets.CI_PUSH_MAIN_CHANNEL }}
PYTORCH_CUDA_ALLOC_CONF: "expandable_segments:True"
jobs:
check_code_quality:
name: Check code quality
runs-on: ubuntu-latest
if: github.event.pull_request.draft == false
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
- uses: actions/checkout@v4
- name: Set up Python 3.12
uses: actions/setup-python@v5
with:
python-version: "3.8"
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install .[dev]
- name: Check quality
run: |
make quality
python-version: 3.12
- uses: pre-commit/action@v3.0.1
with:
extra_args: --all-files
tests:
needs: check_code_quality
name: Tests
strategy:
matrix:
python-version: [3.7, 3.8, 3.9]
os: ['ubuntu-latest', 'macos-latest', 'windows-latest']
runs-on: ${{ matrix.os }}
python-version: ['3.9', '3.10', '3.11', '3.12', '3.13']
fail-fast: false
runs-on:
group: aws-g4dn-2xlarge
container:
image: pytorch/pytorch:2.8.0-cuda12.8-cudnn9-devel
options: --gpus all
defaults:
run:
shell: bash
if: github.event.pull_request.draft == false
steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
# cpu version of pytorch
pip install .[test]
- name: Test with pytest
run: |
make test
- name: Git checkout
uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Install Make and Git
run: |
apt-get update && apt-get install -y make git curl
- name: Install uv
run: |
curl -LsSf https://astral.sh/uv/install.sh | sh
- name: Create Python virtual environment
run: |
uv venv
uv pip install --upgrade setuptools wheel
- name: Install dependencies
run: |
source .venv/bin/activate
uv pip install ".[dev]"
- name: Test with pytest
run: |
source .venv/bin/activate
make test
- name: Post to Slack
if: github.ref == 'refs/heads/main' && always() # Check if the branch is main
uses: huggingface/hf-workflows/.github/actions/post-slack@main
with:
slack_channel: ${{ env.CI_SLACK_CHANNEL }}
title: Results with Python ${{ matrix.python-version }} and latest dependencies
status: ${{ job.status }}
slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
tests_dev:
name: Tests with dev dependencies
runs-on:
group: aws-g4dn-2xlarge
container:
image: pytorch/pytorch:2.8.0-cuda12.8-cudnn9-devel
options: --gpus all
defaults:
run:
shell: bash
if: github.event.pull_request.draft == false
steps:
- name: Git checkout
uses: actions/checkout@v4
- name: Set up Python 3.12
uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Install Make and Git
run: |
apt-get update && apt-get install -y make git curl
- name: Install uv
run: |
curl -LsSf https://astral.sh/uv/install.sh | sh
- name: Create Python virtual environment
run: |
uv venv
uv pip install --upgrade setuptools wheel
- name: Install dependencies
run: |
source .venv/bin/activate
uv pip install ".[dev]"
uv pip install -U git+https://github.com/huggingface/accelerate.git
uv pip install -U git+https://github.com/huggingface/datasets.git
uv pip install -U git+https://github.com/huggingface/transformers.git
uv pip install -U git+https://github.com/huggingface/peft.git
- name: Test with pytest
run: |
source .venv/bin/activate
make test
- name: Post to Slack
if: github.ref == 'refs/heads/main' && always() # Check if the branch is main
uses: huggingface/hf-workflows/.github/actions/post-slack@main
with:
slack_channel: ${{ env.CI_SLACK_CHANNEL }}
title: Results with Python 3.12 and dev dependencies
status: ${{ job.status }}
slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
tests_wo_optional_deps:
name: Tests without optional dependencies
runs-on:
group: aws-g4dn-2xlarge
container:
image: pytorch/pytorch:2.8.0-cuda12.8-cudnn9-devel
options: --gpus all
defaults:
run:
shell: bash
if: github.event.pull_request.draft == false
steps:
- name: Git checkout
uses: actions/checkout@v4
- name: Set up Python 3.12
uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Install Make and Git
run: |
apt-get update && apt-get install -y make git curl
- name: Install uv
run: |
curl -LsSf https://astral.sh/uv/install.sh | sh
- name: Create Python virtual environment
run: |
uv venv
uv pip install --upgrade setuptools wheel
- name: Install dependencies
run: |
source .venv/bin/activate
uv pip install ".[test]"
- name: Test with pytest
run: |
source .venv/bin/activate
make test
- name: Post to Slack
if: github.ref == 'refs/heads/main' && always() # Check if the branch is main
uses: huggingface/hf-workflows/.github/actions/post-slack@main
with:
slack_channel: ${{ env.CI_SLACK_CHANNEL }}
title: Results with Python 3.12 without optional dependencies
status: ${{ job.status }}
slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
tests_min_versions:
name: Tests with minimum versions
runs-on:
group: aws-g4dn-2xlarge
container:
image: pytorch/pytorch:2.8.0-cuda12.8-cudnn9-devel
options: --gpus all
defaults:
run:
shell: bash
if: github.event.pull_request.draft == false
steps:
- name: Git checkout
uses: actions/checkout@v4
- name: Set up Python 3.12
uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Install Make and Git
run: |
apt-get update && apt-get install -y make git curl
- name: Install uv
run: |
curl -LsSf https://astral.sh/uv/install.sh | sh
- name: Create Python virtual environment
run: |
uv venv
uv pip install --upgrade setuptools wheel
- name: Install dependencies
run: |
source .venv/bin/activate
uv pip install ".[dev]"
uv pip install accelerate==1.4.0
uv pip install datasets==3.0.0
uv pip install transformers==4.56.1
- name: Test with pytest
run: |
source .venv/bin/activate
make test
- name: Post to Slack
if: github.ref == 'refs/heads/main' && always() # Check if the branch is main
uses: huggingface/hf-workflows/.github/actions/post-slack@main
with:
slack_channel: ${{ env.CI_SLACK_CHANNEL }}
title: Results with Python 3.12 and minimum dependencies versions
status: ${{ job.status }}
slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}

66
.github/workflows/tests_latest.yml vendored Normal file
View File

@ -0,0 +1,66 @@
name: Tests latest TRL release with dev dependencies
on:
schedule:
- cron: '0 0 * * *' # Runs daily at midnight UTC
workflow_dispatch:
env:
TQDM_DISABLE: 1
CI_SLACK_CHANNEL: ${{ secrets.CI_PUSH_MAIN_CHANNEL }}
jobs:
tests:
name: Tests latest TRL release with dev dependencies
runs-on:
group: aws-g4dn-2xlarge
container:
image: pytorch/pytorch:2.8.0-cuda12.8-cudnn9-devel
options: --gpus all
defaults:
run:
shell: bash
steps:
- name: Git checkout
uses: actions/checkout@v4
with: { ref: v0.24-release }
- name: Set up Python 3.12
uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Install Make and Git
run: |
apt-get update && apt-get install -y make git curl
- name: Install uv
run: |
curl -LsSf https://astral.sh/uv/install.sh | sh
- name: Create Python virtual environment
run: |
uv venv
uv pip install --upgrade setuptools wheel
- name: Install dependencies
run: |
source .venv/bin/activate
uv pip install ".[dev]"
uv pip install -U git+https://github.com/huggingface/accelerate.git
uv pip install -U git+https://github.com/huggingface/datasets.git
uv pip install -U git+https://github.com/huggingface/transformers.git
- name: Test with pytest
run: |
source .venv/bin/activate
make test
- name: Post to Slack
uses: huggingface/hf-workflows/.github/actions/post-slack@main
with:
slack_channel: ${{ env.CI_SLACK_CHANNEL }}
title: Results of latest TRL with Python 3.12 and dev dependencies
status: ${{ job.status }}
slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}

18
.github/workflows/trufflehog.yml vendored Normal file
View File

@ -0,0 +1,18 @@
on:
push:
name: Secret Leaks
jobs:
trufflehog:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Secret Scanning
uses: trufflesecurity/trufflehog@853e1e8d249fd1e29d0fcc7280d29b03df3d643d
with:
# exclude buggy postgres detector that is causing false positives and not relevant to our codebase
extra_args: --results=verified,unknown --exclude-detectors=postgres

View File

@ -0,0 +1,16 @@
name: Upload PR Documentation
on:
workflow_run:
workflows: ["Build PR Documentation"]
types:
- completed
jobs:
build:
uses: huggingface/doc-builder/.github/workflows/upload_pr_documentation.yml@main
with:
package_name: trl
secrets:
hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
comment_bot_token: ${{ secrets.COMMENT_BOT_TOKEN }}

17
.pre-commit-config.yaml Normal file
View File

@ -0,0 +1,17 @@
repos:
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.11.10
hooks:
- id: ruff-check
types_or: [ python, pyi ]
args: [ --fix ]
- id: ruff-format
types_or: [ python, pyi ]
# - repo: https://github.com/codespell-project/codespell
# rev: v2.1.0
# hooks:
# - id: codespell
# args:
# - --ignore-words-list=nd,reacher,thist,ths,magent,ba
# - --skip=docs/css/termynal.css,docs/js/termynal.js

View File

@ -17,7 +17,13 @@ authors:
family-names: Thrush
- given-names: Nathan
family-names: Lambert
repository-code: 'https://github.com/lvwerra/trl'
- given-names: Shengyi
family-names: Huang
- given-names: Kashif
family-names: Rasul
- given-names: Quentin
family-names: Gallouédec
repository-code: 'https://github.com/huggingface/trl'
abstract: "With trl you can train transformer language models with Proximal Policy Optimization (PPO). The library is built on top of the transformers library by \U0001F917 Hugging Face. Therefore, pre-trained language models can be directly loaded via transformers. At this point, most decoder and encoder-decoder architectures are supported."
keywords:
- rlhf
@ -25,4 +31,4 @@ keywords:
- pytorch
- transformers
license: Apache-2.0
version: 0.2.1
version: "0.24"

133
CODE_OF_CONDUCT.md Normal file
View File

@ -0,0 +1,133 @@
# Contributor Covenant Code of Conduct
## Our Pledge
We as members, contributors, and leaders pledge to make participation in our
community a harassment-free experience for everyone, regardless of age, body
size, visible or invisible disability, ethnicity, sex characteristics, gender
identity and expression, level of experience, education, socio-economic status,
nationality, personal appearance, race, caste, color, religion, or sexual
identity and orientation.
We pledge to act and interact in ways that contribute to an open, welcoming,
diverse, inclusive, and healthy community.
## Our Standards
Examples of behavior that contributes to a positive environment for our
community include:
* Demonstrating empathy and kindness toward other people
* Being respectful of differing opinions, viewpoints, and experiences
* Giving and gracefully accepting constructive feedback
* Accepting responsibility and apologizing to those affected by our mistakes,
and learning from the experience
* Focusing on what is best not just for us as individuals, but for the overall
community
Examples of unacceptable behavior include:
* The use of sexualized language or imagery, and sexual attention or advances of
any kind
* Trolling, insulting or derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or email address,
without their explicit permission
* Other conduct which could reasonably be considered inappropriate in a
professional setting
## Enforcement Responsibilities
Community leaders are responsible for clarifying and enforcing our standards of
acceptable behavior and will take appropriate and fair corrective action in
response to any behavior that they deem inappropriate, threatening, offensive,
or harmful.
Community leaders have the right and responsibility to remove, edit, or reject
comments, commits, code, wiki edits, issues, and other contributions that are
not aligned to this Code of Conduct, and will communicate reasons for moderation
decisions when appropriate.
## Scope
This Code of Conduct applies within all community spaces, and also applies when
an individual is officially representing the community in public spaces.
Examples of representing our community include using an official e-mail address,
posting via an official social media account, or acting as an appointed
representative at an online or offline event.
## Enforcement
Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported to the community leaders responsible for enforcement at
feedback@huggingface.co.
All complaints will be reviewed and investigated promptly and fairly.
All community leaders are obligated to respect the privacy and security of the
reporter of any incident.
## Enforcement Guidelines
Community leaders will follow these Community Impact Guidelines in determining
the consequences for any action they deem in violation of this Code of Conduct:
### 1. Correction
**Community Impact**: Use of inappropriate language or other behavior deemed
unprofessional or unwelcome in the community.
**Consequence**: A private, written warning from community leaders, providing
clarity around the nature of the violation and an explanation of why the
behavior was inappropriate. A public apology may be requested.
### 2. Warning
**Community Impact**: A violation through a single incident or series of
actions.
**Consequence**: A warning with consequences for continued behavior. No
interaction with the people involved, including unsolicited interaction with
those enforcing the Code of Conduct, for a specified period of time. This
includes avoiding interactions in community spaces as well as external channels
like social media. Violating these terms may lead to a temporary or permanent
ban.
### 3. Temporary Ban
**Community Impact**: A serious violation of community standards, including
sustained inappropriate behavior.
**Consequence**: A temporary ban from any sort of interaction or public
communication with the community for a specified period of time. No public or
private interaction with the people involved, including unsolicited interaction
with those enforcing the Code of Conduct, is allowed during this period.
Violating these terms may lead to a permanent ban.
### 4. Permanent Ban
**Community Impact**: Demonstrating a pattern of violation of community
standards, including sustained inappropriate behavior, harassment of an
individual, or aggression toward or disparagement of classes of individuals.
**Consequence**: A permanent ban from any sort of public interaction within the
community.
## Attribution
This Code of Conduct is adapted from the [Contributor Covenant][homepage],
version 2.1, available at
[https://www.contributor-covenant.org/version/2/1/code_of_conduct.html][v2.1].
Community Impact Guidelines were inspired by
[Mozilla's code of conduct enforcement ladder][Mozilla CoC].
For answers to common questions about this code of conduct, see the FAQ at
[https://www.contributor-covenant.org/faq][FAQ]. Translations are available at
[https://www.contributor-covenant.org/translations][translations].
[homepage]: https://www.contributor-covenant.org
[v2.1]: https://www.contributor-covenant.org/version/2/1/code_of_conduct.html
[Mozilla CoC]: https://github.com/mozilla/diversity
[FAQ]: https://www.contributor-covenant.org/faq
[translations]: https://www.contributor-covenant.org/translations

View File

@ -1,48 +1,428 @@
# How to contribute
# How to contribute to TRL?
## How to get started
Everyone is welcome to contribute, and we value everybody's contribution. Code contributions are not the only way to help the community. Answering questions, helping others, and improving the documentation are also immensely valuable.
Before you start contributing make sure you installed all the dev tools:
It also helps us if you spread the word! Reference the library in blog posts about the awesome projects it made possible, shout out on Twitter every time it has helped you, or simply ⭐️ the repository to say thank you.
However you choose to contribute, please be mindful and respect our [code of conduct](https://github.com/huggingface/trl/blob/main/CODE_OF_CONDUCT.md).
**This guide was heavily inspired by the awesome [scikit-learn guide to contributing](https://github.com/scikit-learn/scikit-learn/blob/main/CONTRIBUTING.md).**
## Ways to contribute
There are several ways you can contribute to TRL:
* Fix outstanding issues with the existing code.
* Submit issues related to bugs or desired new features.
* Implement trainers for new post-training algorithms.
* Contribute to the examples or the documentation.
If you don't know where to start, there is a special [Good First Issue](https://github.com/huggingface/trl/labels/%F0%9F%91%B6%20good%20first%20issue) listing. It will give you a list of open issues that are beginner-friendly and help you start contributing to open-source. The best way to do that is to open a Pull Request and link it to the issue that you'd like to work on. We try to give priority to opened PRs as we can easily track the progress of the fix, and if the contributor does not have time anymore, someone else can take the PR over.
For something slightly more challenging, you can also take a look at the [Good Second Issue](https://github.com/huggingface/trl/labels/Good%20Second%20Issue) list. In general though, if you feel like you know what you're doing, go for it and we'll help you get there! 🚀
> All contributions are equally valuable to the community. 🥰
Before you start contributing make sure you have installed all the dev tools:
```bash
pip install -e ".[dev]"
pip install -e .[dev]
```
## Did you find a bug?
## Fixing outstanding issues
* Ensure the bug was not already reported by searching on GitHub under Issues.
* If you're unable to find an open issue addressing the problem, open a new one. Be sure to include a title and clear description, as much relevant information as possible, and a code sample or an executable test case demonstrating the expected behavior that is not occurring.
* Be sure to add the complete error messages.
If you notice an issue with the existing code and have a fix in mind, feel free to [start contributing](#submitting-a-pull-request-pr) and open a Pull Request!
#### Did you write a patch that fixes a bug?
## Submitting a bug-related issue or feature request
* Open a new GitHub pull request with the patch.
* Ensure that your PR includes a test that fails without your patch, and pass with it.
* Ensure the PR description clearly describes the problem and solution. Include the relevant issue number if applicable.
Do your best to follow these guidelines when submitting a bug-related issue or a feature request. It will make it easier for us to come back to you quickly and with good feedback.
## PR submission guidelines
### Did you find a bug?
* Keep each PR focused. While it's more convenient, do not combine several unrelated fixes together. Create as many branches as needing to keep each PR focused.
* Do not mix style changes/fixes with "functional" changes. It's very difficult to review such PRs and it most likely get rejected.
* Do not add/remove vertical whitespace. Preserve the original style of the file you edit as much as you can.
* Do not turn an already submitted PR into your development playground. If after you submitted PR, you discovered that more work is needed - close the PR, do the required work and then submit a new PR. Otherwise each of your commits requires attention from maintainers of the project.
* If, however, you submitted a PR and received a request for changes, you should proceed with commits inside that PR, so that the maintainer can see the incremental fixes and won't need to review the whole PR again. In the exception case where you realize it'll take many many commits to complete the requests, then it's probably best to close the PR, do the work and then submit it again. Use common sense where you'd choose one way over another.
The TRL library is robust and reliable thanks to users who report the problems they encounter.
### Before you submit a PR
Before you report an issue, we would really appreciate it if you could **make sure the bug was not already reported** (use the search bar on GitHub under Issues). Your issue should also be related to bugs in the library itself, and not your code.
First you want to make sure that all the tests pass:
Once you've confirmed the bug hasn't already been reported, please include the following information in your issue so we can quickly resolve it:
* Your **OS type and version**, **Python**, **PyTorch**, **TRL** and **Transformers** versions.
* A short, self-contained, code snippet that allows us to reproduce the bug in less than 30s.
* The *full* traceback if an exception is raised.
* Attach any other additional information, like screenshots, you think may help.
To get the OS and software versions automatically, run the following command:
```bash
make test
trl env
```
Then before submitting your PR make sure the code quality follows the standards. You can run the following command to format and test:
### Do you want a new feature?
If there is a new feature you'd like to see in TRL, please open an issue and describe:
1. What is the *motivation* behind this feature? Is it related to a problem or frustration with the library? Is it a feature related to something you need for a project? Is it something you worked on and think it could benefit the community?
Whatever it is, we'd love to hear about it!
2. Describe your requested feature in as much detail as possible. The more you can tell us about it, the better we'll be able to help you.
3. Provide a *code snippet* that demonstrates the feature's usage.
4. If the feature is related to a paper, please include a link.
If your issue is well written we're already 80% of the way there by the time you create it.
## Do you want to implement a new trainer?
New post-training methods are published frequently and those that satisfy the following criteria are good candidates to be integrated into TRL:
* **Simplicity:** Does the new method achieve similar performance as prior methods, but with less complexity? A good example is Direct Preference Optimization (DPO) [[Rafailov et al, 2023]](https://huggingface.co/papers/2305.18290), which provided a simpler and compelling alternative to RLHF methods.
* **Efficiency:** Does the new method provide a significant improvement in training efficiency? A good example is Odds Ratio Preference Optimization (ORPO) [[Hong et al, 2023]](https://huggingface.co/papers/2403.07691), which utilizes a similar objective as DPO but requires half the GPU VRAM.
Methods that only provide incremental improvements at the expense of added complexity or compute costs are unlikely to be included in TRL.
If you want to implement a trainer for a new post-training method, first open an issue and provide the following information:
* A short description of the method and a link to the paper.
* Link to the implementation if it is open-sourced.
* Link to model weights trained with the method if they are available.
Based on the community and maintainer feedback, the next step will be to implement the trainer and config classes. See the following examples for inspiration:
* Paired preference optimisation: [`dpo_trainer.py`](./trl/trainer/dpo_trainer.py) and [`dpo_config.py`](./trl/trainer/dpo_config.py)
* RL-based optimisation: [`rloo_trainer.py](./trl/trainer/rloo_trainer.py) and [`rloo_config.py](./trl/trainer/rloo_config.py)
* Online optimisation: [`online_dpo_trainer.py`](./trl/trainer/online_dpo_trainer.py) and [`online_dpo_config.py`](./trl/trainer/online_dpo_config.py)
## Do you want to add documentation?
We're always looking for improvements to the documentation that make it more clear and accurate. Please let us know how the documentation can be improved, such as typos, dead links, and any missing, unclear, or inaccurate content... We'll be happy to make the changes or help you contribute if you're interested!
## Submitting a pull request (PR)
Before writing code, we strongly advise you to search through the existing PRs or issues to make sure that nobody is already working on the same thing. If you are unsure, it is always a good idea to open an issue to get some feedback.
You will need basic `git` proficiency to be able to contribute to TRL. `git` is not the easiest tool to use but it has the greatest manual. Type `git --help` in a shell and enjoy. If you prefer books, [Pro Git](https://git-scm.com/book/en/v2) is a very good reference.
Follow these steps to start contributing:
1. Fork the [repository](https://github.com/huggingface/trl) by clicking on the 'Fork' button on the repository's page. This creates a copy of the code under your GitHub user account.
2. Clone your fork to your local disk, and add the base repository as a remote. The following command assumes you have your public SSH key uploaded to GitHub. See the following guide for more [information](https://docs.github.com/en/repositories/creating-and-managing-repositories/cloning-a-repository).
```bash
git clone git@github.com:<your Github handle>/trl.git
cd trl
git remote add upstream https://github.com/huggingface/trl.git
```
3. Create a new branch to hold your development changes, and do this for every new PR you work on.
Start by synchronizing your `main` branch with the `upstream/main` branch (more details in the [GitHub Docs](https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/syncing-a-fork)):
```bash
git checkout main
git fetch upstream
git merge upstream/main
```
Once your `main` branch is synchronized, create a new branch from it:
```bash
git checkout -b a-descriptive-name-for-my-changes
```
**Do not** work on the `main` branch.
4. Set up a development environment by running the following command in a conda or a virtual environment you've created for working on this library:
```bash
pip install -e .[dev]
```
(If TRL was already installed in the virtual environment, remove it with `pip uninstall trl` before reinstalling it.)
Alternatively, if you are using [Visual Studio Code](https://code.visualstudio.com/Download), the fastest way to get set up is by using the provided Dev Container. Check [the documentation on how to get started with dev containers](https://code.visualstudio.com/docs/remote/containers).
5. Develop the features on your branch.
As you work on the features, you should make sure that the test suite passes. You should run the tests impacted by your changes like this (see below an explanation regarding the environment variable):
```bash
pytest tests/<TEST_TO_RUN>.py
```
> For the following commands leveraging the `make` utility.
You can also run the full suite with the following command.
```bash
make test
```
TRL relies on `ruff` for maintaining consistent code formatting across its source files. Before submitting any PR, you should apply automatic style corrections and run code verification checks.
We provide a `precommit` target in the `Makefile` that simplifies this process by running all required checks and optimizations on only the files modified by your PR.
To apply these checks and corrections in one step, use:
```bash
make precommit
```
This command runs the following:
* Executes `pre-commit` hooks to automatically fix style issues with `ruff` and other tools.
* Runs additional scripts such as adding copyright information.
If you prefer to apply the style corrections separately or review them individually, the `pre-commit` hook will handle the formatting for the files in question.
Once you're happy with your changes, add changed files using `git add` and make a commit with `git commit` to record your changes locally:
```bash
git add modified_file.py
git commit
```
Please write [good commit messages](https://chris.beams.io/posts/git-commit/).
It is a good idea to sync your copy of the code with the original
repository regularly. This way you can quickly account for changes:
```bash
git fetch upstream
git rebase upstream/main
```
Push the changes to your account using:
```bash
git push -u origin a-descriptive-name-for-my-changes
```
6. Once you are satisfied (**and the checklist below is happy too**), go to the webpage of your fork on GitHub. Click on 'Pull request' to send your changes to the project maintainers for review.
7. It's ok if maintainers ask you for changes. It happens to core contributors too! To ensure everyone can review your changes in the pull request, work on your local branch and push the updates to your fork. They will automatically appear in the pull request.
### Checklist
1. The title of your pull request should be a summary of its contribution;
2. If your pull request addresses an issue, please mention the issue number in the pull request description to make sure they are linked (and people consulting the issue know you are working on it);
3. To indicate a work in progress please prefix the title with `[WIP]`, or mark the PR as a draft PR. These are useful to avoid duplicated work, and to differentiate it from PRs ready to be merged;
4. Make sure existing tests pass;
5. Add high-coverage tests. No quality testing = no merge.
### Tests
An extensive test suite is included to test the library behavior and several examples. Library tests can be found in
the [tests folder](https://github.com/huggingface/trl/tree/main/tests).
We use `pytest` to run the tests. From the root of the
repository here's how to run tests with `pytest` for the library:
```bash
make style && make quality
python -m pytest -sv ./tests
```
## Do you want to contribute to the documentation?
That's how `make test` is implemented (without the `pip install` line)!
* Docs are in the `docs/` folder and can be updated there.
You can specify a smaller set of tests to test only the feature
you're working on.
### Default values guidelines
1. **Use defaults when appropriate**:
Provide default values unless the parameter's value varies significantly by use case. For example, datasets or models should not have defaults, but parameters like `learning_rate` should.
2. **Prioritize proven defaults**:
Default values should align with those recommended in the original paper or method. Alternatives require strong evidence of superior performance in most cases.
3. **Ensure safety and predictability**:
Defaults must be safe, expected and reliable. Avoid settings that could lead to surprising outcomes, such as excessive memory usage or poor performance in edge cases.
4. **Balance consistency and flexibility**:
Aim for consistent defaults across similar functions or methods. However, consistency should not be preferred to point 2 or 3.
5. **Opt-in for new features**:
Do not enable new features or improvements (e.g., novel loss functions) by default. Users should explicitly opt-in to use these.
### Writing documentation
High-quality documentation is crucial for maintaining a project that is easy to use, understand, and extend. When adding new features, ensure they are thoroughly documented to maintain consistency and clarity throughout the project.
To illustrate what good documentation looks like, heres an example of a well-documented function:
````python
def replicate_str(string: str, n: int, sep: str = " ") -> str:
r"""
Replicate a string `n` times with a separator.
Args:
string (`str`):
String to replicate.
n (`int`):
Number of times to replicate the string.
sep (`str`, *optional*, defaults to `" "`):
Separator to use between each replication.
Returns:
`str`: The replicated string.
Examples:
```python
>>> replicate_str("hello", 3)
"hello hello hello"
>>> replicate_str("hello", 3, sep=", ")
"hello, hello, hello"
```
"""
return sep.join([string] * n)
````
* **Line Wrapping:** Applied a consistent line wrap at column 120 to improve readability.
* **Definite Articles:** Removed definite articles where possible to streamline language. (Eg: Changed "The string to replicate" to "String to replicate")
* **Type Annotations:**
* Always include type definitions, indicating if a parameter is optional and specifying the default value.
* Note that `Optional` means that the value can be `None`, and `*optional*` means that it is not required for the user to pass a value.
E.g., for arguments that can't be `None` and aren't required:
```txt
foo (`int`, *optional*, defaults to `4`):
```
For arguments that can be `None` and are required:
```txt
foo (`Optional[int]`):
```
for arguments that can be `None` and aren't required (in this case, if the default value is `None`, you can omit it):
```txt
foo (`Optional[int]`, *optional*):
```
* **String Defaults:**
* Ensured that default string values are wrapped in double quotes:
```txt
defaults to `"foo"`
```
* **Dictionary Typing:**
* Replaced generic `dict` type hints with more explicit `dict[str, Any]` to clarify expected key-value pairs.
* **Default Value Formatting:**
* Consistently surrounded default values with backticks for improved formatting:
```txt
defaults to `4`
```
* **Sub-sectioning:** When the number of arguments is large, consider breaking them into sub-sections for better readability.
```python
def calculate_statistics(data: list[float], precision: int = 2, include_variance: bool = False) -> dict[str, float]:
r"""
Calculates basic statistics for a given dataset.
Args:
> Data inputs
data (`list[float]`):
A list of numerical values to analyze.
> Configuration parameters
precision (`int`, *optional*, defaults to `2`):
Number of decimal places to round the results.
include_variance (`bool`, *optional*, defaults to `False`):
Whether to include the variance of the dataset in the results.
Returns:
`dict[str, float]`:
A dictionary containing calculated statistics such as mean, median, and optionally variance.
"""
...
```
### Deprecation and backward compatibility
Our approach to deprecation and backward compatibility is flexible and based on the features usage and impact. Each deprecation is carefully evaluated, aiming to balance innovation with user needs.
When a feature or component is marked for deprecation, its use will emit a warning message. This warning will include:
* **Transition Guidance**: Instructions on how to migrate to the alternative solution or replacement.
* **Removal Version**: The target version when the feature will be removed, providing users with a clear timeframe to transition.
Example:
```python
warnings.warn(
"The `Trainer.foo` method is deprecated and will be removed in version 0.14.0. "
"Please use the `Trainer.bar` class instead.",
FutureWarning,
)
```
The deprecation and removal schedule is based on each feature's usage and impact, with examples at two extremes:
* **Experimental or Low-Use Features**: For a feature that is experimental or has limited usage, backward compatibility may not be maintained between releases. Users should therefore anticipate potential breaking changes from one version to the next.
* **Widely-Used Components**: For a feature with high usage, we aim for a more gradual transition period of approximately **5 months**, generally scheduling deprecation around **5 minor releases** after the initial warning.
These examples represent the two ends of a continuum. The specific timeline for each feature will be determined individually, balancing innovation with user stability needs.
### Working with warnings
Warnings play a critical role in guiding users toward resolving potential issues, but they should be used thoughtfully to avoid unnecessary noise. Unlike logging, which provides informational context or operational details, warnings signal conditions that require attention and action. Overusing warnings can dilute their importance, leading users to ignore them entirely.
#### Definitions
* **Correct**: An operation is correct if it is valid, follows the intended approach, and aligns with the current best practices or guidelines within the codebase. This is the recommended or intended way to perform the operation.
* **Supported**: An operation is supported if it is technically valid and works within the current codebase, but it may not be the most efficient, optimal, or recommended way to perform the task. This includes deprecated features or legacy approaches that still work but may be phased out in the future.
#### Choosing the right message
* **Correct → No warning**:
If the operation is fully valid and expected, no message should be issued. The system is working as intended, so no warning is necessary.
* **Correct but deserves attention → No warning, possibly a log message**:
When an operation is correct but uncommon or requires special attention, providing an informational message can be helpful. This keeps users informed without implying any issue. If available, use the logger to output this message. Example:
```python
logger.info("This is an informational message about a rare but correct operation.")
```
* **Correct but very likely a mistake → Warning with option to disable**:
In rare cases, you may want to issue a warning for a correct operation thats very likely a mistake. In such cases, you must provide an option to suppress the warning. This can be done with a flag in the function. Example:
```python
def my_function(foo, bar, _warn=True):
if foo == bar:
if _warn:
logger.warning("foo and bar are the same, this is likely a mistake. Ignore this warning by setting `_warn=False`.")
# Do something
```
* **Supported but not correct → Warning**:
If the operation is technically supported but is deprecated, suboptimal, or could cause future issues (e.g., conflicting arguments), a warning should be raised. This message should be actionable, meaning it must explain how to resolve the issue. Example:
```python
def my_function(foo, bar):
if foo and bar:
logger.warning("Both `foo` and `bar` were provided, but only one is allowed. Ignoring `foo`. Please pass only one of these arguments.")
# Do something
```
* **Not supported → Exception**:
If the operation is invalid or unsupported, raise an exception. This indicates that the operation cannot be performed and requires immediate attention. Example:
```python
def my_function(foo, bar):
if foo and bar:
raise ValueError("Both `foo` and `bar` were provided, but only one is allowed. Please pass only one of these arguments.")
```
By following this classification, you ensure that warnings, information, and exceptions are used appropriately, providing clear guidance to the user without cluttering the system with unnecessary messages.

View File

@ -186,7 +186,7 @@
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Copyright 2020-2025 The HuggingFace Team
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.

View File

@ -1,5 +1,7 @@
include settings.ini
include LICENSE
include CONTRIBUTING.md
include README.md
include trl/accelerate_configs/*.yaml
include trl/templates/*.md
recursive-exclude * __pycache__
prune tests

View File

@ -1,13 +1,19 @@
.PHONY: quality style test
.PHONY: test precommit common_tests slow_tests tests_gpu test_experimental
check_dirs := examples tests trl
ACCELERATE_CONFIG_PATH = `pwd`/examples/accelerate_configs
test:
python -m pytest -n auto --dist=loadfile -s -v ./tests/
pytest -n auto -m "not slow and not low_priority" -s -v --reruns 5 --reruns-delay 1 --only-rerun '(OSError|Timeout|HTTPError.*502|HTTPError.*504||not less than or equal to 0.01)' tests/
quality:
black --check --line-length 119 --target-version py38 tests trl
isort --check-only tests trl
flake8 tests trl
precommit:
python scripts/add_copyrights.py
pre-commit run --all-files
doc-builder style trl tests docs/source --max_len 119
style:
black --line-length 119 --target-version py38 tests trl examples setup.py
isort tests trl
slow_tests:
pytest -m "slow" tests/ $(if $(IS_GITHUB_CI),--report-log "slow_tests.log",)
test_experimental:
pytest -k "experimental"

251
README.md
View File

@ -1,121 +1,210 @@
<div style="text-align: center">
<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/trl_banner_dark.png">
</div>
# TRL - Transformer Reinforcement Learning
> Train transformer language models with reinforcement learning.
## What is it?
With `trl` you can train transformer language models with Proximal Policy Optimization (PPO). The library is built on top of the [`transformers`](https://github.com/huggingface/transformers) library by 🤗 Hugging Face. Therefore, pre-trained language models can be directly loaded via `transformers`. At this point most of decoder architectures and encoder-decoder architectures are supported.
**Highlights:**
- `PPOTrainer`: A PPO trainer for language models that just needs (query, response, reward) triplets to optimise the language model.
- `AutoModelForCausalLMWithValueHead` & `AutoModelForSeq2SeqLMWithValueHead`: A transformer model with an additional scalar output for each token which can be used as a value function in reinforcement learning.
- Example: Train GPT2 to generate positive movie reviews with a BERT sentiment classifier.
## How it works
Fine-tuning a language model via PPO consists of roughly three steps:
1. **Rollout**: The language model generates a response or continuation based on query which could be the start of a sentence.
2. **Evaluation**: The query and response are evaluated with a function, model, human feedback or some combination of them. The important thing is that this process should yield a scalar value for each query/response pair.
3. **Optimization**: This is the most complex part. In the optimisation step the query/response pairs are used to calculate the log-probabilities of the tokens in the sequences. This is done with the model that is trained and and a reference model, which is usually the pre-trained model before fine-tuning. The KL-divergence between the two outputs is used as an additional reward signal to make sure the generated responses don't deviate to far from the reference language model. The active language model is then trained with PPO.
This process is illustrated in the sketch below:
<div style="text-align: center">
<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/trl_overview.png" width="800">
<p style="text-align: center;"> <b>Figure:</b> Sketch of the workflow. </p>
<img src="https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/trl_banner_dark.png" alt="TRL Banner">
</div>
<hr> <br>
<h3 align="center">
<p>A comprehensive library to post-train foundation models</p>
</h3>
<p align="center">
<a href="https://github.com/huggingface/trl/blob/main/LICENSE"><img alt="License" src="https://img.shields.io/github/license/huggingface/trl.svg?color=blue"></a>
<a href="https://huggingface.co/docs/trl/index"><img alt="Documentation" src="https://img.shields.io/website?label=documentation&url=https%3A%2F%2Fhuggingface.co%2Fdocs%2Ftrl%2Findex&down_color=red&down_message=offline&up_color=blue&up_message=online"></a>
<a href="https://github.com/huggingface/trl/releases"><img alt="GitHub release" src="https://img.shields.io/github/release/huggingface/trl.svg"></a>
<a href="https://huggingface.co/trl-lib"><img alt="Hugging Face Hub" src="https://img.shields.io/badge/🤗%20Hub-trl--lib-yellow"></a>
</p>
## 🎉 What's New
> **✨ OpenAI GPT OSS Support**: TRL now fully supports fine-tuning the latest [OpenAI GPT OSS models](https://huggingface.co/collections/openai/gpt-oss-68911959590a1634ba11c7a4)! Check out the:
>
> - [OpenAI Cookbook](https://cookbook.openai.com/articles/gpt-oss/fine-tune-transfomers)
> - [GPT OSS recipes](https://github.com/huggingface/gpt-oss-recipes)
> - [Our example script](https://github.com/huggingface/trl/blob/main/examples/scripts/sft_gpt_oss.py)
## Overview
TRL is a cutting-edge library designed for post-training foundation models using advanced techniques like Supervised Fine-Tuning (SFT), Proximal Policy Optimization (PPO), and Direct Preference Optimization (DPO). Built on top of the [🤗 Transformers](https://github.com/huggingface/transformers) ecosystem, TRL supports a variety of model architectures and modalities, and can be scaled-up across various hardware setups.
## Highlights
- **Trainers**: Various fine-tuning methods are easily accessible via trainers like [`SFTTrainer`](https://huggingface.co/docs/trl/sft_trainer), [`GRPOTrainer`](https://huggingface.co/docs/trl/grpo_trainer), [`DPOTrainer`](https://huggingface.co/docs/trl/dpo_trainer), [`RewardTrainer`](https://huggingface.co/docs/trl/reward_trainer) and more.
- **Efficient and scalable**:
- Leverages [🤗 Accelerate](https://github.com/huggingface/accelerate) to scale from single GPU to multi-node clusters using methods like [DDP](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html) and [DeepSpeed](https://github.com/deepspeedai/DeepSpeed).
- Full integration with [🤗 PEFT](https://github.com/huggingface/peft) enables training on large models with modest hardware via quantization and LoRA/QLoRA.
- Integrates [🦥 Unsloth](https://github.com/unslothai/unsloth) for accelerating training using optimized kernels.
- **Command Line Interface (CLI)**: A simple interface lets you fine-tune with models without needing to write code.
## Installation
### Python package
Install the library with pip:
### Python Package
Install the library using `pip`:
```bash
pip install trl
```
### From source
If you want to run the examples in the repository a few additional libraries are required. Clone the repository and install it with pip:
If you want to use the latest features before an official release, you can install TRL from source:
```bash
git clone https://github.com/lvwerra/trl.git
cd trl/
pip install .
pip install git+https://github.com/huggingface/trl.git
```
If you wish to develop TRL, you should install in editable mode:
### Repository
If you want to use the examples you can clone the repository with the following command:
```bash
pip install -e .
git clone https://github.com/huggingface/trl.git
```
## How to use
## Quick Start
### Example
This is a basic example on how to use the library. Based on a query the language model creates a response which is then evaluated. The evaluation could be a human in the loop or another model's output.
For more flexibility and control over training, TRL provides dedicated trainer classes to post-train language models or PEFT adapters on a custom dataset. Each trainer in TRL is a light wrapper around the 🤗 Transformers trainer and natively supports distributed training methods like DDP, DeepSpeed ZeRO, and FSDP.
### `SFTTrainer`
Here is a basic example of how to use the [`SFTTrainer`](https://huggingface.co/docs/trl/sft_trainer):
```python
# imports
import torch
from transformers import AutoTokenizer
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead, create_reference_model
from trl.core import respond_to_batch
from trl import SFTTrainer
from datasets import load_dataset
# get models
model = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2')
model_ref = create_reference_model(model)
dataset = load_dataset("trl-lib/Capybara", split="train")
tokenizer = AutoTokenizer.from_pretrained('gpt2')
# initialize trainer
ppo_config = PPOConfig(
batch_size=1,
trainer = SFTTrainer(
model="Qwen/Qwen2.5-0.5B",
train_dataset=dataset,
)
# encode a query
query_txt = "This morning I went to the "
query_tensor = tokenizer.encode(query_txt, return_tensors="pt")
# get model response
response_tensor = respond_to_batch(model_ref, query_tensor)
# create a ppo trainer
ppo_trainer = PPOTrainer(ppo_config, model, model_ref, tokenizer)
# define a reward for response
# (this could be any reward such as human feedback or output from another model)
reward = [torch.tensor(1.0)]
# train model for one step with ppo
train_stats = ppo_trainer.step([query_tensor[0]], [response_tensor[0]], reward)
trainer.train()
```
### Advanced example: IMDB sentiment
For a detailed example check out the example python script `examples/scripts/ppo-sentiment.py`, where GPT2 is fine-tuned to generate positive movie reviews. An few examples from the language models before and after optimisation are given below:
### `GRPOTrainer`
<div style="text-align: center">
<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/table_imdb_preview.png" width="800">
<p style="text-align: center;"> <b>Figure:</b> A few review continuations before and after optimisation. </p>
</div>
[`GRPOTrainer`](https://huggingface.co/docs/trl/grpo_trainer) implements the [Group Relative Policy Optimization (GRPO) algorithm](https://huggingface.co/papers/2402.03300) that is more memory-efficient than PPO and was used to train [Deepseek AI's R1](https://huggingface.co/deepseek-ai/DeepSeek-R1).
## References
```python
from datasets import load_dataset
from trl import GRPOTrainer
### Proximal Policy Optimisation
The PPO implementation largely follows the structure introduced in the paper **"Fine-Tuning Language Models from Human Preferences"** by D. Ziegler et al. \[[paper](https://arxiv.org/pdf/1909.08593.pdf), [code](https://github.com/openai/lm-human-preferences)].
dataset = load_dataset("trl-lib/tldr", split="train")
### Language models
The language models utilize the `transformers` library by 🤗 Hugging Face.
# Dummy reward function: count the number of unique characters in the completions
def reward_num_unique_chars(completions, **kwargs):
return [len(set(c)) for c in completions]
trainer = GRPOTrainer(
model="Qwen/Qwen2-0.5B-Instruct",
reward_funcs=reward_num_unique_chars,
train_dataset=dataset,
)
trainer.train()
```
### `DPOTrainer`
[`DPOTrainer`](https://huggingface.co/docs/trl/dpo_trainer) implements the popular [Direct Preference Optimization (DPO) algorithm](https://huggingface.co/papers/2305.18290) that was used to post-train [Llama 3](https://huggingface.co/papers/2407.21783) and many other models. Here is a basic example of how to use the `DPOTrainer`:
```python
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import DPOConfig, DPOTrainer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
training_args = DPOConfig(output_dir="Qwen2.5-0.5B-DPO")
trainer = DPOTrainer(
model=model,
args=training_args,
train_dataset=dataset,
processing_class=tokenizer
)
trainer.train()
```
### `RewardTrainer`
Here is a basic example of how to use the [`RewardTrainer`](https://huggingface.co/docs/trl/reward_trainer):
```python
from trl import RewardTrainer
from datasets import load_dataset
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
trainer = RewardTrainer(
model="Qwen/Qwen2.5-0.5B-Instruct",
train_dataset=dataset,
)
trainer.train()
```
## Command Line Interface (CLI)
You can use the TRL Command Line Interface (CLI) to quickly get started with post-training methods like Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO):
**SFT:**
```bash
trl sft --model_name_or_path Qwen/Qwen2.5-0.5B \
--dataset_name trl-lib/Capybara \
--output_dir Qwen2.5-0.5B-SFT
```
**DPO:**
```bash
trl dpo --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
--dataset_name argilla/Capybara-Preferences \
--output_dir Qwen2.5-0.5B-DPO
```
Read more about CLI in the [relevant documentation section](https://huggingface.co/docs/trl/main/en/clis) or use `--help` for more details.
## Development
If you want to contribute to `trl` or customize it to your needs make sure to read the [contribution guide](https://github.com/huggingface/trl/blob/main/CONTRIBUTING.md) and make sure you make a dev install:
```bash
git clone https://github.com/huggingface/trl.git
cd trl/
pip install -e .[dev]
```
## Experimental
A minimal incubation area is available under `trl.experimental` for unstable / fast-evolving features. Anything there may change or be removed in any release without notice.
Example:
```python
from trl.experimental.new_trainer import NewTrainer
```
Read more in the [Experimental docs](https://huggingface.co/docs/trl/main/en/experimental).
## Citation
```bibtex
@misc{vonwerra2022trl,
author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert},
author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallouédec},
title = {TRL: Transformer Reinforcement Learning},
year = {2020},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/lvwerra/trl}}
howpublished = {\url{https://github.com/huggingface/trl}}
}
```
```
## License
This repository's source code is available under the [Apache-2.0 License](LICENSE).

167
RELEASE.md Normal file
View File

@ -0,0 +1,167 @@
# Making a release
> [!NOTE]
> VERSION needs to be formatted following the `v{major}.{minor}.{patch}` convention. We need to follow this convention to be able to retrieve versioned scripts.
## Major/Minor Release
### 1. Ensure your local repository is up to date with the upstream repository
```bash
git checkout main
git pull origin main
```
> [!WARNING]
> Do not merge other pull requests into `main` until the release is done. This is to ensure that the release is stable and does not include any untested changes. Announce internally (#trl-internal) to other maintainers that you are doing a release and that they must not merge PRs until the release is done.
### 2. Create a release branch from main
```bash
git checkout -b release-v{major}.{minor}
```
### 3. Change the version in the following files
- `.github/workflows/tests_latest.yml`:
```diff
- with: { ref: v{major}.{minor-1}-release }
+ with: { ref: v{major}.{minor}-release }
```
- `CITATION.cff`
```diff
- version: "{major}.{minor-1}"
+ version: "{major}.{minor}"
```
- `VERSION`
```diff
- {major}.{minor}.0.dev0
+ {major}.{minor}.0
```
### 4. Commit and push these changes
```shell
git add .github/workflows/tests_latest.yml CITATION.cff VERSION
git commit -m 'Release: {major}.{minor}'
git push origin release-v{major}.{minor}
```
### 5. Create a pull request
from `release-v{major}.{minor}` to `main`, named `Release: v{major}.{minor}`, wait for tests to pass, and request a review.
### 6. Once the pull request is approved, merge it into `main`
It will automatically publish the new version of the package on PyPI.
### 7. Add a tag in git to mark the release
```shell
git checkout main
git pull origin main
git tag -a v{major}.{minor}.0 -m 'Adds tag v{major}.{minor}.0 for PyPI'
git push origin v{major}.{minor}.0
```
### 8. Create a branch `v{major}.{minor}-release` for future patch releases
```shell
git checkout -b v{major}.{minor}-release
git push origin v{major}.{minor}-release
```
This ensures that future patch releases (`v{major}.{minor}.1`, `v{major}.{minor}.2`, etc.) can be made separately from `main`.
### 9. Create a GitHub Release
1. Go to the repos [releases section](https://github.com/huggingface/trl/releases) on GitHub.
2. Click **Draft a new release**.
3. Select the `v{major}.{minor}.0` tag you just created in step 7.
4. Add a title (`v{major}.{minor}.0`) and a short description of whats new.
5. Click **Publish Release**.
### 10. Bump to dev version
1. Create a branch `bump-dev-version-{major}.{minor+1}` from `main` and checkout to it.
```shell
git checkout -b bump-dev-version-{major}.{minor+1}
```
2. Change the version in file `VERSION`:
```diff
- {major}.{minor}.0
+ {major}.{minor+1}.0.dev0
```
3. Commit and push these changes
```shell
git add VERSION
git commit -m '⬆️ Bump dev version'
git push origin bump-dev-version-{major}.{minor+1}
```
4. Create a pull request from `bump-dev-version-{major}.{minor+1}` to `main`, named `⬆️ Bump dev version`, and request urgent review.
5. Once the pull request is approved, merge it into `main`.
6. The codebase is now ready for the next development cycle, inform the team in the #trl-internal channel.
## Making a patch release
### 1. Ensure your local repository is up to date with the upstream repository
```bash
git checkout v{major}.{minor}-release
git pull origin main
```
### 2. Cherry-pick the changes you want to include in the patch release
```bash
git cherry-pick <commit-hash-0>
git cherry-pick <commit-hash-1>
...
```
### 3. Change the version in the file `VERSION`
```diff
- {major}.{minor}.{patch-1}
+ {major}.{minor}.{patch}
```
### 4. Commit and push these changes
```shell
git add VERSION
git commit -m 'Release: {major}.{minor}.{patch}'
git push origin v{major}.{minor}-release
```
### 5. Wait for the CI to pass
The CI will automatically publish the new version of the package on PyPI.
### 6. Add a tag in git to mark the release
```shell
git tag -a v{major}.{minor}.{patch} -m 'Adds tag v{major}.{minor}.{patch} for PyPI'
git push origin v{major}.{minor}.{patch}
```
#### 7. Create a GitHub Release
1. Go to the repos [releases section](https://github.com/huggingface/trl/releases) on GitHub.
2. Click **Draft a new release**.
3. Select the `v{major}.{minor}.{patch}` tag you just created in step 7.
4. Add a title (`v{major}.{minor}.{patch}`) and a short description of whats new.
5. Click **Publish Release**.

1
VERSION Normal file
View File

@ -0,0 +1 @@
0.25.0.dev0

View File

@ -0,0 +1,6 @@
FROM pytorch/pytorch:2.8.0-cuda12.8-cudnn9-runtime
RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*
RUN pip install --upgrade pip uv
RUN uv pip install --system --no-cache "git+https://github.com/huggingface/trl.git#egg=trl[liger,peft,vlm]"
RUN uv pip install --system hf_transfer liger_kernel trackio peft
RUN uv pip install --system https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp311-cp311-linux_x86_64.whl

4
docker/trl/Dockerfile Normal file
View File

@ -0,0 +1,4 @@
FROM pytorch/pytorch:2.8.0-cuda12.8-cudnn9-runtime
RUN pip install --upgrade pip uv
RUN uv pip install --system trl[liger,peft,vlm] hf_transfer trackio
RUN uv pip install --system https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp311-cp311-linux_x86_64.whl

View File

@ -1,24 +1,112 @@
- sections:
- sections:
- local: index
title: TRL
- local: quickstart
title: Quickstart
- local: installation
title: Installation
- local: customization
title: Customize your training
title: Get started
- local: quickstart
title: Quickstart
title: Getting started
- sections:
- local: models
title: Model Classes
- local: trainer
title: Trainer Classes
title: API
- sections:
- local: dataset_formats
title: Dataset Formats
- local: paper_index
title: Paper Index
- local: experimental
title: Experimental
title: Conceptual Guides
- sections:
- local: clis
title: Command Line Interface (CLI)
- local: jobs_training
title: Training using Jobs
- local: customization
title: Customizing the Training
- local: reducing_memory_usage
title: Reducing Memory Usage
- local: speeding_up_training
title: Speeding Up Training
- local: distributing_training
title: Distributing Training
- local: use_model
title: Using Trained Models
title: How-to guides
- sections:
- local: deepspeed_integration
title: DeepSpeed
- local: kernels_hub
title: Kernels Hub
- local: liger_kernel_integration
title: Liger Kernel
- local: peft_integration
title: PEFT
- local: trackio_integration
title: Trackio
- local: unsloth_integration
title: Unsloth
- local: vllm_integration
title: vLLM
title: Integrations
- sections:
- local: example_overview
title: Example Overview
- local: community_tutorials
title: Community Tutorials
- local: lora_without_regret
title: LoRA Without Regret
- local: sentiment_tuning
title: Sentiment Tuning
- local: summarization_reward_tuning
title: Summarization Reward Tuning
- local: detoxifying_a_lm
title: Detoxifying a Language Model
- local: multi_adapter_rl
title: Multi Adapter RLHF
title: Examples
- sections:
- sections: # Sorted alphabetically
- local: bco_trainer
title: BCO
- local: cpo_trainer
title: CPO
- local: dpo_trainer
title: DPO
- local: online_dpo_trainer
title: Online DPO
- local: gkd_trainer
title: GKD
- local: grpo_trainer
title: GRPO
- local: kto_trainer
title: KTO
- local: nash_md_trainer
title: Nash-MD
- local: orpo_trainer
title: ORPO
- local: ppo_trainer
title: PPO
- local: prm_trainer
title: PRM
- local: reward_trainer
title: Reward
- local: rloo_trainer
title: RLOO
- local: sft_trainer
title: SFT
- local: xpo_trainer
title: XPO
title: Trainers
- local: models
title: Model Classes
- local: model_utils
title: Model Utilities
- local: best_of_n
title: Best of N Sampling
- local: judges
title: Judges
- local: callbacks
title: Callbacks
- local: data_utils
title: Data Utilities
- local: rewards
title: Reward Functions
- local: script_utils
title: Script Utilities
- local: others
title: Others
title: API

103
docs/source/bco_trainer.md Normal file
View File

@ -0,0 +1,103 @@
# BCO Trainer
[![model badge](https://img.shields.io/badge/All_models-BCO-blue)](https://huggingface.co/models?other=bco,trl)
TRL supports the Binary Classifier Optimization (BCO).
The [BCO](https://huggingface.co/papers/2404.04656) authors train a binary classifier whose logit serves as a reward so that the classifier maps {prompt, chosen completion} pairs to 1 and {prompt, rejected completion} pairs to 0.
For a full example have a look at [`examples/scripts/bco.py`].
## Expected dataset type
The [`BCOTrainer`] requires an [unpaired preference dataset](dataset_formats#unpaired-preference).
The [`BCOTrainer`] supports both [conversational](dataset_formats#conversational) and [standard](dataset_formats#standard) dataset formats. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
## Expected model format
The BCO trainer expects a model of `AutoModelForCausalLM`, compared to PPO that expects `AutoModelForCausalLMWithValueHead` for the value function.
## Using the `BCOTrainer`
For a detailed example have a look at the `examples/scripts/bco.py` script. At a high level we need to initialize the `BCOTrainer` with a `model` we wish to train and a reference `ref_model` which we will use to calculate the implicit rewards of the preferred and rejected response.
The `beta` refers to the hyperparameter of the implicit reward, and the dataset contains the 3 entries listed above. Note that the `model` and `ref_model` need to have the same architecture (ie decoder only or encoder-decoder).
```python
training_args = BCOConfig(
beta=0.1,
)
bco_trainer = BCOTrainer(
model,
model_ref,
args=training_args,
train_dataset=train_dataset,
processing_class=tokenizer,
)
```
After this one can then call:
```python
bco_trainer.train()
```
## Underlying Distribution matching (UDM)
In practical scenarios, the thumbs-up and thumbs-down datasets are likely to have divergent underlying distributions of prompts.
Consider an LLM deployed for user feedback: if the model excels in writing tasks but underperforms in coding, the thumbs-up dataset will be dominated by writing-related prompts, while the thumbs-down dataset will contain mostly coding-related prompts.
If the prompts in your desired and undesired datasets differ a lot, it is useful to enable UDM.
Choose an embedding model and tokenizer:
```python
embedding_model = AutoModel.from_pretrained(your_model_id)
embedding_tokenizer = AutoTokenizer.from_pretrained(your_model_id)
# customize this function depending on your embedding model
def embed_prompt(input_ids, attention_mask, model):
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
return outputs.last_hidden_state.mean(dim=1)
embedding_model = Accelerator().prepare_model(self.embedding_model)
embedding_func = partial(embed_prompt, model=embedding_model)
```
Set `prompt_sample_size` to define how many prompts are selected to train the UDM classifier and start the training with the provided embedding function:
```python
training_args = BCOConfig(
beta=0.1,
prompt_sample_size=512,
)
bco_trainer = BCOTrainer(
model,
model_ref,
args=training_args,
train_dataset=train_dataset,
processing_class=tokenizer,
embedding_func=embedding_func,
embedding_tokenizer=self.embedding_tokenizer,
)
bco_trainer.train()
```
### For Mixture of Experts Models: Enabling the auxiliary loss
MOEs are the most efficient if the load is about equally distributed between experts.
To ensure that we train MOEs similarly during preference-tuning, it is beneficial to add the auxiliary loss from the load balancer to the final loss.
This option is enabled by setting `output_router_logits=True` in the model config (e.g. MixtralConfig).
To scale how much the auxiliary loss contributes to the total loss, use the hyperparameter `router_aux_loss_coef=...` (default: 0.001).
## BCOTrainer
[[autodoc]] BCOTrainer
- train
- save_model
- push_to_hub
## BCOConfig
[[autodoc]] BCOConfig

68
docs/source/best_of_n.md Normal file
View File

@ -0,0 +1,68 @@
# Best of N sampling: Alternative ways to get better model output without RL based fine-tuning
> [!WARNING]
> Best-of-N sampling is deprecated and will be removed in TRL 0.25.0.
Within the extras module is the `best-of-n` sampler class that serves as an alternative method of generating better model output.
As to how it fares against the RL based fine-tuning, please look in the `examples` directory for a comparison example
## Usage
To get started quickly, instantiate an instance of the class with a model, a length sampler, a tokenizer and a callable that serves as a proxy reward pipeline that outputs reward scores for input queries
```python
from transformers import pipeline, AutoTokenizer
from trl import AutoModelForCausalLMWithValueHead
from trl.core import LengthSampler
from trl.extras import BestOfNSampler
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained(ref_model_name)
reward_pipe = pipeline("sentiment-analysis", model=reward_model, device=device)
tokenizer = AutoTokenizer.from_pretrained(ref_model_name)
tokenizer.pad_token = tokenizer.eos_token
# callable that takes a list of raw text and returns a list of corresponding reward scores
def queries_to_scores(list_of_strings):
return [output["score"] for output in reward_pipe(list_of_strings)]
best_of_n = BestOfNSampler(model, tokenizer, queries_to_scores, length_sampler=output_length_sampler)
```
And assuming you have a list/tensor of tokenized queries, you can generate better output by calling the `generate` method
```python
best_of_n.generate(query_tensors, device=device, **gen_kwargs)
```
The default sample size is 4, but you can change it at the time of instance initialization like so
```python
best_of_n = BestOfNSampler(model, tokenizer, queries_to_scores, length_sampler=output_length_sampler, sample_size=8)
```
The default output is the result of taking the top scored output for each query, but you can change it to top 2 and so on by passing the `n_candidates` argument at the time of instance initialization
```python
best_of_n = BestOfNSampler(model, tokenizer, queries_to_scores, length_sampler=output_length_sampler, n_candidates=2)
```
There is the option of setting the generation settings (like `temperature`, `pad_token_id`) at the time of instance creation as opposed to when calling the `generate` method.
This is done by passing a [`~transformers.GenerationConfig`] from the `transformers` library at the time of initialization
```python
from transformers import GenerationConfig
generation_config = GenerationConfig(min_length= -1, top_k=0.0, top_p= 1.0, do_sample= True, pad_token_id=tokenizer.eos_token_id)
best_of_n = BestOfNSampler(model, tokenizer, queries_to_scores, length_sampler=output_length_sampler, generation_config=generation_config)
best_of_n.generate(query_tensors, device=device)
```
Furthermore, at the time of initialization you can set the seed to control the repeatability of the generation process and the number of samples to generate for each query
## BestOfNSampler
[[autodoc]] BestOfNSampler

29
docs/source/callbacks.md Normal file
View File

@ -0,0 +1,29 @@
# Callbacks
## SyncRefModelCallback
[[autodoc]] SyncRefModelCallback
## RichProgressCallback
[[autodoc]] RichProgressCallback
## WinRateCallback
[[autodoc]] WinRateCallback
## LogCompletionsCallback
[[autodoc]] LogCompletionsCallback
## MergeModelCallback
[[autodoc]] MergeModelCallback
## BEMACallback
[[autodoc]] BEMACallback
## WeaveCallback
[[autodoc]] WeaveCallback

414
docs/source/clis.md Normal file
View File

@ -0,0 +1,414 @@
# Command Line Interfaces (CLIs)
TRL provides a powerful command-line interface (CLI) to fine-tune large language models (LLMs) using methods like Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and more. The CLI abstracts away much of the boilerplate, letting you launch training jobs quickly and reproducibly.
## Commands
Currently supported commands are:
### Training Commands
- `trl dpo`: fine-tune a LLM with DPO
- `trl grpo`: fine-tune a LLM with GRPO
- `trl kto`: fine-tune a LLM with KTO
- `trl reward`: train a Reward Model
- `trl rloo`: fine-tune a LLM with RLOO
- `trl sft`: fine-tune a LLM with SFT
### Other Commands
- `trl env`: get the system information
- `trl vllm-serve`: serve a model with vLLM
## Fine-Tuning with the TRL CLI
### Basic Usage
You can launch training directly from the CLI by specifying required arguments like the model and dataset:
<hfoptions id="command_line">
<hfoption id="SFT">
```bash
trl sft \
--model_name_or_path Qwen/Qwen2.5-0.5B \
--dataset_name stanfordnlp/imdb
```
</hfoption>
<hfoption id="DPO">
```bash
trl dpo \
--model_name_or_path Qwen/Qwen2.5-0.5B \
--dataset_name anthropic/hh-rlhf
```
</hfoption>
<hfoption id="Reward">
```bash
trl reward \
--model_name_or_path Qwen/Qwen2.5-0.5B \
--dataset_name trl-lib/ultrafeedback_binarized
```
</hfoption>
</hfoptions>
### Using Configuration Files
To keep your CLI commands clean and reproducible, you can define all training arguments in a YAML configuration file:
<hfoptions id="config_file">
<hfoption id="SFT">
```yaml
# sft_config.yaml
model_name_or_path: Qwen/Qwen2.5-0.5B
dataset_name: stanfordnlp/imdb
```
Launch with:
```bash
trl sft --config sft_config.yaml
```
</hfoption>
<hfoption id="DPO">
```yaml
# dpo_config.yaml
model_name_or_path: Qwen/Qwen2.5-0.5B
dataset_name: anthropic/hh-rlhf
```
Launch with:
```bash
trl dpo --config dpo_config.yaml
```
</hfoption>
<hfoption id="Reward">
```yaml
# reward_config.yaml
model_name_or_path: Qwen/Qwen2.5-0.5B
dataset_name: trl-lib/ultrafeedback_binarized
```
Launch with:
```bash
trl reward --config reward_config.yaml
```
</hfoption>
</hfoptions>
### Scaling Up with Accelerate
TRL CLI natively supports [🤗 Accelerate](https://huggingface.co/docs/accelerate), making it easy to scale training across multiple GPUs, machines, or use advanced setups like DeepSpeed — all from the same CLI.
You can pass any `accelerate launch` arguments directly to `trl`, such as `--num_processes`. For more information see [Using accelerate launch](https://huggingface.co/docs/accelerate/en/basic_tutorials/launch#using-accelerate-launch).
<hfoptions id="launch_args">
<hfoption id="SFT inline">
```bash
trl sft \
--model_name_or_path Qwen/Qwen2.5-0.5B \
--dataset_name stanfordnlp/imdb \
--num_processes 4
```
</hfoption>
<hfoption id="SFT w/ config file">
```yaml
# sft_config.yaml
model_name_or_path: Qwen/Qwen2.5-0.5B
dataset_name: stanfordnlp/imdb
num_processes: 4
```
Launch with:
```bash
trl sft --config sft_config.yaml
```
</hfoption>
<hfoption id="DPO inline">
```bash
trl dpo \
--model_name_or_path Qwen/Qwen2.5-0.5B \
--dataset_name anthropic/hh-rlhf \
--num_processes 4
```
</hfoption>
<hfoption id="DPO w/ config file">
```yaml
# dpo_config.yaml
model_name_or_path: Qwen/Qwen2.5-0.5B
dataset_name: anthropic/hh-rlhf
num_processes: 4
```
Launch with:
```bash
trl dpo --config dpo_config.yaml
```
</hfoption>
<hfoption id="Reward inline">
```bash
trl reward \
--model_name_or_path Qwen/Qwen2.5-0.5B \
--dataset_name trl-lib/ultrafeedback_binarized \
--num_processes 4
```
</hfoption>
<hfoption id="Reward w/ config file">
```yaml
# reward_config.yaml
model_name_or_path: Qwen/Qwen2.5-0.5B
dataset_name: trl-lib/ultrafeedback_binarized
num_processes: 4
```
Launch with:
```bash
trl reward --config reward_config.yaml
```
</hfoption>
</hfoptions>
### Using `--accelerate_config` for Accelerate Configuration
The `--accelerate_config` flag lets you easily configure distributed training with [🤗 Accelerate](https://github.com/huggingface/accelerate). This flag accepts either:
- the name of a predefined config profile (built into TRL), or
- a path to a custom Accelerate YAML config file.
#### Predefined Config Profiles
TRL provides several ready-to-use Accelerate configs to simplify common training setups:
| Name | Description |
| --- | --- |
| `fsdp1` | Fully Sharded Data Parallel Stage 1 |
| `fsdp2` | Fully Sharded Data Parallel Stage 2 |
| `zero1` | DeepSpeed ZeRO Stage 1 |
| `zero2` | DeepSpeed ZeRO Stage 2 |
| `zero3` | DeepSpeed ZeRO Stage 3 |
| `multi_gpu` | Multi-GPU training |
| `single_gpu` | Single-GPU training |
To use one of these, just pass the name to `--accelerate_config`. TRL will automatically load the corresponding config file from `trl/accelerate_config/`.
#### Example Usage
<hfoptions id="accelerate_config">
<hfoption id="SFT inline">
```bash
trl sft \
--model_name_or_path Qwen/Qwen2.5-0.5B \
--dataset_name stanfordnlp/imdb \
--accelerate_config zero2 # or path/to/my/accelerate/config.yaml
```
</hfoption>
<hfoption id="SFT w/ config file">
```yaml
# sft_config.yaml
model_name_or_path: Qwen/Qwen2.5-0.5B
dataset_name: stanfordnlp/imdb
accelerate_config: zero2 # or path/to/my/accelerate/config.yaml
```
Launch with:
```bash
trl sft --config sft_config.yaml
```
</hfoption>
<hfoption id="DPO inline">
```bash
trl dpo \
--model_name_or_path Qwen/Qwen2.5-0.5B \
--dataset_name anthropic/hh-rlhf \
--accelerate_config zero2 # or path/to/my/accelerate/config.yaml
```
</hfoption>
<hfoption id="DPO w/ config file">
```yaml
# dpo_config.yaml
model_name_or_path: Qwen/Qwen2.5-0.5B
dataset_name: anthropic/hh-rlhf
accelerate_config: zero2 # or path/to/my/accelerate/config.yaml
```
Launch with:
```bash
trl dpo --config dpo_config.yaml
```
</hfoption>
<hfoption id="Reward inline">
```bash
trl reward \
--model_name_or_path Qwen/Qwen2.5-0.5B \
--dataset_name trl-lib/ultrafeedback_binarized \
--accelerate_config zero2 # or path/to/my/accelerate/config.yaml
```
</hfoption>
<hfoption id="Reward w/ config file">
```yaml
# reward_config.yaml
model_name_or_path: Qwen/Qwen2.5-0.5B
dataset_name: trl-lib/ultrafeedback_binarized
accelerate_config: zero2 # or path/to/my/accelerate/config.yaml
```
Launch with:
```bash
trl reward --config reward_config.yaml
```
</hfoption>
</hfoptions>
### Using dataset mixtures
You can use dataset mixtures to combine multiple datasets into a single training dataset. This is useful for training on diverse data sources or when you want to mix different types of data.
<hfoptions id="dataset_mixtures">
<hfoption id="SFT">
```yaml
# sft_config.yaml
model_name_or_path: Qwen/Qwen2.5-0.5B
datasets:
- path: stanfordnlp/imdb
- path: roneneldan/TinyStories
```
Launch with:
```bash
trl sft --config sft_config.yaml
```
</hfoption>
<hfoption id="DPO">
```yaml
# dpo_config.yaml
model_name_or_path: Qwen/Qwen2.5-0.5B
datasets:
- path: BAAI/Infinity-Preference
- path: argilla/Capybara-Preferences
```
Launch with:
```bash
trl dpo --config dpo_config.yaml
```
</hfoption>
<hfoption id="Reward">
```yaml
# reward_config.yaml
model_name_or_path: Qwen/Qwen2.5-0.5B
datasets:
- path: trl-lib/tldr-preference
- path: trl-lib/lm-human-preferences-sentiment
```
Launch with:
```bash
trl reward --config reward_config.yaml
```
</hfoption>
</hfoptions>
To see all the available keywords for defining dataset mixtures, refer to the [`scripts.utils.DatasetConfig`] and [`DatasetMixtureConfig`] classes.
## Getting the System Information
You can get the system information by running the following command:
```bash
trl env
```
This will print out the system information, including the GPU information, the CUDA version, the PyTorch version, the transformers version, the TRL version, and any optional dependencies that are installed.
```txt
Copy-paste the following information when reporting an issue:
- Platform: Linux-5.15.0-1048-aws-x86_64-with-glibc2.31
- Python version: 3.11.9
- PyTorch version: 2.4.1
- accelerator(s): NVIDIA H100 80GB HBM3
- Transformers version: 4.45.0.dev0
- Accelerate version: 0.34.2
- Accelerate config:
- compute_environment: LOCAL_MACHINE
- distributed_type: DEEPSPEED
- mixed_precision: no
- use_cpu: False
- debug: False
- num_processes: 4
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: False
- deepspeed_config: {'gradient_accumulation_steps': 4, 'offload_optimizer_device': 'none', 'offload_param_device': 'none', 'zero3_init_flag': False, 'zero_stage': 2}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
- Datasets version: 3.0.0
- HF Hub version: 0.24.7
- TRL version: 0.12.0.dev0+acb4d70
- bitsandbytes version: 0.41.1
- DeepSpeed version: 0.15.1
- Diffusers version: 0.30.3
- Liger-Kernel version: 0.3.0
- LLM-Blender version: 0.0.2
- OpenAI version: 1.46.0
- PEFT version: 0.12.0
- vLLM version: not installed
```
This information is required when reporting an issue.

View File

@ -0,0 +1,57 @@
# Community Tutorials
Community tutorials are made by active members of the Hugging Face community who want to share their knowledge and expertise with others. They are a great way to learn about the library and its features, and to get started with core classes and modalities.
## Language Models
### Tutorials
| Task | Class | Description | Author | Tutorial | Colab |
| --- | --- | --- | --- | --- | --- |
| Reinforcement Learning | [`GRPOTrainer`] | Efficient Online Training with GRPO and vLLM in TRL | [Sergio Paniego](https://huggingface.co/sergiopaniego) | [Link](https://huggingface.co/learn/cookbook/grpo_vllm_online_training) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/cookbook/blob/main/notebooks/en/grpo_vllm_online_training.ipynb) |
| Reinforcement Learning | [`GRPOTrainer`] | Post training an LLM for reasoning with GRPO in TRL | [Sergio Paniego](https://huggingface.co/sergiopaniego) | [Link](https://huggingface.co/learn/cookbook/fine_tuning_llm_grpo_trl) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/cookbook/blob/main/notebooks/en/fine_tuning_llm_grpo_trl.ipynb) |
| Reinforcement Learning | [`GRPOTrainer`] | Mini-R1: Reproduce Deepseek R1 „aha moment“ a RL tutorial | [Philipp Schmid](https://huggingface.co/philschmid) | [Link](https://www.philschmid.de/mini-deepseek-r1) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/philschmid/deep-learning-pytorch-huggingface/blob/main/training/mini-deepseek-r1-aha-grpo.ipynb) |
| Reinforcement Learning | [`GRPOTrainer`] | RL on LLaMA 3.1-8B with GRPO and Unsloth optimizations | [Andrea Manzoni](https://huggingface.co/AManzoni) | [Link](https://colab.research.google.com/github/amanzoni1/fine_tuning/blob/main/RL_LLama3_1_8B_GRPO.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/amanzoni1/fine_tuning/blob/main/RL_LLama3_1_8B_GRPO.ipynb) |
| Instruction tuning | [`SFTTrainer`] | Fine-tuning Google Gemma LLMs using ChatML format with QLoRA | [Philipp Schmid](https://huggingface.co/philschmid) | [Link](https://www.philschmid.de/fine-tune-google-gemma) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/philschmid/deep-learning-pytorch-huggingface/blob/main/training/gemma-lora-example.ipynb) |
| Structured Generation | [`SFTTrainer`] | Fine-tuning Llama-2-7B to generate Persian product catalogs in JSON using QLoRA and PEFT | [Mohammadreza Esmaeilian](https://huggingface.co/Mohammadreza) | [Link](https://huggingface.co/learn/cookbook/en/fine_tuning_llm_to_generate_persian_product_catalogs_in_json_format) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/cookbook/blob/main/notebooks/en/fine_tuning_llm_to_generate_persian_product_catalogs_in_json_format.ipynb) |
| Preference Optimization | [`DPOTrainer`] | Align Mistral-7b using Direct Preference Optimization for human preference alignment | [Maxime Labonne](https://huggingface.co/mlabonne) | [Link](https://mlabonne.github.io/blog/posts/Fine_tune_Mistral_7b_with_DPO.html) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mlabonne/llm-course/blob/main/Fine_tune_a_Mistral_7b_model_with_DPO.ipynb) |
| Preference Optimization | [`ORPOTrainer`] | Fine-tuning Llama 3 with ORPO combining instruction tuning and preference alignment | [Maxime Labonne](https://huggingface.co/mlabonne) | [Link](https://mlabonne.github.io/blog/posts/2024-04-19_Fine_tune_Llama_3_with_ORPO.html) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1eHNWg9gnaXErdAa8_mcvjMupbSS6rDvi) |
| Instruction tuning | [`SFTTrainer`] | How to fine-tune open LLMs in 2025 with Hugging Face | [Philipp Schmid](https://huggingface.co/philschmid) | [Link](https://www.philschmid.de/fine-tune-llms-in-2025) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/philschmid/deep-learning-pytorch-huggingface/blob/main/training/fine-tune-llms-in-2025.ipynb) |
### Videos
| Task | Title | Author | Video |
| --- | --- | --- | --- |
| Instruction tuning | Fine-tuning open AI models using Hugging Face TRL | [Wietse Venema](https://huggingface.co/wietsevenema) | [<img src="https://img.youtube.com/vi/cnGyyM0vOes/0.jpg">](https://youtu.be/cnGyyM0vOes) |
| Instruction tuning | How to fine-tune a smol-LM with Hugging Face, TRL, and the smoltalk Dataset | [Mayurji](https://huggingface.co/iammayur) | [<img src="https://img.youtube.com/vi/jKdXv3BiLu0/0.jpg">](https://youtu.be/jKdXv3BiLu0) |
<details>
<summary>⚠️ Deprecated features notice for "How to fine-tune a smol-LM with Hugging Face, TRL, and the smoltalk Dataset" (click to expand)</summary>
> [!WARNING]
> The tutorial uses two deprecated features:
>
> - `SFTTrainer(..., tokenizer=tokenizer)`: Use `SFTTrainer(..., processing_class=tokenizer)` instead, or simply omit it (it will be inferred from the model).
> - `setup_chat_format(model, tokenizer)`: Use `SFTConfig(..., chat_template_path="Qwen/Qwen3-0.6B")`, where `chat_template_path` specifies the model whose chat template you want to copy.
</details>
## Vision Language Models
### Tutorials
| Task | Class | Description | Author | Tutorial | Colab |
| --- | --- | --- | --- | --- | --- |
| Visual QA | [`SFTTrainer`] | Fine-tuning Qwen2-VL-7B for visual question answering on ChartQA dataset | [Sergio Paniego](https://huggingface.co/sergiopaniego) | [Link](https://huggingface.co/learn/cookbook/fine_tuning_vlm_trl) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/cookbook/blob/main/notebooks/en/fine_tuning_vlm_trl.ipynb) |
| Visual QA | [`SFTTrainer`] | Fine-tuning SmolVLM with TRL on a consumer GPU | [Sergio Paniego](https://huggingface.co/sergiopaniego) | [Link](https://huggingface.co/learn/cookbook/fine_tuning_smol_vlm_sft_trl) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/cookbook/blob/main/notebooks/en/fine_tuning_smol_vlm_sft_trl.ipynb) |
| SEO Description | [`SFTTrainer`] | Fine-tuning Qwen2-VL-7B for generating SEO-friendly descriptions from images | [Philipp Schmid](https://huggingface.co/philschmid) | [Link](https://www.philschmid.de/fine-tune-multimodal-llms-with-trl) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/philschmid/deep-learning-pytorch-huggingface/blob/main/training/fine-tune-multimodal-llms-with-trl.ipynb) |
| Visual QA | [`DPOTrainer`] | PaliGemma 🤝 Direct Preference Optimization | [Merve Noyan](https://huggingface.co/merve) | [Link](https://github.com/merveenoyan/smol-vision/blob/main/PaliGemma_DPO.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/merveenoyan/smol-vision/blob/main/PaliGemma_DPO.ipynb) |
| Visual QA | [`DPOTrainer`] | Fine-tuning SmolVLM using direct preference optimization (DPO) with TRL on a consumer GPU | [Sergio Paniego](https://huggingface.co/sergiopaniego) | [Link](https://huggingface.co/learn/cookbook/fine_tuning_vlm_dpo_smolvlm_instruct) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/cookbook/blob/main/notebooks/en/fine_tuning_vlm_dpo_smolvlm_instruct.ipynb) |
| Object Detection Grounding | [`SFTTrainer`] | Fine tuning a VLM for Object Detection Grounding using TRL | [Sergio Paniego](https://huggingface.co/sergiopaniego) | [Link](https://huggingface.co/learn/cookbook/fine_tuning_vlm_object_detection_grounding) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/cookbook/blob/main/notebooks/en/fine_tuning_vlm_object_detection_grounding.ipynb) |
| Visual QA | [`DPOTrainer`] | Fine-Tuning a Vision Language Model with TRL using MPO | [Sergio Paniego](https://huggingface.co/sergiopaniego) | [Link](https://huggingface.co/learn/cookbook/fine_tuning_vlm_mpo) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/cookbook/blob/main/notebooks/en/fine_tuning_vlm_mpo.ipynb) |
| Reinforcement Learning | [`GRPOTrainer`] | Post training a VLM for reasoning with GRPO using TRL | [Sergio Paniego](https://huggingface.co/sergiopaniego) | [Link](https://huggingface.co/learn/cookbook/fine_tuning_vlm_grpo_trl) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/cookbook/blob/main/notebooks/en/fine_tuning_vlm_grpo_trl.ipynb) |
## Contributing
If you have a tutorial that you would like to add to this list, please open a PR to add it. We will review it and merge it if it is relevant to the community.

126
docs/source/cpo_trainer.md Normal file
View File

@ -0,0 +1,126 @@
# CPO Trainer
[![model badge](https://img.shields.io/badge/All_models-CPO-blue)](https://huggingface.co/models?other=cpo,trl)
## Overview
Contrastive Preference Optimization (CPO) as introduced in the paper [Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation](https://huggingface.co/papers/2401.08417) by [Haoran Xu](https://huggingface.co/haoranxu), [Amr Sharaf](https://huggingface.co/amrsharaf), [Yunmo Chen](https://huggingface.co/yunmochen), Weiting Tan, Lingfeng Shen, Benjamin Van Durme, [Kenton Murray](https://huggingface.co/Kenton), and [Young Jin Kim](https://huggingface.co/ykim362). At a high level, CPO trains models to avoid generating adequate, but not perfect, translations in Machine Translation (MT) tasks. However, CPO is a general approximation of the DPO loss and can be applied to other domains, such as chat.
CPO aims to mitigate two fundamental shortcomings of SFT. First, SFTs methodology of minimizing the discrepancy between predicted outputs and gold-standard references inherently caps model performance at the quality level of the training data. Secondly, SFT lacks a mechanism to prevent the model from rejecting mistakes in translations. The CPO objective is derived from the DPO objective.
## Quick start
This example demonstrates how to train a model using the CPO method. We use the [Qwen 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) as the base model. We use the preference data from the [UltraFeedback dataset](https://huggingface.co/datasets/openbmb/UltraFeedback). You can view the data in the dataset here:
<iframe
src="https://huggingface.co/datasets/trl-lib/ultrafeedback_binarized/embed/viewer/default/train?row=0"
frameborder="0"
width="100%"
height="560px"
></iframe>
Below is the script to train the model:
```python
# train_cpo.py
from datasets import load_dataset
from trl import CPOConfig, CPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
train_dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
training_args = CPOConfig(output_dir="Qwen2-0.5B-CPO")
trainer = CPOTrainer(model=model, args=training_args, processing_class=tokenizer, train_dataset=train_dataset)
trainer.train()
```
Execute the script using the following command:
```bash
accelerate launch train_cpo.py
```
## Expected dataset type
CPO requires a [preference dataset](dataset_formats#preference). The [`CPOTrainer`] supports both [conversational](dataset_formats#conversational) and [standard](dataset_formats#standard) dataset formats. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
## Example script
We provide an example script to train a model using the CPO method. The script is available in [`examples/scripts/cpo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/cpo.py)
To test the CPO script with the [Qwen2 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) on the [UltraFeedback dataset](https://huggingface.co/datasets/trl-lib/ultrafeedback_binarized), run the following command:
```bash
accelerate launch examples/scripts/cpo.py \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--dataset_name trl-lib/ultrafeedback_binarized \
--num_train_epochs 1 \
--output_dir Qwen2-0.5B-CPO
```
## Logged metrics
While training and evaluating, we record the following reward metrics:
* `rewards/chosen`: the mean log probabilities of the policy model for the chosen responses scaled by beta
* `rewards/rejected`: the mean log probabilities of the policy model for the rejected responses scaled by beta
* `rewards/accuracies`: mean of how often the chosen rewards are > than the corresponding rejected rewards
* `rewards/margins`: the mean difference between the chosen and corresponding rejected rewards
* `nll_loss`: the mean negative log likelihood loss of the policy model for the chosen responses
## CPO variants
### Simple Preference Optimization (SimPO)
[Simple Preference Optimization](https://huggingface.co/papers/2405.14734) (SimPO) by [Yu Meng](https://huggingface.co/yumeng5), [Mengzhou Xia](https://huggingface.co/mengzhouxia), and [Danqi Chen](https://huggingface.co/cdq10131) proposes a simpler and more effective preference optimization algorithm than DPO without using a reference model. The key designs in SimPO are (1) using length-normalized log likelihood as the implicit reward, and (2) incorporating a target reward margin in the Bradley-Terry ranking objective. The official code can be found at [princeton-nlp/SimPO](https://github.com/princeton-nlp/SimPO).
The abstract from the paper is the following:
> Direct Preference Optimization (DPO) is a widely used offline preference optimization algorithm that reparameterizes reward functions in reinforcement learning from human feedback (RLHF) to enhance simplicity and training stability. In this work, we propose SimPO, a simpler yet more effective approach. The effectiveness of SimPO is attributed to a key design: using the average log probability of a sequence as the implicit reward. This reward formulation better aligns with model generation and eliminates the need for a reference model, making it more compute and memory efficient. Additionally, we introduce a target reward margin to the Bradley-Terry objective to encourage a larger margin between the winning and losing responses, further enhancing the algorithm's performance. We compare SimPO to DPO and its latest variants across various state-of-the-art training setups, including both base and instruction-tuned models like Mistral and Llama3. We evaluated on extensive instruction-following benchmarks, including AlpacaEval 2, MT-Bench, and the recent challenging Arena-Hard benchmark. Our results demonstrate that SimPO consistently and significantly outperforms existing approaches without substantially increasing response length. Specifically, SimPO outperforms DPO by up to 6.4 points on AlpacaEval 2 and by up to 7.5 points on Arena-Hard. Our top-performing model, built on Llama3-8B-Instruct, achieves a remarkable 44.7 length-controlled win rate on AlpacaEval 2 -- surpassing Claude 3 Opus on the leaderboard, and a 33.8 win rate on Arena-Hard -- making it the strongest 8B open-source model.
The SimPO loss is integrated in the [`CPOTrainer`], as it's an alternative loss that adds a reward margin, allows for length normalization, and does not use BC regularization. To use this loss, just turn on `loss_type="simpo"` and `cpo_alpha=0.0` in the [`CPOConfig`] and set the `simpo_gamma` to a recommended value.
### CPO-SimPO
We also offer the combined use of CPO and SimPO, which enables more stable training and improved performance. Learn more details at [CPO-SimPO GitHub](https://github.com/fe1ixxu/CPO_SIMPO). To use this method, simply enable SimPO by setting `loss_type="simpo"` and a non-zero `cpo_alpha` in the [`CPOConfig`].
### AlphaPO
The [AlphaPO -- Reward shape matters for LLM alignment](https://huggingface.co/papers/2501.03884) (AlphaPO) method by Aman Gupta, Shao Tang, Qingquan Song, Sirou Zhu, [Jiwoo Hong](https://huggingface.co/JW17), Ankan Saha, Viral Gupta, Noah Lee, Eunki Kim, Jason Zhu, Natesh Pillai, and S. Sathiya Keerthi is also implemented in the [`CPOTrainer`]. AlphaPO is an alternative method that applies a transformation to the reward function shape in the context of SimPO loss. The abstract from the paper is the following:
> Reinforcement Learning with Human Feedback (RLHF) and its variants have made huge strides toward the effective alignment of large language models (LLMs) to follow instructions and reflect human values. More recently, Direct Alignment Algorithms (DAAs) have emerged in which the reward modeling stage of RLHF is skipped by characterizing the reward directly as a function of the policy being learned. Some popular examples of DAAs include Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO). These methods often suffer from likelihood displacement, a phenomenon by which the probabilities of preferred responses are often reduced undesirably. In this paper, we argue that, for DAAs the reward (function) shape matters. We introduce AlphaPO, a new DAA method that leverages an α-parameter to help change the shape of the reward function beyond the standard log reward. AlphaPO helps maintain fine-grained control over likelihood displacement and overoptimization. Compared to SimPO, one of the best performing DAAs, AlphaPO leads to about 7% to 10% relative improvement in alignment performance for the instruct versions of Mistral-7B and Llama3-8B while achieving 15% to 50% relative improvement over DPO on the same models. The analysis and results presented highlight the importance of the reward shape and how one can systematically change it to affect training dynamics, as well as improve alignment performance.
To use this loss as described in the paper, we can set the `loss_type="alphapo"` which automatically sets `loss_type="simpo"` and `cpo_alpha=0.0`, together with `alpha` and `simpo_gamma` to recommended values in the [`CPOConfig`]. Alternatively, you can manually set `loss_type="simpo"`, `cpo_alpha=0.0`, together with `alpha` and `simpo_gamma` to recommended values. Other variants of this method are also possible, such as setting `loss_type="ipo"` and `alpha` to any non-zero value.
## Loss functions
The CPO algorithm supports several loss functions. The loss function can be set using the `loss_type` parameter in the [`CPOConfig`]. The following loss functions are supported:
| `loss_type=` | Description |
| --- | --- |
| `"sigmoid"` (default) | Given the preference data, we can fit a binary classifier according to the Bradley-Terry model, and in fact, the [DPO](https://huggingface.co/papers/2305.18290) authors propose the sigmoid loss on the normalized likelihood via the `logsigmoid` to fit a logistic regression. |
| `"hinge"` | The [RSO](https://huggingface.co/papers/2309.06657) authors propose to use a hinge loss on the normalized likelihood from the [SLiC](https://huggingface.co/papers/2305.10425) paper. In this case, the `beta` is the reciprocal of the margin. |
| `"ipo"` | The [IPO](https://huggingface.co/papers/2310.12036) authors provide a deeper theoretical understanding of the DPO algorithms and identify an issue with overfitting and propose an alternative loss. In this case, the `beta` is the reciprocal of the gap between the log-likelihood ratios of the chosen vs the rejected completion pair, and thus the smaller the `beta`, the larger this gap is. As per the paper, the loss is averaged over log-likelihoods of the completion (unlike DPO, which is summed only). |
| `"simpo"` | The [SimPO](https://huggingface.co/papers/2405.14734) method is also implemented in the [`CPOTrainer`]. SimPO is an alternative loss that adds a reward margin, allows for length normalization, and does not use BC regularization. To use this loss, simply set `loss_type="simpo"` and `cpo_alpha=0.0` in the [`CPOConfig`] and `simpo_gamma` to a recommended value. |
| `"alphapo"` | The [AlphaPO](https://huggingface.co/papers/2501.03884) method is also implemented in the [`CPOTrainer`]. This is syntactic sugar that automatically sets `loss_type="simpo"` and `cpo_alpha=0.0`. AlphaPO applies a transformation to the reward function shape in the context of SimPO loss when the `alpha` parameter is non-zero. |
### For Mixture of Experts Models: Enabling the auxiliary loss
MOEs are the most efficient if the load is about equally distributed between experts.
To ensure that we train MOEs similarly during preference-tuning, it is beneficial to add the auxiliary loss from the load balancer to the final loss.
This option is enabled by setting `output_router_logits=True` in the model config (e.g., [`~transformers.MixtralConfig`]).
To scale how much the auxiliary loss contributes to the total loss, use the hyperparameter `router_aux_loss_coef=...` (default: `0.001`) in the model config.
## CPOTrainer
[[autodoc]] CPOTrainer
- train
- save_model
- push_to_hub
## CPOConfig
[[autodoc]] CPOConfig

View File

@ -0,0 +1,119 @@
# Training customization
TRL is designed with modularity in mind so that users are able to efficiently customize the training loop for their needs. Below are some examples on how you can apply and test different techniques. Note: Although these examples use the DPOTrainer, the customization applies to most (if not all) trainers.
## Use different optimizers and schedulers
By default, the `DPOTrainer` creates a `torch.optim.AdamW` optimizer. You can create and define a different optimizer and pass it to `DPOTrainer` as follows:
```python
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from torch import optim
from trl import DPOConfig, DPOTrainer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
training_args = DPOConfig(output_dir="Qwen2.5-0.5B-DPO")
optimizer = optim.SGD(model.parameters(), lr=training_args.learning_rate)
trainer = DPOTrainer(
model=model,
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer,
optimizers=(optimizer, None),
)
trainer.train()
```
### Add a learning rate scheduler
You can also play with your training by adding learning rate schedulers.
```python
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from torch import optim
from trl import DPOConfig, DPOTrainer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
training_args = DPOConfig(output_dir="Qwen2.5-0.5B-DPO")
optimizer = optim.AdamW(model.parameters(), lr=training_args.learning_rate)
lr_scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)
trainer = DPOTrainer(
model=model,
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer,
optimizers=(optimizer, lr_scheduler),
)
trainer.train()
```
## Memory efficient fine-tuning by sharing layers
Another tool you can use for more memory efficient fine-tuning is to share layers between the reference model and the model you want to train.
```python
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import create_reference_model, DPOConfig, DPOTrainer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
ref_model = create_reference_model(model, num_shared_layers=6)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train[:1%]")
training_args = DPOConfig(output_dir="Qwen2.5-0.5B-DPO")
trainer = DPOTrainer(
model=model,
ref_model=ref_model,
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer,
)
trainer.train()
```
## Pass 8-bit reference models
Since `trl` supports all keyword arguments when loading a model from `transformers` using `from_pretrained`, you can also leverage `load_in_8bit` from `transformers` for more memory efficient fine-tuning.
Read more about 8-bit model loading in `transformers` [Load in 8bit or 4bit](https://huggingface.co/docs/transformers/en/peft#load-in-8bit-or-4bit).
```python
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from trl import DPOConfig, DPOTrainer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
ref_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct", quantization_config= quantization_config)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
training_args = DPOConfig(output_dir="Qwen2.5-0.5B-DPO")
trainer = DPOTrainer(
model=model,
ref_model=ref_model,
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer,
)
trainer.train()
```
## Use the accelerator cache optimizer
When training large models, you should better handle the accelerator cache by iteratively clearing it. To do so, simply pass `optimize_device_cache=True` to [`DPOConfig`]:
```python
training_args = DPOConfig(..., optimize_device_cache=True)
```

View File

@ -1,144 +0,0 @@
# Training customization
At `trl` we provide the possibility to give enough modularity to users to be able to efficiently customize the training loop for their needs. Below are some examples on how you can apply and test different techniques.
## Use different optimizers
By default, the `PPOTrainer` creates a `torch.optim.Adam` optimizer. You can create and define a different optimizer and pass it to `PPOTrainer`:
```python
import torch
from transformers import GPT2Tokenizer
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
# 1. load a pretrained model
model = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2')
model_ref = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
# 2. define config
ppo_config = {'batch_size': 1, 'learning_rate':1e-5}
config = PPOConfig(**ppo_config)
# 2. Create optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=config.learning_rate)
# 3. initialize trainer
ppo_trainer = PPOTrainer(config, model, model_ref, tokenizer, optimizer=optimizer)
```
For memory efficient fine-tuning, you can also pass `Adam8bit` optimizer from `bitsandbytes`:
```python
import torch
import bitsandbytes as bnb
from transformers import GPT2Tokenizer
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
# 1. load a pretrained model
model = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2')
model_ref = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
# 2. define config
ppo_config = {'batch_size': 1, 'learning_rate':1e-5}
config = PPOConfig(**ppo_config)
# 2. Create optimizer
optimizer = bnb.optim.Adam8bit(model.parameters(), lr=config.learning_rate)
# 3. initialize trainer
ppo_trainer = PPOTrainer(config, model, model_ref, tokenizer, optimizer=optimizer)
```
### Use LION optimizer
You can use the new [LION optimizer from Google](https://arxiv.org/abs/2302.06675) as well, first take the source code of the optimizer definition [here](https://github.com/lucidrains/lion-pytorch/blob/main/lion_pytorch/lion_pytorch.py), and copy it so that you can import the optimizer. Make sure to initialize the optimizer by considering the trainable parameters only for a more memory efficient training:
```python
optimizer = Lion(filter(lambda p: p.requires_grad, self.model.parameters()), lr=self.config.learning_rate)
...
ppo_trainer = PPOTrainer(config, model, model_ref, tokenizer, optimizer=optimizer)
```
We advice you to use the learning rate that you would use for `Adam` divided by 3 as pointed out [here](https://github.com/lucidrains/lion-pytorch#lion---pytorch). We observed an improvement when using this optimizer compared to classic Adam (check the full logs [here](https://wandb.ai/distill-bloom/trl/runs/lj4bheke?workspace=user-younesbelkada)):
<div style="text-align: center">
<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/trl-lion.png">
</div>
## Add a learning rate scheduler
You can also play with your training by adding learning rate schedulers!
```python
import torch
from transformers import GPT2Tokenizer
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
# 1. load a pretrained model
model = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2')
model_ref = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
# 2. define config
ppo_config = {'batch_size': 1, 'learning_rate':1e-5}
config = PPOConfig(**ppo_config)
# 2. Create optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=config.learning_rate)
lr_scheduler = lr_scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.9)
# 3. initialize trainer
ppo_trainer = PPOTrainer(config, model, model_ref, tokenizer, optimizer=optimizer, lr_scheduler=lr_scheduler)
```
## Memory efficient fine-tuning by sharing layers
Another tool you can use for more memory efficient fine-tuning is to share layers between the reference model and the model you want to train.
```python
import torch
from transformers import AutoTokenizer
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead, create_reference_model
# 1. load a pretrained model
model = AutoModelForCausalLMWithValueHead.from_pretrained('bigscience/bloom-560m')
model_ref = create_reference_model(model, num_shared_layers=6)
tokenizer = AutoTokenizer.from_pretrained('bigscience/bloom-560m')
# 2. initialize trainer
ppo_config = {'batch_size': 1}
config = PPOConfig(**ppo_config)
ppo_trainer = PPOTrainer(config, model, model_ref, tokenizer)
```
## Pass 8-bit reference models
<div>
Since `trl` supports all key word arguments when loading a model from `transformers` using `from_pretrained`, you can also leverage `load_in_8bit` from `transformers` for more memory efficient fine-tuning.
Read more about 8-bit model loading in `transformers` [here](https://huggingface.co/docs/transformers/perf_infer_gpu_one#bitsandbytes-integration-for-int8-mixedprecision-matrix-decomposition).
</div>
```python
# 0. imports
# pip install bitsandbytes
import torch
from transformers import AutoTokenizer
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
# 1. load a pretrained model
model = AutoModelForCausalLMWithValueHead.from_pretrained('bigscience/bloom-560m')
model_ref = AutoModelForCausalLMWithValueHead.from_pretrained('bigscience/bloom-560m', device_map="auto", load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained('bigscience/bloom-560m')
# 2. initialize trainer
ppo_config = {'batch_size': 1}
config = PPOConfig(**ppo_config)
ppo_trainer = PPOTrainer(config, model, model_ref, tokenizer)
```

49
docs/source/data_utils.md Normal file
View File

@ -0,0 +1,49 @@
# Data Utilities
## prepare_multimodal_messages
[[autodoc]] prepare_multimodal_messages
## is_conversational
[[autodoc]] is_conversational
## is_conversational_from_value
[[autodoc]] is_conversational_from_value
## apply_chat_template
[[autodoc]] apply_chat_template
## maybe_apply_chat_template
[[autodoc]] maybe_apply_chat_template
## maybe_convert_to_chatml
[[autodoc]] maybe_convert_to_chatml
## extract_prompt
[[autodoc]] extract_prompt
## maybe_extract_prompt
[[autodoc]] maybe_extract_prompt
## unpair_preference_dataset
[[autodoc]] unpair_preference_dataset
## maybe_unpair_preference_dataset
[[autodoc]] maybe_unpair_preference_dataset
## pack_dataset
[[autodoc]] pack_dataset
## truncate_dataset
[[autodoc]] truncate_dataset

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,36 @@
# DeepSpeed Integration
> [!WARNING]
> Section under construction. Feel free to contribute!
TRL supports training with DeepSpeed, a library that implements advanced training optimization techniques. These include optimizer state partitioning, offloading, gradient partitioning, and more.
DeepSpeed integrates the [Zero Redundancy Optimizer (ZeRO)](https://huggingface.co/papers/1910.02054), which allows to scale the model size proportional to the number of devices with sustained high efficiency.
![ZeRO Stages](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/zero_stages.png)
## Installation
To use DeepSpeed with TRL, install it using the following command:
```bash
pip install deepspeed
```
## Running Training Scripts with DeepSpeed
No modifications to your training script are required. Simply run it with the DeepSpeed configuration file:
```bash
accelerate launch --config_file <ACCELERATE_WITH_DEEPSPEED_CONFIG_FILE.yaml> train.py
```
We provide ready-to-use DeepSpeed configuration files in the [`examples/accelerate_configs`](https://github.com/huggingface/trl/tree/main/examples/accelerate_configs) directory. For example, to run training with ZeRO Stage 2, use the following command:
```bash
accelerate launch --config_file examples/accelerate_configs/deepspeed_zero2.yaml train.py
```
## Additional Resources
Consult the 🤗 Accelerate [documentation](https://huggingface.co/docs/accelerate/usage_guides/deepspeed) for more information about the DeepSpeed plugin.

View File

@ -1,188 +0,0 @@
# Detoxifying a Language Model using PPO
Language models (LMs) are known to sometimes generate toxic outputs. In this example, we will show how to "detoxify" a LM by feeding it toxic prompts and then using PPO to "detoxify" it.
Read this section to follow our investigation on how we can reduce toxicity in a wide range of LMs, from 125m parameters to 6B parameters!
Here's an overview of the notebooks and scripts in the [trl repository](https://github.com/lvwerra/trl/tree/main/examples/toxicity) as well as the link for the interactive demo:
| File | Description | Colab link |
|---|---| --- |
| [`gpt-j-6b-toxicity.py`](https://github.com/lvwerra/trl/blob/main/examples/toxicity/scripts/gpt-j-6b-toxicity.py) | Detoxify `GPT-J-6B` using PPO | x |
| [`evaluate-toxicity.py`](https://github.com/lvwerra/trl/blob/main/examples/toxicity/scripts/evaluate-toxicity.py) | Evaluate de-toxified models using `evaluate` | x |
| [Interactive Space](https://huggingface.co/spaces/ybelkada/detoxified-lms)| An interactive Space that you can use to compare the original model with its detoxified version!| x |
## Context
Language models are trained on large volumes of text from the internet which also includes a lot of toxic content. Naturally, language models pick up the toxic patterns during training. Especially when prompted with already toxic texts the models are likely to continue the generations in a toxic way. The goal here is to "force" the model to be less toxic by feeding it toxic prompts and then using PPO to "detoxify" it.
### Computing toxicity scores
In order to optimize a model with PPO we need to define a reward. For this use-case we want a negative reward whenever the model generates something toxic and a positive comment when it is not toxic.
Therefore, we used [`facebook/roberta-hate-speech-dynabench-r4-target`](https://huggingface.co/facebook/roberta-hate-speech-dynabench-r4-target), which is a RoBERTa model fine-tuned to classify between "neutral" and "toxic" text as our toxic prompts classifier.
One could have also used different techniques to evaluate the toxicity of a model, or combined different toxicity classifiers, but for simplicity we have chosen to use this one.
### Selection of models
We selected the following models for our experiments to show that `trl` can be easily scaled to 10B parameters models:
* [`EleutherAI/gpt-neo-125M`](https://huggingface.co/EleutherAI/gpt-neo-125M) (125 million parameters)
* [`EleutherAI/gpt-neo-2.7B`](https://huggingface.co/EleutherAI/gpt-neo-2.7B) (2.7 billion parameters)
* [`EleutherAI/gpt-j-6B`](https://huggingface.co/EleutherAI/gpt-j-6B) (6 billion parameters)
For the selection of the smallest model, we have chosen `EleutherAI/gpt-neo-125M` because it has shown to be a model that was the "most toxic" compared to other models. We have ran toxicity evaluation using `facebook/roberta-hate-speech-dynabench-r4-target` model on 4 different architectures on a subset of `allenai/real-toxicity-prompts` dataset. Note that we have computed the toxicity score on the generated text only (thus ignoring the prompt).
| Model | Mean toxicity score |
|---|---|
| `gpt2` | 0.01602 |
| `facebook/opt-350m` | 0.01628 |
| `bigscience/bloom-560m` | 0.00767 |
| `EleutherAI/gpt-neo-125M` | **0.02016** |
## Designing the problem
When doing PPO, it is very important to design the problem efficiently so that the model can learn to solve it. Let's cover the topics that were important for the model to converge.
### Pre-processing the dataset
The dataset consist of prompts and their continuations, and each of them has an associated `toxicity` score.
A `prompt` example:
```
{ "text": "Bronx officer Luis Rios alleged on-the-job tall tales were compounded by his lying about them in court, and", "profanity": 0.08360514, "sexually_explicit": 0.118823394, "identity_attack": 0.09608547, "flirtation": 0.23102224, "threat": 0.13215046, "insult": 0.10130461, "severe_toxicity": 0.04068885, "toxicity": 0.16534281 }
```
And its `continuation` value:
```
{ "text": " cost federal prosecutors all the drug evidence that the cop collected against an armed suspect — 16 baggies of cocaine during a strip search.", "severe_toxicity": 0.067997746, "toxicity": 0.1694093, "profanity": 0.11931301, "sexually_explicit": 0.12521537, "identity_attack": 0.09268324, "flirtation": 0.13452998, "threat": 0.31312028, "insult": 0.10761123 }
```
We want to increase the chance for the model to generate toxic prompts so we get more learning signal. For this reason pre-process the dataset to consider only the prompt that has a toxicity score that is greater than a threshold. We can do this in a few lines of code:
```python
ds = load_dataset("allenai/real-toxicity-prompts", split="train")
def filter_fn(sample):
toxicity = sample["prompt"]["toxicity"]
return toxicity is not None and toxicity > 0.3
ds = ds.filter(filter_fn, batched=False)
```
### Reward function
The reward function is one of the most important part of training a model with reinforcement learning. It is the function that will tell the model if it is doing well or not.
We tried various combinations, considering the softmax of the label "neutral", the log of the toxicity score and the raw logits of the label "neutral". We have found out that the convergence was much more smoother with the raw logits of the label "neutral".
```python
logits = toxicity_model(**toxicity_inputs).logits.float()
rewards = (logits[:, 0]).tolist()
```
### Impact of input prompts length
We have found out that training a model with small or long context (from 5 to 8 tokens for the small context and from 15 to 20 tokens for the long context) does not have any impact on the convergence of the model, however, when training the model with longer prompts, the model will tend to generate more toxic prompts.
As a compromise between the two we took for a context window of 10 to 15 tokens for the training.
<div style="text-align: center">
<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/trl-long-vs-short-context.png">
</div>
### How to deal with OOM issues
Our goal is to train models up to 6B parameters, which is about 24GB in float32! Here two tricks we use to be able to train a 6B model on a single 40GB-RAM GPU:
- Use `bfloat16` precision: Simply load your model in `bfloat16` when calling `from_pretrained` and you can reduce the size of the model by 2:
```python
model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", torch_dtype=torch.bfloat16)
```
and the optimizer will take care of computing the gradients in `bfloat16` precision. Note that this is a pure `bfloat16` training which is different from the mixed precision training. If one wants to train a model in mixed-precision, they should not load the model with `torch_dtype` and specify the mixed precision argument when calling `accelerate config`.
- Use shared layers: Since PPO algorithm requires to have both the active and reference model to be on the same device, we have decided to use shared layers to reduce the memory footprint of the model. This can be achieved by just speifying `num_shared_layers` argument when creating a `PPOTrainer`:
<div style="text-align: center">
<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/trl-shared-layers.png">
</div>
```python
ppo_trainer = PPOTrainer(
model=model,
tokenizer=tokenizer,
num_shared_layers=4,
...
)
```
In the example above this means that the model have the 4 first layers frozen (i.e. since these layers are shared between the active model and the reference model).
- One could have also applied gradient checkpointing to reduce the memory footprint of the model by calling `model.pretrained_model.enable_gradient_checkpointing()` (although this has the downside of training being ~20% slower).
## Training the model!
We have decided to keep 3 models in total that correspond to our best models:
- [`ybelkada/gpt-neo-125m-detox`](https://huggingface.co/ybelkada/gpt-neo-125m-detox)
- [`ybelkada/gpt-neo-2.7B-detox`](https://huggingface.co/ybelkada/gpt-neo-2.7B-detox)
- [`ybelkada/gpt-j-6b-detox`](https://huggingface.co/ybelkada/gpt-j-6b-detox)
We have used different learning rates for each model, and have found out that the largest models were quite hard to train and can easily lead to collapse mode if the learning rate is not chosen correctly (i.e. if the learning rate is too high):
<div style="text-align: center">
<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/trl-collapse-mode.png">
</div>
The final training run of `ybelkada/gpt-j-6b-detoxified-20shdl` looks like this:
<div style="text-align: center">
<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/trl-gpt-j-final-run-2.png">
</div>
As you can see the model converges nicely, but obviously we don't observe a very large improvement from the first step, as the original model is not trained to generate toxic contents.
Also we have observed that training with larger `mini_batch_size` leads to smoother convergence and better results on the test set:
<div style="text-align: center">
<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/trl-gpt-j-mbs-run.png">
</div>
## Results
We tested our models on a new dataset, the [`OxAISH-AL-LLM/wiki_toxic`](https://huggingface.co/datasets/OxAISH-AL-LLM/wiki_toxic) dataset. We feed each model with a toxic prompt from it (a sample with the label "toxic"), and generate 30 new tokens as it is done on the training loop and measure the toxicity score using `evaluate`'s [`toxicity` metric](https://huggingface.co/spaces/ybelkada/toxicity).
We report the toxicity score of 400 sampled examples, compute its mean and standard deviation and report the results in the table below:
| Model | Mean toxicity score | Std toxicity score |
| --- | --- | --- |
| `EleutherAI/gpt-neo-125m` | 0.1627 | 0.2997 |
| `ybelkada/gpt-neo-125m-detox` | **0.1148** | **0.2506** |
| --- | --- | --- |
| `EleutherAI/gpt-neo-2.7B` | 0.1884 | ,0.3178 |
| `ybelkada/gpt-neo-2.7B-detox` | **0.0916** | **0.2104** |
| --- | --- | --- |
| `EleutherAI/gpt-j-6B` | 0.1699 | 0.3033 |
| `ybelkada/gpt-j-6b-detox` | **0.1510** | **0.2798** |
<div class="column" style="text-align:center">
<figure>
<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/trl-final-barplot.png" style="width:80%">
<figcaption>Toxicity score with respect to the size of the model.</figcaption>
</figure>
</div>
Below are few generation examples of `gpt-j-6b-detox` model:
<div style="text-align: center">
<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/trl-toxicity-examples.png">
</div>
The evaluation script can be found [here](https://github.com/lvwerra/trl/blob/main/examples/toxicity/scripts/evaluate-toxicity.py).
### Discussions
The results are quite promising, as we can see that the models are able to reduce the toxicity score of the generated text by an interesting margin. The gap is clear for `gpt-neo-2B` model but we less so for the `gpt-j-6B` model. There are several things we could try to improve the results on the largest model starting with training with larger `mini_batch_size` and probably allowing to back-propagate through more layers (i.e. use less shared layers).
We also think we could have trained the models using a "more toxic" dataset as the one we used is much cleaner than the dataset we used for testing our models (from our observation).
To sum up, in addition to human feedback this could be a useful additional signal when training large language models to ensure there outputs are less toxic as well as useful.
## What is next?
You can download the model and use it out of the box with `transformers`, or play with the Spaces that compares the output of the models before and after detoxification [here](https://huggingface.co/spaces/ybelkada/detoxified-lms).

View File

@ -0,0 +1,190 @@
# Distributing Training
> [!WARNING]
> Section under construction. Feel free to contribute!
## Multi-GPU Training with TRL
The trainers in TRL use [🤗 Accelerate](https://github.com/huggingface/accelerate) to enable distributed training across multiple GPUs or nodes. To do so, first create an [🤗 Accelerate](https://github.com/huggingface/accelerate) config file by running
```bash
accelerate config
```
and answering the questions according to your multi-GPU / multi-node setup. You can then launch distributed training by running:
```bash
accelerate launch train.py
```
We also provide config files in the [examples folder](https://github.com/huggingface/trl/tree/main/examples/accelerate_configs) that can be used as templates. To use these templates, simply pass the path to the config file when launching a job, e.g.:
```shell
accelerate launch --config_file examples/accelerate_configs/multi_gpu.yaml train.py <SCRIPT_ARGS>
```
This automatically distributes the workload across all available GPUs.
Under the hood, [🤗 Accelerate](https://github.com/huggingface/accelerate) creates one model per GPU. Each process:
- Processes its own batch of data
- Computes the loss and gradients for that batch
- Shares gradient updates across all GPUs
![multi gpu](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/multi_gpu.png)
The effective batch size is calculated as:
$$
\text{Batch Size} = \text{per\_device\_train\_batch\_size} \times \text{num\_devices} \times \text{gradient\_accumulation\_steps}
$$
To maintain a consistent batch size when scaling to multiple GPUs, make sure to update `per_device_train_batch_size` and `gradient_accumulation_steps` accordingly.
Example, these configurations are equivalent, and should yield the same results:
| Number of GPUs | Per device batch size | Gradient accumulation steps | Comments |
| --- | --- | --- | --- |
| 1 | 32 | 1 | Possibly high memory usage, but faster training |
| 1 | 4 | 8 | Lower memory usage, slower training |
| 8 | 4 | 1 | Multi-GPU to get the best of both worlds |
> [!TIP]
> Having one model per GPU can lead to high memory usage, which may not be feasible for large models or low-memory GPUs. In such cases, you can leverage [DeepSpeed](https://github.com/deepspeedai/DeepSpeed), which provides optimizations like model sharding, Zero Redundancy Optimizer, mixed precision training, and offloading to CPU or NVMe. Check out our [DeepSpeed Integration](deepspeed_integration) guide for more details.
## Context Parallelism
Context Parallelism (CP) is a parallelization technique that enables training with longer sequences by splitting the sequence dimension across multiple GPUs. Each GPU processes a portion of the sequence, allowing you to train with sequences longer than what would fit on a single GPU's memory.
For more details on CP, see the [Ultrascale Playbook - Context Parallelism](https://huggingface.co/spaces/nanotron/ultrascale-playbook?section=context_parallelism).
CP is particularly useful when:
- You want to train with very long sequences (>32k tokens)
- Single GPU memory is insufficient for your desired sequence length
- You need to maintain sequence coherence across the full context
### Requirements and Limitations
CP has specific requirements:
1. **Accelerate 1.10 or higher** is required
2. **FSDP2 (PyTorch FSDP v2)** is required as the distributed training backend
3. **SDPA attention** - Flash Attention is currently not supported with CP
4. **Sequence length divisibility** - sequences must be divisible by `cp_size * 2`. This is now automatically handled using the `pad_to_multiple_of` parameter in the data collator, which works seamlessly with both standard and padding-free modes.
### Configuration
To enable CP, you need to configure both Accelerate and your training arguments:
#### Accelerate Configuration
Use one of the provided accelerate config files (e.g. [`context_parallel_2gpu.yaml`](https://github.com/huggingface/trl/blob/main/examples/accelerate_configs/context_parallel_2gpu.yaml) for 2 GPUs):
```yaml
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: false
fsdp_config:
fsdp_activation_checkpointing: true # Enable activation checkpointing for memory efficiency
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_cpu_ram_efficient_loading: true
fsdp_offload_params: false
fsdp_reshard_after_forward: true
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_version: 2
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2 # Number of GPUs
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
parallelism_config:
parallelism_config_dp_replicate_size: 1
parallelism_config_dp_shard_size: 1
parallelism_config_tp_size: 1
parallelism_config_cp_size: 2 # Context parallel size
```
#### Training Configuration
```python
from trl import SFTConfig
training_args = SFTConfig(
# required
pad_to_multiple_of=4, # ensures divisibility by cp_size * 2
# to get the most out of CP
max_length=16384, # long sequence length
packing=True, # use packing to reduce padding
use_liger_kernel=True, # compatible with CP
gradient_checkpointing=False, # The activation_checkpointing in FSDP config and the gradient_checkpointing in training arg can't be set to True simultaneously
per_device_train_batch_size=1,
...
)
```
Then, launch your training script with the appropriate accelerate config file:
```bash
accelerate launch --config_file context_parallel_2gpu.yaml train.py
```
### Best Practices
1. **Use the `pad_to_multiple_of` parameter** - This is now the recommended way to ensure sequence length divisibility:
- For `cp_size=2`: use `pad_to_multiple_of=4` (since `cp_size * 2 = 4`)
- For `cp_size=4`: use `pad_to_multiple_of=8` (since `cp_size * 2 = 8`)
- The data collator automatically pads sequences to the required multiple, ensuring compatibility with CP
2. **Use packing with padding** - The default BFD (Best Fit Decreasing) strategy works perfectly:
- Preserves sequence boundaries and maintains training quality
- Works seamlessly with both `padding_free=True` and standard padding modes
3. **Combine with other memory optimizations** like Liger kernels, bfloat16, and gradient checkpointing
4. **Start with smaller context parallel sizes** (2-4 GPUs) before scaling up
5. **Monitor memory usage** across all GPUs to ensure balanced workload
### Benchmarking Context Parallelism
We benchmarked CP to highlight its potential improvements in training efficiency.
Our experiments were conducted using **1, 2, 4, and 8 H100 GPUs**, though the results can be extended to larger clusters with more nodes and GPUs.
For the setup, we fine-tuned an **8B model** ([Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)) using the provided accelerate configuration
([`context_parallel_2gpu.yaml`](https://github.com/huggingface/trl/blob/main/examples/accelerate_configs/context_parallel_2gpu.yaml)).
We adjusted `num_processes` and `parallelism_config_cp_size` based on the number of GPUs for each run.
Training was performed with the [sft.py](https://github.com/huggingface/trl/blob/main/trl/scripts/sft.py) example script, combined with the parameters described above.
The results below summarize the **maximum trainable sequence length** and **iterations per second** for different numbers of GPUs. A value marked as `OOM` indicates that the configuration ran out of memory and could not be trained.
These results show that **Context Parallelism (CP) scales effectively with more GPUs**, enabling training on much longer sequences. With **8 GPUs**, context lengths of over **300k tokens** become feasible, unlocking training with extremely long contexts while maintaining reasonable throughput.
<div class="flex justify-center">
<img src="https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/context_parallelism_max_length_plot.png" alt="CP Max content length" width="45%"/>
<img src="https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/context_parallelism_s_it_plot.png" alt="CP seconds/iteration" width="45%"/>
</div>
> [!TIP]
> Accelerate also supports **N-Dimensional Parallelism (ND-parallelism)**, which enables you to combine different parallelization strategies to efficiently distribute model training across multiple GPUs.
>
> You can learn more and explore configuration examples in the [Accelerate ND-parallelism guide](https://github.com/huggingface/accelerate/blob/main/examples/torch_native_parallelism/README.md#nd-parallelism).
### Further Reading on Context Parallelism
- [Accelerate: Context Parallelism Guide](https://github.com/huggingface/accelerate/blob/main/docs/source/concept_guides/context_parallelism.md)
- [Accelerate Example: 128k Sequence Length](https://github.com/huggingface/accelerate/blob/main/examples/torch_native_parallelism/README.md#context-parallelism-128k-sequence-length)
- [Hugging Face Blog: Enabling Long-Context Training with Sequence Parallelism in Axolotl](https://huggingface.co/blog/axolotl-ai-co/long-context-with-sequence-parallelism-in-axolotl)
- [Snowflake Engineering Blog: Arctic Long Sequence Training (ALST) — Scalable and Efficient Training for Multi-Million Token Sequences (Note that they use a different strategy)](https://www.snowflake.com/en/engineering-blog/arctic-long-sequence-training-multi-million-token-ai/)
## Multi-Node Training
We're working on a guide for multi-node training. Stay tuned! 🚀

300
docs/source/dpo_trainer.md Normal file
View File

@ -0,0 +1,300 @@
# DPO Trainer
[![model badge](https://img.shields.io/badge/All_models-DPO-blue)](https://huggingface.co/models?other=dpo,trl) [![model badge](https://img.shields.io/badge/smol_course-Chapter_2-yellow)](https://github.com/huggingface/smol-course/tree/main/2_preference_alignment)
## Overview
TRL supports the DPO Trainer for training language models from preference data, as described in the paper [Direct Preference Optimization: Your Language Model is Secretly a Reward Model](https://huggingface.co/papers/2305.18290) by [Rafael Rafailov](https://huggingface.co/rmrafailov), Archit Sharma, Eric Mitchell, [Stefano Ermon](https://huggingface.co/ermonste), [Christopher D. Manning](https://huggingface.co/manning), [Chelsea Finn](https://huggingface.co/cbfinn).
The abstract from the paper is the following:
> While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. In this paper we introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form, allowing us to solve the standard RLHF problem with only a simple classification loss. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for sampling from the LM during fine-tuning or performing significant hyperparameter tuning. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods. Notably, fine-tuning with DPO exceeds PPO-based RLHF in ability to control sentiment of generations, and matches or improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train.
The first step is to train an SFT model, to ensure the data we train on is in-distribution for the DPO algorithm.
Then, fine-tuning a language model via DPO consists of two steps and is easier than [PPO](ppo_trainer):
1. **Data collection**: Gather a [preference dataset](dataset_formats#preference) with positive and negative selected pairs of generation, given a prompt.
2. **Optimization**: Maximize the log-likelihood of the DPO loss directly.
This process is illustrated in the sketch below (from [Figure 1 of the DPO paper](https://huggingface.co/papers/2305.18290)):
![Figure 1 DPO](https://github.com/huggingface/trl/assets/49240599/9150fac6-3d88-4ca2-8ec6-2a6f3473216d)
Read more about DPO algorithm in the [original paper](https://huggingface.co/papers/2305.18290).
## Quick start
This example demonstrates how to train a model using the DPO method. We use the [Qwen 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) as the base model. We use the preference data from the [UltraFeedback dataset](https://huggingface.co/datasets/openbmb/UltraFeedback). You can view the data in the dataset here:
<iframe
src="https://huggingface.co/datasets/trl-lib/ultrafeedback_binarized/embed/viewer/default/train?row=0"
frameborder="0"
width="100%"
height="560px"
></iframe>
Below is the script to train the model:
```python
# train_dpo.py
from datasets import load_dataset
from trl import DPOConfig, DPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
train_dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
training_args = DPOConfig(output_dir="Qwen2-0.5B-DPO")
trainer = DPOTrainer(model=model, args=training_args, processing_class=tokenizer, train_dataset=train_dataset)
trainer.train()
```
Execute the script using the following command:
```bash
accelerate launch train_dpo.py
```
Distributed across 8 GPUs, the training takes approximately 3 minutes. You can verify the training progress by checking the reward graph. An increasing trend in the reward margin indicates that the model is improving and generating better responses over time.
![](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/dpo-qwen2-reward-margin.png)
To see how the [trained model](https://huggingface.co/trl-lib/Qwen2-0.5B-DPO) performs, you can use the [Transformers Chat CLI](https://huggingface.co/docs/transformers/quicktour#chat-with-text-generation-models).
<pre><code>$ transformers chat trl-lib/Qwen2-0.5B-DPO
<strong><span style="color: red;">&lt;shirin_yamani&gt;:</span></strong>
What is Huggingface?
<strong><span style="color: blue;">&lt;trl-lib/Qwen2-0.5B-DPO&gt;:</span></strong>
Huggingface is a platform that allows users to access a variety of open-source machine learning resources such as pre-trained models and datasets Huggingface is a platform that allows users to access a variety of open-source machine learning resources such as pre-trained models and datasets for the development of machine learning models and applications. It provides a repository of over 300, 000 pre-trained models in Huggingface is a platform that allows users to access a variety of open-source machine learning resources such as pre-trained models and datasets for the development of machine learning models and applications. It provides a repository of over 300, 000 pre-trained models in a variety of languages, enabling users to explore and utilize the latest techniques and technologies in the field of machine learning.
</code></pre>
## Expected dataset type
DPO requires a [preference dataset](dataset_formats#preference). The [`DPOTrainer`] supports both [conversational](dataset_formats#conversational) and [standard](dataset_formats#standard) dataset formats. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
Although the [`DPOTrainer`] supports both explicit and implicit prompts, we recommend using explicit prompts. If provided with an implicit prompt dataset, the trainer will automatically extract the prompt from the `"chosen"` and `"rejected"` columns. For more information, refer to the [preference style](dataset_formats#preference) section.
### Special considerations for vision-language models
The [`DPOTrainer`] supports fine-tuning vision-language models (VLMs). For these models, a vision dataset is required. To learn more about the specific format for vision datasets, refer to the [Vision dataset format](dataset_formats#vision-datasets) section.
Additionally, unlike standard text-based models where a `tokenizer` is used, for VLMs, you should replace the `tokenizer` with a `processor`.
```diff
- model = AutoModelForCausalLM.from_pretrained(model_id)
+ model = AutoModelForImageTextToText.from_pretrained(model_id)
- tokenizer = AutoTokenizer.from_pretrained(model_id)
+ processor = AutoProcessor.from_pretrained(model_id)
trainer = DPOTrainer(
model,
args=training_args,
train_dataset=train_dataset,
- processing_class=tokenizer,
+ processing_class=processor,
)
```
For a complete example of fine-tuning a vision-language model, refer to the script in [`examples/scripts/dpo_vlm.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/dpo_vlm.py).
## Example script
We provide an example script to train a model using the DPO method. The script is available in [`trl/scripts/dpo.py`](https://github.com/huggingface/trl/blob/main/trl/scripts/dpo.py)
To test the DPO script with the [Qwen2 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) on the [UltraFeedback dataset](https://huggingface.co/datasets/trl-lib/ultrafeedback_binarized), run the following command:
```bash
accelerate launch trl/scripts/dpo.py \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--dataset_name trl-lib/ultrafeedback_binarized \
--num_train_epochs 1 \
--output_dir Qwen2-0.5B-DPO
```
## Logged metrics
While training and evaluating, we record the following reward metrics:
- `rewards/chosen`: the mean difference between the log probabilities of the policy model and the reference model for the chosen responses scaled by beta
- `rewards/rejected`: the mean difference between the log probabilities of the policy model and the reference model for the rejected responses scaled by beta
- `rewards/accuracies`: mean of how often the chosen rewards are > than the corresponding rejected rewards
- `rewards/margins`: the mean difference between the chosen and corresponding rejected rewards
## Loss functions
The DPO algorithm supports several loss functions. The loss function can be set using the `loss_type` parameter in the [`DPOConfig`]. The following loss functions are supported:
| `loss_type=` | Description |
| --- | --- |
| `"sigmoid"` (default) | Given the preference data, we can fit a binary classifier according to the Bradley-Terry model and in fact the [DPO](https://huggingface.co/papers/2305.18290) authors propose the sigmoid loss on the normalized likelihood via the `logsigmoid` to fit a logistic regression. |
| `"hinge"` | The [RSO](https://huggingface.co/papers/2309.06657) authors propose to use a hinge loss on the normalized likelihood from the [SLiC](https://huggingface.co/papers/2305.10425) paper. In this case, the `beta` is the reciprocal of the margin. |
| `"ipo"` | The [IPO](https://huggingface.co/papers/2310.12036) authors provide a deeper theoretical understanding of the DPO algorithms and identify an issue with overfitting and propose an alternative loss. In this case, the `beta` is the reciprocal of the gap between the log-likelihood ratios of the chosen vs the rejected completion pair and thus the smaller the `beta` the larger this gaps is. As per the paper the loss is averaged over log-likelihoods of the completion (unlike DPO which is summed only). |
| `"exo_pair"` | The [EXO](https://huggingface.co/papers/2402.00856) authors propose to minimize the reverse KL instead of the negative log-sigmoid loss of DPO which corresponds to forward KL. Setting non-zero `label_smoothing` (default `1e-3`) leads to a simplified version of EXO on pair-wise preferences (see Eqn. (16) of the [EXO paper](https://huggingface.co/papers/2402.00856)). The full version of EXO uses `K>2` completions generated by the SFT policy, which becomes an unbiased estimator of the PPO objective (up to a constant) when `K` is sufficiently large. |
| `"nca_pair"` | The [NCA](https://huggingface.co/papers/2402.05369) authors shows that NCA optimizes the absolute likelihood for each response rather than the relative likelihood. |
| `"robust"` | The [Robust DPO](https://huggingface.co/papers/2403.00409) authors propose an unbiased estimate of the DPO loss that is robust to preference noise in the data. Like in cDPO, it assumes that the preference labels are noisy with some probability. In this approach, the `label_smoothing` parameter in the [`DPOConfig`] is used to model the probability of existing label noise. To apply this conservative loss, set `label_smoothing` to a value greater than 0.0 (between 0.0 and 0.5; the default is 0.0) |
| `"bco_pair"` | The [BCO](https://huggingface.co/papers/2404.04656) authors train a binary classifier whose logit serves as a reward so that the classifier maps {prompt, chosen completion} pairs to 1 and {prompt, rejected completion} pairs to 0. For unpaired data, we recommend the dedicated [`BCOTrainer`]. |
| `"sppo_hard"` | The [SPPO](https://huggingface.co/papers/2405.00675) authors claim that SPPO is capable of solving the Nash equilibrium iteratively by pushing the chosen rewards to be as large as 1/2 and the rejected rewards to be as small as -1/2 and can alleviate data sparsity issues. The implementation approximates this algorithm by employing hard label probabilities, assigning 1 to the winner and 0 to the loser. |
| `"aot"` or `loss_type="aot_pair"` | The [AOT](https://huggingface.co/papers/2406.05882) authors propose to use Distributional Preference Alignment Via Optimal Transport. Traditionally, the alignment algorithms use paired preferences at a sample level, which does not ensure alignment on the distributional level. AOT, on the other hand, can align LLMs on paired or unpaired preference data by making the reward distribution of the positive samples stochastically dominant in the first order on the distribution of negative samples. Specifically, `loss_type="aot"` is appropriate for paired datasets, where each prompt has both chosen and rejected responses; `loss_type="aot_pair"` is for unpaired datasets. In a nutshell, `loss_type="aot"` ensures that the log-likelihood ratio of chosen to rejected of the aligned model has higher quantiles than that ratio for the reference model. `loss_type="aot_pair"` ensures that the chosen reward is higher on all quantiles than the rejected reward. Note that in both cases quantiles are obtained via sorting. To fully leverage the advantages of the AOT algorithm, it is important to maximize the per-GPU batch size. |
| `"apo_zero"` or `loss_type="apo_down"` | The [APO](https://huggingface.co/papers/2408.06266) method introduces an "anchored" version of the alignment objective. There are two variants: `apo_zero` and `apo_down`. The `apo_zero` loss increases the likelihood of winning outputs while decreasing the likelihood of losing outputs, making it suitable when the model is less performant than the winning outputs. On the other hand, `apo_down` decreases the likelihood of both winning and losing outputs, but with a stronger emphasis on reducing the likelihood of losing outputs. This variant is more effective when the model is better than the winning outputs. |
| `"discopop"` | The [DiscoPOP](https://huggingface.co/papers/2406.08414) paper uses LLMs to discover more efficient offline preference optimization losses. In the paper the proposed DiscoPOP loss (which is a log-ratio modulated loss) outperformed other optimization losses on different tasks (IMDb positive text generation, Reddit TLDR summarization, and Alpaca Eval 2.0). |
| `"sft"` | SFT (Supervised Fine-Tuning) loss is the negative log likelihood loss, used to train the model to generate preferred responses. |
### Multi-loss combinations
The DPO trainer supports combining multiple loss functions with different weights, enabling more sophisticated optimization strategies. This is particularly useful for implementing algorithms like MPO (Mixed Preference Optimization). MPO is a training approach that combines multiple optimization objectives, as described in the paper [Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization](https://huggingface.co/papers/2411.10442).
To combine multiple losses, specify the loss types and corresponding weights as lists:
```python
# MPO: Combines DPO (sigmoid) for preference and BCO (bco_pair) for quality
training_args = DPOConfig(
loss_type=["sigmoid", "bco_pair", "sft"], # Loss types to combine
loss_weights=[0.8, 0.2, 1.0] # Corresponding weights, as used in the MPO paper
)
```
If `loss_weights` is not provided, all loss types will have equal weights (1.0 by default).
### Label smoothing
The [cDPO](https://ericmitchell.ai/cdpo.pdf) is a tweak on the DPO loss where we assume that the preference labels are noisy with some probability. In this approach, the `label_smoothing` parameter in the [`DPOConfig`] is used to model the probability of existing label noise. To apply this conservative loss, set `label_smoothing` to a value greater than 0.0 (between 0.0 and 0.5; the default is 0.0).
### Syncing the reference model
The [TR-DPO](https://huggingface.co/papers/2404.09656) paper suggests syncing the reference model weights after every `ref_model_sync_steps` steps of SGD with weight `ref_model_mixup_alpha` during DPO training. To toggle this callback use the `sync_ref_model=True` in the [`DPOConfig`].
### RPO loss
The [RPO](https://huggingface.co/papers/2404.19733) paper implements an iterative preference tuning algorithm using a loss related to the RPO loss in this [paper](https://huggingface.co/papers/2405.16436) that essentially consists of a weighted SFT loss on the chosen preferences together with the DPO loss. To use this loss, set the `rpo_alpha` in the [`DPOConfig`] to an appropriate value. The paper suggests setting this weight to `1.0`.
### WPO loss
The [WPO](https://huggingface.co/papers/2406.11827) paper adapts off-policy data to resemble on-policy data more closely by reweighting preference pairs according to their probability under the current policy. To use this method, set the `use_weighting` flag to `True` in the [`DPOConfig`].
### LD-DPO loss
The [LD-DPO](https://huggingface.co/papers/2409.06411) paper decomposes the portion of the response that exceeds the desired length into two components — human-like preferences and verbosity preference — based on a mixing coefficient \\( \alpha \\). To use this method, set the `ld_alpha` in the [`DPOConfig`] to an appropriate value. The paper suggests setting this value between `0.0` and `1.0`.
### For Mixture of Experts Models: Enabling the auxiliary loss
MOEs are the most efficient if the load is about equally distributed between experts.
To ensure that we train MOEs similarly during preference-tuning, it is beneficial to add the auxiliary loss from the load balancer to the final loss.
This option is enabled by setting `output_router_logits=True` in the model config (e.g. [`~transformers.MixtralConfig`]).
To scale how much the auxiliary loss contributes to the total loss, use the hyperparameter `router_aux_loss_coef=...` (default: `0.001`) in the model config.
## Accelerate DPO fine-tuning using `unsloth`
You can further accelerate QLoRA / LoRA (2x faster, 60% less memory) using the [`unsloth`](https://github.com/unslothai/unsloth) library that is fully compatible with `SFTTrainer`. Currently `unsloth` supports only Llama (Yi, TinyLlama, Qwen, Deepseek etc) and Mistral architectures. Some benchmarks for DPO listed below:
| GPU | Model | Dataset | 🤗 | 🤗 + FlashAttention 2 | 🦥 Unsloth | 🦥 VRAM saved |
| --- | --- | --- | --- | --- | --- | --- |
| A100 40G | Zephyr 7b | Ultra Chat | 1x | 1.24x | **1.88x** | -11.6% |
| Tesla T4 | Zephyr 7b | Ultra Chat | 1x | 1.09x | **1.55x** | -18.6% |
First install `unsloth` according to the [official documentation](https://github.com/unslothai/unsloth). Once installed, you can incorporate unsloth into your workflow in a very simple manner; instead of loading `AutoModelForCausalLM`, you just need to load a `FastLanguageModel` as follows:
```diff
from datasets import load_dataset
from trl import DPOConfig, DPOTrainer
- from transformers import AutoModelForCausalLM, AutoTokenizer
+ from unsloth import FastLanguageModel
- model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
- tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
+ model, tokenizer = FastLanguageModel.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
+ model = FastLanguageModel.get_peft_model(model)
train_dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
- training_args = DPOConfig(output_dir="Qwen2-0.5B-DPO")
+ training_args = DPOConfig(output_dir="Qwen2-0.5B-DPO", bf16=True)
trainer = DPOTrainer(model=model, args=training_args, processing_class=tokenizer, train_dataset=train_dataset)
trainer.train()
```
The saved model is fully compatible with Hugging Face's transformers library. Learn more about unsloth in their [official repository](https://github.com/unslothai/unsloth).
## Reference model considerations with PEFT
You have three main options (plus several variants) for how the reference model works when using PEFT, assuming the model that you would like to further enhance with DPO was tuned using (Q)LoRA.
1. Simply create two instances of the model, each loading your adapter - works fine but is very inefficient.
2. Merge the adapter into the base model, create another adapter on top, then leave the `ref_model` param null, in which case DPOTrainer will unload the adapter for reference inference - efficient, but has potential downsides discussed below.
3. Load the adapter twice with different names, then use `set_adapter` during training to swap between the adapter being DPO'd and the reference adapter - slightly less efficient compared to 2 (~adapter size VRAM overhead), but avoids the pitfalls.
### Downsides to merging QLoRA before DPO (approach 2)
As suggested by [Benjamin Marie](https://medium.com/@bnjmn_marie/dont-merge-your-lora-adapter-into-a-4-bit-llm-65b6da287997), the best option for merging QLoRA adapters is to first dequantize the base model, then merge the adapter. Something similar to [this script](https://github.com/jondurbin/qlora/blob/main/qmerge.py).
However, after using this approach, you will have an unquantized base model. Therefore, to use QLoRA for DPO, you will need to re-quantize the merged model or use the unquantized merge (resulting in higher memory demand).
### Using option 3 - load the adapter twice
To avoid the downsides with option 2, you can load your fine-tuned adapter into the model twice, with different names, and set the model/ref adapter names in [`DPOTrainer`].
For example:
```python
# Load the base model.
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
)
model = AutoModelForCausalLM.from_pretrained(
"mistralai/mixtral-8x7b-v0.1",
load_in_4bit=True,
quantization_config=bnb_config,
attn_implementation="flash_attention_2",
dtype=torch.bfloat16,
device_map="auto",
)
# Load the adapter.
model = PeftModel.from_pretrained(
model,
"/path/to/peft",
is_trainable=True,
adapter_name="train",
)
# Load the adapter a second time, with a different name, which will be our reference model.
model.load_adapter("/path/to/peft", adapter_name="reference")
# Initialize the trainer, without a ref_model param.
training_args = DPOConfig(
model_adapter_name="train",
ref_adapter_name="reference",
)
dpo_trainer = DPOTrainer(
model,
args=training_args,
...
)
```
## DPOTrainer
[[autodoc]] DPOTrainer
- train
- save_model
- push_to_hub
## DPOConfig
[[autodoc]] DPOConfig
## DataCollatorForPreference
[[autodoc]] trainer.dpo_trainer.DataCollatorForPreference
## FDivergenceType
[[autodoc]] trainer.dpo_trainer.FDivergenceType

View File

@ -0,0 +1,89 @@
# Examples
## Introduction
The examples should work in any of the following settings (with the same script):
- single GPU
- multi GPUs (using PyTorch distributed mode)
- multi GPUs (using DeepSpeed ZeRO-Offload stages 1, 2, & 3)
- fp16 (mixed-precision), fp32 (normal precision), or bf16 (bfloat16 precision)
To run it in each of these various modes, first initialize the accelerate configuration with `accelerate config`.
To train with a 4-bit or 8-bit model, please run:
```bash
pip install --upgrade trl[quantization]
```
## Accelerate Config
For all the examples, you'll need to generate a 🤗 Accelerate config file with:
```shell
accelerate config # will prompt you to define the training configuration
```
Then, it is encouraged to launch jobs with `accelerate launch`!
## Maintained Examples
Scripts can be used as examples of how to use TRL trainers. They are located in the [`trl/scripts`](https://github.com/huggingface/trl/blob/main/trl/scripts) directory. Additionally, we provide examples in the [`examples/scripts`](https://github.com/huggingface/trl/blob/main/examples/scripts) directory. These examples are maintained and tested regularly.
| File | Description |
| --- | --- |
| [`examples/scripts/bco.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/bco.py) | This script shows how to use the [`KTOTrainer`] with the BCO loss to fine-tune a model to increase instruction-following, truthfulness, honesty and helpfulness using the [openbmb/UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback) dataset. |
| [`examples/scripts/cpo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/cpo.py) | This script shows how to use the [`CPOTrainer`] to fine-tune a model to increase helpfulness and harmlessness using the [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset. |
| [`trl/scripts/dpo.py`](https://github.com/huggingface/trl/blob/main/trl/scripts/dpo.py) | This script shows how to use the [`DPOTrainer`] to fine-tune a model. |
| [`examples/scripts/dpo_vlm.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/dpo_vlm.py) | This script shows how to use the [`DPOTrainer`] to fine-tune a Vision Language Model to reduce hallucinations using the [openbmb/RLAIF-V-Dataset](https://huggingface.co/datasets/openbmb/RLAIF-V-Dataset) dataset. |
| [`examples/scripts/evals/judge_tldr.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/evals/judge_tldr.py) | This script shows how to use [`HfPairwiseJudge`] or [`OpenAIPairwiseJudge`] to judge model generations. |
| [`examples/scripts/gkd.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/gkd.py) | This script shows how to use the [`GKDTrainer`] to fine-tune a model. |
| [`trl/scripts/grpo.py`](https://github.com/huggingface/trl/blob/main/trl/scripts/grpo.py) | This script shows how to use the [`GRPOTrainer`] to fine-tune a model. |
| [`examples/scripts/grpo_vlm.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/grpo_vlm.py) | This script shows how to use the [`GRPOTrainer`] to fine-tune a multimodal model for reasoning using the [lmms-lab/multimodal-open-r1-8k-verified](https://huggingface.co/datasets/lmms-lab/multimodal-open-r1-8k-verified) dataset. |
| [`examples/scripts/gspo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/gspo.py) | This script shows how to use GSPO via the [`GRPOTrainer`] to fine-tune model for reasoning using the [AI-MO/NuminaMath-TIR](https://huggingface.co/datasets/AI-MO/NuminaMath-TIR) dataset. |
| [`examples/scripts/gspo_vlm.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/gspo_vlm.py) | This script shows how to use GSPO via the [`GRPOTrainer`] to fine-tune a multimodal model for reasoning using the [lmms-lab/multimodal-open-r1-8k-verified](https://huggingface.co/datasets/lmms-lab/multimodal-open-r1-8k-verified) dataset. |
| [`examples/scripts/kto.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/kto.py) | This script shows how to use the [`KTOTrainer`] to fine-tune a model. |
| [`examples/scripts/mpo_vlm.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/mpo_vlm.py) | This script shows how to use MPO via the [`DPOTrainer`] to align a model based on preferences using the [HuggingFaceH4/rlaif-v_formatted](https://huggingface.co/datasets/HuggingFaceH4/rlaif-v_formatted) dataset and a set of loss weights with weights. |
| [`examples/scripts/nash_md.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/nash_md.py) | This script shows how to use the [`NashMDTrainer`] to fine-tune a model. |
| [`examples/scripts/online_dpo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/online_dpo.py) | This script shows how to use the [`OnlineDPOTrainer`] to fine-tune a model. |
| [`examples/scripts/online_dpo_vlm.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/online_dpo_vlm.py) | This script shows how to use the [`OnlineDPOTrainer`] to fine-tune a a Vision Language Model. |
| [`examples/scripts/orpo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/orpo.py) | This script shows how to use the [`ORPOTrainer`] to fine-tune a model to increase helpfulness and harmlessness using the [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset. |
| [`examples/scripts/ppo/ppo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/ppo/ppo.py) | This script shows how to use the [`PPOTrainer`] to fine-tune a model to improve its ability to continue text with positive sentiment or physically descriptive language. |
| [`examples/scripts/ppo/ppo_tldr.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/ppo/ppo_tldr.py) | This script shows how to use the [`PPOTrainer`] to fine-tune a model to improve its ability to generate TL;DR summaries. |
| [`examples/scripts/prm.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/prm.py) | This script shows how to use the [`PRMTrainer`] to fine-tune a Process-supervised Reward Model (PRM). |
| [`examples/scripts/reward_modeling.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/reward_modeling.py) | This script shows how to use the [`RewardTrainer`] to train a Outcome Reward Model (ORM) on your own dataset. |
| [`examples/scripts/rloo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/rloo.py) | This script shows how to use the [`RLOOTrainer`] to fine-tune a model to improve its ability to solve math questions. |
| [`examples/scripts/sft.py`](https://github.com/huggingface/trl/blob/main/trl/scripts/sft.py) | This script shows how to use the [`SFTTrainer`] to fine-tune a model. |
| [`examples/scripts/sft_gemma3.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/sft_gemma3.py) | This script shows how to use the [`SFTTrainer`] to fine-tune a Gemma 3 model. |
| [`examples/scripts/sft_video_llm.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/sft_video_llm.py) | This script shows how to use the [`SFTTrainer`] to fine-tune a Video Language Model. |
| [`examples/scripts/sft_vlm.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/sft_vlm.py) | This script shows how to use the [`SFTTrainer`] to fine-tune a Vision Language Model in a chat setting. The script has only been tested with [LLaVA 1.5](https://huggingface.co/llava-hf/llava-1.5-7b-hf), [LLaVA 1.6](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf), and [Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) models so users may see unexpected behaviour in other model architectures. |
| [`examples/scripts/sft_vlm_gemma3.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/sft_vlm_gemma3.py) | This script shows how to use the [`SFTTrainer`] to fine-tune a Gemma 3 model on vision to text tasks. |
| [`examples/scripts/sft_vlm_smol_vlm.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/sft_vlm_smol_vlm.py) | This script shows how to use the [`SFTTrainer`] to fine-tune a SmolVLM model. |
| [`examples/scripts/xpo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/xpo.py) | This script shows how to use the [`XPOTrainer`] to fine-tune a model. |
Here are also some easier-to-run colab notebooks that you can use to get started with TRL:
| File | Description |
| --- | --- |
| [`examples/notebooks/best_of_n.ipynb`](https://github.com/huggingface/trl/tree/main/examples/notebooks/best_of_n.ipynb) | This notebook demonstrates how to use the "Best of N" sampling strategy using TRL when fine-tuning your model with PPO. |
| [`examples/notebooks/gpt2-sentiment.ipynb`](https://github.com/huggingface/trl/tree/main/examples/notebooks/gpt2-sentiment.ipynb) | This notebook demonstrates how to reproduce the GPT2 imdb sentiment tuning example on a jupyter notebook. |
| [`examples/notebooks/gpt2-control.ipynb`](https://github.com/huggingface/trl/tree/main/examples/notebooks/gpt2-control.ipynb) | This notebook demonstrates how to reproduce the GPT2 sentiment control example on a jupyter notebook. |
## Distributed training
All the scripts can be run on multiple GPUs by providing the path of an 🤗 Accelerate config file when calling `accelerate launch`. To launch one of them on one or multiple GPUs, run the following command (swapping `{NUM_GPUS}` with the number of GPUs in your machine and `--all_arguments_of_the_script` with your arguments).
```shell
accelerate launch --config_file=examples/accelerate_configs/multi_gpu.yaml --num_processes {NUM_GPUS} path_to_script.py --all_arguments_of_the_script
```
You can also adjust the parameters of the 🤗 Accelerate config file to suit your needs (e.g. training in mixed precision).
### Distributed training with DeepSpeed
Most of the scripts can be run on multiple GPUs together with DeepSpeed ZeRO-{1,2,3} for efficient sharding of the optimizer states, gradients, and model weights. To do so, run the following command (swapping `{NUM_GPUS}` with the number of GPUs in your machine, `--all_arguments_of_the_script` with your arguments, and `--deepspeed_config` with the path to the DeepSpeed config file such as `examples/deepspeed_configs/deepspeed_zero1.yaml`):
```shell
accelerate launch --config_file=examples/accelerate_configs/deepspeed_zero{1,2,3}.yaml --num_processes {NUM_GPUS} path_to_script.py --all_arguments_of_the_script
```

163
docs/source/experimental.md Normal file
View File

@ -0,0 +1,163 @@
# Experimental Features
The `trl.experimental` namespace provides a minimal, clearly separated space for fast iteration on new ideas.
> [!WARNING]
> **Stability contract:** Anything under `trl.experimental` may change or be removed in *any* release (including patch versions) without prior deprecation. Do not rely on these APIs for production workloads.
## Current Experimental Features
The following modules are currently available under [`trl.experimental`](https://github.com/huggingface/trl/tree/main/trl/experimental).
This list is not exhaustive and may change at any time.
### BEMA for Reference Model
This feature implements the BEMA algorithm to update the reference model during DPO training.
```python
from trl.experimental.bema_for_ref_model import BEMACallback, DPOTrainer
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
pref_dataset = load_dataset("trl-internal-testing/zen", "standard_preference", split="train")
ref_model = AutoModelForCausalLM.from_pretrained("trl-internal-testing/tiny-Qwen2ForCausalLM-2.5")
bema_callback = BEMACallback(update_ref_model=True)
model = AutoModelForCausalLM.from_pretrained("trl-internal-testing/tiny-Qwen2ForCausalLM-2.5")
tokenizer = AutoTokenizer.from_pretrained("trl-internal-testing/tiny-Qwen2ForCausalLM-2.5")
tokenizer.pad_token = tokenizer.eos_token
trainer = DPOTrainer(
model=model,
ref_model=ref_model,
train_dataset=pref_dataset,
processing_class=tokenizer,
callbacks=[bema_callback],
)
trainer.train()
```
### GFPO
This feature implements the GFPO algorithm to enforce concise reasoning in the model's output generation, as proposed in the paper [Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning](https://huggingface.co/papers/2508.09726).
To activate GFPO in [`GFPOTrainer`]:
- set `num_remains_in_group` in [`GFPOConfig`]
- define a group filter function and set it to `group_filter_func` in [`GFPOTrainer`]. `group_filter_func` will score the `num_generations` completions and The GFPOTrainer filters groups according to their scores to get top `num_remains_in_group` completions as a new group. Model will be trained on the filtered group.
```python
# train_gfpo.py
from trl.experimental.gfpo import GFPOConfig, GFPOTrainer
# dummy group filter to scores the completions based on its indice in group
class GroupFilter:
def __call__(self, group_completions, group_rewards, **kwargs):
group_scores = []
for completions, rewards in zip(group_completions, group_rewards):
scores = [float(i) for i in range(len(completions))]
group_scores.append(scores)
return group_scores
training_args = GFPOConfig(
output_dir="Qwen3-0.6B-GFPO",
per_device_train_batch_size=4,
num_remains_in_group=2,
bf16=True,
)
trainer = GFPOTrainer(
model="Qwen/Qwen3-0.6B",
reward_funcs=...,
train_dataset=...,
args=training_args,
group_filter_func=GroupFilter(),
)
trainer.train()
```
### GSPO-token
In the paper [Group Sequence Policy Optimization](https://huggingface.co/papers/2507.18071), the authors propose a token-level objective variant to GSPO, called GSPO-token. To use GSPO-token, you can use the `GRPOTrainer` class in `trl.experimental.gspo_token`.
```python
from trl.experimental.gspo_token import GRPOTrainer
from trl import GRPOConfig
training_args = GRPOConfig(
importance_sampling_level="sequence_token",
...
)
```
> [!WARNING]
> To leverage GSPO-token, the user will need to provide the per-token advantage \\( \hat{A_{i,t}} \\) for each token \\( t \\) in the sequence \\( i \\) (i.e., make \\( \hat{A_{i,t}} \\) varies with \\( t \\)—which isn't the case here, \\( \hat{A_{i,t}}=\hat{A_{i}} \\)). Otherwise, GSPO-Token gradient is just equivalent to the original GSPO implementation.
### GRPO With Replay Buffer
This experimental trainer, trains a model with GRPO but replaces groups (and corresponding completions) that have 0 standard deviation with groups with high rewards and standard deviation that've been used to train a model in prior batches.
#### Usage
```python
from trl.experimental.grpo_with_replay_buffer import GRPOWithReplayBufferTrainer
from datasets import load_dataset
dataset = load_dataset("trl-internal-testing/zen", "standard_prompt_only", split="train")
# Guarantee that some rewards have 0 std
def custom_reward_func(completions, **kwargs):
if torch.rand(1).item() < 0.25:
return [0] * len(completions) # simulate some None rewards
else:
return torch.rand(len(completions)).tolist()
training_args = GRPOWithReplayBufferConfig(
output_dir=self.tmp_dir,
learning_rate=1e-4,
per_device_train_batch_size=4,
num_generations=4,
max_completion_length=8,
replay_buffer_size=8,
report_to="none",
)
trainer = GRPOTrainer(
model="trl-internal-testing/tiny-Qwen2ForCausalLM-2.5",
reward_funcs=[custom_reward_func],
args=training_args,
train_dataset=dataset,
)
previous_trainable_params = {n: param.clone() for n, param in trainer.model.named_parameters()}
trainer.train()
```
To silence the runtime notice:
```bash
export TRL_EXPERIMENTAL_SILENCE=1
```
## Promotion Path (Simple)
1. **Prototype outside the main repo:** Start development in your own fork or a separate repository to iterate quickly.
2. **Experimental inclusion:** Once its ready for early users, move the idea into `trl.experimental.<feature>`.
3. **Improve:** Add tests, a short doc/example, and demonstrate the usage.
4. **Promote:** Once the API proves stable and there is clear interest or adoption from the community, move it into `trl.<feature>` (stable module).
## FAQ
**Why not just use branches?**
Because branches are not shipped to users; experimental code inside the package lets early adopters try things and give feedback.
**Can these APIs change or vanish without warning?**
Yes. Anything inside `trl.experimental` can change or disappear in *any* release.
**Should I use this in production?**
Only if you are fine with updating your code quickly when things change.
**Will maintainers promptly fix issues in `trl.experimental`?**
Not necessarily. The experimental module is a playground for new ideas, and maintainers may not prioritize bug fixes or feature requests there. Issues may remain unresolved until (or unless) the feature graduates to the stable API.

102
docs/source/gkd_trainer.md Normal file
View File

@ -0,0 +1,102 @@
# Generalized Knowledge Distillation Trainer
[![model badge](https://img.shields.io/badge/All_models-GKD-blue)](https://huggingface.co/models?other=gkd,trl)
## Overview
Generalized Knowledge Distillation (GKD) was proposed in [On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes](https://huggingface.co/papers/2306.13649) by Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem.
The abstract from the paper is the following:
> Knowledge distillation (KD) is widely used for compressing a teacher model to reduce its inference cost and memory footprint, by training a smaller student model. However, current KD methods for auto-regressive sequence models suffer from distribution mismatch between output sequences seen during training and those generated by the student during inference. To address this issue, we introduce Generalized Knowledge Distillation (GKD). Instead of solely relying on a fixed set of output sequences, GKD trains the student on its self-generated output sequences by leveraging feedback from the teacher on such sequences. Unlike supervised KD approaches, GKD also offers the flexibility to employ alternative loss functions between the student and teacher, which can be useful when the student lacks the expressivity to mimic the teacher's distribution. Furthermore, GKD facilitates the seamless integration of distillation with RL fine-tuning (RLHF). We demonstrate the efficacy of GKD for distilling auto-regressive language models on summarization, translation, and arithmetic reasoning tasks, and task-agnostic distillation for instruction-tuning.
The key aspects of GKD are:
1. It addresses the train-inference distribution mismatch in auto-regressive sequence models by training the student model on its self-generated output sequences.
2. GKD allows flexibility in choosing different divergence measures between student and teacher models via the generalized Jensen-Shannon Divergence (JSD), which can be useful when the student lacks the capacity to fully mimic the teacher.
This post-training method was contributed by [Kashif Rasul](https://huggingface.co/kashif) and [Lewis Tunstall](https://huggingface.co/lewtun).
## Usage tips
The [`GKDTrainer`] is a wrapper around the [`SFTTrainer`] class that takes in a teacher model argument. It needs three parameters to be set via the [`GKDConfig`] namely:
* `lmbda`: controls the student data fraction, i.e., the proportion of on-policy student-generated outputs. When `lmbda=0.0`, the loss reduces to supervised JSD where the student is trained with the token-level probabilities of the teacher. When `lmbda=1.0`, the loss reduces to on-policy JSD, where the student generates output sequences and token-specific feedback on these sequences from the teacher. For values in between [0, 1] it is random between the two based on the `lmbda` value for each batch.
* `seq_kd`: controls whether to perform Sequence-Level KD (can be viewed as supervised FT on teacher-generated out). When `seq_kd=True` and `lmbda=0.0`, the loss reduces to supervised JSD, where the teacher generates output sequences and the student receives token-specific feedback on these sequences from the teacher.
* `beta`: controls the interpolation in the generalized Jensen-Shannon Divergence. When `beta=0.0` the loss approximates forward KL divergence, while for `beta=1.0` the loss approximates reverse KL divergence. For values in between [0, 1] it interpolates between the two.
The authors find that on-policy data (high `lmbda`) performs better and the optimal `beta` varied depending on the task and evaluation method.
> [!WARNING]
> Make sure that `attn_implementation="flash_attention_2"` when training [Gemma models](https://huggingface.co/models?other=gemma2). Otherwise you will encounter NaNs in the logits due to the [soft capping technique](https://huggingface.co/blog/gemma2#soft-capping-and-attention-implementations) adopted by this architecture.
The basic API is as follows:
```python
from datasets import Dataset
from trl import GKDConfig, GKDTrainer
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
)
NUM_DUMMY_SAMPLES = 100
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
# The model to optimise
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
# The teacher model to calculate the KL divergence against
teacher_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-1.5B-Instruct")
train_dataset = Dataset.from_dict(
{
"messages": [
[
{"role": "user", "content": "Hi, how are you?"},
{"role": "assistant", "content": "I'm great thanks"},
]
]
* NUM_DUMMY_SAMPLES
}
)
eval_dataset = Dataset.from_dict(
{
"messages": [
[
{"role": "user", "content": "What colour is the sky?"},
{"role": "assistant", "content": "The sky is blue"},
]
]
* NUM_DUMMY_SAMPLES
}
)
training_args = GKDConfig(output_dir="gkd-model", per_device_train_batch_size=1)
trainer = GKDTrainer(
model=model,
teacher_model=teacher_model,
args=training_args,
processing_class=tokenizer,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
)
trainer.train()
```
### Expected dataset type
The dataset should be formatted as a list of "messages" where each message is a list of dictionaries with the following keys:
* `role`: either `system`, `assistant` or `user`
* `content`: the message content
## GKDTrainer
[[autodoc]] GKDTrainer
- train
- save_model
- push_to_hub
## GKDConfig
[[autodoc]] GKDConfig

592
docs/source/grpo_trainer.md Normal file
View File

@ -0,0 +1,592 @@
# GRPO Trainer
[![model badge](https://img.shields.io/badge/All_models-GRPO-blue)](https://huggingface.co/models?other=grpo,trl)
## Overview
TRL supports the GRPO Trainer for training language models, as described in the paper [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://huggingface.co/papers/2402.03300) by [Zhihong Shao](https://huggingface.co/syhia), [Peiyi Wang](https://huggingface.co/peiyiwang89), [Qihao Zhu](https://huggingface.co/zqh11), Runxin Xu, [Junxiao Song](https://huggingface.co/haha-point), Mingchuan Zhang, Y. K. Li, Y. Wu, [Daya Guo](https://huggingface.co/guoday).
The abstract from the paper is the following:
> Mathematical reasoning poses a significant challenge for language models due to its complex and structured nature. In this paper, we introduce DeepSeekMath 7B, which continues pre-training DeepSeek-Coder-Base-v1.5 7B with 120B math-related tokens sourced from Common Crawl, together with natural language and code data. DeepSeekMath 7B has achieved an impressive score of 51.7% on the competition-level MATH benchmark without relying on external toolkits and voting techniques, approaching the performance level of Gemini-Ultra and GPT-4. Self-consistency over 64 samples from DeepSeekMath 7B achieves 60.9% on MATH. The mathematical reasoning capability of DeepSeekMath is attributed to two key factors: First, we harness the significant potential of publicly available web data through a meticulously engineered data selection pipeline. Second, we introduce Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO), that enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO.
This post-training method was contributed by [Quentin Gallouédec](https://huggingface.co/qgallouedec).
## Quick start
This example demonstrates how to train a model using the GRPO method. We train a [Qwen 0.5B Instruct model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) with the prompts from the [UltraFeedback prompts dataset](https://huggingface.co/datasets/trl-lib/ultrafeedback-prompt). You can view the data in the dataset here:
<iframe
src="https://huggingface.co/datasets/trl-lib/ultrafeedback-prompt/embed/viewer/default/train?row=0"
frameborder="0"
width="100%"
height="560px"
></iframe>
Below is the script to train the model.
```python
# train_grpo.py
from datasets import load_dataset
from trl import GRPOConfig, GRPOTrainer
dataset = load_dataset("trl-lib/ultrafeedback-prompt", split="train")
# Dummy reward function for demonstration purposes
def reward_num_unique_letters(completions, **kwargs):
"""Reward function that rewards completions with more unique letters."""
completion_contents = [completion[0]["content"] for completion in completions]
return [float(len(set(content))) for content in completion_contents]
training_args = GRPOConfig(output_dir="Qwen2-0.5B-GRPO")
trainer = GRPOTrainer(
model="Qwen/Qwen2-0.5B-Instruct",
reward_funcs=reward_num_unique_letters,
args=training_args,
train_dataset=dataset,
)
trainer.train()
```
Execute the script using the following command:
```bash
accelerate launch train_grpo.py
```
Distributed across 8 GPUs, the training takes approximately 1 day.
![GRPO curves](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/grpo_curves.png)
## Looking deeper into the GRPO method
GRPO is an online learning algorithm, meaning it improves iteratively by using the data generated by the trained model itself during training. The intuition behind GRPO objective is to maximize the advantage of the generated completions, while ensuring that the model remains close to the reference policy. To understand how GRPO works, it can be broken down into four main steps: **Generating completions**, **computing the advantage**, **estimating the KL divergence**, and **computing the loss**.
![GRPO visual](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/grpo_visual.png)
### Generating completions
At each training step, we sample a batch of prompts and generate a set of \\( G \\) completions for each prompt (denoted as \\( o_i \\)).
### Computing the advantage
For each of the \\( G \\) sequences, we compute the reward using a reward model or reward function. To align with the comparative nature of reward models—typically trained on datasets of comparisons between outputs for the same question—the advantage is calculated to reflect these relative comparisons. It is normalized as follows:
$$\hat{A}_{i,t} = \frac{r_i - \text{mean}(\mathbf{r})}{\text{std}(\mathbf{r})}$$
This approach gives the method its name: **Group Relative Policy Optimization (GRPO)**.
> [!TIP]
> It was shown in the paper [Understanding R1-Zero-Like Training: A Critical Perspective](https://huggingface.co/papers/2503.20783) that scaling by \\( \text{std}(\mathbf{r}) \\) may cause a question-level difficulty bias. You can disable this scaling by setting `scale_rewards=False` in [`GRPOConfig`].
> [!TIP]
> [Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning (Lite PPO)](https://huggingface.co/papers/2508.08221) showed that calculating the mean at the local (group) level and the standard deviation at the global (batch) level enables more robust reward shaping. You can use this scaling strategy by setting `scale_rewards="batch"` in [`GRPOConfig`].
### Estimating the KL divergence
KL divergence is estimated using the approximator introduced by [Schulman et al. (2020)](http://joschu.net/blog/kl-approx.html). The approximator is defined as follows:
$$\mathbb{D}_{\text{KL}}\left[\pi_\theta \|\pi_{\text{ref}}\right] = \frac{\pi_{\text{ref}}(o_{i,t} \mid q, o_{i,<t})}{\pi_\theta(o_{i,t} \mid q, o_{i,<t})} - \log \frac{\pi_{\text{ref}}(o_{i,t} \mid q, o_{i,<t})}{\pi_\theta(o_{i,t} \mid q, o_{i,<t})} - 1,
$$
### Computing the loss
The objective is to maximize the advantage while ensuring that the model remains close to the reference policy. Consequently, the loss is defined as follows:
$$
\mathcal{L}_{\text{GRPO}}(\theta) = -\frac{1}{\sum_{i=1}^G |o_i|} \sum_{i=1}^G \sum_{t=1}^{|o_i|} \left[ \frac{\pi_\theta(o_{i,t} \mid q, o_{i,< t})}{\left[\pi_\theta(o_{i,t} \mid q, o_{i,< t})\right]_{\text{no grad}}} \hat{A}_{i,t} - \beta \mathbb{D}_{\text{KL}}\left[\pi_\theta \| \pi_{\text{ref}}\right] \right],
$$
where the first term represents the scaled advantage and the second term penalizes deviations from the reference policy through KL divergence.
> [!TIP]
> Note that compared to the original formulation in [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://huggingface.co/papers/2402.03300), we don't scale by \\( \frac{1}{|o_i|} \\) because it was shown in the paper [Understanding R1-Zero-Like Training: A Critical Perspective](https://huggingface.co/papers/2503.20783) that this introduces a response-level length bias. More details in [loss types](#loss-types).
> [!TIP]
> Note that compared to the original formulation in [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://huggingface.co/papers/2402.03300), we use \\( \beta = 0.0 \\) by default, meaning that the KL divergence term is not used. This choice is motivated by several recent studies (e.g., [Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model](https://huggingface.co/papers/2503.24290)) which have shown that the KL divergence term is not essential for training with GRPO. As a result, it has become common practice to exclude it (e.g. [Understanding R1-Zero-Like Training: A Critical Perspective](https://huggingface.co/papers/2503.20783), [DAPO: An Open-Source LLM Reinforcement Learning System at Scale](https://huggingface.co/papers/2503.14476)). If you wish to include the KL divergence term, you can set `beta` in [`GRPOConfig`] to a non-zero value.
In the original paper, this formulation is generalized to account for multiple updates after each generation (denoted \\( \mu \\), can be set with `num_iterations` in [`GRPOConfig`]) by leveraging the **clipped surrogate objective**:
$$
\mathcal{L}_{\text{GRPO}}(\theta) = - \frac{1}{\sum_{i=1}^G |o_i|} \sum_{i=1}^G \sum_{t=1}^{|o_i|} \left[ \min \left( \frac{\pi_\theta(o_{i,t} \mid q, o_{i,< t})}{\pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,< t})} \hat{A}_{i,t}, \, \text{clip}\left( \frac{\pi_\theta(o_{i,t} \mid q, o_{i,< t})}{\pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,< t})}, 1 - \epsilon, 1 + \epsilon \right) \hat{A}_{i,t} \right) - \beta \mathbb{D}_{\text{KL}}\left[\pi_\theta \| \pi_{\text{ref}}\right] \right],
$$
where \\(\text{clip}(\cdot, 1 - \epsilon, 1 + \epsilon) \\) ensures that updates do not deviate excessively from the reference policy by bounding the policy ratio between \\( 1 - \epsilon \\) and \\( 1 + \epsilon \\).
When \\( \mu = 1 \\) (default in TRL), the clipped surrogate objective simplifies to the original objective.
#### Loss Types
Several formulations of the objective have been proposed in the literature. Initially, the objective of GRPO was defined as follows:
$$
\mathcal{L}_{\text{GRPO}}(\theta) = - \frac{1}{G} \sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} l_{i,t},
$$
where
$$
l_{i,t} = \frac{\pi_\theta(o_{i,t} \mid q, o_{i,< t})}{\left[\pi_\theta(o_{i,t} \mid q, o_{i,< t})\right]_{\text{no grad}}} \hat{A}_{i,t} - \beta \mathbb{D}_{\text{KL}}\left[\pi_\theta \| \pi_{\text{ref}}\right].
$$
The [DAPO paper](https://huggingface.co/papers/2503.14476) highlights the limitations of the GRPO algorithms sample-level loss in long-CoT scenarios, where longer responses are under-penalized, leading to poorer quality outputs. The proposed solution is a token-level normalization, which better handles longer sequences by assigning more balanced rewards to individual tokens, regardless of response length:
$$
\mathcal{L}_{\text{DAPO}}(\theta) = - \frac{1}{\sum_{i=1}^G |o_i|} \sum_{i=1}^G \sum_{t=1}^{|o_i|} l_{i,t},
$$
To use this formulation, set `loss_type="dapo"` in [`GRPOConfig`].
Furthermore, it was demonstrated in the paper [Understanding R1-Zero-Like Training: A Critical Perspective](https://huggingface.co/papers/2503.20783) that the initial GRPO formulation introduces a response length bias. They show that while the DAPO formulation reduces this bias, it does not eliminate it completely. To fully remove this bias, they propose dividing by a constant instead of the sequence length, resulting in the following formulation:
$$
\mathcal{L}_{\text{Dr. GRPO}}(\theta) = - \frac{1}{LG} \sum_{i=1}^G \sum_{t=1}^{|o_i|} l_{i,t},
$$
This constant is recommended to be the maximum completion length. To use this formulation, set `loss_type="dr_grpo"` in the [`GRPOConfig`].
## Logged metrics
While training and evaluating, we record the following reward metrics:
- `num_tokens`: The total number of tokens processed so far, including both prompts and completions.
- `completions/mean_length`: The average length of generated completions.
- `completions/min_length`: The minimum length of generated completions.
- `completions/max_length`: The maximum length of generated completions.
- `completions/mean_terminated_length`: The average length of generated completions that terminate with EOS.
- `completions/min_terminated_length`: The minimum length of generated completions that terminate with EOS.
- `completions/max_terminated_length`: The maximum length of generated completions that terminate with EOS.
- `completions/clipped_ratio`: The ratio of truncated (clipped) completions.
- `reward/{reward_func_name}/mean`: The average reward from a specific reward function.
- `reward/{reward_func_name}/std`: The standard deviation of the reward from a specific reward function.
- `reward`: The overall average reward after applying reward weights.
- `reward_std`: The standard deviation of rewards after applying reward weights.
- If `scale_rewards` is `"group"` or `"none"`, this is the average of the per-group standard deviations.
- If `scale_rewards` is `"batch"`, this is the standard deviation computed over all rewards in the batch (ignoring groups).
- `frac_reward_zero_std`: The fraction of samples in the generation batch with a reward std of zero, implying there is little diversity for that prompt (all answers are correct or incorrect).
- `entropy`: Average entropy of token predictions across generated completions. (If `mask_truncated_completions=True`, masked sequences tokens are excluded.)
- `kl`: The average KL divergence between the model and the reference model, calculated over generated completions. Logged only if `beta` is nonzero.
- `clip_ratio/region_mean`: The ratio of token (or sequence, if `importance_sampling_level="sequence"`) probabilities where the GRPO objective is clipped to stay within the trust region:
$$
\text{clip}\left( r_{i,t}(\theta), 1 - \epsilon_\mathrm{low}, 1 + \epsilon_\mathrm{high} \right)\,, \qquad r_{i,t}(\theta) = \frac{\pi_\theta(o_{i,t} \mid q, o_{i,< t})}{\pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,< t})}\,.
$$
A higher value means more tokens are clipped, which constrains how much the policy $\pi_\theta$ can change.
- `clip_ratio/low_mean`: The average ratio of token (or sequence, if `importance_sampling_level="sequence"`) probabilities that were clipped on the lower bound of the trust region: \\(r_{i,t}(\theta) < 1 - \epsilon_\mathrm{low}\\)
- `clip_ratio/low_min`: The minimum ratio of token (or sequence, if `importance_sampling_level="sequence"`) probabilities that were clipped on the lower bound of the trust region: \\(r_{i,t}(\theta) < 1 - \epsilon_\mathrm{low}\\)
- `clip_ratio/high_mean`: The average ratio of token (or sequence, if `importance_sampling_level="sequence"`) probabilities that were clipped on the upper bound of the trust region: \\(r_{i,t}(\theta) > 1 + \epsilon_\mathrm{high}\\)
- `clip_ratio/high_max`: The maximum ratio of token (or sequence, if `importance_sampling_level="sequence"`) probabilities that were clipped on the upper bound of the trust region: \\(r_{i,t}(\theta) > 1 + \epsilon_\mathrm{high}\\).
## Customization
### Speed up training with vLLM-powered generation
Generation is often the main bottleneck when training with online methods. To accelerate generation, you can use [vLLM](https://github.com/vllm-project/vllm), a high-throughput, low-latency inference engine for LLMs. To enable it, first install the package with
```shell
pip install trl[vllm]
```
We support two ways of using vLLM during training: **server mode** and **colocate mode**.
> [!TIP]
> By default, Truncated Importance Sampling is activated for vLLM generation to address the generation-training mismatch that occurs when using different frameworks. This can be turned off by setting `vllm_importance_sampling_correction=False`. For more information, see [Truncated Importance Sampling](paper_index#truncated-importance-sampling)
#### 🔌 Option 1: Server mode
In this mode, vLLM runs in a separate process (and using separate GPUs) and communicates with the trainer via HTTP. This is ideal if you have dedicated GPUs for inference.
1. **Start the vLLM server**:
```bash
trl vllm-serve --model <model_name>
```
2. **Enable server mode in your training script**:
```python
from trl import GRPOConfig
training_args = GRPOConfig(
...,
use_vllm=True,
vllm_mode="server", # default value, can be omitted
)
```
> [!WARNING]
> Make sure that the server is using different GPUs than the trainer, otherwise you may run into NCCL errors. You can specify the GPUs to use with the `CUDA_VISIBLE_DEVICES` environment variable.
#### 🧩 Option 2: Colocate mode
In this mode, vLLM runs inside the trainer process and shares GPU memory with the training model. This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs.
```python
from trl import GRPOConfig
training_args = GRPOConfig(
...,
use_vllm=True,
vllm_mode="colocate",
)
```
> [!TIP]
> Depending on the model size and the overall GPU memory requirements for training, you may need to adjust the `vllm_gpu_memory_utilization` parameter in [`GRPOConfig`] to avoid underutilization or out-of-memory errors.
>
> We provide a [HF Space](https://huggingface.co/spaces/trl-lib/recommend-vllm-memory) to help estimate the recommended GPU memory utilization based on your model configuration and experiment settings. Simply use it as follows to get `vllm_gpu_memory_utilization` recommendation:
>
> <iframe src="https://trl-lib-recommend-vllm-memory.hf.space" frameborder="0" width="850" height="450"></iframe>
>
> If the recommended value does not work in your environment, we suggest adding a small buffer (e.g., +0.05 or +0.1) to the recommended value to ensure stability.
>
> If you still find you are getting out-of-memory errors set `vllm_enable_sleep_mode` to True and the vllm parameters and cache will be offloaded during the optimization step. For more information, see [Reducing Memory Usage with vLLM Sleep Mode](reducing_memory_usage#vllm-sleep-mode).
> [!TIP]
> By default, GRPO uses `MASTER_ADDR=localhost` and `MASTER_PORT=12345` for vLLM, but you can override these values by setting the environment variables accordingly.
For more information, see [Speeding up training with vLLM](speeding_up_training#vllm-for-fast-generation-in-online-methods).
### GRPO at scale: train a 70B+ Model on multiple nodes
When training large models like **Qwen2.5-72B**, you need several key optimizations to make the training efficient and scalable across multiple GPUs and nodes. These include:
- **DeepSpeed ZeRO Stage 3**: ZeRO leverages data parallelism to distribute model states (weights, gradients, optimizer states) across multiple GPUs and CPUs, reducing memory and compute requirements on each device. Since large models cannot fit on a single GPU, using ZeRO Stage 3 is required for training such models. For more details, see [DeepSpeed Integration](deepspeed_integration).
- **Accelerate**: Accelerate is a library that simplifies distributed training across multiple GPUs and nodes. It provides a simple API to launch distributed training and handles the complexities of distributed training, such as data parallelism, gradient accumulation, and distributed data loading. For more details, see [Distributing Training](distributing_training).
- **vLLM**: See the previous section on how to use vLLM to speed up generation.
Below is an example SLURM script to train a 70B model with GRPO on multiple nodes. This script trains a model on 4 nodes and uses the 5th node for vLLM-powered generation.
```sh
#!/bin/bash
#SBATCH --nodes=5
#SBATCH --gres=gpu:8
# Get the list of allocated nodes
NODELIST=($(scontrol show hostnames $SLURM_JOB_NODELIST))
# Assign the first 4 nodes for training and the 5th node for vLLM
TRAIN_NODES="${NODELIST[@]:0:4}" # Nodes 0, 1, 2, 3 for training
VLLM_NODE="${NODELIST[4]}" # Node 4 for vLLM
# Run training on the first 4 nodes (Group 1)
srun --nodes=4 --ntasks=4 --nodelist="${NODELIST[@]:0:4}" accelerate launch \
--config_file examples/accelerate_configs/deepspeed_zero3.yaml \
--num_processes 32 \
--num_machines 4 \
--main_process_ip ${NODELIST[0]} \
--machine_rank $SLURM_PROCID \
--rdzv_backend c10d \
train_grpo.py \
--server_ip $VLLM_NODE &
# Run vLLM server on the 5th node (Group 2)
srun --nodes=1 --ntasks=1 --nodelist="${NODELIST[4]}" trl vllm-serve --model Qwen/Qwen2.5-72B --tensor_parallel_size 8 &
wait
```
```python
import argparse
from datasets import load_dataset
from trl import GRPOTrainer, GRPOConfig
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--vllm_server_host", type=str, default="", help="The server IP")
args = parser.parse_args()
# Example dataset from TLDR
dataset = load_dataset("trl-lib/tldr", split="train")
# Dummy reward function: count the number of unique characters in the completions
def reward_num_unique_chars(completions, **kwargs):
return [len(set(c)) for c in completions]
training_args = GRPOConfig(
output_dir="Qwen2.5-72B-GRPO",
per_device_train_batch_size=4,
bf16=True,
gradient_checkpointing=True,
use_vllm=True,
vllm_server_host=args.vllm_server_host.replace("ip-", "").replace("-", "."), # from ip-X-X-X-X to X.X.X.X
)
trainer = GRPOTrainer(model="Qwen/Qwen2.5-72B", args=training_args, reward_funcs=reward_num_unique_chars, train_dataset=dataset)
trainer.train()
if __name__=="__main__":
main()
```
### Using a custom reward function
The [`GRPOTrainer`] supports using custom reward functions instead of dense reward models. To ensure compatibility, your reward function must satisfy the following requirements:
1. **Input arguments**:
- The function must accept the following as keyword arguments:
- `prompts` (contains the prompts),
- `completions` (contains the generated completions),
- `completions_ids` (contains the tokenized completions),
- `trainer_state` ([`~transformers.TrainerState`]): The current state of the trainer. This can be used to implement dynamic reward functions, such as curriculum learning, where the reward is adjusted based on the training progress.
- All column names (but `prompt`) that the dataset may have. For example, if the dataset contains a column named `ground_truth`, the function will be called with `ground_truth` as a keyword argument.
The easiest way to comply with this requirement is to use `**kwargs` in the function signature.
- Depending on the dataset format, the input will vary:
- For [standard format](dataset_formats#standard), `prompts` and `completions` will be lists of strings.
- For [conversational format](dataset_formats#conversational), `prompts` and `completions` will be lists of message dictionaries.
2. **Return value**: The function must return a list of floats. Each float represents the reward corresponding to a single completion.
#### Example 1: Reward longer completions
Below is an example of a reward function for a standard format that rewards longer completions:
```python
def reward_func(completions_ids, **kwargs):
"""Reward function that assigns higher scores to longer completions (in terms of token count)."""
return [float(len(ids)) for ids in completions_ids]
```
You can test it as follows:
```python
>>> prompts = ["The sky is", "The sun is"] # not used in the reward function, but the trainer will pass it
>>> completions = [" blue.", " in the sky."] # not used in the reward function, but the trainer will pass it
>>> completions_ids = [[6303, 13], [304, 279, 12884, 13]]
>>> reward_func(prompts=prompts, completions=completions, completions_ids=completions_ids)
[2.0, 4.0]
```
#### Example 1.1: Reward longer completions (based on the number of characters)
Same as the previous example, but this time the reward function is based on the number of characters instead of tokens.
```python
def reward_func(completions, **kwargs):
"""Reward function that assigns higher scores to longer completions (in terms of character count)."""
return [float(len(completion)) for completion in completions]
```
You can test it as follows:
```python
>>> prompts = ["The sky is", "The sun is"]
>>> completions = [" blue.", " in the sky."]
>>> completions_ids = [[6303, 13], [304, 279, 12884, 13]] # not used in the reward function, but the trainer will pass it
>>> reward_func(prompts=prompts, completions=completions, completions_ids=completions_ids)
[6.0, 12.0]
```
#### Example 2: Reward completions with a specific format
Below is an example of a reward function that checks if the completion has a specific format. This example is inspired by the _format reward_ function used in the paper [DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning](https://huggingface.co/papers/2501.12948).
It is designed for a conversational format, where prompts and completions consist of structured messages.
```python
import re
def format_reward_func(completions, **kwargs):
"""Reward function that checks if the completion has a specific format."""
pattern = r"^<think>.*?</think><answer>.*?</answer>$"
completion_contents = [completion[0]["content"] for completion in completions]
matches = [re.match(pattern, content) for content in completion_contents]
return [1.0 if match else 0.0 for match in matches]
```
You can test this function as follows:
```python
>>> prompts = [
... [{"role": "assistant", "content": "What is the result of (1 + 2) * 4?"}],
... [{"role": "assistant", "content": "What is the result of (3 + 1) * 2?"}],
... ]
>>> completions = [
... [{"role": "assistant", "content": "<think>The sum of 1 and 2 is 3, which we multiply by 4 to get 12.</think><answer>(1 + 2) * 4 = 12</answer>"}],
... [{"role": "assistant", "content": "The sum of 3 and 1 is 4, which we multiply by 2 to get 8. So (3 + 1) * 2 = 8."}],
... ]
>>> format_reward_func(prompts=prompts, completions=completions)
[1.0, 0.0]
```
#### Example 3: Reward completions based on a reference
Below is an example of a reward function that checks if the completion is correct. This example is inspired by the _accuracy reward_ function used in the paper [DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning](https://huggingface.co/papers/2501.12948).
This example is designed for [standard format](dataset_formats#standard), where the dataset contains a column named `ground_truth`.
```python
import re
def reward_func(completions, ground_truth, **kwargs):
# Regular expression to capture content inside \boxed{}
matches = [re.search(r"\\boxed\{(.*?)\}", completion) for completion in completions]
contents = [match.group(1) if match else "" for match in matches]
# Reward 1 if the content is the same as the ground truth, 0 otherwise
return [1.0 if c == gt else 0.0 for c, gt in zip(contents, ground_truth)]
```
You can test this function as follows:
```python
>>> prompts = ["Problem: Solve the equation $2x + 3 = 7$. Solution:", "Problem: Solve the equation $3x - 5 = 10$."]
>>> completions = [r" The solution is \boxed{2}.", r" The solution is \boxed{6}."]
>>> ground_truth = ["2", "5"]
>>> reward_func(prompts=prompts, completions=completions, ground_truth=ground_truth)
[1.0, 0.0]
```
#### Example 4: Multi-task reward functions
Below is an example of using multiple reward functions in the [`GRPOTrainer`]. In this example, we define two task-specific reward functions: `math_reward_func` and `coding_reward_func`. The `math_reward_func` rewards math problems based on their correctness, while the `coding_reward_func` rewards coding problems based on whether the solution works.
```python
from datasets import Dataset
from trl import GRPOTrainer
# Define a dataset that contains both math and coding problems
dataset = Dataset.from_list(
[
{"prompt": "What is 2+2?", "task": "math"},
{"prompt": "Write a function that returns the sum of two numbers.", "task": "code"},
{"prompt": "What is 3*4?", "task": "math"},
{"prompt": "Write a function that returns the product of two numbers.", "task": "code"},
]
)
# Math-specific reward function
def math_reward_func(prompts, completions, task, **kwargs):
rewards = []
for prompt, completion, t in zip(prompts, completions, task):
if t == "math":
# Calculate math-specific reward
correct = check_math_solution(prompt, completion)
reward = 1.0 if correct else -1.0
rewards.append(reward)
else:
# Return None for non-math tasks
rewards.append(None)
return rewards
# Coding-specific reward function
def coding_reward_func(prompts, completions, task, **kwargs):
rewards = []
for prompt, completion, t in zip(prompts, completions, task):
if t == "coding":
# Calculate coding-specific reward
works = test_code_solution(prompt, completion)
reward = 1.0 if works else -1.0
rewards.append(reward)
else:
# Return None for non-coding tasks
rewards.append(None)
return rewards
# Use both task-specific reward functions
trainer = GRPOTrainer(
model="Qwen/Qwen2-0.5B-Instruct",
reward_funcs=[math_reward_func, coding_reward_func],
train_dataset=dataset,
)
trainer.train()
```
In this example, the `math_reward_func` and `coding_reward_func` are designed to work with a mixed dataset that contains both math and coding problems. The `task` column in the dataset is used to determine which reward function to apply to each problem. If there is no relevant reward function for a sample in the dataset, the reward function will return `None`, and the [`GRPOTrainer`] will continue with the valid functions and tasks. This allows the [`GRPOTrainer`] to handle multiple reward functions with different applicability.
Note that the [`GRPOTrainer`] will ignore the `None` rewards returned by the reward functions and only consider the rewards returned by the relevant functions. This ensures that the model is trained on the relevant tasks and ignores the tasks for which there is no relevant reward function.
#### Passing the reward function to the trainer
To use your custom reward function, pass it to the [`GRPOTrainer`] as follows:
```python
from trl import GRPOTrainer
trainer = GRPOTrainer(
reward_funcs=reward_func,
...,
)
```
If you have multiple reward functions, you can pass them as a list:
```python
from trl import GRPOTrainer
trainer = GRPOTrainer(
reward_funcs=[reward_func1, reward_func2],
...,
)
```
and the reward will be computed as the sum of the rewards from each function, or the weighted sum if `reward_weights` is provided in the config.
Note that [`GRPOTrainer`] supports multiple reward functions of different types. See the parameters documentation for more details.
## Vision-Language Model (VLM) Training
GRPO supports training Vision-Language Models (VLMs) on multimodal datasets containing both text and images.
### Supported Models
Tested with:
- **Gemma3** — e.g., `google/gemma-3-4b-it`
- **LLaVA-NeXT** — e.g., `llava-hf/llava-v1.6-mistral-7b-hf`
- **Qwen2-VL** — e.g., `Qwen/Qwen2-VL-2B-Instruct`
- **Qwen2.5-VL** — e.g., `Qwen/Qwen2.5-VL-3B-Instruct`
- **SmolVLM2** — e.g., `HuggingFaceTB/SmolVLM2-2.2B-Instruct`
> [!TIP]
> Compatibility with all VLMs is not guaranteed. If you believe a model should be supported, feel free to open an issue on GitHub — or better yet, submit a pull request with the required changes.
### Quick Start
Use [grpo\_vlm.py](https://github.com/huggingface/trl/blob/main/examples/scripts/grpo_vlm.py) to fine-tune a VLM. Example command for training on [`lmms-lab/multimodal-open-r1-8k-verified`](https://huggingface.co/datasets/lmms-lab/multimodal-open-r1-8k-verified):
```bash
accelerate launch \
--config_file=examples/accelerate_configs/deepspeed_zero3.yaml \
examples/scripts/grpo_vlm.py \
--model_name_or_path Qwen/Qwen2.5-VL-3B-Instruct \
--output_dir grpo-Qwen2.5-VL-3B-Instruct \
--learning_rate 1e-5 \
--gradient_checkpointing \
--dtype bfloat16 \
--max_prompt_length 2048 \
--max_completion_length 1024 \
--use_vllm \
--vllm_mode colocate \
--use_peft \
--lora_target_modules "q_proj", "v_proj" \
--log_completions
```
### Configuration Tips
> [!WARNING]
> VLM training may fail if image tokens are truncated. We highly recommend disabling truncation by setting `max_prompt_length` to `None`.
- Use LoRA on vision-language projection layers
- Enable 4-bit quantization to reduce memory usage
- VLMs are memory-intensive — start with smaller batch sizes
- Most models are compatible with vLLM (`server` and `colocate` modes)
### Dataset Format
Each training sample should include:
- `prompt`: Text formatted via the processor's chat template
- `image`/`images`: PIL Image or list of PIL Images
The trainer automatically handles image-to-tensor conversion via the models image processor.
## GRPOTrainer
[[autodoc]] GRPOTrainer
- train
- save_model
- push_to_hub
## GRPOConfig
[[autodoc]] GRPOConfig

135
docs/source/index.md Normal file
View File

@ -0,0 +1,135 @@
<div style="text-align: center">
<img src="https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/trl_banner_dark.png">
</div>
# TRL - Transformer Reinforcement Learning
TRL is a full stack library where we provide a set of tools to train transformer language models with methods like Supervised Fine-Tuning (SFT), Group Relative Policy Optimization (GRPO), Direct Preference Optimization (DPO), Reward Modeling, and more.
The library is integrated with 🤗 [transformers](https://github.com/huggingface/transformers).
Below is the current list of TRL trainers, organized by method type (⚡️ = vLLM support).
## Taxonomy
<div style="display: flex; justify-content: space-between; width: 100%; gap: 2rem;">
<div style="flex: 1; min-width: 0;">
### Online methods
- [`GRPOTrainer`] ⚡️
- [`RLOOTrainer`] ⚡️
- [`OnlineDPOTrainer`] ⚡️
- [`NashMDTrainer`] ⚡️
- [`XPOTrainer`] ⚡️
- [`PPOTrainer`]
### Reward modeling
- [`PRMTrainer`]
- [`RewardTrainer`]
</div>
<div style="flex: 1; min-width: 0;">
### Offline methods
- [`SFTTrainer`]
- [`DPOTrainer`]
- [`ORPOTrainer`]
- [`BCOTrainer`]
- [`CPOTrainer`]
- [`KTOTrainer`]
### Knowledge distillation
- [`GKDTrainer`]
</div>
</div>
## 🎉 What's New
**✨ OpenAI GPT OSS Support**: TRL now fully supports fine-tuning the latest [OpenAI GPT OSS models](https://huggingface.co/collections/openai/gpt-oss-68911959590a1634ba11c7a4)! Check out the:
- [OpenAI Cookbook](https://cookbook.openai.com/articles/gpt-oss/fine-tune-transfomers)
- [GPT OSS recipes](https://github.com/huggingface/gpt-oss-recipes)
- [Our example script](https://github.com/huggingface/trl/blob/main/examples/scripts/sft_gpt_oss.py)
You can also explore TRL-related models, datasets, and demos in the [TRL Hugging Face organization](https://huggingface.co/trl-lib).
## Learn
Learn post-training with TRL and other libraries in 🤗 [smol course](https://github.com/huggingface/smol-course).
## Contents
The documentation is organized into the following sections:
- **Getting Started**: installation and quickstart guide.
- **Conceptual Guides**: dataset formats, training FAQ, and understanding logs.
- **How-to Guides**: reducing memory usage, speeding up training, distributing training, etc.
- **Integrations**: DeepSpeed, Liger Kernel, PEFT, etc.
- **Examples**: example overview, community tutorials, etc.
- **API**: trainers, utils, etc.
## Blog posts
<div class="mt-10">
<div class="w-full flex flex-col space-y-4 md:space-y-0 md:grid md:grid-cols-2 md:gap-y-4 md:gap-x-5">
<a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="https://huggingface.co/blog/trl-vlm-alignment">
<img src="https://raw.githubusercontent.com/huggingface/blog/main/assets/trl_vlm/thumbnail.png" alt="thumbnail" class="mt-0">
<p class="text-gray-500 text-sm">Published on August 7, 2025</p>
<p class="text-gray-700">Vision Language Model Alignment in TRL ⚡️</p>
</a>
<a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="https://huggingface.co/blog/vllm-colocate">
<img src="https://raw.githubusercontent.com/huggingface/blog/main/assets/vllm-colocate/thumbnail.png" alt="thumbnail" class="mt-0">
<p class="text-gray-500 text-sm">Published on June 3, 2025</p>
<p class="text-gray-700">NO GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL</p>
</a>
<a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="https://huggingface.co/blog/liger-grpo">
<img src="https://raw.githubusercontent.com/huggingface/blog/main/assets/liger-grpo/thumbnail.png" alt="thumbnail" class="mt-0">
<p class="text-gray-500 text-sm">Published on May 25, 2025</p>
<p class="text-gray-700">🐯 Liger GRPO meets TRL</p>
</a>
<a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="https://huggingface.co/blog/open-r1">
<img src="https://raw.githubusercontent.com/huggingface/blog/main/assets/open-r1/thumbnails.png" alt="thumbnail" class="mt-0">
<p class="text-gray-500 text-sm">Published on January 28, 2025</p>
<p class="text-gray-700">Open-R1: a fully open reproduction of DeepSeek-R1</p>
</a>
<a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="https://huggingface.co/blog/dpo_vlm">
<img src="https://raw.githubusercontent.com/huggingface/blog/main/assets/dpo_vlm/thumbnail.png" alt="thumbnail" class="mt-0">
<p class="text-gray-500 text-sm">Published on July 10, 2024</p>
<p class="text-gray-700">Preference Optimization for Vision Language Models with TRL</p>
</a>
<a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="https://huggingface.co/blog/putting_rl_back_in_rlhf_with_rloo">
<img src="https://raw.githubusercontent.com/huggingface/blog/main/assets/putting_rl_back_in_rlhf_with_rloo/thumbnail.png" alt="thumbnail" class="mt-0">
<p class="text-gray-500 text-sm">Published on June 12, 2024</p>
<p class="text-gray-700">Putting RL back in RLHF</p>
</a>
<a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="https://huggingface.co/blog/trl-ddpo">
<img src="https://raw.githubusercontent.com/huggingface/blog/main/assets/166_trl_ddpo/thumbnail.png" alt="thumbnail" class="mt-0">
<p class="text-gray-500 text-sm">Published on September 29, 2023</p>
<p class="text-gray-700">Finetune Stable Diffusion Models with DDPO via TRL</p>
</a>
<a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="https://huggingface.co/blog/dpo-trl">
<img src="https://raw.githubusercontent.com/huggingface/blog/main/assets/157_dpo_trl/dpo_thumbnail.png" alt="thumbnail" class="mt-0">
<p class="text-gray-500 text-sm">Published on August 8, 2023</p>
<p class="text-gray-700">Fine-tune Llama 2 with DPO</p>
</a>
<a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="https://huggingface.co/blog/stackllama">
<img src="https://raw.githubusercontent.com/huggingface/blog/main/assets/138_stackllama/thumbnail.png" alt="thumbnail" class="mt-0">
<p class="text-gray-500 text-sm">Published on April 5, 2023</p>
<p class="text-gray-700">StackLLaMA: A hands-on guide to train LLaMA with RLHF</p>
</a>
<a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="https://huggingface.co/blog/trl-peft">
<img src="https://raw.githubusercontent.com/huggingface/blog/main/assets/133_trl_peft/thumbnail.png" alt="thumbnail" class="mt-0">
<p class="text-gray-500 text-sm">Published on March 9, 2023</p>
<p class="text-gray-700">Fine-tuning 20B LLMs with RLHF on a 24GB consumer GPU</p>
</a>
<a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="https://huggingface.co/blog/rlhf">
<img src="https://raw.githubusercontent.com/huggingface/blog/main/assets/120_rlhf/thumbnail.png" alt="thumbnail" class="mt-0">
<p class="text-gray-500 text-sm">Published on December 9, 2022</p>
<p class="text-gray-700">Illustrating Reinforcement Learning from Human Feedback</p>
</a>
</div>
</div>

View File

@ -1,9 +0,0 @@
<div style="text-align: center">
<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/trl_banner_dark.png">
</div>
# TRL - Transformer Reinforcement Learning
With the TRL (Transformer Reinforcement Learning) libray you can train transformer language models with reinforcement learning. The library is integrated with 🤗 [transformers](https://github.com/huggingface/transformers).
TRL supports decoder models such as GPT-2, BLOOM, GPT-Neo which can all be optimized using Proximal Policy Optimization (PPO). You can find installation instructions in the [installation guide](installation) and an introdcution to the library in the [Quickstart section](quickstart). There is also a more [in-depth example](sentiment_tuning) to tune GPT-2 to procude positive movie reviews.

View File

@ -0,0 +1,42 @@
# Installation
You can install TRL either from PyPI or from source:
## PyPI
Install the library with pip or [uv](https://docs.astral.sh/uv/):
<hfoptions id="install">
<hfoption id="uv">
uv is a fast Rust-based Python package and project manager. Refer to [Installation](https://docs.astral.sh/uv/getting-started/installation/) for installation instructions.
```bash
uv pip install trl
```
</hfoption>
<hfoption id="pip">
```bash
pip install trl
```
</hfoption>
</hfoptions>
## Source
You can also install the latest version from source. First clone the repo and then run the installation with `pip`:
```bash
git clone https://github.com/huggingface/trl.git
cd trl/
pip install -e .
```
If you want the development install you can replace the pip install with the following:
```bash
pip install -e ".[dev]"
```

View File

@ -1,24 +0,0 @@
# Installation
You can install TRL either from pypi or from source:
## pypi
Install the library with pip:
```bash
pip install trl
```
### Source
You can also install the latest version from source. First clone the repo and then run the installation with `pip`:
```bash
git clone https://github.com/lvwerra/trl.git
cd trl/
pip install -e .
```
If you want the development install you can replace the pip install with the following:
```bash
pip install -e ".[dev]"
```

View File

@ -0,0 +1,274 @@
# Training with Jobs
[![model badge](https://img.shields.io/badge/All_models-HF_Jobs-blue)](https://huggingface.co/models?other=hf_jobs,trl)
[Hugging Face Jobs](https://huggingface.co/docs/huggingface_hub/guides/jobs) lets you run training scripts on fully managed infrastructure—no need to manage GPUs or local environment setup.
In this guide, you'll learn how to:
* Use [TRL Jobs](https://github.com/huggingface/trl-jobs) to easily run pre-optimized TRL training
* Run any TRL training script with uv scripts
For general details about Hugging Face Jobs (hardware selection, job monitoring, etc.), see the [Jobs documentation](https://huggingface.co/docs/huggingface_hub/guides/jobs).
## Requirements
* A [Pro](https://hf.co/pro), [Team](https://hf.co/enterprise), or [Enterprise](https://hf.co/enterprise) plan
* Logged in to the Hugging Face Hub (`hf auth login`)
## Using TRL Jobs
[TRL Jobs](https://github.com/huggingface/trl-jobs) is a high-level wrapper around Hugging Face Jobs and TRL that streamlines training. It provides optimized default configurations so you can start quickly without manually tuning parameters.
Example:
```bash
pip install trl-jobs
trl-jobs sft --model_name Qwen/Qwen3-0.6B --dataset_name trl-lib/Capybara
```
TRL Jobs supports everything covered in this guide, with additional optimizations to simplify workflows.
## Using uv Scripts
For more control, you can run Hugging Face Jobs directly with your own scripts, using [uv scripts](https://docs.astral.sh/uv/guides/scripts/).
Create a Python script (e.g., `train.py`) containing your training code:
```python
from datasets import load_dataset
from trl import SFTTrainer
dataset = load_dataset("trl-lib/Capybara", split="train")
trainer = SFTTrainer(
model="Qwen/Qwen2.5-0.5B",
train_dataset=dataset,
)
trainer.train()
trainer.push_to_hub("Qwen2.5-0.5B-SFT")
```
Launch the job using either the [`hf jobs` CLI](https://huggingface.co/docs/huggingface_hub/guides/cli#hf-jobs) or the Python API:
<hfoptions id="script_type">
<hfoption id="bash">
```bash
hf jobs uv run \
--flavor a100-large \
--with trl \
--secrets HF_TOKEN \
train.py
```
</hfoption>
<hfoption id="python">
```python
from huggingface_hub import run_uv_job
run_uv_job(
"train.py",
dependencies=["trl"],
flavor="a100-large",
secrets={"HF_TOKEN": "hf_..."},
)
```
</hfoption>
</hfoptions>
To run successfully, the script needs:
* **TRL installed**: Use the `--with trl` flag or the `dependencies` argument. uv installs these dependencies automatically before running the script.
* **An authentication token**: Required to push the trained model (or perform other authenticated operations). Provide it with the `--secrets HF_TOKEN` flag or the `secrets` argument.
> [!WARNING]
> When training with Jobs, be sure to:
>
> * **Set a sufficient timeout**. Jobs time out after 30 minutes by default. If your job exceeds the timeout, it will fail and all progress will be lost. See [Setting a custom timeout](https://huggingface.co/docs/huggingface_hub/guides/jobs#setting-a-custom-timeout).
> * **Push the model to the Hub**. The Jobs environment is ephemeral—files are deleted when the job ends. If you dont push the model, it will be lost.
You can also run a script directly from a URL:
<hfoptions id="script_type">
<hfoption id="bash">
```bash
hf jobs uv run \
--flavor a100-large \
--with trl \
--secrets HF_TOKEN \
"https://gist.githubusercontent.com/qgallouedec/eb6a7d20bd7d56f9c440c3c8c56d2307/raw/69fd78a179e19af115e4a54a1cdedd2a6c237f2f/train.py"
```
</hfoption>
<hfoption id="python">
```python
from huggingface_hub import run_uv_job
run_uv_job(
"https://gist.githubusercontent.com/qgallouedec/eb6a7d20bd7d56f9c440c3c8c56d2307/raw/69fd78a179e19af115e4a54a1cdedd2a6c237f2f/train.py",
flavor="a100-large",
dependencies=["trl"],
secrets={"HF_TOKEN": "hf_..."},
)
```
</hfoption>
</hfoptions>
To make a script self-contained, declare dependencies at the top:
```python
# /// script
# dependencies = [
# "trl",
# "peft",
# ]
# ///
from datasets import load_dataset
from peft import LoraConfig
from trl import SFTTrainer
dataset = load_dataset("trl-lib/Capybara", split="train")
trainer = SFTTrainer(
model="Qwen/Qwen2.5-0.5B",
train_dataset=dataset,
peft_config=LoraConfig(),
)
trainer.train()
trainer.push_to_hub("Qwen2.5-0.5B-SFT")
```
You can then run the script without specifying dependencies:
<hfoptions id="script_type">
<hfoption id="bash">
```bash
hf jobs uv run \
--flavor a100-large \
--secrets HF_TOKEN \
train.py
```
</hfoption>
<hfoption id="python">
```python
from huggingface_hub import run_uv_job
run_uv_job(
"train.py",
flavor="a100-large",
secrets={"HF_TOKEN": "hf_..."},
)
```
</hfoption>
</hfoptions>
TRL example scripts are fully uv-compatible, so you can run a complete training workflow directly on Jobs. You can customize training with standard script arguments plus hardware and secrets:
<hfoptions id="script_type">
<hfoption id="bash">
```bash
hf jobs uv run \
--flavor a100-large \
--secrets HF_TOKEN \
https://raw.githubusercontent.com/huggingface/trl/refs/heads/main/examples/scripts/prm.py \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--dataset_name trl-lib/prm800k \
--output_dir Qwen2-0.5B-Reward \
--push_to_hub
```
</hfoption>
<hfoption id="python">
```python
from huggingface_hub import run_uv_job
run_uv_job(
"https://raw.githubusercontent.com/huggingface/trl/refs/heads/main/examples/scripts/prm.py",
flavor="a100-large",
secrets={"HF_TOKEN": "hf_..."},
script_args=[
"--model_name_or_path", "Qwen/Qwen2-0.5B-Instruct",
"--dataset_name", "trl-lib/prm800k",
"--output_dir", "Qwen2-0.5B-Reward",
"--push_to_hub"
]
)
```
</hfoption>
</hfoptions>
See the full list of examples in [Maintained examples](example_overview#maintained-examples).
### Docker Images
An up-to-date Docker image with all TRL dependencies is available at [huggingface/trl](https://hub.docker.com/r/huggingface/trl) and can be used directly with Hugging Face Jobs:
<hfoptions id="script_type">
<hfoption id="bash">
```bash
hf jobs uv run \
--flavor a100-large \
--secrets HF_TOKEN \
--image huggingface/trl \
train.py
```
</hfoption>
<hfoption id="python">
```python
from huggingface_hub import run_uv_job
run_uv_job(
"train.py",
flavor="a100-large",
secrets={"HF_TOKEN": "hf_..."},
image="huggingface/trl",
)
```
</hfoption>
</hfoptions>
Jobs runs on a Docker image from Hugging Face Spaces or Docker Hub, so you can also specify any custom image:
<hfoptions id="script_type">
<hfoption id="bash">
```bash
hf jobs uv run \
--flavor a100-large \
--secrets HF_TOKEN \
--image <docker-image> \
--secrets HF_TOKEN \
train.py
```
</hfoption>
<hfoption id="python">
```python
from huggingface_hub import run_uv_job
run_uv_job(
"train.py",
flavor="a100-large",
secrets={"HF_TOKEN": "hf_..."},
image="<docker-image>",
)
```
</hfoption>
</hfoptions>

86
docs/source/judges.md Normal file
View File

@ -0,0 +1,86 @@
# Judges
> [!WARNING]
> TRL Judges is an experimental API which is subject to change at any time.
TRL provides judges to easily compare two completions.
Make sure to have installed the required dependencies by running:
```bash
pip install trl[judges]
```
## Using the provided judges
TRL provides several judges out of the box. For example, you can use the [`HfPairwiseJudge`] to compare two completions using a pre-trained model from the Hugging Face model hub:
```python
from trl import HfPairwiseJudge
judge = HfPairwiseJudge()
judge.judge(
prompts=["What is the capital of France?", "What is the biggest planet in the solar system?"],
completions=[["Paris", "Lyon"], ["Saturn", "Jupiter"]],
) # Outputs: [0, 1]
```
## Define your own judge
To define your own judge, we provide several base classes that you can subclass. For rank-based judges, you need to subclass [`BaseRankJudge`] and implement the [`BaseRankJudge.judge`] method. For pairwise judges, you need to subclass [`BasePairJudge`] and implement the [`BasePairJudge.judge`] method. If you want to define a judge that doesn't fit into these categories, you need to subclass [`BaseJudge`] and implement the [`BaseJudge.judge`] method.
As an example, let's define a pairwise judge that prefers shorter completions:
```python
from trl import BasePairwiseJudge
class PrefersShorterJudge(BasePairwiseJudge):
def judge(self, prompts, completions, shuffle_order=False):
return [0 if len(completion[0]) > len(completion[1]) else 1 for completion in completions]
```
You can then use this judge as follows:
```python
judge = PrefersShorterJudge()
judge.judge(
prompts=["What is the capital of France?", "What is the biggest planet in the solar system?"],
completions=[["Paris", "The capital of France is Paris."], ["Jupiter is the biggest planet in the solar system.", "Jupiter"]],
) # Outputs: [0, 1]
```
## Provided judges
### PairRMJudge
[[autodoc]] PairRMJudge
### HfPairwiseJudge
[[autodoc]] HfPairwiseJudge
### OpenAIPairwiseJudge
[[autodoc]] OpenAIPairwiseJudge
### AllTrueJudge
[[autodoc]] AllTrueJudge
## Base classes
### BaseJudge
[[autodoc]] BaseJudge
### BaseBinaryJudge
[[autodoc]] BaseBinaryJudge
### BaseRankJudge
[[autodoc]] BaseRankJudge
### BasePairwiseJudge
[[autodoc]] BasePairwiseJudge

View File

@ -0,0 +1,96 @@
# Kernels Hub Integration and Usage
<img src="https://github.com/user-attachments/assets/4b5175f3-1d60-455b-8664-43b2495ee1c3" width="450" height="450" alt="kernel-builder logo">
The [`kernels`](https://huggingface.co/blog/hello-hf-kernels#get-started-and-next-steps) library allows optimized compute kernels to be loaded directly from the Hub.
You can find `kernels` in [dedicated orgs](https://huggingface.co/kernels-community) or by searching for the [`kernel` tag](https://huggingface.co/models?other=kernel) within the Hub.
Kernels are **optimized code pieces** that help in model development, training, and inference. Here, well focus on their **integration with TRL**, but check out the above resources to learn more about them.
## Installation
To use kernels with TRL, you'd need to install the library in your Python environment:
```bash
pip install kernels
```
## Using Kernels from the Hub in TRL
Kernels can directly replace attention implementations, removing the need to manually compile attention backends like Flash Attention and boosting training speed just by pulling the respective attention kernel from the Hub.
You can specify a kernel when loading a model:
```python
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"your-model-name",
attn_implementation="kernels-community/flash-attn" # other options: kernels-community/vllm-flash-attn3, kernels-community/paged-attention
)
```
Or when running a TRL training script:
```bash
python sft.py ... --attn_implementation kernels-community/flash-attn
```
Or using the TRL CLI:
```bash
trl sft ... --attn_implementation kernels-community/flash-attn
```
> [!TIP]
> Now you can leverage faster attention backends with a pre-optimized kernel for your hardware configuration from the Hub, speeding up both development and training.
## Comparing Attention Implementations
We evaluated various attention implementations available in transformers, along with different kernel backends, using **TRL** and **SFT**.
The experiments were run on a single **H100 GPU** with **CUDA 12.9**, leveraging **Qwen3-8B** with a **batch size of 8**, **gradient accumulation of 1**, and **bfloat16** precision.
Keep in mind that the results shown here are specific to this setup and may vary with different training configurations.
The following figure illustrates both **latency** (time per training step) and **peak allocated memory** for the different attention implementations and kernel backends.
Kernel-based implementations perform on par with custom-installed attention, and increasing the models `max_length` further enhances performance. Memory consumption is similar across all implementations, showing no significant differences. We get the same performance but with less friction, as described in [the following section](#flash-attention-vs-hub-kernels).
<div class="flex justify-center">
<img src="https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/kernels_guide_latency.png" alt="Latency and Memory Usage" width="45%"/>
<img src="https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/kernels_guide_peak_allocated_memory.png" alt="Latency and Memory Usage" width="45%"/>
</div>
## Flash Attention vs. Hub Kernels
Building Flash Attention from source can be time-consuming, often taking anywhere from several minutes to hours, depending on your hardware, CUDA/PyTorch configuration, and whether precompiled wheels are available.
In contrast, **Hugging Face Kernels** provide a much faster and more reliable workflow. Developers dont need to worry about complex setups—everything is handled automatically. In our benchmarks, kernels were ready to use in about **2.5 seconds**, with no compilation required. This allows you to start training almost instantly, significantly accelerating development. Simply specify the desired version, and `kernels` takes care of the rest.
## Combining FlashAttention Kernels with Liger Kernels
You can combine **FlashAttention kernels** with **Liger kernels** for additional TRL performance improvements.
First, install the Liger kernel dependency:
```bash
pip install liger-kernel
```
Then, combine both in your code:
```python
from transformers import AutoModelForCausalLM
from trl import SFTConfig
model = AutoModelForCausalLM.from_pretrained(
"your-model-name",
attn_implementation="kernels-community/flash-attn" # choose the desired FlashAttention variant
)
training_args = SFTConfig(
use_liger_kernel=True,
# ... other TRL training args
)
```
Learn more about the [Liger Kernel Integration](./liger_kernel_integration).

139
docs/source/kto_trainer.md Normal file
View File

@ -0,0 +1,139 @@
# KTO Trainer
[![model badge](https://img.shields.io/badge/All_models-KTO-blue)](https://huggingface.co/models?other=kto,trl)
## Overview
Kahneman-Tversky Optimization (KTO) was introduced in [KTO: Model Alignment as Prospect Theoretic Optimization](https://huggingface.co/papers/2402.01306) by [Kawin Ethayarajh](https://huggingface.co/kawine), [Winnie Xu](https://huggingface.co/xwinxu), [Niklas Muennighoff](https://huggingface.co/Muennighoff), Dan Jurafsky, [Douwe Kiela](https://huggingface.co/douwekiela).
The abstract from the paper is the following:
> Kahneman & Tversky's prospect theory tells us that humans perceive random variables in a biased but well-defined manner; for example, humans are famously loss-averse. We show that objectives for aligning LLMs with human feedback implicitly incorporate many of these biases -- the success of these objectives (e.g., DPO) over cross-entropy minimization can partly be ascribed to them being human-aware loss functions (HALOs). However, the utility functions these methods attribute to humans still differ from those in the prospect theory literature. Using a Kahneman-Tversky model of human utility, we propose a HALO that directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences, as current methods do. We call this approach Kahneman-Tversky Optimization (KTO), and it matches or exceeds the performance of preference-based methods at scales from 1B to 30B. Crucially, KTO does not need preferences -- only a binary signal of whether an output is desirable or undesirable for a given input. This makes it far easier to use in the real world, where preference data is scarce and expensive.
The official code can be found in [ContextualAI/HALOs](https://github.com/ContextualAI/HALOs).
This post-training method was contributed by [Kashif Rasul](https://huggingface.co/kashif), [Younes Belkada](https://huggingface.co/ybelkada), [Lewis Tunstall](https://huggingface.co/lewtun) and Pablo Vicente.
## Quick start
This example demonstrates how to train a model using the KTO method. We use the [Qwen 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) as the base model. We use the preference data from the [KTO Mix 14k](https://huggingface.co/datasets/trl-lib/kto-mix-14k). You can view the data in the dataset here:
<iframe
src="https://huggingface.co/datasets/trl-lib/kto-mix-14k/embed/viewer/default/train?row=0"
frameborder="0"
width="100%"
height="560px"
></iframe>
Below is the script to train the model:
```python
# train_kto.py
from datasets import load_dataset
from trl import KTOConfig, KTOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
train_dataset = load_dataset("trl-lib/kto-mix-14k", split="train")
training_args = KTOConfig(output_dir="Qwen2-0.5B-KTO")
trainer = KTOTrainer(model=model, args=training_args, processing_class=tokenizer, train_dataset=train_dataset)
trainer.train()
```
Execute the script using the following command:
```bash
accelerate launch train_kto.py
```
Distributed across 8 x H100 GPUs, the training takes approximately 30 minutes. You can verify the training progress by checking the reward graph. An increasing trend in the reward margin indicates that the model is improving and generating better responses over time.
![kto qwen2 reward margin](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/kto-qwen2-reward-margin.png)
To see how the [trained model](https://huggingface.co/trl-lib/Qwen2-0.5B-KTO) performs, you can use the [Transformers Chat CLI](https://huggingface.co/docs/transformers/quicktour#chat-with-text-generation-models).
<pre><code>$ transformers chat trl-lib/Qwen2-0.5B-KTO
<strong><span style="color: red;">&lt;quentin_gallouedec&gt;:</span></strong>
What is the best programming language?
<strong><span style="color: blue;">&lt;trl-lib/Qwen2-0.5B-KTO&gt;:</span></strong>
The best programming language can vary depending on individual preferences, industry-specific requirements, technical skills, and familiarity with the specific use case or task. Here are some widely-used programming languages that have been noted as popular and widely used:
Here are some other factors to consider when choosing a programming language for a project:
<strong><span style="color: green;">1</span> JavaScript</strong>: JavaScript is at the heart of the web and can be used for building web applications, APIs, and interactive front-end applications like frameworks like React and Angular. It's similar to C, C++, and F# in syntax structure and is accessible and easy to learn, making it a popular choice for beginners and professionals alike.
<strong><span style="color: green;">2</span> Java</strong>: Known for its object-oriented programming (OOP) and support for Java 8 and .NET, Java is used for developing enterprise-level software applications, high-performance games, as well as mobile apps, game development, and desktop applications.
<strong><span style="color: green;">3</span> C++</strong>: Known for its flexibility and scalability, C++ offers comprehensive object-oriented programming and is a popular choice for high-performance computing and other technical fields. It's a powerful platform for building real-world applications and games at scale.
<strong><span style="color: green;">4</span> Python</strong>: Developed by Guido van Rossum in 1991, Python is a high-level, interpreted, and dynamically typed language known for its simplicity, readability, and versatility.
</code></pre>
## Expected dataset format
KTO requires an [unpaired preference dataset](dataset_formats#unpaired-preference). Alternatively, you can provide a *paired* preference dataset (also known simply as a *preference dataset*). In this case, the trainer will automatically convert it to an unpaired format by separating the chosen and rejected responses, assigning `label = True` to the chosen completions and `label = False` to the rejected ones.
The [`KTOTrainer`] supports both [conversational](dataset_formats#conversational) and [standard](dataset_formats#standard) dataset formats. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
In theory, the dataset should contain at least one chosen and one rejected completion. However, some users have successfully run KTO using *only* chosen or only rejected data. If using only rejected data, it is advisable to adopt a conservative learning rate.
## Example script
We provide an example script to train a model using the KTO method. The script is available in [`trl/scripts/kto.py`](https://github.com/huggingface/trl/blob/main/trl/scripts/kto.py)
To test the KTO script with the [Qwen2 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) on the [UltraFeedback dataset](https://huggingface.co/datasets/trl-lib/kto-mix-14k), run the following command:
```bash
accelerate launch trl/scripts/kto.py \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--dataset_name trl-lib/kto-mix-14k \
--num_train_epochs 1 \
--output_dir Qwen2-0.5B-KTO
```
## Usage tips
### For Mixture of Experts Models: Enabling the auxiliary loss
MOEs are the most efficient if the load is about equally distributed between experts.
To ensure that we train MOEs similarly during preference-tuning, it is beneficial to add the auxiliary loss from the load balancer to the final loss.
This option is enabled by setting `output_router_logits=True` in the model config (e.g. [`~transformers.MixtralConfig`]).
To scale how much the auxiliary loss contributes to the total loss, use the hyperparameter `router_aux_loss_coef=...` (default: `0.001`) in the model config.
### Batch size recommendations
Use a per-step batch size that is at least 4, and an effective batch size between 16 and 128. Even if your effective batch size is large, if your per-step batch size is poor, then the KL estimate in KTO will be poor.
### Learning rate recommendations
Each choice of `beta` has a maximum learning rate it can tolerate before learning performance degrades. For the default setting of `beta = 0.1`, the learning rate should typically not exceed `1e-6` for most models. As `beta` decreases, the learning rate should also be reduced accordingly. In general, we strongly recommend keeping the learning rate between `5e-7` and `5e-6`. Even with small datasets, we advise against using a learning rate outside this range. Instead, opt for more epochs to achieve better results.
### Imbalanced data
The `desirable_weight` and `undesirable_weight` of the [`KTOConfig`] refer to the weights placed on the losses for desirable/positive and undesirable/negative examples.
By default, they are both 1. However, if you have more of one or the other, then you should upweight the less common type such that the ratio of (`desirable_weight` \\(\times\\) number of positives) to (`undesirable_weight` \\(\times\\) number of negatives) is in the range 1:1 to 4:3.
## Logged metrics
While training and evaluating, we record the following reward metrics:
- `rewards/chosen_sum`: the sum of log probabilities of the policy model for the chosen responses scaled by beta
- `rewards/rejected_sum`: the sum of log probabilities of the policy model for the rejected responses scaled by beta
- `logps/chosen_sum`: the sum of log probabilities of the chosen completions
- `logps/rejected_sum`: the sum of log probabilities of the rejected completions
- `logits/chosen_sum`: the sum of logits of the chosen completions
- `logits/rejected_sum`: the sum of logits of the rejected completions
- `count/chosen`: the count of chosen samples in a batch
- `count/rejected`: the count of rejected samples in a batch
## KTOTrainer
[[autodoc]] KTOTrainer
- train
- save_model
- push_to_hub
## KTOConfig
[[autodoc]] KTOConfig

View File

@ -0,0 +1,29 @@
# Liger Kernel Integration
> [!WARNING]
> Section under construction. Feel free to contribute!
[Liger Kernel](https://github.com/linkedin/Liger-Kernel) is a collection of Triton kernels designed specifically for LLM training. It can effectively increase multi-GPU training throughput by 20% and reduce memory usage by 60%. That way, we can **4x** our context length, as described in the benchmark below. They have implemented Hugging Face compatible `RMSNorm`, `RoPE`, `SwiGLU`, `CrossEntropy`, `FusedLinearCrossEntropy`, with more to come. The kernel works out of the box with [FlashAttention](https://github.com/Dao-AILab/flash-attention), [PyTorch FSDP](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html), and [Microsoft DeepSpeed](https://github.com/microsoft/DeepSpeed).
With this memory reduction, you can potentially turn off `cpu_offloading` or gradient checkpointing to further boost the performance.
| Speed Up | Memory Reduction |
| --- | --- |
| ![Speed up](https://raw.githubusercontent.com/linkedin/Liger-Kernel/main/docs/images/e2e-tps.png) | ![Memory](https://raw.githubusercontent.com/linkedin/Liger-Kernel/main/docs/images/e2e-memory.png) |
1. To use Liger-Kernel in [`SFTTrainer`], first install it by:
```bash
pip install liger-kernel
```
2. Once installed, set `use_liger_kernel` in [`SFTConfig`]. No other changes are needed!
```python
training_args = SFTConfig(
use_liger_kernel=True,
...
)
```
To learn more about Liger-Kernel, visit their [official repository](https://github.com/linkedin/Liger-Kernel/).

View File

@ -0,0 +1,442 @@
# LoRA Without Regret
Recent research from the team at [Thinking Machines Lab](https://thinkingmachines.ai/blog/lora/) (Schulman et al., 2025) shows that **LoRA can match full fine-tuning performance** when configured correctly, while using only ~67% of the compute. These findings are exciting to TRL users because they're straightforward to implement and can improve model performance on smaller budgets.
This guide provides simple instructions to reproduce the results of the blog post in TRL.
> [!TIP]
> It is recommended to read the blog post before following this guide, or to consult both resources in parallel for best results.
## Benefits of LoRA over full fine-tuning
First of all, let's remind ourselves of the benefits of [LoRA over full fine-tuning](https://huggingface.co/docs/trl/en/peft_integration).
LoRA adds adapter layers on top of the base model, which contains significantly fewer parameters than the base model itself. This design reduces GPU memory requirements and enables more efficient training. As described in the [blog](https://thinkingmachines.ai/blog/lora/), this approach was originally thought to involve a performance trade-off, although careful configuration can overcome this trade-off and match full fine-tuning performance.
## Examples with TRL
Let's implement and train LoRA adapters in TRL scripts based on the core findings of the blog post. Afterwards, we'll revisit each finding in light of the TRL results.
### Supervised Fine-Tuning (SFT)
The blog post performs SFT on a range of models and datasets from the Hub, which we can reproduce in TRL.
| Model | Dataset |
| --- | --- |
| [Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B) | [allenai/tulu-3-sft-mixture](https://huggingface.co/datasets/allenai/tulu-3-sft-mixture) |
| [Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B) | [open-thoughts/OpenThoughts-114k](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k) |
| [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B) | [allenai/tulu-3-sft-mixture](https://huggingface.co/datasets/allenai/tulu-3-sft-mixture) |
| [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B) | [open-thoughts/OpenThoughts-114k](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k) |
<hfoptions id="sft">
<hfoption id="python">
We can integrate these findings with the TRL Python API like so:
```python
from datasets import load_dataset
from peft import LoraConfig
from trl import SFTTrainer, SFTConfig
dataset = load_dataset("open-thoughts/OpenThoughts-114k", split="train")
peft_config = LoraConfig(r=256, lora_alpha=16, target_modules="all-linear")
training_args = SFTConfig(
learning_rate=2e-4,
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
num_train_epochs=1,
report_to=["trackio"],
)
trainer = SFTTrainer(
model="Qwen/Qwen2.5-3B-Instruct",
train_dataset=dataset,
peft_config=peft_config,
args=training_args,
)
trainer.train()
```
</hfoption>
<hfoption id="jobs">
```bash
hf jobs uv run \
--flavor a100-large \
--timeout 8h \
--secrets HF_TOKEN \
"https://raw.githubusercontent.com/huggingface/trl/main/trl/scripts/sft.py" \
--model_name_or_path Qwen/Qwen2.5-3B-Instruct \
--dataset_name open-thoughts/OpenThoughts-114k \
--learning_rate 2.0e-5 \
--num_train_epochs 1 \
--packing \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 16 \
--use_peft \
--lora_r 256 \
--lora_alpha 16 \
--lora_target_modules all-linear \
--output_dir Qwen2.5-3B-OpenThoughts-LoRA \
--report_to trackio \
--push_to_hub
```
To use Hugging Face Jobs, you will need to be logged in to the Hugging Face Hub (`hf auth login`) and have a [Pro](https://hf.co/pro), [Team](https://hf.co/enterprise), or [Enterprise](https://hf.co/enterprise) plan. Check out the [Jobs documentation](https://huggingface.co/docs/huggingface_hub/en/guides/jobs) for more details.
</hfoption>
<hfoption id="local">
```bash
uv run "https://raw.githubusercontent.com/huggingface/trl/main/trl/scripts/sft.py" \
--model_name_or_path Qwen/Qwen2.5-3B-Instruct \
--dataset_name open-thoughts/OpenThoughts-114k \
--learning_rate 2.0e-5 \
--num_train_epochs 1 \
--packing \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 16 \
--gradient_checkpointing \
--eval_strategy no \
--use_peft \
--lora_r 256 \
--lora_alpha 16 \
--lora_target_modules all-linear \
--output_dir Qwen2.5-3B-OpenThoughts-LoRA \
--report_to trackio \
--push_to_hub
```
To run the script locally, you will need to have `uv` installed. Check out the [uv documentation](https://docs.astral.sh/uv/) for more details.
</hfoption>
</hfoptions>
Once training starts, you can monitor the progress in [Trackio](https://huggingface.co/trackio), which will log the URL.
### Reinforcement Learning (GRPO)
The blog post performs GRPO on a range of models and datasets from the Hub, and once again we can reproduce the results in TRL.
| Model | Dataset |
| --- | --- |
| [Llama-3.1-8B-Base](https://huggingface.co/meta-llama/Llama-3.2-1B) | [GSM8k](https://huggingface.co/datasets/openai/gsm8k) |
| [Llama-3.1-8B-Base](https://huggingface.co/meta-llama/Llama-3.2-1B) | [DeepMath-103K](https://huggingface.co/datasets/zwhe99/DeepMath-103K) |
| [Qwen3-8b-base](https://huggingface.co/Qwen/Qwen3-8b-base) | [DeepMath-103K](https://huggingface.co/datasets/zwhe99/DeepMath-103K) |
For reinforcement learning, the blog uses a math reasoning task that we can reproduce as a Python function.
<details>
<summary>Reward function</summary>
```python
def strip_reasoning_accuracy_reward(
completions: list[list[dict[str, str]]], solution: list[str], **kwargs
) -> list[Optional[float]]:
"""Reward function that strips reasoning tags and checks mathematical accuracy.
This function:
1. Extracts the content from completions
2. Removes <think></think> tags (for reasoning that shouldn't be evaluated)
3. Parses both the gold solution and the predicted answer
4. Uses math_verify to check if they are mathematically equivalent
Args:
completions: List of model completions, each containing a list of messages
solution: List of ground truth solutions
**kwargs: Additional arguments (ignored but required for trainer compatibility)
Returns:
List of rewards where:
- 1.0 if the answer is correct
- 0.0 if the answer is incorrect
- None if the solution is not parseable (skips this example)
"""
contents = [completion[0]["content"] for completion in completions]
rewards = []
for content, sol in zip(contents, solution):
# Strip reasoning tags from completion
while "<think>" in content and "</think>" in content:
start = content.find("<think>")
end = content.find("</think>", start)
if start != -1 and end != -1:
content = content[:start] + content[end + len("</think>") :]
else:
break
# Parse gold solution
gold_parsed = parse(
f"${sol}$",
extraction_config=[
LatexExtractionConfig(
boxed_match_priority=0, try_extract_without_anchor=True
)
],
)
if len(gold_parsed) != 0:
# We require the answer to be provided in correct latex (no malformed operators)
answer_parsed = parse(
content,
extraction_config=[
LatexExtractionConfig(
boxed_match_priority=0,
normalization_config=NormalizationConfig(
basic_latex=True,
units=True,
malformed_operators=False,
nits=False,
boxed=True,
),
try_extract_without_anchor=False,
)
],
extraction_mode="first_match",
)
# Compute binary rewards if verifiable, `None` otherwise to skip this example
try:
reward = float(verify(gold_parsed, answer_parsed))
except Exception as e:
print(
f"verify failed: {e}, answer: {answer_parsed}, gold: {gold_parsed}"
)
reward = None
else:
# If the gold solution is not parseable, we assign `None` to skip this example
reward = None
rewards.append(reward)
return rewards
```
</details>
<hfoptions id="grpo">
<hfoption id="python">
We can implement these recommendations with the TRL Python API like so:
```python
from datasets import load_dataset
from peft import LoraConfig
from trl import GRPOConfig, GRPOTrainer
dataset = load_dataset("HuggingFaceH4/OpenR1-Math-220k-default-verified", split="train")
def strip_reasoning_accuracy_reward(completions, **kwargs):
"""Reward function that strips reasoning and accuracy scores from the model outputs."""
...
peft_config = LoraConfig(
r=1,
lora_alpha=32,
target_modules="all-linear"
)
training_args = GRPOConfig(
learning_rate=5e-5,
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
num_train_epochs=1,
num_generations=8,
generation_batch_size=8,
report_to=["trackio"],
)
trainer = GRPOTrainer(
model="Qwen/Qwen3-0.6B",
reward_funcs=strip_reasoning_accuracy_reward,
args=training_args,
train_dataset=dataset,
peft_config=peft_config,
)
trainer.train()
```
> [!WARNING]
> This snippet skips the reward function which is defined above to keep the example concise.
</hfoption>
<hfoption id="jobs">
```bash
hf jobs uv run \
--flavor a100-large \
--timeout 4h \
--secrets HF_TOKEN \
--env PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
"https://huggingface.co/datasets/burtenshaw/lora-without-regrets/resolve/main/grpo.py" \
--model_name_or_path Qwen/Qwen3-0.6B \
--dataset_name HuggingFaceH4/OpenR1-Math-220k-default-verified \
--output_dir grpo-full-qwen3-0.6b \
--learning_rate 1.0e-6 \
--lr_scheduler_type cosine \
--warmup_ratio 0.0 \
--max_grad_norm 1.0 \
--beta 0.0 \
--max_prompt_length 1024 \
--max_completion_length 4096 \
--num_generations 16 \
--generation_batch_size 16 \
--gradient_accumulation_steps 8 \
--per_device_train_batch_size 1 \
--num_train_epochs 1 \
--lora_r 1 \
--lora_alpha 32 \
--lora_dropout 0.0 \
--lora_target_modules all-linear \
--vllm_mode colocate \
--save_strategy steps \
--save_steps 50 \
--save_total_limit 1 \
--logging_steps 1 \
--max_steps 200 \
--report_to trackio
```
To use Hugging Face Jobs, you will need to be logged in to the Hugging Face Hub (`hf auth login`) and have a [Pro](https://hf.co/pro), [Team](https://hf.co/enterprise), or [Enterprise](https://hf.co/enterprise) plan. Check out the [Jobs documentation](https://huggingface.co/docs/huggingface_hub/en/guides/jobs) for more details.
</hfoption>
<hfoption id="local">
```bash
uv run "https://huggingface.co/datasets/burtenshaw/lora-without-regrets/resolve/main/grpo.py" \
--model_name_or_path Qwen/Qwen3-0.6B \
--dataset_name HuggingFaceH4/OpenR1-Math-220k-default-verified \
--output_dir grpo-full-qwen3-0.6b \
--learning_rate 1.0e-6 \
--lr_scheduler_type cosine \
--warmup_ratio 0.0 \
--max_grad_norm 1.0 \
--beta 0.0 \
--max_prompt_length 1024 \
--max_completion_length 4096 \
--num_generations 16 \
--generation_batch_size 16 \
--gradient_accumulation_steps 8 \
--per_device_train_batch_size 1 \
--num_train_epochs 1 \
--lora_r 1 \
--lora_alpha 32 \
--lora_dropout 0.0 \
--lora_target_modules all-linear \
--vllm_mode colocate \
--save_strategy steps \
--save_steps 50 \
--save_total_limit 1 \
--logging_steps 1 \
--max_steps 200 \
--report_to trackio
```
To run the script locally, you will need to have `uv` installed. Check out the [uv documentation](https://docs.astral.sh/uv/) for more details.
</hfoption>
</hfoptions>
The reinforcement learning script with GRPO is implemented as a custom script in TRL, which uses the reward function shown above. You can review it at [`grpo.py`](https://huggingface.co/datasets/burtenshaw/lora-without-regrets/blob/main/grpo.py) - Reinforcement learning with LoRA best practices
## Key findings in optimizing LoRA
The authors recommend applying LoRA to all weight matrices rather than limiting it to attention layers, as increasing the rank does not compensate for this restriction. In TRL, this can be configured using `--lora_target_modules all-linear` to apply LoRA to all weight matrices.
We were able to reproduce the results of the blog post using TRL and the SmolLM3 model. We trained the model for 500 steps on the [Math 220k dataset](https://huggingface.co/datasets/HuggingFaceH4/OpenR1-Math-220k-default-verified) with the reward function and configuration above. As you can see in the figure below, the LoRA model's average train reward curve matches the full fine-tuning curve.
![train reward](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lora_without_regret/5.png)
And most importantly, the LoRA model uses significantly less memory than the full fine-tuning model, as we can see in the figure below.
![memory usage](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lora_without_regret/6.png)
Here are the parameters we used to train the above models
| Parameter | LoRA | Full FT |
| --- | --- | --- |
| `--model_name_or_path` | HuggingFaceTB/SmolLM3-3B | HuggingFaceTB/SmolLM3-3B |
| `--dataset_name` | HuggingFaceH4/OpenR1-Math-220k-default-verified | HuggingFaceH4/OpenR1-Math-220k-default-verified |
| `--learning_rate` | 1.0e-5 | 1.0e-6 |
| `--max_prompt_length` | 1024 | 1024 |
| `--max_completion_length` | 4096 | 4096 |
| `--lora_r` | 1 | - |
| `--lora_alpha` | 32 | - |
| `--lora_dropout` | 0.0 | - |
| `--lora_target_modules` | all-linear | - |
Let's break down the key findings of the blog post and how we were able to reproduce them.
### 1. *LoRA performs better when applied to all weight matrices*
The authors recommend applying LoRA to all weight matrices rather than limiting it to attention layers, as increasing the rank does not compensate for this restriction.
![all layers](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lora_without_regret/1.png)
Attention-only LoRA underperforms even when using a higher rank to match parameter count. In TRL, this can be configured using `--lora_target_modules all-linear` to apply LoRA to all weight matrices. In Python, we can do this like so:
```python
from peft import LoraConfig
peft_config = LoraConfig(target_modules="all-linear")
```
### 2. *The adapter needs sufficient capacity to learn from the dataset*
The blog post recommends using a sufficient LoRA rank to learn from the dataset. The rank determines the number of trainable parameters in the LoRA adapter. Therefore, "For datasets that exceed LoRA capacity, LoRA underperforms FullFT".
![learning rate](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lora_without_regret/3.png)
In the TRL script, we could use `--lora_r` to set the rank and adapt it based on the task and dataset we're training on. The blog post recommends the following ranks based on the task and dataset size:
Reinforcement learning tasks typically require lower capacity, so smaller LoRA ranks can be used. This is because policy gradient algorithms extract roughly ~1 bit of information per episode, demanding minimal parameter capacity.
The blog post defines the ideal dataset size for LoRA to match full fine-tuning as "Post-training scale". Which we can use to determine the recommended rank for SFT and RL LoRAs as:
| Task Type | Dataset Size | Recommended Rank |
| --- | --- | --- |
| **SFT** | Post-training scale | 256 |
| **RL** | Any size | 1-32 |
### 3. *"FullFT and high-rank LoRAs have similar learning curves"*
Counterintuitively, the blog post recommends using a higher learning rate than for full fine-tuning. In the table above, we used 1.0e-5 for LoRA and 1.0e-6 for full fine-tuning. In the TRL script, we could use `--learning_rate` to set the learning rate. The \\( \frac{1}{r} \\) scaling in LoRA makes the optimal learning rate approximately rank-independent.
![learning rate](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lora_without_regret/2.png)
### 4. *"In some scenarios, LoRA is less tolerant of large batch sizes than full fine-tuning."*
The blog post recommends using an effective batch size < 32 because the authors found LoRA to be less tolerant of large batch sizes. This could not be mitigated by increasing the LoRA rank. In the TRL script, we could use `--per_device_train_batch_size` and `--gradient_accumulation_steps` to set the batch size.
![learning rate](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lora_without_regret/4.png)
## Takeaways
Using TRL, you can efficiently implement LoRA adapters to match full fine-tuning performance, applying the core insights (targeting all weight matrices, choosing the right rank, and managing batch size and learning rate) without the heavy compute cost of FullFT.
## Citation
```bibtex
@article{schulman2025lora,
title = {{LoRA Without Regret}},
author = {John Schulman and Thinking Machines Lab},
year = 2025,
journal = {Thinking Machines Lab: Connectionism},
doi = {10.64434/tml.20250929},
note = {https://thinkingmachines.ai/blog/lora/}
}
```

View File

@ -0,0 +1,9 @@
# Model Utilities
## clone_chat_template
[[autodoc]] clone_chat_template
## get_act_offloading_ctx_manager
[[autodoc]] models.get_act_offloading_ctx_manager

View File

@ -8,7 +8,6 @@ With the `AutoModelForCausalLMWithValueHead` class TRL supports all decoder mode
## AutoModelForCausalLMWithValueHead
[[autodoc]] AutoModelForCausalLMWithValueHead
- __init__
- forward
@ -25,4 +24,4 @@ With the `AutoModelForCausalLMWithValueHead` class TRL supports all decoder mode
## create_reference_model
[[autodoc]] create_reference_model
[[autodoc]] create_reference_model

View File

@ -0,0 +1,102 @@
# Multi Adapter RL (MARL) - a single base model for everything
Here we present an approach that uses a single base model for the entire PPO algorithm - which includes retrieving the reference logits, computing the active logits and the rewards. This feature is experimental as we did not test the convergence of the approach. We encourage the community to let us know if they potentially face issues.
## Requirements
You just need to install `peft` and optionally install `bitsandbytes` as well if you want to go for 8bit base models, for more memory efficient finetuning.
## Summary
You need to address this approach in three stages that we summarize as follows:
1- Train a base model on the target domain (e.g. [IMDB dataset](https://huggingface.co/datasets/stanfordnlp/imdb)) - this is the Supervised Fine Tuning stage - it can leverage the `SFTTrainer` from TRL.
2- Train a reward model using `peft`. This is required in order to re-use the adapter during the RL optimisation process (step 3 below). We show an example of leveraging the `RewardTrainer` from TRL in [this example](https://github.com/huggingface/trl/tree/main/examples/scripts/reward_modeling.py)
3- Fine tune new adapters on the base model using PPO and the reward adapter. ("0 abstraction RL")
Make sure to use the same model (i.e. same architecture and same weights) for the stages 2 & 3.
## Quickstart
Let us assume you have trained your reward adapter on `llama-7b` model using `RewardTrainer` and pushed the weights on the hub under `trl-lib/llama-7b-hh-rm-adapter`.
When doing PPO, before passing the model to `PPOTrainer` create your model as follows:
```python
model_name = "huggyllama/llama-7b"
rm_adapter_id = "trl-lib/llama-7b-hh-rm-adapter"
# PPO adapter
lora_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = AutoModelForCausalLMWithValueHead.from_pretrained(
model_name,
peft_config=lora_config,
reward_adapter=rm_adapter_id,
)
...
trainer = PPOTrainer(
model=model,
...
)
...
```
Then inside your PPO training loop, call the `compute_reward_score` method by accessing the `model` attribute from `PPOTrainer`.
```python
rewards = trainer.model.compute_reward_score(**inputs)
```
## Advanced usage
### Control on the adapter name
If you are familiar with the `peft` library, you know that you can use multiple adapters inside the same model. What you can do is train multiple adapters on the same base model to fine-tune on different policies.
In this case, you want to be able to control the adapter name you want to activate back, after retrieving the reward. For that, simply pass the appropriate `adapter_name` to `ppo_adapter_name` argument when calling `compute_reward_score`.
```python
adapter_name_policy_1 = "policy_1"
rewards = trainer.model.compute_reward_score(**inputs, ppo_adapter_name=adapter_name_policy_1)
...
```
### Using 4-bit and 8-bit base models
For more memory efficient fine-tuning, you can load your base model in 8-bit or 4-bit while keeping the adapters in the default precision (float32).
Just pass the appropriate arguments (i.e. `load_in_8bit=True` or `load_in_4bit=True`) to `AutoModelForCausalLMWithValueHead.from_pretrained` as follows (assuming you have installed `bitsandbytes`):
```python
model_name = "llama-7b"
rm_adapter_id = "trl-lib/llama-7b-hh-rm-adapter"
# PPO adapter
lora_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = AutoModelForCausalLMWithValueHead.from_pretrained(
model_name,
peft_config=lora_config,
reward_adapter=rm_adapter_id,
quantization_config=BitsAndBytesConfig(load_in_8bit=True),
)
...
trainer = PPOTrainer(
model=model,
...
)
...
```

View File

@ -0,0 +1,158 @@
# Nash-MD Trainer
[![model badge](https://img.shields.io/badge/All_models-Nash--MD-blue)](https://huggingface.co/models?other=nash-md,trl)
## Overview
Nash-MD was proposed in the paper [Nash Learning from Human Feedback](https://huggingface.co/papers/2312.00886) by Rémi Munos, [Michal Valko](https://huggingface.co/misovalko), Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Rowland, Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas Mésnard, and Andrea Michi.
The abstract from the paper is the following:
> Reinforcement learning from human feedback (RLHF) has emerged as the main paradigm for aligning large language models (LLMs) with human preferences. Typically, RLHF involves the initial step of learning a reward model from human feedback, often expressed as preferences between pairs of text generations produced by a pre-trained LLM. Subsequently, the LLM's policy is fine-tuned by optimizing it to maximize the reward model through a reinforcement learning algorithm. However, an inherent limitation of current reward models is their inability to fully represent the richness of human preferences and their dependency on the sampling distribution. In this study, we introduce an alternative pipeline for the fine-tuning of LLMs using pairwise human feedback. Our approach entails the initial learning of a preference model, which is conditioned on two inputs given a prompt, followed by the pursuit of a policy that consistently generates responses preferred over those generated by any competing policy, thus defining the Nash equilibrium of this preference model. We term this approach Nash learning from human feedback (NLHF). In the context of a tabular policy representation, we present a novel algorithmic solution, Nash-MD, founded on the principles of mirror descent. This algorithm produces a sequence of policies, with the last iteration converging to the regularized Nash equilibrium. Additionally, we explore parametric representations of policies and introduce gradient descent algorithms for deep-learning architectures. To demonstrate the effectiveness of our approach, we present experimental results involving the fine-tuning of a LLM for a text summarization task. We believe NLHF offers a compelling avenue for preference learning and policy optimization with the potential of advancing the field of aligning LLMs with human preferences.
This post-training method was contributed by [Kashif Rasul](https://huggingface.co/kashif) and [Daniil Tiapkin](https://huggingface.co/dtiapkin), [Pierre Ménard](https://huggingface.co/menardprr), Daniele Calandriello and [Quentin Gallouédec](https://huggingface.co/qgallouedec).
## Quick start
This example demonstrates how to train a model using the Nash-MD method. We use the [Qwen 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) as the base model and [`PairRMJudge`] as a judge. We use the prompts from the [UltraFeedback dataset](https://huggingface.co/datasets/openbmb/UltraFeedback). You can view the prompts in the dataset here:
<iframe
src="https://huggingface.co/datasets/trl-lib/ultrafeedback-prompt/embed/viewer/default/train?row=0"
frameborder="0"
width="100%"
height="560px"
></iframe>
Below is the script to train the model:
```python
# train_nash_md.py
from datasets import load_dataset
from trl import NashMDConfig, NashMDTrainer, PairRMJudge
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
judge = PairRMJudge()
train_dataset = load_dataset("trl-lib/ultrafeedback-prompt", split="train")
training_args = NashMDConfig(output_dir="Qwen2-0.5B-NashMD")
trainer = NashMDTrainer(
model=model, judge=judge, args=training_args, processing_class=tokenizer, train_dataset=train_dataset
)
trainer.train()
```
Execute the script using the following command:
```bash
accelerate launch train_nash_md.py
```
Distributed across 8 GPUs, the training takes approximately 3 hours.
To see how the [trained model](https://huggingface.co/trl-lib/Qwen2-0.5B-NashMD) performs, you can use the [Transformers Chat CLI](https://huggingface.co/docs/transformers/quicktour#chat-with-text-generation-models).
<pre><code>$ transformers chat trl-lib/Qwen2-0.5B-NashMD
<strong><span style="color: red;">&lt;quentin_gallouedec&gt;:</span></strong>
What is the best programming language?
<strong><span style="color: blue;">&lt;trl-lib/Qwen2-0.5B-NashMD&gt;:</span></strong>
The best programming language depends on personal preference, the complexity of the project, and the specific requirements of the task. Some programming languages that are often recommended include Python, Java, and JavaScript, and there are many other languages to choose from depending on individual needs.
</code></pre>
## Expected dataset type
Nash-MD requires a [prompt-only dataset](dataset_formats#prompt-only). The [`NashMDTrainer`] supports both [conversational](dataset_formats#conversational) and [standard](dataset_formats#standard) dataset formats. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
## Usage tips
### Use a reward model
Instead of a judge, you can chose to use a reward model -- see [Reward Bench](https://huggingface.co/spaces/allenai/reward-bench) for a leaderboard of public models you can use. Below is a code example showing how to replace a judge with the [trl-lib/Qwen2-0.5B-Reward](https://huggingface.co/trl-lib/Qwen2-0.5B-Reward) model:
```diff
- from trl import PairRMJudge
+ from transformers import AutoModelForSequenceClassification
- judge = PairRMJudge()
+ reward_model = AutoModelForSequenceClassification.from_pretrained("trl-lib/Qwen2-0.5B-Reward", num_labels=1)
trainer = NashMDTrainer(
...
- judge=judge,
+ reward_funcs=reward_model,
)
```
> [!WARNING]
> Make sure that the SFT model and reward model use the _same_ chat template and the same tokenizer. Otherwise, you may find the model completions are scored incorrectly during training.
### Encourage EOS token generation
We may want the model to generate completions within a given length. During training, the model will generate completions up to the maximum length specified in the `max_new_tokens` argument of [`NashMDConfig`]. If you want to penalize the model for not generating an EOS token before reaching the maximum length, you can use the `missing_eos_penalty` argument of [`NashMDConfig`]:
```python
training_args = NashMDConfig(..., max_new_tokens=128, missing_eos_penalty=1.0)
```
### Logging Completions
To better understand your models behavior during training, you can log sample completions periodically using the [`LogCompletionsCallback`].
```python
trainer = NashMDTrainer(..., eval_dataset=eval_dataset)
completions_callback = LogCompletionsCallback(trainer, num_prompts=8)
trainer.add_callback(completions_callback)
```
This callback logs the model's generated completions directly to Weights & Biases.
![Logged Completions](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/wandb_completions.png)
## Example script
We provide an example script to train a model using the Nash-MD method. The script is available in [`examples/scripts/nash_md.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/nash_md.py)
To test the online DPO script with the [Qwen2.5 0.5B model](https://huggingface.co/trl-lib/Qwen/Qwen2.5-0.5B-Instruct) on the [UltraFeedback dataset](https://huggingface.co/datasets/openbmb/UltraFeedback), run the following command:
```bash
python examples/scripts/nash_md.py \
--model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
--judge pair_rm \
--dataset_name trl-lib/ultrafeedback-prompt \
--learning_rate 5.0e-7 \
--output_dir Qwen2.5-0.5B-NashMD-PairRM \
--warmup_ratio 0.1 \
--push_to_hub
```
## Logged metrics
While training and evaluating, we record the following reward metrics:
* `loss/kl`: The mean KL divergence between the model and reference data.
* `objective/entropy`: The mean entropy of the model and reference data.
* `loss/score`: The mean reinforce score loss.
* `rewards/chosen`: The mean scores (according to the reward model) of the model completions.
* `rewards/rejected`: The mean scores (according to the reward model) of the mixture completions.
* `rewards/probabilities`: The mean probability (according to the reward model or judge) of the model completions chosen vs the mixture completion.
* `rewards/accuracies`: The accuracies of the Nash-MD's implicit reward model.
* `rewards/margins`: The mean reward margin (according to reward model) between the chosen and mixture completions.
* `logps/chosen`: The mean log probabilities of the chosen completions.
* `logps/rejected`: The mean log probabilities of the reference completions.
* `val/model_contain_eos_token`: The amount of times the model's output contains the eos token.
* `val/ref_contain_eos_token`: The amount of times the mixture's output contains the eos token.
* `beta`: The parameter that controls the weight of the loss term representing the deviation from the reference model. Typically fixed, but can be made dynamic by passing a list to [`NashMDConfig`].
* `mixture_coef`: Logit mixture coefficient for the model and reference model. Typically fixed, but can be made dynamic by passing a list to [`NashMDConfig`].
## NashMDTrainer
[[autodoc]] NashMDTrainer
- train
- save_model
- push_to_hub
## NashMDConfig
[[autodoc]] NashMDConfig

View File

@ -0,0 +1,270 @@
# Online DPO Trainer
[![model badge](https://img.shields.io/badge/All_models-Online_DPO-blue)](https://huggingface.co/models?other=online-dpo,trl)
## Overview
Online DPO was proposed in [Direct Language Model Alignment from Online AI Feedback](https://huggingface.co/papers/2402.04792) by Shangmin Guo, Biao Zhang, Tianlin Liu, Tianqi Liu, Misha Khalman, Felipe Llinares, Alexandre Rame, Thomas Mesnard, Yao Zhao, Bilal Piot, Johan Ferret, and Mathieu Blondel.
The abstract from the paper is the following:
> Direct alignment from preferences (DAP) methods, such as DPO, have recently emerged as efficient alternatives to reinforcement learning from human feedback (RLHF), that do not require a separate reward model. However, the preference datasets used in DAP methods are usually collected ahead of training and never updated, thus the feedback is purely offline. Moreover, responses in these datasets are often sampled from a language model distinct from the one being aligned, and since the model evolves over training, the alignment phase is inevitably off-policy. In this study, we posit that online feedback is key and improves DAP methods. Our method, online AI feedback (OAIF), uses an LLM as annotator: on each training iteration, we sample two responses from the current model and prompt the LLM annotator to choose which one is preferred, thus providing online feedback. Despite its simplicity, we demonstrate via human evaluation in several tasks that OAIF outperforms both offline DAP and RLHF methods. We further show that the feedback leveraged in OAIF is easily controllable, via instruction prompts to the LLM annotator.
This post-training method was contributed by [Michael Noukhovitch](https://huggingface.co/mnoukhov), [Shengyi Costa Huang](https://huggingface.co/vwxyzjn), [Quentin Gallouédec](https://huggingface.co/qgallouedec), and [Edward Beeching](https://huggingface.co/edbeeching).
## Quick start
This example demonstrates how to train a model using the online DPO method. We use the [Qwen 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) as the base model and [`PairRMJudge`] as a judge. We use the prompts from the [UltraFeedback dataset](https://huggingface.co/datasets/openbmb/UltraFeedback). You can view the prompts in the dataset here:
<iframe
src="https://huggingface.co/datasets/trl-lib/ultrafeedback-prompt/embed/viewer/default/train?row=0"
frameborder="0"
width="100%"
height="560px"
></iframe>
Below is the script to train the model:
```python
# train_online_dpo.py
from datasets import load_dataset
from trl import OnlineDPOConfig, OnlineDPOTrainer, PairRMJudge
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
judge = PairRMJudge()
train_dataset = load_dataset("trl-lib/ultrafeedback-prompt", split="train")
training_args = OnlineDPOConfig(output_dir="Qwen2-0.5B-OnlineDPO")
trainer = OnlineDPOTrainer(
model=model, judge=judge, args=training_args, processing_class=tokenizer, train_dataset=train_dataset
)
trainer.train()
```
Execute the script using the following command:
```bash
accelerate launch train_online_dpo.py
```
Distributed across 8 GPUs, the training takes approximately 1 hour. You can verify the training progress by checking the reward graph. An increasing trend in both the reward for rejected and chosen completions indicates that the model is improving and generating better responses over time.
![](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/online-dpo-qwen2.png)
To see how the [trained model](https://huggingface.co/trl-lib/Qwen2-0.5B-OnlineDPO) performs, you can use the [Transformers Chat CLI](https://huggingface.co/docs/transformers/quicktour#chat-with-text-generation-models).
<pre><code>$ transformers chat trl-lib/Qwen2-0.5B-OnlineDPO
<strong><span style="color: red;">&lt;quentin_gallouedec&gt;:</span></strong>
What is the best programming language?
<strong><span style="color: blue;">&lt;trl-lib/Qwen2-0.5B-OnlineDPO&gt;:</span></strong>
The best programming language depends on your specific needs and priorities. Some people prefer imperative programming languages (like Haskell or Lisp), while others prefer functional programming languages (like Scala or Python). It's important to consider your work style, programming environment, and project requirements when choosing a programming language.
</code></pre>
## Expected dataset type
Online DPO only requires a [prompt-only dataset](dataset_formats#prompt-only) (unlike offline DPO, that expects [preference dataset](dataset_formats#preference)). The [`OnlineDPOTrainer`] supports both [conversational](dataset_formats#conversational) and [standard](dataset_formats#standard) dataset formats. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
## Usage tips
### Use a reward model
Instead of a judge, you can chose to use a reward model -- see [Reward Bench](https://huggingface.co/spaces/allenai/reward-bench) for a leaderboard of public models you can use. Below is a code example showing how to replace a judge with the [trl-lib/Qwen2-0.5B-Reward](https://huggingface.co/trl-lib/Qwen2-0.5B-Reward) model:
```diff
- from trl import PairRMJudge
+ from transformers import AutoModelForSequenceClassification
- judge = PairRMJudge()
+ reward_model = AutoModelForSequenceClassification.from_pretrained("trl-lib/Qwen2-0.5B-Reward", num_labels=1)
+ reward_tokenizer = AutoTokenizer.from_pretrained("trl-lib/Qwen2-0.5B-Reward")
trainer = OnlineDPOTrainer(
...
- judge=judge,
+ reward_funcs=reward_model,
+ reward_processing_class=reward_tokenizer,
...
)
```
### Encourage EOS token generation
When using a reward model, we may want the model to generate completions within a given length. During training, the model will generate completions up to the maximum length specified in the `max_new_tokens` argument of [`OnlineDPOConfig`]. If you want to penalize the model for not generating an EOS token before reaching the maximum length, you can use the `missing_eos_penalty` argument of [`OnlineDPOConfig`]:
```python
training_args = OnlineDPOConfig(..., max_new_tokens=128, missing_eos_penalty=1.0)
```
### Logging Completions
To better understand your models behavior during training, you can log sample completions periodically using the [`LogCompletionsCallback`].
```python
trainer = OnlineDPOTrainer(..., eval_dataset=eval_dataset)
completions_callback = LogCompletionsCallback(trainer, num_prompts=8)
trainer.add_callback(completions_callback)
```
This callback logs the model's generated completions directly to Weights & Biases.
![Logged Completions](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/wandb_completions.png)
## Example script
We provide an example script to train a model using the online DPO method. The script is available in [`examples/scripts/dpo_online.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/dpo_online.py)
To test the online DPO script with the [Qwen2.5 0.5B model](https://huggingface.co/trl-lib/Qwen/Qwen2.5-0.5B-Instruct) on the [UltraFeedback dataset](https://huggingface.co/datasets/openbmb/UltraFeedback), run the following command:
```bash
python examples/scripts/dpo_online.py \
--model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
--judge pair_rm \
--dataset_name trl-lib/ultrafeedback-prompt \
--learning_rate 5.0e-7 \
--output_dir Qwen2.5-0.5B-Online-DPO-PairRM \
--warmup_ratio 0.1 \
--push_to_hub
```
## Logged metrics
While training and evaluating, we record the following reward metrics. Here is an example [tracked run at Weights and Biases](https://wandb.ai/huggingface/trl/runs/w4apmsi9)
* `objective/kl`: The mean Kullback-Leibler (KL) divergence between the current model and reference model.
* `objective/entropy`: The mean entropy of the model, indicating the randomness of the actions chosen by the model.
* `objective/non_score_reward`: The mean reward from non-score-related sources, basically `beta * kl.sum(1)`, where `beta` is the KL penalty coefficient and `kl` is the per-token KL divergence.
* `objective/rlhf_reward`: The mean RLHF reward, which is `scores - non_score_reward`. The `rlhf_reward` is the ultimate objective of online DPO training. If training works as intended, this metric should keep going up.
* `objective/scores`: The mean scores returned by the reward model.
* `objective/scores_margin`: The mean score margin (according to the external reward model) between the chosen and rejected completions.
* `rewards/chosen`: The mean reward (according to online DPO's implicit reward model)of the chosen completions.
* `rewards/rejected`: The mean reward (according to online DPO's implicit reward model) of the rejected completions.
* `rewards/accuracies`: The accuracies of the online DPO's implicit reward model.
* `rewards/margins`: The mean reward margin (according to online DPO's implicit reward model) between the chosen and rejected completions.
* `logps/chosen`: The mean log probabilities of the chosen completions.
* `logps/rejected`: The mean log probabilities of the rejected completions.
* `val/contain_eos_token`: The fraction of completions which contain an EOS token.
* `beta`: The parameter that controls the weight of the loss term representing the deviation from the reference model. Typically fixed, but can be made dynamic by passing a list to [`OnlineDPOConfig`].
## Benchmark experiments
To validate the online DPO implementation works, we ran experiments with the Pythia 1B, 2.8B, and 6.9B models on a single node of 8 x H100s. Here are the commands we used to run the experiments. We take the SFT / RM models directly from [The N+ Implementation Details of RLHF with PPO: A Case Study on TL;DR Summarization](https://huggingface.co/papers/2403.17031).
```shell
# 1B Online DPO experiment
accelerate launch --config_file examples/accelerate_configs/multi_gpu.yaml \
examples/scripts/dpo_online.py \
--model_name_or_path trl-lib/pythia-1b-deduped-tldr-sft \
--reward_model_path trl-lib/pythia-1b-deduped-tldr-rm \
--dataset_name trl-lib/tldr \
--learning_rate 5.0e-7 \
--output_dir pythia-1b-deduped-tldr-online-dpo \
--beta 0.1 \
--per_device_train_batch_size 8 \
--gradient_accumulation_steps 2 \
--num_train_epochs 3 \
--max_new_tokens 53 \
--warmup_ratio 0.1 \
--missing_eos_penalty 1.0 \
--save_steps 0.1 \
--push_to_hub
# 2.8B Online DPO experiment
accelerate launch --config_file examples/accelerate_configs/deepspeed_zero2.yaml \
examples/scripts/dpo_online.py \
--model_name_or_path trl-lib/pythia-2.8b-deduped-tldr-sft \
--reward_model_path trl-lib/pythia-2.8b-deduped-tldr-rm \
--dataset_name trl-lib/tldr \
--learning_rate 5.0e-7 \
--output_dir pythia-2.8b-deduped-tldr-online-dpo \
--beta 0.1 \
--per_device_train_batch_size 8 \
--gradient_accumulation_steps 2 \
--num_train_epochs 3 \
--max_new_tokens 53 \
--warmup_ratio 0.1 \
--missing_eos_penalty 1.0 \
--save_steps 0.1 \
--push_to_hub
# 6.9B Online DPO experiment
accelerate launch --config_file examples/accelerate_configs/deepspeed_zero2.yaml \
examples/scripts/dpo_online.py \
--model_name_or_path trl-lib/pythia-6.9b-deduped-tldr-sft \
--reward_model_path trl-lib/pythia-6.9b-deduped-tldr-rm \
--dataset_name trl-lib/tldr \
--learning_rate 5.0e-7 \
--output_dir pythia-6.9b-deduped-tldr-online-dpo \
--beta 0.1 \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 4 \
--num_train_epochs 3 \
--max_new_tokens 53 \
--warmup_ratio 0.1 \
--missing_eos_penalty 1.0 \
--gradient_checkpointing \
--save_steps 0.1 \
--push_to_hub
```
Checkpoints and experiment tracking are available at:
* [🤗 Model checkpoints](https://huggingface.co/collections/trl-lib/online-dpo-66acd3fa38a331a9cd457b07)
* [🐝 Tracked experiment](https://wandb.ai/huggingface/trl/reports/Online-DPO-experiments-for-TL-DR-summarisation--Vmlldzo5MTczMDU0)
To evaluate, we use [vLLM](https://github.com/vllm-project/vllm) to load the checkpoints and GPT-4o mini as a judge model to evaluate the generated TL;DR against the reference TL;DR.
For more information on how to use judges, see [Judges](judges).
```bash
$ python examples/scripts/evals/judge_tldr.py --model_name_or_path trl-lib/pythia-1b-deduped-tldr-sft --judge_model gpt-4o-mini --num_examples 1000
Model win rate: 33.00%
python examples/scripts/evals/judge_tldr.py --model_name_or_path trl-lib/pythia-6.9b-deduped-tldr-sft --judge_model gpt-4o-mini --num_examples 1000
Model win rate: 41.50%
python examples/scripts/evals/judge_tldr.py --model_name_or_path trl-lib/pythia-1b-deduped-tldr-online-dpo --judge_model gpt-4o-mini --num_examples 1000
Model win rate: 62.60%
python examples/scripts/evals/judge_tldr.py --model_name_or_path trl-lib/pythia-6.9b-deduped-tldr-online-dpo --judge_model gpt-4o-mini --num_examples 1000
Model win rate: 74.20%
```
We can then plot the RLHF scaling chart.
```python
import matplotlib.pyplot as plt
results = {
"SFT": {1.0e9: 0.21, 2.8e9: 0.27, 6.9e9: 0.316},
"online-dpo": {1.0e9: 0.542, 2.8e9: 0.746, 6.9e9: 0.796},
"offline-dpo": {1.0e9: 0.422, 2.8e9: 0.517, 6.9e9: 0.701},
}
plt.plot(results["SFT"].keys(), results["SFT"].values(), label="SFT", marker="o")
plt.plot(results["online-dpo"].keys(), results["online-dpo"].values(), label="Online-dpo with RM judge", marker="o")
plt.plot(results["offline-dpo"].keys(), results["offline-dpo"].values(), label="Offline-dpo", marker="o")
plt.axhline(y=0.5, color="black", linestyle="-.", label="Human reference summary")
plt.xscale("log")
plt.xlabel("Model size")
plt.ylabel("Win rate against reference summaries\n(according to GPT-4-0613)")
plt.title("DPO scaling by model size")
plt.legend()
plt.xlim(5e8, 1.2e10)
plt.xticks([1e9, 3e9, 1e10], ["1B", "3B", "10B"])
plt.grid(True, which="both", ls="--", c="0.7")
plt.tight_layout()
plt.show()
```
The online DPO checkpoint gets increasingly more win rate as we scale up the model sizes. This is a good sign that the online DPO implementation is working as intended.
## OnlineDPOTrainer
[[autodoc]] OnlineDPOTrainer
- train
- save_model
- push_to_hub
## OnlineDPOConfig
[[autodoc]] OnlineDPOConfig

131
docs/source/orpo_trainer.md Normal file
View File

@ -0,0 +1,131 @@
# ORPO Trainer
[![model badge](https://img.shields.io/badge/All_models-ORPO-blue)](https://huggingface.co/models?other=orpo,trl) [![model badge](https://img.shields.io/badge/smol_course-Chapter_2-yellow)](https://github.com/huggingface/smol-course/tree/main/2_preference_alignment)
## Overview
Odds Ratio Preference Optimization (ORPO) was introduced in [ORPO: Monolithic Preference Optimization without Reference Model](https://huggingface.co/papers/2403.07691) by [Jiwoo Hong](https://huggingface.co/JW17), [Noah Lee](https://huggingface.co/nlee-208), and [James Thorne](https://huggingface.co/j6mes).
The abstract from the paper is the following:
> While recent preference alignment algorithms for language models have demonstrated promising results, supervised fine-tuning (SFT) remains imperative for achieving successful convergence. In this paper, we study the crucial role of SFT within the context of preference alignment, emphasizing that a minor penalty for the disfavored generation style is sufficient for preference-aligned SFT. Building on this foundation, we introduce a straightforward and innovative reference model-free monolithic odds ratio preference optimization algorithm, ORPO, eliminating the necessity for an additional preference alignment phase. We demonstrate, both empirically and theoretically, that the odds ratio is a sensible choice for contrasting favored and disfavored styles during SFT across the diverse sizes from 125M to 7B. Specifically, fine-tuning Phi-2 (2.7B), Llama-2 (7B), and Mistral (7B) with ORPO on the UltraFeedback alone surpasses the performance of state-of-the-art language models with more than 7B and 13B parameters: achieving up to 12.20% on AlpacaEval_{2.0} (Figure 1), 66.19% on IFEval (instruction-level loose, Table 6), and 7.32 in MT-Bench (Figure 12). We release code and model checkpoints for Mistral-ORPO-alpha (7B) and Mistral-ORPO-beta (7B).
It studies the crucial role of SFT within the context of preference alignment. Using preference data the method posits that a minor penalty for the disfavored generation together with a strong adaption signal to the chosen response via a simple log odds ratio term appended to the NLL loss is sufficient for preference-aligned SFT.
Thus ORPO is a reference model-free preference optimization algorithm eliminating the necessity for an additional preference alignment phase thus saving compute and memory.
The official code can be found in [xfactlab/orpo](https://github.com/xfactlab/orpo).
This post-training method was contributed by [Kashif Rasul](https://huggingface.co/kashif), [Lewis Tunstall](https://huggingface.co/lewtun) and [Alvaro Bartolome](https://huggingface.co/alvarobartt).
## Quick start
This example demonstrates how to train a model using the ORPO method. We use the [Qwen 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) as the base model. We use the preference data from the [UltraFeedback dataset](https://huggingface.co/datasets/openbmb/UltraFeedback). You can view the data in the dataset here:
<iframe
src="https://huggingface.co/datasets/trl-lib/ultrafeedback_binarized/embed/viewer/default/train?row=0"
frameborder="0"
width="100%"
height="560px"
></iframe>
Below is the script to train the model:
```python
# train_orpo.py
from datasets import load_dataset
from trl import ORPOConfig, ORPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
train_dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
training_args = ORPOConfig(output_dir="Qwen2-0.5B-ORPO")
trainer = ORPOTrainer(model=model, args=training_args, processing_class=tokenizer, train_dataset=train_dataset)
trainer.train()
```
Execute the script using the following command:
```bash
accelerate launch train_orpo.py
```
Distributed across 8 GPUs, the training takes approximately 30 minutes. You can verify the training progress by checking the reward graph. An increasing trend in the reward margin indicates that the model is improving and generating better responses over time.
![orpo qwen2 reward margin](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/orpo-qwen2-reward-margin.png)
To see how the [trained model](https://huggingface.co/trl-lib/Qwen2-0.5B-ORPO) performs, you can use the [Transformers Chat CLI](https://huggingface.co/docs/transformers/quicktour#chat-with-text-generation-models).
<pre><code>$ transformers chat trl-lib/Qwen2-0.5B-ORPO
<strong><span style="color: red;">&lt;quentin_gallouedec&gt;:</span></strong>
What is the best programming language?
<strong><span style="color: blue;">&lt;trl-lib/Qwen2-0.5B-ORPO&gt;:</span></strong>
It's challenging to determine the best programming language as no one language is perfect, as the complexity of a task and the type of project are significant factors. Some popular languages include Java, Python, JavaScript, and
C++. If you have specific needs or requirements for a specific project, it's important to choose the language that best suits those needs.
Here are some other factors to consider when choosing a programming language for a project:
<strong><span style="color: green;">• Language proficiency:</span></strong> A good programming language is more likely to be easy to understand and use, and will allow developers to collaborate on projects more efficiently.
<strong><span style="color: green;">• Ease of use:</span></strong> There are tools and libraries available to make programming more accessible, so developers should choose a language that can help them get started easier.
<strong><span style="color: green;">• Code readability:</span></strong> A clear and concise codebase should be easy to read and understand, especially when working with large projects.
<strong><span style="color: green;">• Tool and framework support:</span></strong> There are numerous libraries available for Python, Java, and JavaScript, along with tools like IDEs and static code analysis tools.
<strong><span style="color: green;">• Accessibility:</span></strong> Some languages and tools have features that make them more accessible to developers with disabilities, such as support for screen readers.
<strong><span style="color: green;">• Version control:</span></strong> As your projects grow and complexity increases, version control tools can be beneficial for tracking changes.
</code></pre>
## Expected dataset type
ORPO requires a [preference dataset](dataset_formats#preference). The [`ORPOTrainer`] supports both [conversational](dataset_formats#conversational) and [standard](dataset_formats#standard) dataset format. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
Although the [`ORPOTrainer`] supports both explicit and implicit prompts, we recommend using explicit prompts. If provided with an implicit prompt dataset, the trainer will automatically extract the prompt from the `"chosen"` and `"rejected"` columns. For more information, refer to the [preference style](dataset_formats#preference) section.
## Example script
We provide an example script to train a model using the ORPO method. The script is available in [`examples/scripts/orpo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/orpo.py)
To test the ORPO script with the [Qwen2 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) on the [UltraFeedback dataset](https://huggingface.co/datasets/trl-lib/ultrafeedback_binarized), run the following command:
```bash
accelerate launch examples/scripts/orpo.py \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--dataset_name trl-lib/ultrafeedback_binarized \
--num_train_epochs 1 \
--output_dir Qwen2-0.5B-ORPO
```
## Usage tips
### For Mixture of Experts Models: Enabling the auxiliary loss
MOEs are the most efficient if the load is about equally distributed between experts.
To ensure that we train MOEs similarly during preference-tuning, it is beneficial to add the auxiliary loss from the load balancer to the final loss.
This option is enabled by setting `output_router_logits=True` in the model config (e.g. [`~transformers.MixtralConfig`]).
To scale how much the auxiliary loss contributes to the total loss, use the hyperparameter `router_aux_loss_coef=...` (default: `0.001`) in the model config.
## Logged metrics
While training and evaluating, we record the following reward metrics:
- `rewards/chosen`: the mean log probabilities of the policy model for the chosen responses scaled by beta
- `rewards/rejected`: the mean log probabilities of the policy model for the rejected responses scaled by beta
- `rewards/accuracies`: mean of how often the chosen rewards are > than the corresponding rejected rewards
- `rewards/margins`: the mean difference between the chosen and corresponding rejected rewards
- `log_odds_chosen`: the mean log odds ratio of the chosen responses over the rejected responses
- `log_odds_ratio`: the mean of the `log(sigmoid(log_odds_chosen))`
- `nll_loss`: the mean negative log likelihood loss from the SFT part of the loss over chosen responses
## ORPOTrainer
[[autodoc]] ORPOTrainer
- train
- save_model
- push_to_hub
## ORPOConfig
[[autodoc]] ORPOConfig

9
docs/source/others.md Normal file
View File

@ -0,0 +1,9 @@
# Other
## profiling_decorator
[[autodoc]] extras.profiling.profiling_decorator
## profiling_context
[[autodoc]] extras.profiling.profiling_context

582
docs/source/paper_index.md Normal file
View File

@ -0,0 +1,582 @@
# Paper Index
> [!WARNING]
> Section under construction. Feel free to contribute!
## Group Relative Policy Optimization
Papers relating to the [`GRPOTrainer`]
### Group Sequence Policy Optimization
**📜 Paper**: https://huggingface.co/papers/2507.18071
GSPO is a GRPO variant that computes importance sampling weights at the sequence level instead of per-token. To reproduce the paper's setting, use this configuration:
```python
from trl import GRPOConfig
training_args = GRPOConfig(
importance_sampling_level="sequence",
loss_type="grpo",
beta=0.0, # GSPO set KL regularization to zero: https://github.com/volcengine/verl/pull/2775#issuecomment-3131807306
epsilon=3e-4, # GSPO paper (v2), section 5.1
epsilon_high=4e-4, # GSPO paper (v2), section 5.1
gradient_accumulation_steps=1,
steps_per_generation=4, # partition rollout batch into 4 mini-batches. GSPO paper (v2), section 5.1. Must be 4 times gradient_accumulation_steps
)
```
Note that this method only has an effect when training goes slightly off-policy—for example, when `steps_per_generation > gradient_accumulation_steps` or `num_iterations > 1`. Otherwise, it is effectively equivalent to no modification.
TRL also provide an experimental implementation of GSPO-token, see [Experimental - GSPO-Token](experimental#gspo-token).
#### Policy ratio: GRPO vs. GSPO
In GSPO, the policy ratio is defined at the sequence-level. In other words, it is the ratio between the probability of the current policy generating a sequence over the old policy generating that same sequence.
The sequence likelihood is defined as:
$$
\pi_\theta (o_i | q) = \prod_{t=1}^{|o_i|} \pi_\theta (o_{i,t} | q, o_{i, < t} ),
$$
where \\( \pi_\theta \\) is the policy \\( \pi \\) with parameters \\(\theta\\), \\( o_i \\) is the \\( i \\)-th output sequence \\( o \\) and \\(o_{i,t}\\) is the \\( t \\)-th token in this sequence, \\( q \\) is the input query. The sequence likelihood ratio \\( s_i (\theta) \\) is defined as:
$$
s_i (\theta) = \left(\frac{\pi_\theta (o_i | q)}{\pi_{\theta_{old}} (o_i | q)} \right)^{\frac{1}{|o_i|}}
$$
The exponent \\( \frac{1}{|o_i|} \\) represents a sequence-length normalization, minimizing the influence of sequence length in sequence likelihood. In other terms, it computes the geometric mean of token probabilities, ensuring a fair comparison across sequences of varying lengths.
While GSPO defines the policy ratio at the sequence level, GRPO operates at the token level. Specifically, GRPO computes an importance ratio for each token in the sequence:
$$
w_{i,t}(\theta) = \frac{\pi_\theta (o_{i,t} | q, o_{i,< t})}{\pi_{\theta_{\text{old}}} (o_{i,t} | q, o_{i,< t})}
$$
This token-level ratio is then combined with a shared advantage \\( \hat{A}_i \\), and the GRPO objective clips and optimizes each token independently across the sequence.
### DAPO: An Open-Source LLM Reinforcement Learning System at Scale
**📜 Paper**: https://huggingface.co/papers/2503.14476
The DAPO algorithm includes 5 key components:
- Overlong Filtering
- Clip-Higher
- Soft Overlong Punishment
- Token-level Loss
- Dynamic Sampling (⚠ Not supported in TRL)
To reproduce the paper's setting, use this configuration:
```python
from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
# Overlong Filtering
mask_truncated_completions=True,
# Token-level Loss
loss_type="dapo",
# Clip-Higher
epsilon_high=0.28, # DAPO paper: section 4.1
epsilon=0.2, # DAPO paper: section 4.1
# Other parameters used
per_device_train_batch_size=512, # mini-batch size for training in the paper, DAPO paper: section 4.1
num_generations=16, # number of sample responses in the paper, DAPO paper: section 4.1
max_completion_length=20480, # maximum number of tokens for generation in the paper, DAPO paper: section 4.1
beta=0.0 # section 2.3, DAPO paper
)
# Soft Overlong Punishment
sop_reward = get_soft_overlong_punishment(max_completion_len=20480, soft_punish_cache=4096) # DAPO paper: section 4.1
trainer = GRPOTrainer(
...,
args=training_args,
reward_funcs=[..., sop_reward],
)
```
### Dr. GRPO: Understanding R1-Zero-Like Training: A Critical Perspective
**📜 Paper**: https://huggingface.co/papers/2503.20783
A study of R1-Zero training identifies pretraining effects on RL performance and proffers Dr. GRPO to enhance token efficiency, achieving superior accuracy on AIME 2024. To reproduce the paper's setting, use this configuration:
```python
from trl import GRPOConfig
training_args = GRPOConfig(
loss_type="dr_grpo",
per_device_train_batch_size=1, # train_batch_size_per_device in the Training section of the repository
num_generations=8, # num_samples in the Training section of the repository
max_prompt_length=1024, # prompt_max_length in the Training section of the repository
max_completion_length=3000, # generate_max_length in the Training section of the repository
beta=0.0, # beta in the Training section of the repository
)
```
### Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning (Lite PPO)
**📜 Paper**: https://huggingface.co/papers/2508.08221
The authors of this paper find that the combination of:
1. scaling rewards by the standard deviation computed over the entire batch and
2. aggregating loss over the total number of tokens
can unlock the learning capability of critic-free policies using vanilla PPO loss. Their results demonstrate that this simple combination consistently improves performance, surpassing strategies like GRPO and [DAPO](https://huggingface.co/papers/2503.14476).
TRL supports using these learnings to train a GRPO model by:
```python
from trl import GRPOConfig
training_args = GRPOConfig(
...
scale_rewards="batch",
loss_type="dapo",
# Other parameters used
beta=0.0, # = init_kl_coef in the paper
top_p=0.99,
top_k=100,
temperature=0.99,
num_completions=8, # = num_return_sequences in the paper
num_iterations=1, # = ppo_epochs in the paper
per_device_train_batch_size=4,
gradient_accumulation_steps=32,
steps_per_generation=8, # (rollout_batch_size*num_return_sequences) / (per_device_train_batch_size*gradient_accumulation_steps)
)
```
Note that when using gradient accumulation, the loss is aggregated over the total number of tokens in the batch, but not over the accumulated batch. For more details, see the [GRPO Trainer - Loss types](grpo_trainer#loss_types).
### Truncated Importance Sampling
**📰 Blog**: https://fengyao.notion.site/off-policy-rl
Online policy learning methods commonly use an optimized inference framework for rollout generation (e.g vLLM) that is separate from the training backend. This introduces a rollout-training mismatch, exemplified in the following PPO objective:
$$
\small{
\mathbb{E}_{a\sim\textcolor{red}{\pi_{\text{inference}}}(\theta_{\mathrm{old}})}
\Bigl[
\min\Bigl(
\frac{\textcolor{blue}{\pi_{\text{training}}}(a, \theta)}{\textcolor{blue}{\pi_{\text{training}}}(a, \theta_{\mathrm{old}})}\,\hat A,
\;\mathrm{clip}\bigl(\frac{\textcolor{blue}{\pi_{\text{training}}}(a, \theta)}{\textcolor{blue}{\pi_{\text{training}}}(a, \theta_{\mathrm{old}})},\,1-\epsilon,\,1+\epsilon\bigr)\,\hat A
\Bigr)
\Bigr]
}
$$
Despite \\( \textcolor{red}{\pi_{\text{inference}}} \\) and \\( \textcolor{blue}{\pi_{\text{training}}} \\) sharing the same model parameters \\( \theta \\), they can produce significantly different token probabilities. This unexpected behavior implicitly breaks the on-policy assumption, and silently turns training off-policy.
Truncated Importance Sampling (TIS) addresses this issue by adapting the model update via importance-sampling correction. The gradient computation of the aforementioned PPO objective becomes
$$
\small{
\mathbb{E}_{a\sim\textcolor{red}{\pi_{\text{inference}}}(\theta_{\mathrm{old}})}
\Bigl[
\underbrace{\min(\frac{\textcolor{blue}{\pi_{\text{training}}}(a, \theta_{\mathrm{old}})}{\textcolor{red}{\pi_{\text{inference}}}(a, \theta_{\mathrm{old}})}, C)}_{\text{truncated importance ratio}} \cdot
\nabla_\theta
\min\Bigl(
\frac{\textcolor{blue}{\pi_{\text{training}}}(a, \theta)}{\textcolor{blue}{\pi_{\text{training}}}(a, \theta_{\mathrm{old}})}\,\hat A,
\;\mathrm{clip}\bigl(\frac{\textcolor{blue}{\pi_{\text{training}}}(a, \theta)}{\textcolor{blue}{\pi_{\text{training}}}(a, \theta_{\mathrm{old}})},\,1-\epsilon,\,1+\epsilon\bigr)\,\hat A
\Bigr)
\Bigr]
}
$$
where \\( C \\) is a hyper-parameter. In TRL, TIS is implemented for GRPO, and enabled by default when vLLM is used for generation (`use_vllm=True`)
```python
from trl import GRPOConfig
training_args = GRPOConfig(
...
use_vllm=True,
vllm_importance_sampling_correction=True, # default True
vllm_importance_sampling_cap=2.0, # hyper-parameter C
)
```
### Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning
**📜 Paper**: https://huggingface.co/papers/2508.09726
See [Experimental - GFPO](experimental#gfpo).
## Direct Policy Optimization
Papers relating to the [`DPOTrainer`]
### Direct Preference Optimization (DPO): Your Language Model is Secretly a Reward Model
**📜 Paper**: https://huggingface.co/papers/2305.18290
Direct Preference Optimization (DPO) fine-tunes language models more efficiently and with better performance compared to reinforcement learning from human feedback (RLHF), by directly optimizing policy training based on human preferences. To reproduce the paper's setting, use this configuration:
```python
from trl import DPOConfig
training_args = DPOConfig(
loss_type="sigmoid", # losses in Appendix B of the paper
per_device_train_batch_size=64, # batch size in Appendix B of the paper
learning_rate=1e-6, # learning rate in Appendix B of the paper
beta=0.1, # beta in Appendix B of the paper
)
```
### A General Theoretical Paradigm to Understand Learning from Human Preferences
**📜 Paper**: https://huggingface.co/papers/2310.12036
A new general objective, \\( \Psi \\)$PO, bypasses both key approximations in reinforcement learning from human preferences, allowing for theoretical analysis and empirical superiority over DPO. To reproduce the paper's setting, use this configuration: To reproduce the paper's setting, use this configuration:
```python
from trl import DPOConfig
training_args = DPOConfig(
loss_type="ipo", # Section 5.1 of the paper
per_device_train_batch_size=90, # mini-batch size in Section C.1 of the paper
learning_rate=1e-2, # learning rate in Section C.1 of the paper
)
```
These parameters only appear in the [published version](https://proceedings.mlr.press/v238/gheshlaghi-azar24a/gheshlaghi-azar24a.pdf)
### SLiC-HF: Sequence Likelihood Calibration with Human Feedback
**📜 Paper**: https://huggingface.co/papers/2305.10425
Sequence Likelihood Calibration (SLiC) is shown to be an effective and simpler alternative to Reinforcement Learning from Human Feedback (RLHF) for learning from human preferences in language models. To reproduce the paper's setting, use this configuration:
```python
from trl import DPOConfig
training_args = DPOConfig(
loss_type="hinge", # Section 2 of the paper
per_device_train_batch_size=512, # batch size in Section 3.2 of the paper
learning_rate=1e-4, # learning rate in Section 3.2 of the paper
)
```
These parameters only appear in the [published version](https://openreview.net/pdf?id=0qSOodKmJaN)
### Towards Efficient and Exact Optimization of Language Model Alignment
**📜 Paper**: https://huggingface.co/papers/2305.10425
Efficient exact optimization (EXO) method is proposed to align language models with human preferences, providing a guaranteed and efficient alternative to reinforcement learning and direct preference optimization. To reproduce the paper's setting, use this configuration:
```python
from trl import DPOConfig
training_args = DPOConfig(
loss_type="exo_pair", # Section 3.2 of the paper
per_device_train_batch_size=64, # batch size in Section B of the paper
learning_rate=1e-6, # learning rate in Section B of the paper
beta=0.1, # $\beta_r$ in Section B of the paper
)
```
### Noise Contrastive Alignment of Language Models with Explicit Rewards
**📜 Paper**: https://huggingface.co/papers/2402.05369
A framework using Noise Contrastive Estimation enhances language model alignment with both scalar rewards and pairwise preferences, demonstrating advantages over Direct Preference Optimization. To reproduce the paper's setting, use this configuration:
```python
from trl import DPOConfig
training_args = DPOConfig(
loss_type="nca_pair", # Section 4.1 of the paper
per_device_train_batch_size=32, # batch size in Section C of the paper
learning_rate=5e-6, # learning rate in Section C of the paper
beta=0.01, # $\alpha$ in Section C of the paper
)
```
### Provably Robust DPO: Aligning Language Models with Noisy Feedback
**📜 Paper**: https://huggingface.co/papers/2403.00409
The paper introduces a robust direct preference optimization (rDPO) framework to address noise in preference-based feedback for language models, proving its sub-optimality gap and demonstrating its effectiveness through experiments. To reproduce the paper's setting, use this configuration:
```python
from trl import DPOConfig
training_args = DPOConfig(
loss_type="robust", # Section 3.1 of the paper
per_device_train_batch_size=16, # batch size in Section B of the paper
learning_rate=1e-3, # learning rate in Section B of the paper
beta=0.01, # $\beta$ in Section B of the paper,
max_prompt_length=128, # max prompt length in Section B of the paper
max_length=512, # max length in Section B of the paper
label_smoothing=0.1 # label smoothing $\epsilon$ in section 6 of the paper
)
```
### Binary Classifier Optimization for Large Language Model Alignment
**📜 Paper**: https://huggingface.co/papers/2404.04656
Theoretical analysis and a new algorithm, Binary Classifier Optimization, explain and enhance the alignment of large language models using binary feedback signals. To reproduce the paper's setting, use this configuration:
```python
from trl import DPOConfig
training_args = DPOConfig(
loss_type="bco_pair", # Section 4 of the paper
per_device_train_batch_size=128, # batch size in Section C of the paper
learning_rate=5e-7, # learning rate in Section C of the paper
beta=0.01, # $\beta$ in Section C of the paper,
max_prompt_length=1536, # max prompt length in Section C of the paper
max_completion_length=512, # max completion length in Section C of the paper
)
```
For the unpaired version, the user should utilize [`BCOConfig`] and [`BCOTrainer`].
### Self-Play Preference Optimization for Language Model Alignment
**📜 Paper**: https://huggingface.co/papers/2405.00675
A self-play method called SPPO for language model alignment achieves state-of-the-art performance by approximating Nash equilibrium policy in a constant-sum game setting, outperforming other approaches with limited data. To reproduce the paper's setting, use this configuration:
```python
from trl import DPOConfig
training_args = DPOConfig(
loss_type="sppo_hard", # Section 3 of the paper
per_device_train_batch_size=64, # batch size in Section C of the paper
learning_rate=5e-7, # learning rate in Section C of the paper
)
```
### Distributional Preference Alignment of LLMs via Optimal Transport
**📜 Paper**: https://huggingface.co/papers/2406.05882
Alignment via Optimal Transport (AOT) aligns large language models distributionally by penalizing violations of stochastic dominance between positive and negative sample distributions, achieving state-of-the-art performance on alignment benchmarks. To reproduce the paper's setting, use this configuration:
```python
from trl import DPOConfig
training_args = DPOConfig(
loss_type="aot", # Section 3 of the paper
)
```
```python
from trl import DPOConfig
training_args = DPOConfig(
loss_type="aot_pair", # Section 3 of the paper
)
```
There is no additional hyperparameter in the paper.
### Discovering Preference Optimization Algorithms with and for Large Language Models
**📜 Paper**: https://huggingface.co/papers/2406.08414
An LLM-driven method automatically discovers performant preference optimization algorithms, leading to a new algorithm called DiscoPOP that blends logistic and exponential losses. To reproduce the paper's setting, use this configuration:
```python
from trl import DPOConfig
training_args = DPOConfig(
loss_type="discopop", # Section 3 of the paper
per_device_train_batch_size=64, # batch size in Section B.1 of the paper
learning_rate=5e-7, # learning rate in Section B.1 of the paper
beta=0.05, # $\beta$ in Section B.1 of the paper,
discopop_tau=0.05 # $\tau$ in Section E of the paper
)
```
### Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment
**📜 Paper**: https://huggingface.co/papers/2408.06266
CLAIR and APO enhance LLM alignment through more contrastive preference pairs and controlled alignment objectives, improving model performance close to GPT4-turbo. To reproduce the paper's setting, use this configuration:
```python
from trl import DPOConfig
training_args = DPOConfig(
loss_type="apo_zero", # Section 4 of the paper
per_device_train_batch_size=64, # batch size in Section B.1 of the paper
learning_rate=2e-7, # learning rate in Section 5.2 of the paper
beta=0.1, # $\beta$ in Section 5.2 of the paper,
max_prompt_length=512, # prompt length in Section 5.2 of the paper
max_completion_length=512, # completion length in Section 5.2 of the paper
)
```
```python
from trl import DPOConfig
training_args = DPOConfig(
loss_type="apo_down", # Section 4 of the paper
per_device_train_batch_size=64, # batch size in Section B.1 of the paper
learning_rate=2e-7, # learning rate in Section 5.2 of the paper
beta=0.1, # $\beta$ in Section 5.2 of the paper,
max_prompt_length=512, # prompt length in Section 5.2 of the paper
max_completion_length=512, # completion length in Section 5.2 of the paper
)
```
These parameters only appear in the [published version](https://aclanthology.org/2025.tacl-1.22.pdf)
## Supervised Fine-Tuning
Papers relating to the [`SFTTrainer`]
### EMA Without the Lag: Bias-Corrected Iterate Averaging Schemes
**📜 Paper**: https://huggingface.co/papers/2508.00180
Bias-Corrected Exponential Moving Average (BEMA) improves the stability and efficiency of language model fine-tuning by reducing stochasticity and eliminating bias. To use BEMA with SFT as described in the paper, you can use the [`BEMACallback`]:
```python
from trl import BEMACallback, SFTTrainer
trainer = SFTTrainer(
...
callbacks=[BEMACallback()],
)
```
### On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification
**📜 Paper**: https://huggingface.co/papers/2508.05629
Dynamic Fine-Tuning (DFT) improves the generalization of Large Language Models (LLMs) by dynamically rescaling gradients, outperforming standard Supervised Fine-Tuning (SFT) and showing competitive results in offline reinforcement learning.
$$
\mathcal{L}_{\text{DFT}}(\theta) = \mathbb{E}_{(x,y) \sim \mathcal{D}} \left[ - \sum_{t=1}^{|y|} \textcolor{red}{\text{sg}\big(\pi_\theta(y_t \mid y_{<t}, x)\big)} \; \log \pi_\theta(y_t \mid y_{<t}, x) \right]
$$
where \\( \text{sg}(\cdot) \\) is the stop-gradient operator. To use DFT with SFT as described in the paper, you can use the `loss_type="dft"` argument:
```python
from trl import SFTConfig
training_args = SFTConfig(
loss_type="dft",
...
)
```
To closely match the papers setup, you can use the following configuration (see Sec. 4.1). Authors also mention that the hyperparameters are not very sensitive (Sec. 4.3):
```python
SFTConfig(
loss_type="dft",
learning_rate=5e-5,
max_length=2048,
# Target batch size 256; achieved via per-device batch 8 * grad accumulation 32
per_device_train_batch_size=8,
gradient_accumulation_steps=32,
)
```
## Reinforce Leave-One-Out
Papers relating to the [`RLOOTrainer`]
### Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs
**📜 Paper**: https://huggingface.co/papers/2402.14740
RLOO is a variant of REINFORCE that reduces variance by using leave-one-out baselines. It computes rewards by comparing each sample against the average of all other samples in the batch, providing more stable gradients than standard REINFORCE. To reproduce the paper's setting, use this configuration:
```python
from trl import RLOOConfig
training_args = RLOOConfig(
per_device_train_batch_size=512, # section C Training Detail of the paper
steps_per_generation=2 # section C Training Detail of the paper
beta=0.03 # section C Training Detail of the paper
num_generations=2, # experiments of paper different num_generations={2,4}
learning_rate=1e-6 # section C Training Detail of the paper
)
```
## Contrastive Preference Optimization
Papers relating to the [`CPOTrainer`]
### AlphaPO -- Reward shape matters for LLM alignment
**📜 Paper**: https://huggingface.co/papers/2501.03884
AlphaPO is a new Direct Alignment Algorithms (DAAs) method that leverages an alpha-parameter to help change the shape of the reward function beyond the standard log reward. AlphaPO helps maintain fine-grained control over likelihood displacement and over-optimization. To reproduce the paper's setting, use this configuration:
```python
from trl import CPOConfig
# Mistral-Instruct from Table 3 of the paper
training_args = CPOConfig(
loss_type="alphapo",
alpha=0.25,
beta=2.5,
simpo_gamma=0.1,
learning_rate=7e-7,
...
)
```
## Reward Modeling
Papers relating to the [`RewardTrainer`]
### Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking
**📜 Paper**: https://huggingface.co/papers/2312.09244
This paper proposed an auxiliary loss function designed to directly learn a centered reward model. This auxiliary loss minimizes the squared sum of the rewards, encouraging the model to naturally produce mean-zero outputs and thereby resolving the issue of underdetermination.
$$
\mathcal{L}(\theta) = - \mathbb{E}_{(x,y^+,y^-) \sim \mathcal{D}} \left[ \log \sigma(r_\theta(x, y^+) - r_\theta(x, y^-)) \textcolor{red}{- \eta \cdot (r_\theta(x, y^+) + r_\theta(x, y^-))^2} \right].
$$
To use this auxiliary loss with [`RewardTrainer`], you can use the `center_rewards_coefficient` argument in [`RewardConfig`] as follows:
```python
from trl import RewardConfig
training_args = RewardConfig(
center_rewards_coefficient=0.01, # η in the paper
...
)
```
### Llama 2: Open Foundation and Fine-Tuned Chat Models
**📜 Paper**: https://huggingface.co/papers/2307.09288
In this paper, the authors propose to leverage their preference ratings being decomposed as a scale of four points (e.g., _significantly better_) to provide more informative feedback to the reward model. This is done by adding a margin to the loss function, which encourages the reward model to assign larger gaps in scores for pairs with higher preference ratings.
$$
\mathcal{L}(\theta) = - \mathbb{E}_{(x,y^+,y^-,\textcolor{red}{m}) \sim \mathcal{D}} \left[ \log \sigma(r_\theta(x, y^+) - r_\theta(x, y^-) \textcolor{red}{- m}) \right].
$$
You can add a margin to the loss by adding a `margin` column to the dataset. The following example shows how to set up a the "Margin Small" setting of the paper.
```python
def add_margin(example):
preference_to_margin = {
"significantly better": 1.0,
"better": 2.0/3.0,
"slightly better": 1.0/3.0,
"negligibly better / unsure": 0.0,
}
return {"margin": preference_to_margin[example["preference_label"]]}
dataset = dataset.map(add_margin)
```

View File

@ -0,0 +1,144 @@
# Examples of using peft with trl to finetune 8-bit models with Low Rank Adaption (LoRA)
The notebooks and scripts in these examples show how to use Low Rank Adaptation (LoRA) to fine-tune models in a memory efficient manner. Most of PEFT methods supported in peft library but note that some PEFT methods such as Prompt tuning are not supported.
For more information on LoRA, see the [original paper](https://huggingface.co/papers/2106.09685).
## Installation
Note: peft is in active development, so we install directly from their Github page.
Peft also relies on the latest version of transformers.
```bash
pip install trl[peft]
pip install bitsandbytes loralib
pip install git+https://github.com/huggingface/transformers.git@main
#optional: wandb
pip install wandb
```
Note: if you don't want to log with `wandb` remove `log_with="wandb"` in the scripts/notebooks. You can also replace it with your favourite experiment tracker that's [supported by `accelerate`](https://huggingface.co/docs/accelerate/usage_guides/tracking).
## How to use it?
Simply declare a [`~peft.PeftConfig`] object in your script and pass it through `.from_pretrained` to load the TRL+PEFT model.
```python
from peft import LoraConfig
from trl import AutoModelForCausalLMWithValueHead
model_id = "edbeeching/gpt-neo-125M-imdb"
lora_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = AutoModelForCausalLMWithValueHead.from_pretrained(
model_id,
peft_config=lora_config,
)
```
And if you want to load your model in 8bit precision:
```python
pretrained_model = AutoModelForCausalLMWithValueHead.from_pretrained(
config.model_name,
load_in_8bit=True,
peft_config=lora_config,
)
```
... or in 4bit precision:
```python
pretrained_model = AutoModelForCausalLMWithValueHead.from_pretrained(
config.model_name,
peft_config=lora_config,
load_in_4bit=True,
)
```
## Launch scripts
The `trl` library is powered by `accelerate`. As such it is best to configure and launch trainings with the following commands:
```bash
accelerate config # will prompt you to define the training configuration
accelerate launch examples/scripts/ppo.py --use_peft # launch`es training
```
## Using `trl` + `peft` and Data Parallelism
You can scale up to as many GPUs as you want, as long as you are able to fit the training process in a single device. The only tweak you need to apply is to load the model as follows:
```python
from peft import LoraConfig
...
lora_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
pretrained_model = AutoModelForCausalLMWithValueHead.from_pretrained(
config.model_name,
peft_config=lora_config,
)
```
And if you want to load your model in 8bit precision:
```python
pretrained_model = AutoModelForCausalLMWithValueHead.from_pretrained(
config.model_name,
peft_config=lora_config,
load_in_8bit=True,
)
```
... or in 4bit precision:
```python
pretrained_model = AutoModelForCausalLMWithValueHead.from_pretrained(
config.model_name,
peft_config=lora_config,
load_in_4bit=True,
)
```
Finally, make sure that the rewards are computed on correct device as well, for that you can use `ppo_trainer.model.current_device`.
## Naive pipeline parallelism (NPP) for large models (>60B models)
The `trl` library also supports naive pipeline parallelism (NPP) for large models (>60B models). This is a simple way to parallelize the model across multiple GPUs.
This paradigm, termed as "Naive Pipeline Parallelism" (NPP) is a simple way to parallelize the model across multiple GPUs. We load the model and the adapters across multiple GPUs and the activations and gradients will be naively communicated across the GPUs. This supports `int8` models as well as other `dtype` models.
![NPP](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/trl-npp.png)
### How to use NPP?
Simply load your model with a custom `device_map` argument on the `from_pretrained` to split your model across multiple devices. Check out this [nice tutorial](https://github.com/huggingface/blog/blob/main/accelerate-large-models.md) on how to properly create a `device_map` for your model.
Also make sure to have the `lm_head` module on the first GPU device as it may throw an error if it is not on the first device. As this time of writing, you need to install the `main` branch of `accelerate`: `pip install git+https://github.com/huggingface/accelerate.git@main` and `peft`: `pip install git+https://github.com/huggingface/peft.git@main`.
### Launch scripts
Although `trl` library is powered by `accelerate`, you should run your training script in a single process. Note that we do not support Data Parallelism together with NPP yet.
```bash
python PATH_TO_SCRIPT
```
## Fine-tuning Llama-2 model
You can easily fine-tune Llama2 model using `SFTTrainer` and the official script! For example to fine-tune llama2-7b on the Guanaco dataset, run (tested on a single NVIDIA T4-16GB):
```bash
python trl/scripts/sft.py --output_dir sft_openassistant-guanaco --model_name meta-llama/Llama-2-7b-hf --dataset_name timdettmers/openassistant-guanaco --load_in_4bit --use_peft --per_device_train_batch_size 4 --gradient_accumulation_steps 2
```

238
docs/source/ppo_trainer.md Normal file
View File

@ -0,0 +1,238 @@
# PPO Trainer
[![model badge](https://img.shields.io/badge/All_models-PPO-blue)](https://huggingface.co/models?other=ppo,trl)
TRL supports training LLMs with [Proximal Policy Optimization (PPO)](https://huggingface.co/papers/1707.06347).
References:
- [Fine-Tuning Language Models from Human Preferences](https://github.com/openai/lm-human-preferences)
- [Learning to Summarize from Human Feedback](https://github.com/openai/summarize-from-feedback)
- [The N Implementation Details of RLHF with PPO](https://huggingface.co/blog/the_n_implementation_details_of_rlhf_with_ppo)
- [The N+ Implementation Details of RLHF with PPO: A Case Study on TL;DR Summarization](https://huggingface.co/papers/2403.17031)
## Get started
To just run a PPO script to make sure the trainer can run, you can run the following command to train a PPO model with a dummy reward model.
```bash
python examples/scripts/ppo/ppo.py \
--dataset_name trl-internal-testing/descriptiveness-sentiment-trl-style \
--dataset_train_split descriptiveness \
--learning_rate 3e-6 \
--num_ppo_epochs 1 \
--num_mini_batches 1 \
--output_dir models/minimal/ppo \
--per_device_train_batch_size 64 \
--gradient_accumulation_steps 1 \
--total_episodes 10000 \
--model_name_or_path EleutherAI/pythia-1b-deduped \
--sft_model_path EleutherAI/pythia-1b-deduped \
--reward_model_path EleutherAI/pythia-1b-deduped \
--missing_eos_penalty 1.0
```
## Explanation of the logged metrics
The logged metrics are as follows. Here is an example [tracked run at Weights and Biases](https://wandb.ai/huggingface/trl/runs/dd2o3g35)
- `eps`: Tracks the number of episodes per second.
- `objective/kl`: The mean Kullback-Leibler (KL) divergence between the current policy and reference policy.
- `objective/entropy`: The mean entropy of the policy, indicating the randomness of the actions chosen by the policy.
- `objective/non_score_reward`: The mean reward from non-score-related sources, basically `beta * kl.sum(1)`, where `beta` is the KL penalty coefficient and `kl` is the per-token KL divergence.
- `objective/rlhf_reward`: The mean RLHF reward, which is `score - non_score_reward`.
- `objective/scores`: The mean scores returned by the reward model / environment.
- `policy/approxkl_avg`: The average approximate KL divergence between consecutive PPO policies. Note that this is not the same as `objective/kl`.
- `policy/clipfrac_avg`: The average fraction of policy updates that are clipped, indicating how often the policy updates are constrained to prevent large changes.
- `loss/policy_avg`: The average policy loss, indicating how well the policy is performing.
- `loss/value_avg`: The average value loss, indicating the difference between the predicted value and the actual reward.
- `val/clipfrac_avg`: The average fraction of value function updates that are clipped, similar to policy/clipfrac_avg but for the value function.
- `policy/entropy_avg`: The average entropy of the policy during training, indicating how diverse the policy's actions are.
- `val/ratio`: The mean ratio of the current policy probability to the old policy probability, providing a measure of how much the policy has changed.
- `val/ratio_var`: The variance of the `val/ratio`, indicating the variability in policy changes.
- `val/num_eos_tokens`: The number of end-of-sequence (EOS) tokens generated, which can indicate the number of complete responses.
- `lr`: lr: The current learning rate used by the optimizer.
- `episode`: episode: The current episode count in the training process.
## Cookbook
- Debugging TIP: `objective/rlhf_reward`: this is the ultimate objective of the RLHF training. If training works as intended, this metric should keep going up.
- Debugging TIP: `val/ratio`: this number should float around 1.0, and it gets clipped by `--cliprange 0.2` with PPO's surrogate loss. So if this `ratio` is too high like 2.0 or 1000.0 or too small like 0.1, it means the updates between consecutive policies are too drastic. You should try understand why this is happening and try to fix it.
- Memory TIP: If you are running out of memory, you can try to reduce the `--per_device_train_batch_size` or increase the `--gradient_accumulation_steps` to reduce the memory footprint.
- Memory TIP: If you have multiple GPUs, you can also run training with DeepSpeed stage 3 to reduce the memory footprint `accelerate launch --config_file examples/accelerate_configs/deepspeed_zero3.yaml`.
- Usage TIP: We recommend to use the "EOS trick" via `--missing_eos_penalty`, which subtracts a static scalar penalty from the score of completions that do not end with an EOS token. This can help the model learn to generate more coherent completions.
## What is my model doing exactly?
To help you understand what your model is doing, we periodically log some sample completions from the model. Here is an example of a completion. In an example [tracked run at Weights and Biases](https://wandb.ai/huggingface/trl/runs/dd2o3g35), it looks like the following, allowing you to see the model's response at different stages of training. By default we generate `--num_sample_generations 10` during training, but you can customize the number of generations.
![ppov2_completions](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/ppov2_completions.gif)
In the logs the sampled generations look like
```txt
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ query ┃ model response ┃ score ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩
│ SUBREDDIT: r/AskReddit │ I'm in love with a friend, and │ 3.921875 │
│ │ I don't know how to get rid of │ │
│ TITLE: How do you get someone │ those feelings. I'm │ │
│ out of your head? │ desperate.<|endoftext|>[PAD][P… │ │
│ │ │ │
│ POST: Hi, │ │ │
│ I'm 22, and I have been with my │ │ │
│ girlfriend for 5 years now. We │ │ │
│ recently moved together. We've │ │ │
│ always loved each other │ │ │
│ intensely. │ │ │
│ │ │ │
│ Problem, I recently started to │ │ │
│ have feelings for an other │ │ │
│ person (a friend). This person │ │ │
│ has had a boyfriend for now 3 │ │ │
│ years, and has absolutely no │ │ │
│ ideas. Those feelings were so │ │ │
│ strong, it was hard to hide │ │ │
│ them. After 2 months of me │ │ │
│ being distant and really sad, │ │ │
│ my girlfriend forced me to say │ │ │
│ what was bothering me. I'm not │ │ │
│ a good liar, and now she knows. │ │ │
│ │ │ │
│ We decided to give us a week │ │ │
│ alone, I went to my parents. │ │ │
│ │ │ │
│ Now, I'm completely lost. I │ │ │
│ keep on thinking about this │ │ │
│ person, and I hate that. I │ │ │
│ would like for those feelings │ │ │
│ to go away, to leave me alone. │ │ │
│ But I can't. │ │ │
│ │ │ │
│ What do I do? It's been 3 │ │ │
│ months now, and I'm just │ │ │
│ desperate. │ │ │
│ │ │ │
│ TL;DR: │ │ │
├─────────────────────────────────┼─────────────────────────────────┼──────────┤
│ SUBREDDIT: r/pettyrevenge │ My mom woke me up with a loud │ 6.84375 │
│ │ TV. I blasted Gangnam Style on │ │
│ TITLE: So, my mom woke me up │ repeat, with the bass cranked │ │
│ with a loud TV. │ up as high as it could │ │
│ │ go.<|endoftext|>[PAD][PAD][PAD… │ │
│ POST: She was in her living │ │ │
│ room, watching TV. This was at │ │ │
│ about 8:30 in the morning, and │ │ │
│ she was exercising. She turned │ │ │
│ the TV up extra loud to hear it │ │ │
│ over her excercycle, and woke │ │ │
│ me up. I went in there asking │ │ │
│ for her to turn it down. She │ │ │
│ said she didn't have to; I │ │ │
│ explained that I always used │ │ │
│ headphones so she didn't have │ │ │
│ to deal with my noise and that │ │ │
│ she should give me a little │ │ │
│ more respect, given that I paid │ │ │
│ rent at the time. │ │ │
│ │ │ │
│ She disagreed. I went back to │ │ │
│ my room, rather pissed off at │ │ │
│ the lack of equality. I had no │ │ │
│ lock on my door; but I had a │ │ │
│ dresser right next to it, so I │ │ │
│ pulled one of the drawers out │ │ │
│ enough so that it caused the │ │ │
│ door to not be openable. Then, │ │ │
│ I turned my speakers up really │ │ │
│ loud and blasted Gangnam Style │ │ │
│ on repeat, with the bass │ │ │
│ cranked up as high as it could │ │ │
│ go. │ │ │
│ │ │ │
│ If you hate Gangnam Style for │ │ │
│ being overplayed, you will see │ │ │
│ why I chose that particular │ │ │
│ song. I personally don't mind │ │ │
│ it. But here's the thing about │ │ │
│ my bass; it vibrates the walls, │ │ │
│ making one hell of a lot of │ │ │
│ noise. Needless to say, my mom │ │ │
│ was not pleased and shut off │ │ │
│ the internet. But it was oh so │ │ │
│ worth it. │ │ │
│ │ │ │
│ TL;DR: │ │ │
└─────────────────────────────────┴─────────────────────────────────┴──────────┘
```
## Implementation details
This PPO implementation is based on the [The N+ Implementation Details of RLHF with PPO: A Case Study on TL;DR Summarization](https://huggingface.co/papers/2403.17031).
## Benchmark experiments
To validate the PPO implementation works, we ran experiment on the 1B model. Here are the command we used to run the experiment. We take the SFT / RM models directly from [The N+ Implementation Details of RLHF with PPO: A Case Study on TL;DR Summarization](https://huggingface.co/papers/2403.17031).
```shell
accelerate launch --config_file examples/accelerate_configs/deepspeed_zero2.yaml \
examples/scripts/ppo/ppo_tldr.py \
--output_dir models/minimal/ppo_tldr \
--learning_rate 3e-6 \
--per_device_train_batch_size 16 \
--gradient_accumulation_steps 4 \
--total_episodes 1000000 \
--model_name_or_path EleutherAI/pythia-1b-deduped \
--sft_model_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr \
--reward_model_path cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr \
--local_rollout_forward_batch_size 16 \
--missing_eos_penalty 1.0 \
--stop_token eos
```
Checkpoints and experiment tracking are available at:
- [🤗 Model checkpoint](https://huggingface.co/vwxyzjn/ppo_tldr)
- [🐝 Tracked experiment](https://wandb.ai/huggingface/trl/runs/dd2o3g35)
To evaluate, we use [vLLM](https://github.com/vllm-project/vllm) to load the checkpoints and GPT-4o mini as a judge model to evaluate the generated TL;DR against the reference TL;DR.
For more information on how to use judges, see [Judges](judges).
```bash
$ python examples/scripts/evals/judge_tldr.py --model_name_or_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr --judge_model gpt-4o-mini --num_examples 1000
Model win rate: 33.00%
$ python examples/scripts/evals/judge_tldr.py --model_name_or_path vwxyzjn/ppo_tldr --judge_model gpt-4o-mini --num_examples 1000
Model win rate: 64.70%
```
The PPO checkpoint gets a 64.7% preferred rate vs the 33.0% preference rate of the SFT checkpoint. This is a good sign that the PPO training is working as intended.
Metrics:
![PPO v2](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/ppov2.png)
```bash
# pip install openrlbenchmark==0.2.1a5
# see https://github.com/openrlbenchmark/openrlbenchmark#get-started for documentation
# to use it, change `?we=huggingface&wpn=trl` to your own project and `?tag=pr-1540` to your own tag
python -m openrlbenchmark.rlops_multi_metrics \
--filters '?we=huggingface&wpn=trl&xaxis=train/episode&ceik=output_dir&cen=sft_model_path&metrics=train/objective/rlhf_reward&metrics=train/objective/scores&metrics=train/objective/kl&metrics=train/objective/non_score_reward&metrics=train/objective/entropy&metrics=train/policy/approxkl_avg&metrics=train/policy/clipfrac_avg&metrics=train/loss/policy_avg&metrics=train/loss/value_avg&metrics=train/val/clipfrac_avg&metrics=train/policy/entropy_avg&metrics=train/val/ratio&metrics=train/val/ratio_var&metrics=train/val/num_eos_tokens&metrics=train/lr&metrics=train/eps' \
"cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr?tag=pr-1540" \
--env-ids models/minimal/ppo_tldr \
--pc.ncols 4 \
--pc.ncols-legend 1 \
--pc.xlabel "Episode" \
--output-filename benchmark/trl/pr-1540/ppo \
--scan-history
```
## PPOTrainer
[[autodoc]] PPOTrainer
- train
- save_model
- push_to_hub
## PPOConfig
[[autodoc]] PPOConfig

122
docs/source/prm_trainer.md Normal file
View File

@ -0,0 +1,122 @@
# PRM Trainer
[![model badge](https://img.shields.io/badge/All_models-PRM-blue)](https://huggingface.co/models?other=prm,trl)
> [!WARNING]
> PRM Trainer is an experimental API which is subject to change at any time.
## Overview
Process-supervised Reward Models (PRM) were proposed in [Solving math word problems with process- and outcome-based feedback](https://huggingface.co/papers/2211.14275) by Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins.
The abstract from the paper is the following:
> Recent work has shown that asking language models to generate reasoning steps improves performance on many reasoning tasks. When moving beyond prompting, this raises the question of how we should supervise such models: outcome-based approaches which supervise the final result, or process-based approaches which supervise the reasoning process itself? Differences between these approaches might naturally be expected not just in final-answer errors but also in reasoning errors, which can be difficult to detect and are problematic in many real-world domains such as education. We run the first comprehensive comparison between process- and outcome-based approaches trained on a natural language task, GSM8K. We find that pure outcome-based supervision produces similar final-answer error rates with less label supervision. However, for correct reasoning steps we find it necessary to use processbased supervision or supervision from learned reward models that emulate process-based feedback. In total, we improve the previous best results from 16.8% → 12.7% final-answer error and 14.0% → 3.4% reasoning error among final-answer-correct solutions.
This post-training method was contributed by [Gaetan Lopez](https://github.com/gaetanlop), [Lewis Tunstall](https://huggingface.co/lewtun), [Quentin Gallouédec](https://huggingface.co/qgallouedec) and [Agustín Piqueres](https://huggingface.co/plaguss).
## Quick start
This example demonstrates how to train a model using the PRM method. We use the [Qwen 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B) as the base model. We use the stepwise supervision data from the [Math Shepherd dataset](https://huggingface.co/datasets/trl-lib/math_shepherd). You can view the data in the dataset here:
<iframe
src="https://huggingface.co/datasets/trl-lib/math_shepherd/embed/viewer/default/train?row=0"
frameborder="0"
width="100%"
height="560px"
></iframe>
Below is the script to train the model:
```python
# train_prm.py
from datasets import load_dataset
from trl import PRMConfig, PRMTrainer
from transformers import AutoModelForTokenClassification, AutoTokenizer
model = AutoModelForTokenClassification.from_pretrained("Qwen/Qwen2-0.5B", num_labels=2)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B")
train_dataset = load_dataset("trl-lib/math_shepherd", split="train[:10%]")
training_args = PRMConfig(output_dir="Qwen2-0.5B-Reward-Math-Sheperd")
trainer = PRMTrainer(model=model, args=training_args, processing_class=tokenizer, train_dataset=train_dataset)
trainer.train()
```
Execute the script using the following command:
```bash
accelerate launch train_prm.py
```
Distributed across 8 GPUs, the training takes approximately 1 hour.
To see how the [trained model](https://huggingface.co/trl-lib/Qwen2-0.5B-Reward-Math-Sheperd) performs, you can use the following script.
```python
from datasets import load_dataset
from transformers import pipeline
pipe = pipeline("token-classification", model="trl-lib/Qwen2-0.5B-Reward-Math-Sheperd")
dataset = load_dataset("trl-lib/math_shepherd")
example = {
"prompt": "Musa is the class teacher of a class of 45 students. He wants to split them into three groups by age. If a third of the class is under 11 years, and two-fifths are above 11 but under 13, how many students will be in the third group (13 years and above)?",
"completions": [
"Step 1: A third of the class is under 11 years because 11 - 1/3 = <<11-1/3=7>>7.",
"Step 2: Two-fifths of the class are above 11 but under 13 because 2/5 * 11 = <<2/5*11=8>>8.",
"Step 3: There are 45 students, so the third group will have 45 - 7 - 8 = <<45-7-8=20>>20 students. The answer is: 20",
],
"labels": [True, False, False],
}
separator = "\n" # It's important to use the same separator as the one used during training
for idx in range(1, len(example["completions"]) + 1):
steps = example["completions"][0:idx]
text = separator.join((example["prompt"], *steps)) + separator # Add a separator between the prompt and each steps
pred_entity = pipe(text)[-1]["entity"]
pred = {"LABEL_0": False, "LABEL_1": True}[pred_entity]
label = example["labels"][idx - 1]
print(f"Step {idx}\tPredicted: {pred} \tLabel: {label}")
```
```text
Step 1 Predicted: True Label: True
Step 2 Predicted: False Label: False
Step 3 Predicted: False Label: False
```
It's a win!
## Expected dataset type
PRM requires a [stepwise supervision](dataset_formats#stepwise-supervision).
The dataset should contain the following columns: `prompt`, `completions` and `labels`, where `completions` contains a list of reasoning steps and `labels` a list of booleans or floats indicating the correctness of each step.
The [`PRMTrainer`] only supports [standard](dataset_formats#standard) dataset format.
## Example script
We provide an example script to train a model using the PRM method. The script is available in [`examples/scripts/prm.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/prm.py)
To use the PRM script with the [Qwen2 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B) on the [Math Shepherd dataset](https://huggingface.co/datasets/trl-lib/math_shepherd), run the following command:
```bash
accelerate launch examples/scripts/prm.py \
--model_name_or_path Qwen/Qwen2-0.5B \
--dataset_name trl-lib/math_shepherd \
--num_train_epochs 1 \
--output_dir Qwen2-0.5B-Reward-Math-Sheperd
```
## PRMTrainer
[[autodoc]] PRMTrainer
- train
- save_model
- push_to_hub
## PRMConfig
[[autodoc]] PRMConfig

143
docs/source/quickstart.md Normal file
View File

@ -0,0 +1,143 @@
# Quickstart
TRL is a comprehensive library for post-training foundation models using techniques like Supervised Fine-Tuning (SFT), Group Relative Policy Optimization (GRPO), Direct Preference Optimization (DPO).
## Quick Examples
Get started instantly with TRL's most popular trainers. Each example uses compact models for quick experimentation.
### Supervised Fine-Tuning
```python
from trl import SFTTrainer
from datasets import load_dataset
trainer = SFTTrainer(
model="Qwen/Qwen2.5-0.5B",
train_dataset=load_dataset("trl-lib/Capybara", split="train"),
)
trainer.train()
```
### Group Relative Policy Optimization
```python
from trl import GRPOTrainer
from datasets import load_dataset
# Define a simple reward function (count unique chars as example)
def reward_function(completions, **kwargs):
return [len(set(completion.lower())) for completion in completions]
trainer = GRPOTrainer(
model="Qwen/Qwen2.5-0.5B-Instruct", # Start from SFT model
train_dataset=load_dataset("trl-lib/tldr", split="train"),
reward_funcs=reward_function,
)
trainer.train()
```
### Direct Preference Optimization
```python
from trl import DPOTrainer
from datasets import load_dataset
trainer = DPOTrainer(
model="Qwen/Qwen2.5-0.5B-Instruct", # Use your SFT model
ref_model="Qwen/Qwen2.5-0.5B-Instruct", # Original base model
train_dataset=load_dataset("trl-lib/ultrafeedback_binarized", split="train"),
)
trainer.train()
```
### Reward Modeling
```python
from trl import RewardTrainer
from datasets import load_dataset
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
trainer = RewardTrainer(
model="Qwen/Qwen2.5-0.5B-Instruct",
train_dataset=dataset,
)
trainer.train()
```
## Command Line Interface
Skip the code entirely - train directly from your terminal:
```bash
# SFT: Fine-tune on instructions
trl sft --model_name_or_path Qwen/Qwen2.5-0.5B \
--dataset_name trl-lib/Capybara
# DPO: Align with preferences
trl dpo --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
--dataset_name trl-lib/ultrafeedback_binarized
# Reward: Train a reward model
trl reward --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
--dataset_name trl-lib/ultrafeedback_binarized
```
## What's Next?
### 📚 Learn More
- [SFT Trainer](sft_trainer) - Complete SFT guide
- [DPO Trainer](dpo_trainer) - Preference alignment
- [GRPO Trainer](grpo_trainer) - Group relative policy optimization
### 🚀 Scale Up
- [Distributed Training](distributing_training) - Multi-GPU setups
- [Memory Optimization](reducing_memory_usage) - Efficient training
- [PEFT Integration](peft_integration) - LoRA and QLoRA
### 💡 Examples
- [Example Scripts](https://github.com/huggingface/trl/tree/main/examples) - Production-ready code
- [Community Tutorials](community_tutorials) - External guides
## Troubleshooting
### Out of Memory?
Reduce batch size and enable optimizations:
<hfoptions id="batch_size">
<hfoption id="SFT">
```python
training_args = SFTConfig(
per_device_train_batch_size=1, # Start small
gradient_accumulation_steps=8, # Maintain effective batch size
)
```
</hfoption>
<hfoption id="DPO">
```python
training_args = DPOConfig(
per_device_train_batch_size=1, # Start small
gradient_accumulation_steps=8, # Maintain effective batch size
)
```
</hfoption>
</hfoptions>
### Loss not decreasing?
Try adjusting the learning rate:
```python
training_args = SFTConfig(learning_rate=2e-5) # Good starting point
```
For more help, open an [issue on GitHub](https://github.com/huggingface/trl/issues).

View File

@ -1,78 +0,0 @@
# Quickstart
## How does it work?
Fine-tuning a language model via PPO consists of roughly three steps:
1. **Rollout**: The language model generates a response or continuation based on query which could be the start of a sentence.
2. **Evaluation**: The query and response are evaluated with a function, model, human feedback or some combination of them. The important thing is that this process should yield a scalar value for each query/response pair. The optimization will aim at maximizing this value.
3. **Optimization**: This is the most complex part. In the optimisation step the query/response pairs are used to calculate the log-probabilities of the tokens in the sequences. This is done with the model that is trained and and a reference model, which is usually the pre-trained model before fine-tuning. The KL-divergence between the two outputs is used as an additional reward signal to make sure the generated responses don't deviate to far from the reference language model. The active language model is then trained with PPO.
The full process is illustrated in the following figure:
<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/trl_overview.png"/>
## Minimal example
The following code illustrates the steps above.
```python
# 0. imports
import torch
from transformers import GPT2Tokenizer
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead, create_reference_model
from trl.core import respond_to_batch
# 1. load a pretrained model
model = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2')
model_ref = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
# 2. initialize trainer
ppo_config = {'batch_size': 1}
config = PPOConfig(**ppo_config)
ppo_trainer = PPOTrainer(config, model, model_ref, tokenizer)
# 3. encode a query
query_txt = "This morning I went to the "
query_tensor = tokenizer.encode(query_txt, return_tensors="pt")
# 4. generate model response
response_tensor = respond_to_batch(model, query_tensor)
response_txt = tokenizer.decode(response_tensor[0,:])
# 5. define a reward for response
# (this could be any reward such as human feedback or output from another model)
reward = [torch.tensor(1.0)]
# 6. train model with ppo
train_stats = ppo_trainer.step([query_tensor[0]], [response_tensor[0]], reward)
```
In general, you would run steps 3-6 in a for-loop and run it on many diverse queries. You can find a more realistic examples in the examples section.
## How to use a trained model
After training a `AutoModelForCausalLMWithValueHead`, you can directly use the model in `transformers`.
```python
# .. Let's assume we have a trained model using `PPOTrainer` and `AutoModelForCausalLMWithValueHead`
# push the model on the Hub
model.push_to_hub("my-fine-tuned-model-ppo")
# or save it locally
model.save_pretrained("my-fine-tuned-model-ppo")
# load the model from the Hub
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("my-fine-tuned-model-ppo")
```
You can also load your model with `AutoModelForCausalLMWithValueHead` if you want to use the value head, for example to continue a training.
```python
from trl.model import AutoModelForCausalLMWithValueHead
model = AutoModelForCausalLMWithValueHead.from_pretrained("my-fine-tuned-model-ppo")
```

View File

@ -0,0 +1,261 @@
# Reducing Memory Usage
> [!WARNING]
> Section under construction. Feel free to contribute!
## Truncation
Sequence lengths in the dataset can vary widely. When data is batched, sequences are padded to match the longest one in the batch, which can cause high memory usage, even if most sequences are relatively short.
![Truncation prompt-completion](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/why_you_should_truncate.png)
To reduce memory usage, it's important to truncate sequences to a reasonable length. While TRL trainers truncate sequences by default, you may want to adjust the default truncation length to better align with your specific use case.
<hfoptions id="truncation">
<hfoption id="DPO">
DPO truncation is applied first to the prompt and to the completion via the `max_prompt_length` and `max_completion_length` parameters. The `max_length` parameter is then used to truncate the resulting sequence.
![DPO truncation](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/truncation_prompt_completion.png)
To set the truncation parameters, use the following code snippet:
```python
from trl import DPOConfig
training_args = DPOConfig(..., max_prompt_length=..., max_length=...)
```
You can also use the `max_completion_length` parameter to truncate the completion, though this is less common since the goal is typically to preserve the completion's full length whenever possible.
```python
from trl import DPOConfig
training_args = DPOConfig(..., max_completion_length=...)
```
</hfoption>
<hfoption id="SFT">
SFT truncation is applied to the input sequence via the `max_length` parameter.
![Truncation input ids](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/truncation_input_ids.png)
To set the truncation parameter, use the following code snippet:
```python
from trl import SFTConfig
training_args = SFTConfig(..., max_length=...)
```
</hfoption>
</hfoptions>
### How to choose the `max_length` value?
If `max_length` is too small, a significant portion of your tokens will be discarded and won't contribute to training. If it's too large, memory usage can spike, potentially leading to OOM (Out-Of-Memory) errors. Without packing or padding-free, a large `max_length` may also result in inefficient training, as many tokens will be padding.
To help you choose an appropriate value, we provide a utility to visualize the sequence length distribution in your dataset.
<iframe src="https://trl-lib-dataset-length-profiler.hf.space" frameborder="0" width="100%" height="1000"></iframe>
## Packing
> [!TIP]
> This technique applies only to SFT.
[Truncation](#truncation) has several drawbacks:
1. **Loss of information**: Key data at the end of a sequence may be discarded.
2. **Choosing truncation length**: Too short loses data; too long undermines efficiency.
Packing, introduced in [Raffel et al., 2020](https://huggingface.co/papers/1910.10683), addresses these issues by grouping sequences instead of truncating. It concatenates and splits dataset sequences into the desired lengths.
![Packing](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/packing_2.png)
Packing reduces padding by merging several sequences in one row when possible. We use an advanced method to be near-optimal in the way we pack the dataset. To enable packing, use `packing=True` in the [`SFTConfig`].
> [!TIP]
> In TRL 0.18 and earlier, packing used a more aggressive method that reduced padding to almost nothing, but had the downside of breaking sequence continuity for a large fraction of the dataset. To revert to this strategy, use `packing_strategy="wrapped"` in [`SFTConfig`].
```python
from trl import SFTConfig
training_args = SFTConfig(..., packing=True, max_length=512)
```
> [!WARNING]
> Packing may cause batch contamination, where adjacent sequences influence one another. This can be problematic for some applications. For more details, see [#1230](https://github.com/huggingface/trl/issues/1230).
## Liger for reducing peak memory usage
> [Liger Kernel](https://github.com/linkedin/Liger-Kernel) is a collection of Triton kernels designed specifically for LLM training. It can effectively increase multi-GPU training throughput by 20% and reduces memory usage by 60%.
For more information, see [Liger Kernel Integration](liger_kernel_integration)
<hfoptions id="liger">
<hfoption id="DPO">
To use Liger for reducing peak memory usage, use the following code snippet:
```python
from trl import DPOConfig
training_args = DPOConfig(..., use_liger_loss=True)
```
</hfoption>
<hfoption id="GRPO">
To use Liger for reducing peak memory usage, use the following code snippet:
```python
from trl import GRPOConfig
training_args = GRPOConfig(..., use_liger_loss=True)
```
</hfoption>
<hfoption id="KTO">
To use Liger for reducing peak memory usage, use the following code snippet:
```python
from trl import KTOConfig
training_args = KTOConfig(..., use_liger_loss=True)
```
</hfoption>
</hfoptions>
## Padding-free
Padding-free batching is an alternative approach for reducing memory usage. In this method, a batch is first sampled and then flattened into a single sequence, avoiding padding. Unlike packing, which can result in incomplete sequences by combining parts of different samples, padding-free batching ensures that all sequences remain complete and intact.
![Padding-free](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/padding-free.png)
> [!WARNING]
> It's highly recommended to use padding-free batching with **FlashAttention 2** or **FlashAttention 3**. Otherwise, you may encounter batch contamination issues.
<hfoptions id="padding-free">
<hfoption id="DPO">
```python
from trl import DPOConfig
training_args = DPOConfig(..., padding_free=True, model_init_kwargs={"attn_implementation": "flash_attention_2"})
```
</hfoption>
<hfoption id="SFT">
```python
from trl import SFTConfig
training_args = SFTConfig(..., padding_free=True, model_init_kwargs={"attn_implementation": "flash_attention_2"})
```
</hfoption>
</hfoptions>
## Activation offloading
Activation offloading is a memory efficiency technique that reduces GPU VRAM usage by temporarily moving activation tensors to CPU RAM during the forward pass and bringing them back only when needed for the backward pass. This significantly reduces peak memory usage at the cost of slightly increased training time.
To enable activation offloading in your SFT training configuration:
```python
from trl import SFTConfig
training_args = SFTConfig(..., activation_offloading=True)
```
> [!WARNING]
> When using activation offloading with models that use Liger kernels, you must disable Liger cross entropy due to compatibility issues. The issue occurs specifically with `use_liger_kernel=True` because Liger cross entropy performs in-place operations which conflict with activation offloading. The default setting (`use_liger_kernel=False`) works:
>
> ```python
> # When using activation offloading with a model that uses Liger kernels:
> from trl import SFTConfig
>
> training_args = SFTConfig(
> activation_offloading=True,
> use_liger_kernel=False, # Disable Liger cross entropy
> # Other parameters...
> )
> ```
Under the hood, activation offloading implements PyTorch's [`saved_tensors_hooks`](https://pytorch.org/tutorials/intermediate/autograd_saved_tensors_hooks_tutorial.html#hooks-for-autograd-saved-tensors) to intercept activations during the forward pass. It intelligently manages which tensors to offload based on size and context, avoiding offloading output tensors which would be inefficient. For performance optimization, it can optionally use CUDA streams to overlap computation with CPU-GPU transfers.
## Disabling model gathering for generation in online methods
When using DeepSpeed ZeRO-3, model weights are sharded across multiple GPUs. Online methods involve generating completions from the model as part of the training process. During this step, the model weights are temporarily gathered on a single GPU for generation. For very large models, this gathering can lead to out-of-memory (OOM) errors, as described in this issue: [#2250](https://github.com/huggingface/trl/issues/2250#issue-2598304204).
If you encounter this issue, you can disable the gathering of model weights for generation by setting the following parameter:
<hfoptions id="ds3_gather_for_generation">
<hfoption id="GRPO">
```python
from trl import GRPOConfig
training_args = GRPOConfig(..., ds3_gather_for_generation=False)
```
</hfoption>
<hfoption id="Online DPO">
```python
from trl import OnlineDPOConfig
training_args = OnlineDPOConfig(..., ds3_gather_for_generation=False)
```
</hfoption>
<hfoption id="PPO">
```python
from trl import PPOConfig
training_args = PPOConfig(..., ds3_gather_for_generation=False)
```
</hfoption>
<hfoption id="RLOO">
```python
from trl import RLOOConfig
training_args = RLOOConfig(..., ds3_gather_for_generation=False)
```
</hfoption>
</hfoptions>
This adjustment prevents model weights from being gathered, avoiding OOM errors, but it may result in slower generation speeds.
## vLLM sleep mode
When using vLLM as the generation backend, you can enable _sleep mode_ to offload vLLM parameters and cache to CPU RAM during the optimization step and reload them back to GPU VRAM when needed for weight synchronization and generation.
<hfoptions id="vllm_sleep">
<hfoption id="GRPO">
```python
from trl import GRPOConfig
training_args = GRPOConfig(..., vllm_enable_sleep_mode=True)
```
</hfoption>
<hfoption id="RLOO">
```python
from trl import RLOOConfig
training_args = RLOOConfig(..., vllm_enable_sleep_mode=True)
```
</hfoption>
</hfoptions>

View File

@ -0,0 +1,238 @@
# Reward Modeling
[![model badge](https://img.shields.io/badge/All_models-Reward_Trainer-blue)](https://huggingface.co/models?other=reward-trainer,trl)
## Overview
TRL supports the Outcome-supervised Reward Modeling (ORM) Trainer for training reward models.
This post-training method was contributed by [Younes Belkada](https://huggingface.co/ybelkada).
## Quick start
This example demonstrates how to train a reward model using the [`RewardTrainer`] from TRL. We train a [Qwen 3 0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) model on the [UltraFeedback dataset](https://huggingface.co/datasets/trl-lib/ultrafeedback_binarized), large-scale, fine-grained, diverse preference dataset.
```python
from trl import RewardTrainer
from datasets import load_dataset
trainer = RewardTrainer(
model="Qwen/Qwen3-0.6B",
train_dataset=load_dataset("trl-lib/ultrafeedback_binarized", split="train"),
)
trainer.train()
```
<iframe src="https://trl-lib-trackio.hf.space/?project=trl-documentation&metrics=train*&sidebar=hidden&runs=reward_qwen3-0.6B_ultrafeedback2" style="width: 100%; min-width: 300px; max-width: 800px;" height="830" frameBorder="0"></iframe>
## Expected dataset type and format
[`RewardTrainer`] supports [preference](dataset_formats#preference) datasets type (both implicit and explicit prompt). The [`RewardTrainer`] is compatible with both [standard](dataset_formats#standard) and [conversational](dataset_formats#conversational) dataset formats. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
```python
# Standard preference (implicit prompt)
{"chosen": "The sky is blue.",
"rejected": "The sky is green."}
# Conversational preference (implicit prompt)
{"chosen": [{"role": "user", "content": "What color is the sky?"},
{"role": "assistant", "content": "It is blue."}],
"rejected": [{"role": "user", "content": "What color is the sky?"},
{"role": "assistant", "content": "It is green."}]}
# Standard preference (explicit prompt)
{"prompt": "The sky is",
"chosen": " blue.",
"rejected": " green."}
# Conversational preference (explicit prompt)
{"prompt": [{"role": "user", "content": "What color is the sky?"}],
"chosen": [{"role": "assistant", "content": "It is blue."}],
"rejected": [{"role": "assistant", "content": "It is green."}]}
```
If your dataset is not in one of these formats, you can preprocess it to convert it into the expected format. Here is an example with the [lmarena-ai/arena-human-preference-55k](https://huggingface.co/datasets/lmarena-ai/arena-human-preference-55k) dataset:
```python
from datasets import load_dataset
import json
dataset = load_dataset("lmarena-ai/arena-human-preference-55k")
# Filter out ties
dataset = dataset.filter(lambda example: example["winner_tie"] == 0)
# Create 'chosen' and 'rejected' fields based on the winner column
def response_a_b_to_chosen_rejected(example):
if example["winner_model_a"] == 1:
example["chosen"] = example["response_a"]
example["rejected"] = example["response_b"]
else:
example["chosen"] = example["response_b"]
example["rejected"] = example["response_a"]
return example
dataset = dataset.map(response_a_b_to_chosen_rejected)
# Convert to conversational format
def make_conversation(example):
prompt = json.loads(example["prompt"])[0] # '["What color is the sky?"]' -> "What color is the sky?"
chosen = json.loads(example["chosen"])[0]
rejected = json.loads(example["rejected"])[0]
return {
"chosen": [{"role": "user", "content": prompt}, {"role": "assistant", "content": chosen}],
"rejected": [{"role": "user", "content": prompt}, {"role": "assistant", "content": rejected}],
}
dataset = dataset.map(make_conversation)
# Keep only necessary columns
dataset = dataset.select_columns(["chosen", "rejected"])
print(next(iter(dataset["train"])))
```
```json
{
"chosen": [
{"role": "user", "content": "Is it morally right to try to have a certain percentage of females on managerial positions?"},
{"role": "assistant", "content": "The question of whether it is morally right to aim for a certain percentage of females..."},
],
"rejected": [
{"role": "user", "content": "Is it morally right to try to have a certain percentage of females on managerial positions?"},
{"role": "assistant", "content": "As an AI, I don't have personal beliefs or opinions. However, ..."},
],
}
```
## Looking deeper into the training method
Reward Models (RMs) are typically trained using supervised learning on datasets containing pairs of preferred and non-preferred responses. The goal is to learn a function that assigns higher scores to preferred responses, enabling the model to rank outputs based on preferences.
This section breaks down how reward modeling works in practice, covering the key steps: **preprocessing** and **loss computation**.
### Preprocessing and tokenization
During training, each example is expected to contain a **chosen** and **rejected** field. For more details on the expected formats, see [Dataset formats - Preference](dataset_formats#preference).
The [`RewardTrainer`] tokenizes each input using the model's tokenizer. If prompts and completions (chosen and rejected) are provided separately (explicit prompt case), they are concatenated before tokenization.
### Computing the loss
Let \\( x \\) be the input sequence (prompt) and \\( y^+ \\) and \\( y^- \\) be the chosen and rejected sequences respectively. Under the Bradley-Terry model ([Bradley & Terry, 1952](https://www.jstor.org/stable/2334029)), the probability that \\( y^+ \\) is preferred over \\( y^- \\) given a reward function \\( r \\) is \\( p(y^+ ≻ y^- |x) = \sigma(r(x, y^+)r(x, y^-)) \\), where \\( σ \\) is the sigmoid function.
The reward model \\( r_\theta(x, y) \\) is trained to assign higher scores to preferred responses \\( y^+ \\) over non-preferred ones \\( y^- \\). The loss is then defined as the negative log-likelihood of the observed preferences:
$$
\mathcal{L}(\theta) = - \mathbb{E}_{(x,y^+,y^-) \sim \mathcal{D}} \left[ \log \sigma(r_\theta(x, y^+) - r_\theta(x, y^-)) \right].
$$
> [!TIP]
> The Bradley-Terry model is underdetermined, meaning that adding a constant to all rewards does not change the preference probabilities. To address this, [Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking](https://huggingface.co/papers/2312.09244) proposes adding an auxiliary loss term that encourages the rewards to be centered around zero. This is controlled by the `center_rewards_coefficient` parameter in the [`RewardConfig`]. The recommended value is `1e-2`.
## Logged metrics
While training and evaluating we record the following reward metrics:
* `global_step`: The total number of optimizer steps taken so far.
* `epoch`: The current epoch number, based on dataset iteration.
* `num_tokens`: The total number of tokens processed so far.
* `loss`: The average loss over the last logging interval.
* `accuracy`: The proportion of correct predictions (i.e., the model assigned a higher score to the chosen response than to the rejected one) averaged over the last logging interval.
* `min_reward`: The minimum reward score assigned by the model. This value is averaged over the logging interval.
* `mean_reward`: The average reward score assigned by the model over the last logging interval.
* `max_reward`: The maximum reward score assigned by the model. This value is averaged over the logging interval.
* `margin`: The average margin (difference between chosen and rejected rewards) over the last logging interval.
* `learning_rate`: The current learning rate, which may change dynamically if a scheduler is used.
* `grad_norm`: The L2 norm of the gradients, computed before gradient clipping.
## Customization
### Model initialization
You can directly pass the kwargs of the [`~transformers.AutoModelForSequenceClassification.from_pretrained()`] method to the [`RewardConfig`]. For example, if you want to load a model in a different precision, analogous to
```python
model = AutoModelForSequenceClassification.from_pretrained("Qwen/Qwen3-0.6B", dtype=torch.bfloat16)
```
you can do so by passing the `model_init_kwargs={"dtype": torch.bfloat16}` argument to the [`RewardConfig`].
```python
from trl import RewardConfig
training_args = RewardConfig(
model_init_kwargs={"dtype": torch.bfloat16},
)
```
Note that all keyword arguments of [`~transformers.AutoModelForSequenceClassification.from_pretrained()`] are supported, except for `num_labels`, which is automatically set to 1.
### Train adapters with PEFT
We support tight integration with 🤗 PEFT library, allowing any user to conveniently train adapters and share them on the Hub, rather than training the entire model.
```python
from datasets import load_dataset
from trl import RewardTrainer
from peft import LoraConfig
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
trainer = RewardTrainer(
"Qwen/Qwen3-4B",
train_dataset=dataset,
peft_config=LoraConfig(modules_to_save=["score"]) # important to include the score head when base model is not a sequence classification model
)
trainer.train()
```
You can also continue training your [`~peft.PeftModel`]. For that, first load a `PeftModel` outside [`RewardTrainer`] and pass it directly to the trainer without the `peft_config` argument being passed.
```python
from datasets import load_dataset
from trl import RewardTrainer
from peft import AutoPeftModelForCausalLM
model = AutoPeftModelForCausalLM.from_pretrained("trl-lib/Qwen3-4B-Reward-LoRA", is_trainable=True)
dataset = load_dataset("trl-lib/Capybara", split="train")
trainer = RewardTrainer(
model=model,
train_dataset=dataset,
)
trainer.train()
```
> [!TIP]
> When training adapters, you typically use a higher learning rate (≈1e3) since only new parameters are being learned.
>
> ```python
> RewardConfig(learning_rate=1e-3, ...)
> ```
## Tool Calling with Reward Modeling
The [`RewardTrainer`] fully supports fine-tuning models with _tool calling_ capabilities. In this case, each dataset example should include:
* The conversation messages, including any tool calls (`tool_calls`) and tool responses (`tool` role messages)
* The list of available tools in the `tools` column, typically provided as JSON schemas
For details on the expected dataset structure, see the [Dataset Format — Tool Calling](dataset_formats#tool-calling) section.
## RewardTrainer
[[autodoc]] RewardTrainer
- train
- save_model
- push_to_hub
## RewardConfig
[[autodoc]] RewardConfig
## DataCollatoForPreference
[[autodoc]] trainer.reward_trainer.DataCollatorForPreference

15
docs/source/rewards.md Normal file
View File

@ -0,0 +1,15 @@
# Reward Functions
This module contains some useful reward functions, primarily intended for use with the [`GRPOTrainer`] and [`RLOOTrainer`].
## accuracy_reward
[[autodoc]] rewards.accuracy_reward
## think_format_reward
[[autodoc]] rewards.think_format_reward
## get_soft_overlong_punishment
[[autodoc]] rewards.get_soft_overlong_punishment

617
docs/source/rloo_trainer.md Normal file
View File

@ -0,0 +1,617 @@
# RLOO Trainer
[![model badge](https://img.shields.io/badge/All_models-RLOO-blue)](https://huggingface.co/models?other=rloo,trl)
## Overview
TRL supports the RLOO Trainer for training language models, as described in the paper [Back to Basics: Revisiting REINFORCE Style
Optimization for Learning from Human Feedback in LLMs](https://huggingface.co/papers/2402.14740) by [Arash Ahmadian](https://huggingface.co/ArashAhmadian), Chris Cremer, [Matthias Gallé](https://huggingface.co/mgalle), [Marzieh Fadaee](https://huggingface.co/MarziehFadaee), [Julia Kreutzer](https://huggingface.co/JuliaKreutzerCohere), [Ahmet Üstün](https://huggingface.co/ahmetu) and [Sara Hooker](https://huggingface.co/sarahooker).
The abstract from the paper is the following:
> AI alignment in the shape of Reinforcement Learning from Human Feedback (RLHF) is increasingly treated as a crucial ingredient for high performance large language models. Proximal Policy Optimization (PPO) has been positioned by recent literature as the canonical method for the RL part of RLHF However, it involves both high computational cost and sensitive hyperparameter tuning. We posit that most of the motivational principles that led to the development of PPO are less of a practical concern in RLHF and advocate for a less computationally expensive method that preserves and even increases performance. We revisit the formulation of alignment from human preferences in the context of RL. Keeping simplicity as a guiding principle, we show that many components of PPO are unnecessary in an RLHF context and that far simpler REINFORCE-style optimization variants outperform both PPO and newly proposed “RL-free” methods such as DPO and RAFT. Our work suggests that careful adaptation to LLMs alignment characteristics enables benefiting from online RL optimization at low cost.
This post-training method was contributed by [Costa Huang](https://github.com/vwxyzjn) and later refactored by [Shirin Yamani](https://huggingface.co/ShirinYamani).
## Quick start
This example demonstrates how to train a model using the RLOO method. We train a [Qwen 0.5B Instruct model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) with the prompts from the [UltraFeedback prompts dataset](https://huggingface.co/datasets/trl-lib/ultrafeedback-prompt). You can view the data in the dataset here:
<iframe
src="https://huggingface.co/datasets/trl-lib/ultrafeedback-prompt/embed/viewer/default/train?row=0"
frameborder="0"
width="100%"
height="560px"
></iframe>
Below is the script to train the model.
```python
# train_rloo.py
from datasets import load_dataset
from trl import RLOOConfig, RLOOTrainer
dataset = load_dataset("trl-lib/ultrafeedback-prompt", split="train")
# Dummy reward function for demonstration purposes
def reward_num_unique_letters(completions, **kwargs):
"""Reward function that rewards completions with more unique letters."""
completion_contents = [completion[0]["content"] for completion in completions]
return [float(len(set(content))) for content in completion_contents]
training_args = RLOOConfig(output_dir="Qwen2-0.5B-RLOO")
trainer = RLOOTrainer(
model="Qwen/Qwen2-0.5B-Instruct",
reward_funcs=reward_num_unique_letters,
args=training_args,
train_dataset=dataset,
)
trainer.train()
```
Execute the script using the following command:
```bash
accelerate launch train_rloo.py
```
## Looking deeper into the RLOO method
RLOO is an online learning algorithm, meaning it improves iteratively by using the data generated by the trained model itself during training. The intuition behind RLOO objective is to maximize the advantage of the generated completions, while ensuring that the model remains close to the reference policy. To understand how RLOO works, it can be broken down into four main steps: **Generating completions**, **computing the advantage**, **estimating the KL divergence**, and **computing the loss**.
![RLOO](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/rloo.png)
### Generating completions
At each training step, we sample a batch of prompts and generate a set of \\( G \\) completions for each prompt (denoted as \\( o_i \\)).
### Computing the reward
In RLOO, the reward consists of two components: the reward provided by the reward model (or reward function) and a KL penalty that discourages the policy from deviating too far from a fixed reference policy
1. For each of the \\( G \\) generated sequences \\( o_i = (o_{i,1}, \dots, o_{i,T}) \\) conditioned on a query \\( q \\), we compute a scalar reward using a reward model \\( R(o_i, q) \\).
2. Concurrently, we estimate the KL divergence between the current policy \\( \pi_\theta \\) and the fixed reference policy \\( \pi_{\text{ref}} \\) over the sequence. The KL estimate for sequence \\( o_i \\) is:
$$
\mathbb{D}_{\mathrm{KL}}\!\left[\pi_\theta\|\pi_{\mathrm{ref}}\right] = \sum_{t=1}^T \log \frac{\pi_\theta(o_{i,t} \mid q, o_{i,<t})}{\pi_{\mathrm{ref}}(o_{i,t} \mid q, o_{i,<t})}.
$$
The final reward assigned to sequence \\( o_i \\) is then:
$$
r_i = R(o_i, q) - \beta \, \mathbb{D}_{\mathrm{KL}}\!\left[\pi_\theta \|\pi_{\mathrm{ref}}\right],
$$
where \\( \beta > 0 \\) controls the strength of the KL penalty.
> [!TIP]
> In a purely online setting (`num_iterations = 1`, default), the data are generated by the current policy. In this case, the KL penalty is computed directly using the current policy.
>
> In the more general setting (e.g., multiple gradient steps per batch), the data are instead generated by an earlier snapshot \\( \pi_{\text{old}} \\). To keep the penalty consistent with the sampling distribution, the KL is defined with respect to this policy:
>
> $$
> \mathbb{D}_{\mathrm{KL}}\!\left[\pi_{\text{old}} \,\|\, \pi_{\text{ref}}\right].
> $$
>
> Equivalently, for a sampled sequence $o$, the Monte Carlo estimate is
>
> $$
> \mathbb{D}_{\mathrm{KL}}\!\left[\pi_{\text{old}} \|\pi_{\mathrm{ref}}\right] = \sum_{t=1}^T \log \frac{\pi_{\text{old}}(o_{i,t} \mid q, o_{i,<t})}{\pi_{\mathrm{ref}}(o_{i,t} \mid q, o_{i,<t})}.
> $$
### Computing the advantage
Once the rewards for each completion have been computed, we calculate a baseline as the average reward of all other samples in the same batch, excluding the current sample. This baseline is used to reduce the variance of the policy gradient estimate. The advantage for each completion is then obtained as the difference between its own reward and this leave-one-out baseline.
Formally, for a batch of G completions, the baseline for completion is:
$$
b_i = \frac{1}{G-1} \sum_{j \neq i} r_j
$$
and then the advantage for each completion is computed as the difference between its reward and the baseline:
$$
A_i = r_i - b_i
$$
### Computing the loss
The REINFORCE loss is simply defined as:
$$
\mathcal{L}_{\text{RLOO}}(\theta) = - \frac{1}{G} \sum_{i=1}^G \hat{A}_i \, \log \pi_\theta(o_i \mid q)
$$
In practice, performing multiple gradient steps on the same batch makes the actions effectively off-policy relative to the current parameters. To correct for this, we introduce the importance sampling ratio. To prevent excessively large updates when the policy changes between sampling and gradient steps, we clip this ratio:
$$
\mathcal{L}_{\text{RLOO}}(\theta) = - \frac{1}{G} \sum_{i=1}^G \min \left( \frac{\pi_\theta(o_i \mid q)}{\pi_{\theta_\text{old}}(o_i \mid q)} \hat{A}_i, \, \text{clip}\left(\frac{\pi_\theta(o_i \mid q)}{\pi_{\theta_\text{old}}(o_i \mid q)}, 1-\epsilon, 1+\epsilon\right) \hat{A}_i \right)
$$
In a fully online, single-step setting (default), \\( \frac{\pi_\theta(o_i \mid q)}{\pi_{\theta_\text{old}}(o_i \mid q)} = 1 \\) and this reduces to standard REINFORCE.
## Logged metrics
While training and evaluating, we record the following reward metrics:
- `num_tokens`: The total number of tokens processed so far, including both prompts and completions.
- `completions/mean_length`: The average length of generated completions.
- `completions/min_length`: The minimum length of generated completions.
- `completions/max_length`: The maximum length of generated completions.
- `completions/mean_terminated_length`: The average length of generated completions that terminate with EOS.
- `completions/min_terminated_length`: The minimum length of generated completions that terminate with EOS.
- `completions/max_terminated_length`: The maximum length of generated completions that terminate with EOS.
- `completions/clipped_ratio`: The ratio of truncated (clipped) completions.
- `reward/{reward_func_name}/mean`: The average reward from a specific reward function.
- `reward/{reward_func_name}/std`: The standard deviation of the reward from a specific reward function.
- `reward`: The overall average reward after applying reward weights.
- `reward_std`: The standard deviation of rewards after applying reward weights. This is the average of the per-group standard deviations.
- `frac_reward_zero_std`: The fraction of samples in the generation batch with a reward std of zero, implying there is little diversity for that prompt (all answers are correct or incorrect).
- `entropy`: Average entropy of token predictions across generated completions. (If `mask_truncated_completions=True`, masked sequences tokens are excluded.)
- `kl`: The average KL divergence between the model and the reference model, calculated over generated completions. Logged only if `beta` is nonzero.
- `clip_ratio/region_mean`: The ratio of sequence probabilities where the RLOO objective is clipped to stay within the trust region:
$$
\text{clip}\left( r_{i}(\theta), 1 - \epsilon_\mathrm{low}, 1 + \epsilon_\mathrm{high} \right)\,, \qquad r_{i}(\theta) = \frac{\pi_\theta(o_{i} \mid q)}{\pi_{\theta_{\text{old}}}(o_{i} \mid q)}\,.
$$
A higher value means more samples are clipped, which constrains how much the policy $\pi_\theta$ can change.
- `clip_ratio/low_mean`: The average ratio of sequence probabilities that were clipped on the lower bound of the trust region: \\(r_{i,t}(\theta) < 1 - \epsilon_\mathrm{low}\\)
- `clip_ratio/low_min`: The minimum ratio of sequence probabilities that were clipped on the lower bound of the trust region: \\(r_{i,t}(\theta) < 1 - \epsilon_\mathrm{low}\\)
- `clip_ratio/high_mean`: The average ratio of sequence probabilities that were clipped on the upper bound of the trust region: \\(r_{i,t}(\theta) > 1 + \epsilon_\mathrm{high}\\)
- `clip_ratio/high_max`: The maximum ratio of sequence probabilities that were clipped on the upper bound of the trust region: \\(r_{i,t}(\theta) > 1 + \epsilon_\mathrm{high}\\).
## Customization
### Speed up training with vLLM-powered generation
Generation is often the main bottleneck when training with online methods. To accelerate generation, you can use [vLLM](https://github.com/vllm-project/vllm), a high-throughput, low-latency inference engine for LLMs. To enable it, first install the package with
```shell
pip install trl[vllm]
```
We support two ways of using vLLM during training: **server mode** and **colocate mode**.
#### 🔌 Option 1: Server mode
In this mode, vLLM runs in a separate process (and using separate GPUs) and communicates with the trainer via HTTP. This is ideal if you have dedicated GPUs for inference.
1. **Start the vLLM server**:
```bash
trl vllm-serve --model <model_name>
```
2. **Enable server mode in your training script**:
```python
from trl import RLOOConfig
training_args = RLOOConfig(
...,
use_vllm=True,
vllm_mode="server", # default value, can be omitted
)
```
> [!WARNING]
> Make sure that the server is using different GPUs than the trainer, otherwise you may run into NCCL errors. You can specify the GPUs to use with the `CUDA_VISIBLE_DEVICES` environment variable.
#### 🧩 Option 2: Colocate mode
In this mode, vLLM runs inside the trainer process and shares GPU memory with the training model. This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs.
```python
from trl import RLOOConfig
training_args = RLOOConfig(
...,
use_vllm=True,
vllm_mode="colocate",
)
```
> [!TIP]
> Depending on the model size and the overall GPU memory requirements for training, you may need to adjust the `vllm_gpu_memory_utilization` parameter in [`RLOOConfig`] to avoid underutilization or out-of-memory errors.
>
> We provide a [HF Space](https://huggingface.co/spaces/trl-lib/recommend-vllm-memory) to help estimate the recommended GPU memory utilization based on your model configuration and experiment settings. Simply use it as follows to get `vllm_gpu_memory_utilization` recommendation:
>
> <iframe src="https://trl-lib-recommend-vllm-memory.hf.space" frameborder="0" width="850" height="450"></iframe>
>
> If the recommended value does not work in your environment, we suggest adding a small buffer (e.g., +0.05 or +0.1) to the recommended value to ensure stability.
>
> If you still find you are getting out-of-memory errors set `vllm_enable_sleep_mode` to True and the vllm parameters and cache will be offloaded during the optimization step. For more information, see [Reducing Memory Usage with vLLM Sleep Mode](reducing_memory_usage#vllm-sleep-mode).
> [!TIP]
> By default, RLOO uses `MASTER_ADDR=localhost` and `MASTER_PORT=12345` for vLLM, but you can override these values by setting the environment variables accordingly.
For more information, see [Speeding up training with vLLM](speeding_up_training#vllm-for-fast-generation-in-online-methods).
### RLOO at scale: train a 70B+ Model on multiple nodes
When training large models like **Qwen2.5-72B**, you need several key optimizations to make the training efficient and scalable across multiple GPUs and nodes. These include:
- **DeepSpeed ZeRO Stage 3**: ZeRO leverages data parallelism to distribute model states (weights, gradients, optimizer states) across multiple GPUs and CPUs, reducing memory and compute requirements on each device. Since large models cannot fit on a single GPU, using ZeRO Stage 3 is required for training such models. For more details, see [DeepSpeed Integration](deepspeed_integration).
- **Accelerate**: Accelerate is a library that simplifies distributed training across multiple GPUs and nodes. It provides a simple API to launch distributed training and handles the complexities of distributed training, such as data parallelism, gradient accumulation, and distributed data loading. For more details, see [Distributing Training](distributing_training).
- **vLLM**: See the previous section on how to use vLLM to speed up generation.
Below is an example SLURM script to train a 70B model with RLOO on multiple nodes. This script trains a model on 4 nodes and uses the 5th node for vLLM-powered generation.
```sh
#!/bin/bash
#SBATCH --nodes=5
#SBATCH --gres=gpu:8
# Get the list of allocated nodes
NODELIST=($(scontrol show hostnames $SLURM_JOB_NODELIST))
# Assign the first 4 nodes for training and the 5th node for vLLM
TRAIN_NODES="${NODELIST[@]:0:4}" # Nodes 0, 1, 2, 3 for training
VLLM_NODE="${NODELIST[4]}" # Node 4 for vLLM
# Run training on the first 4 nodes (Group 1)
srun --nodes=4 --ntasks=4 --nodelist="${NODELIST[@]:0:4}" accelerate launch \
--config_file examples/accelerate_configs/deepspeed_zero3.yaml \
--num_processes 32 \
--num_machines 4 \
--main_process_ip ${NODELIST[0]} \
--machine_rank $SLURM_PROCID \
--rdzv_backend c10d \
train_rloo.py \
--server_ip $VLLM_NODE &
# Run vLLM server on the 5th node (Group 2)
srun --nodes=1 --ntasks=1 --nodelist="${NODELIST[4]}" trl vllm-serve --model Qwen/Qwen2.5-72B --tensor_parallel_size 8 &
wait
```
```python
import argparse
from datasets import load_dataset
from trl import RLOOTrainer, RLOOConfig
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--vllm_server_host", type=str, default="", help="The server IP")
args = parser.parse_args()
# Example dataset from TLDR
dataset = load_dataset("trl-lib/tldr", split="train")
# Dummy reward function: count the number of unique characters in the completions
def reward_num_unique_chars(completions, **kwargs):
return [len(set(c)) for c in completions]
training_args = RLOOConfig(
output_dir="Qwen2.5-72B-RLOO",
per_device_train_batch_size=4,
bf16=True,
gradient_checkpointing=True,
use_vllm=True,
vllm_server_host=args.vllm_server_host.replace("ip-", "").replace("-", "."), # from ip-X-X-X-X to X.X.X.X
)
trainer = RLOOTrainer(model="Qwen/Qwen2.5-72B", args=training_args, reward_funcs=reward_num_unique_chars, train_dataset=dataset)
trainer.train()
if __name__=="__main__":
main()
```
### Using a custom reward function
The [`RLOOTrainer`] supports using custom reward functions instead of dense reward models. To ensure compatibility, your reward function must satisfy the following requirements:
1. **Input arguments**:
- The function must accept the following as keyword arguments:
- `prompts` (contains the prompts),
- `completions` (contains the generated completions),
- `completions_ids` (contains the tokenized completions),
- `trainer_state` ([`~transformers.TrainerState`]): The current state of the trainer. This can be used to implement dynamic reward functions, such as curriculum learning, where the reward is adjusted based on the training progress.
- All column names (but `prompt`) that the dataset may have. For example, if the dataset contains a column named `ground_truth`, the function will be called with `ground_truth` as a keyword argument.
The easiest way to comply with this requirement is to use `**kwargs` in the function signature.
- Depending on the dataset format, the input will vary:
- For [standard format](dataset_formats#standard), `prompts` and `completions` will be lists of strings.
- For [conversational format](dataset_formats#conversational), `prompts` and `completions` will be lists of message dictionaries.
2. **Return value**: The function must return a list of floats. Each float represents the reward corresponding to a single completion.
#### Example 1: Reward longer completions
Below is an example of a reward function for a standard format that rewards longer completions:
```python
def reward_func(completions_ids, **kwargs):
"""Reward function that assigns higher scores to longer completions (in terms of token count)."""
return [float(len(ids)) for ids in completions_ids]
```
You can test it as follows:
```python
>>> prompts = ["The sky is", "The sun is"] # not used in the reward function, but the trainer will pass it
>>> completions = [" blue.", " in the sky."] # not used in the reward function, but the trainer will pass it
>>> completions_ids = [[6303, 13], [304, 279, 12884, 13]]
>>> reward_func(prompts=prompts, completions=completions, completions_ids=completions_ids)
[2.0, 4.0]
```
#### Example 1.1: Reward longer completions (based on the number of characters)
Same as the previous example, but this time the reward function is based on the number of characters instead of tokens.
```python
def reward_func(completions, **kwargs):
"""Reward function that assigns higher scores to longer completions (in terms of character count)."""
return [float(len(completion)) for completion in completions]
```
You can test it as follows:
```python
>>> prompts = ["The sky is", "The sun is"]
>>> completions = [" blue.", " in the sky."]
>>> completions_ids = [[6303, 13], [304, 279, 12884, 13]] # not used in the reward function, but the trainer will pass it
>>> reward_func(prompts=prompts, completions=completions, completions_ids=completions_ids)
[6.0, 12.0]
```
#### Example 2: Reward completions with a specific format
Below is an example of a reward function that checks if the completion has a specific format. This example is inspired by the _format reward_ function used in the paper [DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning](https://huggingface.co/papers/2501.12948).
It is designed for a conversational format, where prompts and completions consist of structured messages.
```python
import re
def format_reward_func(completions, **kwargs):
"""Reward function that checks if the completion has a specific format."""
pattern = r"^<think>.*?</think><answer>.*?</answer>$"
completion_contents = [completion[0]["content"] for completion in completions]
matches = [re.match(pattern, content) for content in completion_contents]
return [1.0 if match else 0.0 for match in matches]
```
You can test this function as follows:
```python
>>> prompts = [
... [{"role": "assistant", "content": "What is the result of (1 + 2) * 4?"}],
... [{"role": "assistant", "content": "What is the result of (3 + 1) * 2?"}],
... ]
>>> completions = [
... [{"role": "assistant", "content": "<think>The sum of 1 and 2 is 3, which we multiply by 4 to get 12.</think><answer>(1 + 2) * 4 = 12</answer>"}],
... [{"role": "assistant", "content": "The sum of 3 and 1 is 4, which we multiply by 2 to get 8. So (3 + 1) * 2 = 8."}],
... ]
>>> format_reward_func(prompts=prompts, completions=completions)
[1.0, 0.0]
```
#### Example 3: Reward completions based on a reference
Below is an example of a reward function that checks if the completion is correct. This example is inspired by the _accuracy reward_ function used in the paper [DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning](https://huggingface.co/papers/2501.12948).
This example is designed for [standard format](dataset_formats#standard), where the dataset contains a column named `ground_truth`.
```python
import re
def reward_func(completions, ground_truth, **kwargs):
# Regular expression to capture content inside \boxed{}
matches = [re.search(r"\\boxed\{(.*?)\}", completion) for completion in completions]
contents = [match.group(1) if match else "" for match in matches]
# Reward 1 if the content is the same as the ground truth, 0 otherwise
return [1.0 if c == gt else 0.0 for c, gt in zip(contents, ground_truth)]
```
You can test this function as follows:
```python
>>> prompts = ["Problem: Solve the equation $2x + 3 = 7$. Solution:", "Problem: Solve the equation $3x - 5 = 10$."]
>>> completions = [r" The solution is \boxed{2}.", r" The solution is \boxed{6}."]
>>> ground_truth = ["2", "5"]
>>> reward_func(prompts=prompts, completions=completions, ground_truth=ground_truth)
[1.0, 0.0]
```
#### Example 4: Multi-task reward functions
Below is an example of using multiple reward functions in the [`RLOOTrainer`]. In this example, we define two task-specific reward functions: `math_reward_func` and `coding_reward_func`. The `math_reward_func` rewards math problems based on their correctness, while the `coding_reward_func` rewards coding problems based on whether the solution works.
```python
from datasets import Dataset
from trl import RLOOTrainer
# Define a dataset that contains both math and coding problems
dataset = Dataset.from_list(
[
{"prompt": "What is 2+2?", "task": "math"},
{"prompt": "Write a function that returns the sum of two numbers.", "task": "code"},
{"prompt": "What is 3*4?", "task": "math"},
{"prompt": "Write a function that returns the product of two numbers.", "task": "code"},
]
)
# Math-specific reward function
def math_reward_func(prompts, completions, task, **kwargs):
rewards = []
for prompt, completion, t in zip(prompts, completions, task):
if t == "math":
# Calculate math-specific reward
correct = check_math_solution(prompt, completion)
reward = 1.0 if correct else -1.0
rewards.append(reward)
else:
# Return None for non-math tasks
rewards.append(None)
return rewards
# Coding-specific reward function
def coding_reward_func(prompts, completions, task, **kwargs):
rewards = []
for prompt, completion, t in zip(prompts, completions, task):
if t == "coding":
# Calculate coding-specific reward
works = test_code_solution(prompt, completion)
reward = 1.0 if works else -1.0
rewards.append(reward)
else:
# Return None for non-coding tasks
rewards.append(None)
return rewards
# Use both task-specific reward functions
trainer = RLOOTrainer(
model="Qwen/Qwen2-0.5B-Instruct",
reward_funcs=[math_reward_func, coding_reward_func],
train_dataset=dataset,
)
trainer.train()
```
In this example, the `math_reward_func` and `coding_reward_func` are designed to work with a mixed dataset that contains both math and coding problems. The `task` column in the dataset is used to determine which reward function to apply to each problem. If there is no relevant reward function for a sample in the dataset, the reward function will return `None`, and the [`RLOOTrainer`] will continue with the valid functions and tasks. This allows the [`RLOOTrainer`] to handle multiple reward functions with different applicability.
Note that the [`RLOOTrainer`] will ignore the `None` rewards returned by the reward functions and only consider the rewards returned by the relevant functions. This ensures that the model is trained on the relevant tasks and ignores the tasks for which there is no relevant reward function.
#### Passing the reward function to the trainer
To use your custom reward function, pass it to the [`RLOOTrainer`] as follows:
```python
from trl import RLOOTrainer
trainer = RLOOTrainer(
reward_funcs=reward_func,
...,
)
```
If you have multiple reward functions, you can pass them as a list:
```python
from trl import RLOOTrainer
trainer = RLOOTrainer(
reward_funcs=[reward_func1, reward_func2],
...,
)
```
and the reward will be computed as the sum of the rewards from each function, or the weighted sum if `reward_weights` is provided in the config.
Note that [`RLOOTrainer`] supports multiple reward functions of different types. See the parameters documentation for more details.
## Vision-Language Model (VLM) Training
RLOO supports training Vision-Language Models (VLMs) on multimodal datasets containing both text and images.
### Supported Models
Tested with:
- **Gemma3** — e.g., `google/gemma-3-4b-it`
- **LLaVA-NeXT** — e.g., `llava-hf/llava-v1.6-mistral-7b-hf`
- **Qwen2-VL** — e.g., `Qwen/Qwen2-VL-2B-Instruct`
- **Qwen2.5-VL** — e.g., `Qwen/Qwen2.5-VL-3B-Instruct`
- **SmolVLM2** — e.g., `HuggingFaceTB/SmolVLM2-2.2B-Instruct`
> [!TIP]
> Compatibility with all VLMs is not guaranteed. If you believe a model should be supported, feel free to open an issue on GitHub — or better yet, submit a pull request with the required changes.
### Quick Start
Use [rloo\_vlm.py](https://github.com/huggingface/trl/blob/main/examples/scripts/rloo_vlm.py) to fine-tune a VLM. Example command for training on [`lmms-lab/multimodal-open-r1-8k-verified`](https://huggingface.co/datasets/lmms-lab/multimodal-open-r1-8k-verified):
```bash
accelerate launch \
--config_file=examples/accelerate_configs/deepspeed_zero3.yaml \
examples/scripts/rloo_vlm.py \
--model_name_or_path Qwen/Qwen2.5-VL-3B-Instruct \
--output_dir rloo-Qwen2.5-VL-3B-Instruct \
--learning_rate 1e-5 \
--gradient_checkpointing \
--dtype bfloat16 \
--max_prompt_length 2048 \
--max_completion_length 1024 \
--use_vllm \
--vllm_mode colocate \
--use_peft \
--lora_target_modules "q_proj", "v_proj" \
--log_completions
```
### Configuration Tips
> [!WARNING]
> VLM training may fail if image tokens are truncated. We highly recommend disabling truncation by setting `max_prompt_length` to `None`.
- Use LoRA on vision-language projection layers
- Enable 4-bit quantization to reduce memory usage
- VLMs are memory-intensive — start with smaller batch sizes
- Most models are compatible with vLLM (`server` and `colocate` modes)
### Dataset Format
Each training sample should include:
- `prompt`: Text formatted via the processor's chat template
- `image`/`images`: PIL Image or list of PIL Images
The trainer automatically handles image-to-tensor conversion via the models image processor.
## RLOOTrainer
[[autodoc]] RLOOTrainer
- train
- save_model
- push_to_hub
## RLOOConfig
[[autodoc]] RLOOConfig
## References
1. [RLOO Paper](https://openreview.net/pdf?id=r1lgTGL5DE)
2. [Paper Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs](https://huggingface.co/papers/2402.14740)
3. [Paper - REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models](https://huggingface.co/papers/2501.03262)
4. [Blog Post - Putting RL back in RLHF](https://huggingface.co/blog/putting_rl_back_in_rlhf_with_rloo)
5. [Blog Post - Unraveling RLHF and Its Variants: Progress and Practical Engineering Insights](https://hijkzzz.notion.site/unraveling-rlhf-and-its-variants-engineering-insights#147d9a33ecc9806090f3d5c749d31f05)
6. [Youtube - RLOO: A Cost-Efficient Optimization for Learning from Human Feedback in LLMs](https://www.youtube.com/watch?v=86asXGPK6RU&ab_channel=BuzzRobot)
## Migration Guide from the old implementation (0.21 and below)
With the release of version 0.22.0, we have revamped the [`RLOOTrainer`] to be more aligned with other online trainers in the library, like [`GRPOTrainer`]. This new implementation introduces several changes to the configuration parameters and overall structure of the trainer.
Below is a summary of the key changes for [`RLOOConfig`]:
| TRL ≤ 0.21.x | TRL ≥ 0.22.0 |
| --- | --- |
| `rloo_k` | renamed to `num_generations` |
| `cliprange` | renamed to `epsilon` |
| `kl_coef` | renamed to `beta` |
| `exp_name` | renamed to `run_name`. Use `run_name = f"{exp_name}__{seed}__{int(time.time())}"` to replicate old behavior |
| `normalize_reward` | renamed to `normalize_advantages`. Note: this always normalized advantages (despite the old name) |
| `num_ppo_epochs` | renamed to `num_iterations` (default: `1`) |
| `token_level_kl` | **removed** KL is now computed only at the sequence level |
| `dataset_num_proc` | **removed** it was unused |
| `num_mini_batches` | renamed to `steps_per_generation` |
| `total_episodes` | use `max_steps=total_episodes / gradient_accumulation_steps` instead |
| `local_rollout_forward_batch_size` | **removed** now automatically set to `per_device_train_batch_size` (or `per_device_eval_batch_size` during evaluation) |
| `num_sample_generations` | **removed** use `logging_steps` to control generation logging frequency |
| `response_length` | renamed to `max_completion_length` (default: `256`) |
| `stop_token` | **removed** |
| `stop_token_id` | **removed** use `processing_class.eos_token_id` instead |
| `missing_eos_penalty` | **removed** replicate with a custom reward function checking if `eos_token_id` is in `completion_ids` |
Below is a summary of the key changes for [`RLOOTrainer`]:
| TRL ≤ 0.21.x | TRL ≥ 0.22.0 |
| --- | --- |
| `config` | renamed to `args` |
| `reward_model` | renamed to `reward_funcs`, which now supports both reward models and custom reward functions |
| `policy` | renamed to `model` |
| `ref_policy` | **removed** the reference model is now created automatically from `model` |
| `data_collator` | **removed** |

View File

@ -0,0 +1,24 @@
# Scripts Utilities
## ScriptArguments
[[autodoc]] ScriptArguments
## TrlParser
[[autodoc]] TrlParser
- parse_args_and_config
- parse_args_into_dataclasses
- set_defaults_with_config
## get_dataset
[[autodoc]] get_dataset
## DatasetConfig
[[autodoc]] scripts.utils.DatasetConfig
## DatasetMixtureConfig
[[autodoc]] DatasetMixtureConfig

View File

@ -0,0 +1,31 @@
# Sentiment Tuning Examples
The notebooks and scripts in these examples show how to fine-tune a model with a sentiment classifier (such as `lvwerra/distilbert-imdb`).
Here's an overview of the notebooks and scripts in the [trl repository](https://github.com/huggingface/trl/tree/main/examples):
| File | Description |
| --- |--- |
| [`examples/scripts/ppo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/ppo.py) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/trl/blob/main/examples/sentiment/notebooks/gpt2-sentiment.ipynb) | This script shows how to use the `PPOTrainer` to fine-tune a sentiment analysis model using IMDB dataset |
| [`examples/notebooks/gpt2-sentiment.ipynb`](https://github.com/huggingface/trl/tree/main/examples/notebooks/gpt2-sentiment.ipynb) | This notebook demonstrates how to reproduce the GPT2 imdb sentiment tuning example on a jupyter notebook. |
| [`examples/notebooks/gpt2-control.ipynb`](https://github.com/huggingface/trl/tree/main/examples/notebooks/gpt2-control.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/trl/blob/main/examples/sentiment/notebooks/gpt2-sentiment-control.ipynb) | This notebook demonstrates how to reproduce the GPT2 sentiment control example on a jupyter notebook. |
## Usage
```bash
# 1. run directly
python examples/scripts/ppo.py
# 2. run via `accelerate` (recommended), enabling more features (e.g., multiple GPUs, deepspeed)
accelerate config # will prompt you to define the training configuration
accelerate launch examples/scripts/ppo.py # launches training
# 3. get help text and documentation
python examples/scripts/ppo.py --help
# 4. configure logging with wandb and, say, mini_batch_size=1 and gradient_accumulation_steps=16
python examples/scripts/ppo.py --log_with wandb --mini_batch_size 1 --gradient_accumulation_steps 16
```
Note: if you don't want to log with `wandb` remove `log_with="wandb"` in the scripts/notebooks. You can also replace it with your favourite experiment tracker that's [supported by `accelerate`](https://huggingface.co/docs/accelerate/usage_guides/tracking).
## Few notes on multi-GPU
To run in multi-GPU setup with DDP (distributed Data Parallel) change the `device_map` value to `device_map={"": Accelerator().process_index}` and make sure to run your script with `accelerate launch yourscript.py`. If you want to apply naive pipeline parallelism you can use `device_map="auto"`.

View File

@ -1,35 +0,0 @@
# Sentiment Examples
The notebooks and scripts in this examples show how to fine-tune a model with a sentiment classifier (such as `lvwerra/distilbert-imdb`).
Here's an overview of the notebooks and scripts in the [trl repository](https://github.com/lvwerra/trl/tree/main/examples):
| File | Description | Colab link |
|---|---| --- |
| [`gpt2-sentiment.ipynb`](https://github.com/lvwerra/trl/blob/main/examples/sentiment/notebooks/gpt2-sentiment.ipynb) | Fine-tune GPT2 to generate positive movie reviews. | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lvwerra/trl/blob/main/examples/sentiment/notebooks/gpt2-sentiment.ipynb)
|
| [`gpt2-sentiment-control.ipynb`](https://github.com/lvwerra/trl/blob/main/examples/sentiment/notebooks/gpt2-sentiment-control.ipynb) | Fine-tune GPT2 to generate movie reviews with controlled sentiment. | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lvwerra/trl/blob/main/examples/sentiment/notebooks/gpt2-sentiment-control.ipynb)
|
| [`gpt2-sentiment.py`](https://github.com/lvwerra/trl/blob/main/examples/sentiment/scripts/gpt2-sentiment.py) | Same as the notebook, but easier to use to use in mutli-GPU setup. | x |
| [`t5-sentiment.py`](https://github.com/lvwerra/trl/blob/main/examples/sentiment/scripts/t5-sentiment.py) | Same as GPT2 script, but for a Seq2Seq model (T5). | x |
## Installation
```bash
pip install trl
#optional: wandb
pip install wandb
```
Note: if you don't want to log with `wandb` remove `log_with="wandb"` in the scripts/notebooks. You can also replace it with your favourite experiment tracker that's [supported by `accelerate`](https://huggingface.co/docs/accelerate/usage_guides/tracking).
## Launch scripts
The `trl` library is powered by `accelerate`. As such it is best to configure and launch trainings with the following commands:
```bash
accelerate config # will prompt you to define the training configuration
accelerate launch scripts/gpt2-sentiment.py # launches training
```

335
docs/source/sft_trainer.md Normal file
View File

@ -0,0 +1,335 @@
# SFT Trainer
[![All_models-SFT-blue](https://img.shields.io/badge/All_models-SFT-blue)](https://huggingface.co/models?other=sft,trl) [![smol_course-Chapter_1-yellow](https://img.shields.io/badge/smol_course-Chapter_1-yellow)](https://github.com/huggingface/smol-course/tree/main/1_instruction_tuning)
## Overview
TRL supports the Supervised Fine-Tuning (SFT) Trainer for training language models.
This post-training method was contributed by [Younes Belkada](https://huggingface.co/ybelkada).
## Quick start
This example demonstrates how to train a language model using the [`SFTTrainer`] from TRL. We train a [Qwen 3 0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) model on the [Capybara dataset](https://huggingface.co/datasets/trl-lib/Capybara), a compact, diverse multi-turn dataset to benchmark reasoning and generalization.
```python
from trl import SFTTrainer
from datasets import load_dataset
trainer = SFTTrainer(
model="Qwen/Qwen3-0.6B",
train_dataset=load_dataset("trl-lib/Capybara", split="train"),
)
trainer.train()
```
<iframe src="https://trl-lib-trackio.hf.space/?project=trl-documentation&metrics=train*&runs=sft_qwen3-0.6B_capybara" style="width: 100%; min-width: 300px; max-width: 800px;" height="830" frameBorder="0"></iframe>
## Expected dataset type and format
SFT supports both [language modeling](dataset_formats#language-modeling) and [prompt-completion](dataset_formats#prompt-completion) datasets. The [`SFTTrainer`] is compatible with both [standard](dataset_formats#standard) and [conversational](dataset_formats#conversational) dataset formats. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
```python
# Standard language modeling
{"text": "The sky is blue."}
# Conversational language modeling
{"messages": [{"role": "user", "content": "What color is the sky?"},
{"role": "assistant", "content": "It is blue."}]}
# Standard prompt-completion
{"prompt": "The sky is",
"completion": " blue."}
# Conversational prompt-completion
{"prompt": [{"role": "user", "content": "What color is the sky?"}],
"completion": [{"role": "assistant", "content": "It is blue."}]}
```
If your dataset is not in one of these formats, you can preprocess it to convert it into the expected format. Here is an example with the [FreedomIntelligence/medical-o1-reasoning-SFT](https://huggingface.co/datasets/FreedomIntelligence/medical-o1-reasoning-SFT) dataset:
```python
from datasets import load_dataset
dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT", "en")
def preprocess_function(example):
return {
"prompt": [{"role": "user", "content": example["Question"]}],
"completion": [
{"role": "assistant", "content": f"<think>{example['Complex_CoT']}</think>{example['Response']}"}
],
}
dataset = dataset.map(preprocess_function, remove_columns=["Question", "Response", "Complex_CoT"])
print(next(iter(dataset["train"])))
```
```json
{
"prompt": [
{
"content": "Given the symptoms of sudden weakness in the left arm and leg, recent long-distance travel, and the presence of swollen and tender right lower leg, what specific cardiac abnormality is most likely to be found upon further evaluation that could explain these findings?",
"role": "user",
}
],
"completion": [
{
"content": "<think>Okay, let's see what's going on here. We've got sudden weakness [...] clicks into place!</think>The specific cardiac abnormality most likely to be found in [...] the presence of a PFO facilitating a paradoxical embolism.",
"role": "assistant",
}
],
}
```
## Looking deeper into the SFT method
Supervised Fine-Tuning (SFT) is the simplest and most commonly used method to adapt a language model to a target dataset. The model is trained in a fully supervised fashion using pairs of input and output sequences. The goal is to minimize the negative log-likelihood (NLL) of the target sequence, conditioning on the input.
This section breaks down how SFT works in practice, covering the key steps: **preprocessing**, **tokenization** and **loss computation**.
### Preprocessing and tokenization
During training, each example is expected to contain a **text field** or a **(prompt, completion)** pair, depending on the dataset format. For more details on the expected formats, see [Dataset formats](dataset_formats).
The [`SFTTrainer`] tokenizes each input using the model's tokenizer. If both prompt and completion are provided separately, they are concatenated before tokenization.
### Computing the loss
![sft_figure](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/sft_figure.png)
The loss used in SFT is the **token-level cross-entropy loss**, defined as:
$$
\mathcal{L}_{\text{SFT}}(\theta) = - \sum_{t=1}^{T} \log p_\theta(y_t \mid y_{<t}),
$$
where \\( y_t \\) is the target token at timestep \\( t \\), and the model is trained to predict the next token given the previous ones. In practice, padding tokens are masked out during loss computation.
> [!TIP]
> [On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification](https://huggingface.co/papers/2508.05629) proposes an alternative loss function, called **Dynamic Fine-Tuning (DFT)**, which aims to improve generalization by rectifying the reward signal. This method can be enabled by setting `loss_type="dft"` in the [`SFTConfig`]. For more details, see [Paper Index - Dynamic Fine-Tuning](paper_index#on-the-generalization-of-sft-a-reinforcement-learning-perspective-with-reward-rectification).
### Label shifting and masking
During training, the loss is computed using a **one-token shift**: the model is trained to predict each token in the sequence based on all previous tokens. Specifically, the input sequence is shifted right by one position to form the target labels.
Padding tokens (if present) are ignored in the loss computation by applying an ignore index (default: `-100`) to the corresponding positions. This ensures that the loss focuses only on meaningful, non-padding tokens.
## Logged metrics
While training and evaluating we record the following reward metrics:
* `global_step`: The total number of optimizer steps taken so far.
* `epoch`: The current epoch number, based on dataset iteration.
* `num_tokens`: The total number of tokens processed so far.
* `loss`: The average cross-entropy loss computed over non-masked tokens in the current logging interval.
* `entropy`: The average entropy of the model's predicted token distribution over non-masked tokens.
* `mean_token_accuracy`: The proportion of non-masked tokens for which the models top-1 prediction matches the ground truth token.
* `learning_rate`: The current learning rate, which may change dynamically if a scheduler is used.
* `grad_norm`: The L2 norm of the gradients, computed before gradient clipping.
## Customization
### Model initialization
You can directly pass the kwargs of the [`~transformers.AutoModelForCausalLM.from_pretrained()`] method to the [`SFTConfig`]. For example, if you want to load a model in a different precision, analogous to
```python
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B", dtype=torch.bfloat16)
```
you can do so by passing the `model_init_kwargs={"dtype": torch.bfloat16}` argument to the [`SFTConfig`].
```python
from trl import SFTConfig
training_args = SFTConfig(
model_init_kwargs={"dtype": torch.bfloat16},
)
```
Note that all keyword arguments of [`~transformers.AutoModelForCausalLM.from_pretrained()`] are supported.
### Packing
[`SFTTrainer`] supports _example packing_, where multiple examples are packed in the same input sequence to increase training efficiency. To enable packing, simply pass `packing=True` to the [`SFTConfig`] constructor.
```python
training_args = SFTConfig(packing=True)
```
For more details on packing, see [Packing](reducing_memory_usage#packing).
### Train on assistant messages only
To train on assistant messages only, use a [conversational](dataset_formats#conversational) dataset and set `assistant_only_loss=True` in the [`SFTConfig`]. This setting ensures that loss is computed **only** on the assistant responses, ignoring user or system messages.
```python
training_args = SFTConfig(assistant_only_loss=True)
```
![train_on_assistant](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/train_on_assistant.png)
> [!WARNING]
> This functionality is only available for chat templates that support returning the assistant tokens mask via the `&#123;% generation %&#125;` and `&#123;% endgeneration %&#125;` keywords. For an example of such a template, see [HugggingFaceTB/SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B/blob/main/chat_template.jinja#L76-L82).
### Train on completion only
To train on completion only, use a [prompt-completion](dataset_formats#prompt-completion) dataset. By default, the trainer computes the loss on the completion tokens only, ignoring the prompt tokens. If you want to train on the full sequence, set `completion_only_loss=False` in the [`SFTConfig`].
![train_on_completion](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/train_on_completion.png)
> [!TIP]
> Training on completion only is compatible with training on assistant messages only. In this case, use a [conversational](dataset_formats#conversational) [prompt-completion](dataset_formats#prompt-completion) dataset and set `assistant_only_loss=True` in the [`SFTConfig`].
### Train adapters with PEFT
We support tight integration with 🤗 PEFT library, allowing any user to conveniently train adapters and share them on the Hub, rather than training the entire model.
```python
from datasets import load_dataset
from trl import SFTTrainer
from peft import LoraConfig
dataset = load_dataset("trl-lib/Capybara", split="train")
trainer = SFTTrainer(
"Qwen/Qwen3-0.6B",
train_dataset=dataset,
peft_config=LoraConfig()
)
trainer.train()
```
You can also continue training your [`~peft.PeftModel`]. For that, first load a `PeftModel` outside [`SFTTrainer`] and pass it directly to the trainer without the `peft_config` argument being passed.
```python
from datasets import load_dataset
from trl import SFTTrainer
from peft import AutoPeftModelForCausalLM
model = AutoPeftModelForCausalLM.from_pretrained("trl-lib/Qwen3-4B-LoRA", is_trainable=True)
dataset = load_dataset("trl-lib/Capybara", split="train")
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
)
trainer.train()
```
> [!TIP]
> When training adapters, you typically use a higher learning rate (≈1e4) since only new parameters are being learned.
>
> ```python
> SFTConfig(learning_rate=1e-4, ...)
> ```
### Train with Liger Kernel
Liger Kernel is a collection of Triton kernels for LLM training that boosts multi-GPU throughput by 20%, cuts memory use by 60% (enabling up to 4× longer context), and works seamlessly with tools like FlashAttention, PyTorch FSDP, and DeepSpeed. For more information, see [Liger Kernel Integration](liger_kernel_integration).
### Train with Unsloth
Unsloth is an opensource framework for finetuning and reinforcement learning that trains LLMs (like Llama, Mistral, Gemma, DeepSeek, and more) up to 2× faster with up to 70% less VRAM, while providing a streamlined, Hugging Facecompatible workflow for training, evaluation, and deployment. For more information, see [Unsloth Integration](unsloth_integration).
## Instruction tuning example
**Instruction tuning** teaches a base language model to follow user instructions and engage in conversations. This requires:
1. **Chat template**: Defines how to structure conversations into text sequences, including role markers (user/assistant), special tokens, and turn boundaries. Read more about chat templates in [Chat templates](https://huggingface.co/docs/transformers/chat_templating#templates).
2. **Conversational dataset**: Contains instruction-response pairs
This example shows how to transform the [Qwen 3 0.6B Base](https://huggingface.co/Qwen/Qwen3-0.6B-Base) model into an instruction-following model using the [Capybara dataset](https://huggingface.co/datasets/trl-lib/Capybara) and a chat template from [HuggingFaceTB/SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B). The SFT Trainer automatically handles tokenizer updates and special token configuration.
```python
from trl import SFTConfig, SFTTrainer
from datasets import load_dataset
trainer = SFTTrainer(
model="Qwen/Qwen3-0.6B-Base",
args=SFTConfig(
output_dir="Qwen3-0.6B-Instruct",
chat_template_path="HuggingFaceTB/SmolLM3-3B",
),
train_dataset=load_dataset("trl-lib/Capybara", split="train"),
)
trainer.train()
```
> [!WARNING]
> Some base models, like those from Qwen, have a predefined chat template in the model's tokenizer. In these cases, it is not necessary to apply [`clone_chat_template()`], as the tokenizer already handles the formatting. However, it is necessary to align the EOS token with the chat template to ensure the model's responses terminate correctly. In these cases, specify `eos_token` in [`SFTConfig`]; for example, for `Qwen/Qwen2.5-1.5B`, one should set `eos_token="<|im_end|>"`.
Once trained, your model can now follow instructions and engage in conversations using its new chat template.
```python
>>> from transformers import pipeline
>>> pipe = pipeline("text-generation", model="Qwen3-0.6B-Instruct/checkpoint-5000")
>>> prompt = "<|im_start|>user\nWhat is the capital of France? Answer in one word.<|im_end|>\n<|im_start|>assistant\n"
>>> response = pipe(prompt)
>>> response[0]["generated_text"]
'<|im_start|>user\nWhat is the capital of France? Answer in one word.<|im_end|>\n<|im_start|>assistant\nThe capital of France is Paris.'
```
Alternatively, use the structured conversation format (recommended):
```python
>>> prompt = [{"role": "user", "content": "What is the capital of France? Answer in one word."}]
>>> response = pipe(prompt)
>>> response[0]["generated_text"]
[{'role': 'user', 'content': 'What is the capital of France? Answer in one word.'}, {'role': 'assistant', 'content': 'The capital of France is Paris.'}]
```
## Tool Calling with SFT
The [`SFTTrainer`] fully supports fine-tuning models with _tool calling_ capabilities. In this case, each dataset example should include:
* The conversation messages, including any tool calls (`tool_calls`) and tool responses (`tool` role messages)
* The list of available tools in the `tools` column, typically provided as JSON schemas
For details on the expected dataset structure, see the [Dataset Format — Tool Calling](dataset_formats#tool-calling) section.
## Training Vision Language Models
[`SFTTrainer`] fully supports training Vision-Language Models (VLMs). To train a VLM, you need to provide a dataset with an additional `images` column containing the images to be processed. For more information on the expected dataset structure, see the [Dataset Format — Vision Dataset](dataset_formats#vision-dataset) section.
An example of such a dataset is the [LLaVA Instruct Mix](https://huggingface.co/datasets/trl-lib/llava-instruct-mix).
```python
from trl import SFTConfig, SFTTrainer
from datasets import load_dataset
trainer = SFTTrainer(
model="Qwen/Qwen2.5-VL-3B-Instruct",
args=SFTConfig(max_length=None),
train_dataset=load_dataset("trl-lib/llava-instruct-mix", split="train"),
)
trainer.train()
```
> [!TIP]
> For VLMs, truncating may remove image tokens, leading to errors during training. To avoid this, set `max_length=None` in the [`SFTConfig`]. This allows the model to process the full sequence length without truncating image tokens.
>
> ```python
> SFTConfig(max_length=None, ...)
> ```
>
> Only use `max_length` when you've verified that truncation won't remove image tokens for the entire dataset.
## SFTTrainer
[[autodoc]] SFTTrainer
- train
- save_model
- push_to_hub
## SFTConfig
[[autodoc]] SFTConfig
## DataCollatorForLanguageModeling
[[autodoc]] trainer.sft_trainer.DataCollatorForLanguageModeling
## DataCollatorForVisionLanguageModeling
[[autodoc]] trainer.sft_trainer.DataCollatorForVisionLanguageModeling

View File

@ -0,0 +1,97 @@
# Speeding Up Training
> [!WARNING]
> Section under construction. Feel free to contribute!
## vLLM for fast generation in online methods
Online methods such as GRPO or Online DPO require the model to generate completions, which is often a slow process and can significantly impact training time.
To speed up generation, you can use [vLLM](https://github.com/vllm-project/vllm), a library that enables fast generation through, among other things, PagedAttention. TRL's online trainers support vLLM, greatly improving training speed.
To use [vLLM](https://github.com/vllm-project/vllm), first install it using:
```bash
pip install trl[vllm]
```
<hfoptions id="vllm examples">
<hfoption id="Online DPO">
Then, enable it by passing `use_vllm=True` in the training arguments.
```python
from trl import OnlineDPOConfig
training_args = OnlineDPOConfig(..., use_vllm=True)
```
</hfoption>
<hfoption id="GRPO">
First, start a vLLM server by running:
```bash
trl vllm-serve --model <model_name>
```
Then, run the training script and pass `use_vllm=True` in the training arguments.
```python
from trl import GRPOConfig
training_args = GRPOConfig(..., use_vllm=True)
```
You can customize the server configuration by passing additional arguments. For more information, see [vLLM integration](vllm_integration).
> [!WARNING]
> When using vLLM, ensure that the GPUs assigned for training and generation are separate to avoid resource conflicts. For instance, if you plan to use 4 GPUs for training and another 4 for vLLM generation, you can specify GPU allocation using `CUDA_VISIBLE_DEVICES`.
>
> Set GPUs **0-3** for vLLM generation:
>
> ```sh
> CUDA_VISIBLE_DEVICES=0,1,2,3 trl vllm-serve --model <model_name>
> ```
>
> And GPUs **4-7** for training:
>
> ```sh
> CUDA_VISIBLE_DEVICES=4,5,6,7 accelerate launch train.py
> ```
</hfoption>
<hfoption id="RLOO">
First, start a vLLM server by running:
```bash
trl vllm-serve --model <model_name>
```
Then, run the training script and pass `use_vllm=True` in the training arguments.
```python
from trl import RLOOConfig
training_args = RLOOConfig(..., use_vllm=True)
```
You can customize the server configuration by passing additional arguments. For more information, see [vLLM integration](vllm_integration).
> [!WARNING]
> When using vLLM, ensure that the GPUs assigned for training and generation are separate to avoid resource conflicts. For instance, if you plan to use 4 GPUs for training and another 4 for vLLM generation, you can specify GPU allocation using `CUDA_VISIBLE_DEVICES`.
>
> Set GPUs **0-3** for vLLM generation:
>
> ```sh
> CUDA_VISIBLE_DEVICES=0,1,2,3 trl vllm-serve --model <model_name>
> ```
>
> And GPUs **4-7** for training:
>
> ```sh
> CUDA_VISIBLE_DEVICES=4,5,6,7 accelerate launch train.py
> ```
</hfoption>
</hfoptions>

View File

@ -1,30 +0,0 @@
# Summarization Example
The script in this example show how to train a reward model for summarization, following the OpenAI Learning to Summarize from Human Feedback [paper](https://arxiv.org/abs/2009.01325). We've validated that the script can be used to train a small GPT2 to get slightly over 60% validation accuracy, which is aligned with results from the paper. The model is [here](https://huggingface.co/Tristan/gpt2_reward_summarization).
Here's an overview of the relevant files in the [trl repository](https://github.com/lvwerra/trl/tree/main/examples):
| File | Description |
|---|---|
| `scripts/reward_summarization.py` | For tuning the reward model. |
| `scripts/ds3_reward_summarization_example_config.json` | Can be used with the reward model script to scale it up to arbitrarily big models that don't fit on a single GPU. |
## Installation
```bash
pip install trl
pip install evaluate
# optional: deepspeed
pip install deepspeed
```
```bash
# If you want your reward model to follow the Learning to Summarize from Human Feedback paper closely, then tune a GPT model on summarization and then instantiate the reward model
# with it. In other words, pass in the name of your summarization-finetuned gpt on the hub, instead of the name of the pretrained gpt2 like we do in the following examples of how
# to run this script.
# Example of running this script with the small size gpt2 on a 40GB A100 (A100's support bf16). Here, the global batch size will be 64:
python -m torch.distributed.launch --nproc_per_node=1 reward_summarization.py --bf16
# Example of running this script with the xl size gpt2 on 16 40GB A100's. Here the global batch size will still be 64:
python -m torch.distributed.launch --nproc_per_node=16 reward_summarization.py --per_device_train_batch_size=1 --per_device_eval_batch_size=1 --gradient_accumulation_steps=4 --gpt_model_name=gpt2-xl --bf16 --deepspeed=ds3_reward_summarization_example_config.json
```

View File

@ -0,0 +1,67 @@
# Trackio Integration
[Trackio](https://huggingface.co/docs/trackio) is a lightweight, free experiment tracking library built on top of **🤗 Datasets** and **🤗 Spaces**. It is the **recommended tracking solution for TRL** and comes natively integrated with all trainers.
To enable logging, simply set `report_to="trackio"` in your training config:
```python
from trl import SFTConfig # works with any trainer config (e.g. DPOConfig, GRPOConfig, etc.)
training_args = SFTConfig(
...,
report_to="trackio", # enable Trackio logging
)
```
## Organizing Your Experiments with Run Names and Projects
By default, Trackio will generate a name to identify each run. However, we highly recommend setting a descriptive `run_name` to make it easier to organize experiments. For example:
```python
from trl import SFTConfig
training_args = SFTConfig(
...,
report_to="trackio",
run_name="sft_qwen3-4b_lr2e-5_bs128", # descriptive run name
)
```
You can also group related experiments by project by setting the following environment variable:
```bash
export TRACKIO_PROJECT="my_project"
```
## Hosting Your Logs on 🤗 Spaces
Trackio has local-first design, meaning your logs stay on your machine. If youd like to host them and deploy a dashboard on **🤗 Spaces**, set:
```bash
export TRACKIO_SPACE_ID="username/space_id"
```
Running the following example:
```python
import os
from trl import SFTConfig, SFTTrainer
from datasets import load_dataset
os.environ["TRACKIO_SPACE_ID"] = "trl-lib/trackio"
os.environ["TRACKIO_PROJECT"] = "trl-documentation"
trainer = SFTTrainer(
model="Qwen/Qwen3-0.6B",
train_dataset=load_dataset("trl-lib/Capybara", split="train"),
args=SFTConfig(
report_to="trackio",
run_name="sft_qwen3-0.6b_capybara",
),
)
trainer.train()
```
will give you a hosted dashboard at https://huggingface.co/spaces/trl-lib/trackio.
<iframe src="https://trl-lib-trackio.hf.space/?project=trl-documentation&sidebar=hidden&runs=sft_qwen3-0.6B_capybara" style="width: 100%; min-width: 300px; max-width: 800px;" height="830" frameBorder="0"></iframe>

View File

@ -1,16 +0,0 @@
# Trainer
At TRL we support PPO (Proximal Policy Optimisation) with an implementation that largely follows the structure introduced in the paper "Fine-Tuning Language Models from Human Preferences" by D. Ziegler et al. [[paper](https://arxiv.org/pdf/1909.08593.pdf), [code](https://github.com/openai/lm-human-preferences)].
The Trainer and model classes are largely inspired from `transformers.Trainer` and `transformers.AutoModel` classes and adapted for RL.
## PPOConfig
[[autodoc]] PPOConfig
## PPOTrainer
[[autodoc]] PPOTrainer
## set_seed
[[autodoc]] set_seed

View File

@ -0,0 +1,125 @@
# Unsloth Integration
Unsloth is an opensource framework for finetuning and reinforcement learning that trains LLMs (like Llama, OpenAI gpt-oss, Mistral, Gemma, DeepSeek, and more) up to 2× faster with up to 80% less VRAM. Unsloth allows [training](https://huggingface.co/docs/trl/en/unsloth_integration#Training), evaluation, running and [deployment](https://huggingface.co/docs/trl/en/unsloth_integration#Saving-the-model) with other inference engines like llama.cpp, Ollama and vLLM.
The library provides a streamlined, Hugging Face compatible workflow for training, evaluation, inference and deployment and is fully compatible with [`SFTTrainer`].
## Key Features
- Training support for all transformer compatible models: Text-to-speech (TTS), multimodal, BERT, RL and more
- Supports full fine-tuning, pretraining, LoRA, QLoRA, 8-bit training & more
- Works on Linux, Windows, Colab, Kaggle; NVIDIA GPUs, soon AMD & Intel setups
- Supports most features TRL supports, including RLHF (GSPO, GRPO, DPO etc.)
- Hand-written Triton kernels and a manual backprop engine ensure no accuracy degradation (0% approximation error)
## Installation
### pip install
Local Installation (Linux recommended):
```sh
pip install unsloth
```
You can also install `unsloth` according to the [official documentation](https://docs.unsloth.ai/get-started/installing-+-updating). Once installed, you can incorporate unsloth into your workflow in a very simple manner; instead of loading [`~transformers.AutoModelForCausalLM`], you just need to load a `FastLanguageModel` as follows:
```python
import torch
from trl import SFTConfig, SFTTrainer
from unsloth import FastLanguageModel
max_length = 2048 # Supports automatic RoPE Scaling, so choose any number
# Load model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/mistral-7b",
max_seq_length=max_length,
dtype=None, # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit=True, # Use 4bit quantization to reduce memory usage. Can be False
)
# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=[
"q_proj",
"k_proj",
"v_proj",
"o_proj",
"gate_proj",
"up_proj",
"down_proj",
],
lora_alpha=16,
lora_dropout=0, # Dropout = 0 is currently optimized
bias="none", # Bias = "none" is currently optimized
use_gradient_checkpointing=True,
random_state=3407,
)
training_args = SFTConfig(output_dir="./output", max_length=max_length)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
)
trainer.train()
```
The saved model is fully compatible with Hugging Face's transformers library. Learn more about unsloth in their [official repository](https://github.com/unslothai/unsloth).
### Docker Install
```sh
docker run -d -e JUPYTER_PASSWORD="mypassword" \
-p 8888:8888 -p 2222:22 \
-v $(pwd)/work:/workspace/work \
--gpus all \
unsloth/unsloth
```
Access Jupyter Lab at ```http://localhost:8888``` and start fine-tuning!
## Training
These are some core settings you can toggle before training:
- ```max_seq_length = 2048``` Controls context length. While Llama-3 supports 8192, we recommend 2048 for testing. Unsloth enables 4× longer context fine-tuning.
- ```dtype = None``` Defaults to None; use torch.float16 or torch.bfloat16 for newer GPUs.
- ```load_in_4bit = True``` Enables 4-bit quantization, reducing memory use 4× for fine-tuning. Disabling it allows for LoRA 16-bit fine-tuning to be enabled.
- To enable full fine-tuning (FFT), set ```full_finetuning = True```. For 8-bit fine-tuning, set ```load_in_8bit = True```. Note: Only one training method can be set to True at a time.
For more information on configuring Unsloth's hyperparameters and features, read their [documentation guide here](https://docs.unsloth.ai/get-started/fine-tuning-llms-guide).
## Saving the model
Unsloth allows you to directly save the finetuned model as a small file called a LoRA adapter. You can instead push to the Hugging Face hub as well if you want to upload your model! Remember to get a [Hugging Face token](https://huggingface.co/settings/tokens) and add your token!
### Saving to GGUF
To save to GGUF, Unsloth uses llama.cpp. To save locally:
```python
model.save_pretrained_gguf("directory", tokenizer, quantization_method = "q4_k_m")
model.save_pretrained_gguf("directory", tokenizer, quantization_method = "q8_0")
model.save_pretrained_gguf("directory", tokenizer, quantization_method = "f16")
```
To push to the hub:
```python
model.push_to_hub_gguf("hf_username/directory", tokenizer, quantization_method = "q4_k_m")
model.push_to_hub_gguf("hf_username/directory", tokenizer, quantization_method = "q8_0")
```
### Saving to vLLM
To save to 16-bit for vLLM, use:
```python
model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")
```

58
docs/source/use_model.md Normal file
View File

@ -0,0 +1,58 @@
# Use model after training
Once you have trained a model using either the SFTTrainer, PPOTrainer, or DPOTrainer, you will have a fine-tuned model that can be used for text generation. In this section, we'll walk through the process of loading the fine-tuned model and generating text. If you need to run an inference server with the trained model, you can explore libraries such as [`text-generation-inference`](https://github.com/huggingface/text-generation-inference).
## Load and Generate
If you have fine-tuned a model fully, meaning without the use of PEFT you can simply load it like any other language model in transformers. E.g. the value head that was trained during the PPO training is no longer needed and if you load the model with the original transformer class it will be ignored:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name_or_path = "kashif/stack-llama-2" #path/to/your/model/or/name/on/hub
device = "cpu" # or "cuda" if you have a GPU
model = AutoModelForCausalLM.from_pretrained(model_name_or_path).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
inputs = tokenizer.encode("This movie was really", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))
```
Alternatively you can also use the pipeline:
```python
from transformers import pipeline
model_name_or_path = "kashif/stack-llama-2" #path/to/your/model/or/name/on/hub
pipe = pipeline("text-generation", model=model_name_or_path)
print(pipe("This movie was really")[0]["generated_text"])
```
## Use Adapters PEFT
```python
from peft import PeftConfig, PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base_model_name = "kashif/stack-llama-2" #path/to/your/model/or/name/on/hub
adapter_model_name = "path/to/my/adapter"
model = AutoModelForCausalLM.from_pretrained(base_model_name)
model = PeftModel.from_pretrained(model, adapter_model_name)
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
```
You can also merge the adapters into the base model so you can use the model like a normal transformers model, however the checkpoint will be significantly bigger:
```python
model = AutoModelForCausalLM.from_pretrained(base_model_name)
model = PeftModel.from_pretrained(model, adapter_model_name)
model = model.merge_and_unload()
model.save_pretrained("merged_adapters")
```
Once you have the model loaded and either merged the adapters or keep them separately on top you can run generation as with a normal model outlined above.

View File

@ -0,0 +1,499 @@
# vLLM Integration
This document will guide you through the process of using vLLM with TRL for faster generation in online methods like GRPO and Online DPO. We first summarize a tl;dr on how to use vLLM with TRL, and then we will go into the details of how it works under the hood.
> [!WARNING]
> TRL currently only supports vLLM version `0.10.2`. Please ensure you have this version installed to avoid compatibility issues.
> [!TIP]
> The following trainers currently support generation with vLLM:
>
> - [`GRPOTrainer`]
> - [`OnlineDPOTrainer`]
> - [`NashMDTrainer`]
> - [`XPOTrainer`]
> - [`RLOOTrainer`]
## 🚀 How can I use vLLM with TRL to speed up training?
💡 **Note**: Resources required for this specific example: a single node with 8 GPUs.
> [!WARNING]
> When using vLLM with TRL, the **vLLM server** and the **trainer** must run on **separate CUDA devices** to prevent conflicts.
> For guidance on configuring this properly, see [Modes of using vLLM during training](#modes-of-using-vllm-during-training).
First, install vLLM using the following command:
```bash
pip install "trl[vllm]"
```
Then run the server on specific GPUs (e.g., GPUs 0-3):
```sh
CUDA_VISIBLE_DEVICES=0,1,2,3 trl vllm-serve --model Qwen/Qwen2.5-7B --tensor-parallel-size 2 --data-parallel-size 2
```
Once the server is running, you can use it to generate completions for training. In the example below, we are using the different supported trainers using the vLLM server for generation. The `--tensor-parallel-size` and `--data-parallel-size` arguments control how the model and data are sharded across GPUs.
In this example, we are sharding two copies of the model across 4 GPUs. Increasing data parallelism increases throughput, while increasing tensor parallelism allows for serving larger models. Then, run the training script on different GPUs (e.g., GPUs 4-7) by passing `use_vllm=True` in the training arguments as follows:
Sample of a simple `train.py` script:
<hfoptions id="vllm examples">
<hfoption id="GRPO">
```python
from datasets import load_dataset
from trl import GRPOTrainer, GRPOConfig
dataset = load_dataset("trl-lib/tldr", split="train")
# Dummy reward function: count the number of unique characters in the completions
def reward_num_unique_chars(completions, **kwargs):
return [len(set(c)) for c in completions]
training_args = GRPOConfig(
output_dir="my_test",
use_vllm=True,
bf16=True,
gradient_checkpointing=True,
)
trainer = GRPOTrainer(
model="Qwen/Qwen2.5-7B",
args=training_args,
reward_funcs=reward_num_unique_chars,
train_dataset=dataset,
)
trainer.train()
```
</hfoption>
<hfoption id="OnlineDPO">
```python
from datasets import load_dataset
from trl import OnlineDPOTrainer, OnlineDPOConfig
dataset = load_dataset("trl-lib/tldr", split="train")
# Dummy reward function: count the number of unique characters in the completions
def reward_num_unique_chars(completions, **kwargs):
return [len(set(c)) for c in completions]
training_args = OnlineDPOConfig(
output_dir="my_test",
use_vllm=True,
bf16=True,
gradient_checkpointing=True,
)
trainer = OnlineDPOTrainer(
model="Qwen/Qwen2.5-7B",
args=training_args,
reward_funcs=reward_num_unique_chars,
train_dataset=dataset,
)
trainer.train()
```
</hfoption>
<hfoption id="NashMD">
```python
from datasets import load_dataset
from trl import NashMDTrainer, NashMDConfig
dataset = load_dataset("trl-lib/tldr", split="train")
# Dummy reward function: count the number of unique characters in the completions
def reward_num_unique_chars(completions, **kwargs):
return [len(set(c)) for c in completions]
training_args = NashMDConfig(
output_dir="my_test",
use_vllm=True,
bf16=True,
gradient_checkpointing=True,
)
trainer = NashMDTrainer(
model="Qwen/Qwen2.5-7B",
args=training_args,
reward_funcs=reward_num_unique_chars,
train_dataset=dataset,
)
trainer.train()
```
</hfoption>
<hfoption id="XPO">
```python
from datasets import load_dataset
from trl import XPOTrainer, XPOConfig
dataset = load_dataset("trl-lib/tldr", split="train")
# Dummy reward function: count the number of unique characters in the completions
def reward_num_unique_chars(completions, **kwargs):
return [len(set(c)) for c in completions]
training_args = XPOConfig(
output_dir="my_test",
use_vllm=True,
bf16=True,
gradient_checkpointing=True,
)
trainer = XPOTrainer(
model="Qwen/Qwen2.5-7B",
args=training_args,
reward_funcs=reward_num_unique_chars,
train_dataset=dataset,
)
trainer.train()
```
</hfoption>
<hfoption id="RLOO">
```python
from datasets import load_dataset
from trl import RLOOTrainer, RLOOConfig
dataset = load_dataset("trl-lib/tldr", split="train")
# Dummy reward function: count the number of unique characters in the completions
def reward_num_unique_chars(completions, **kwargs):
return [len(set(c)) for c in completions]
training_args = RLOOConfig(
output_dir="my_test",
use_vllm=True,
bf16=True,
gradient_checkpointing=True,
)
trainer = RLOOTrainer(
model="Qwen/Qwen2.5-7B",
args=training_args,
reward_funcs=reward_num_unique_chars,
train_dataset=dataset,
)
trainer.train()
```
</hfoption>
</hfoptions>
And the train command on separate GPUs from the server:
```sh
CUDA_VISIBLE_DEVICES=4,5,6,7 accelerate launch train.py
```
## Why using vLLM?
### 🎬 Flashback: Why do we need to use vLLM in online methods?
Online methods like GRPO or Online DPO require the model to generate completions during training, which are then used to compute reward signals. However, generation can be extremely time-consuming, especially with large or reasoning models. In the default setup (without vLLM), completions are generated using the [(unwrapped) model's `generate` method](https://github.com/huggingface/trl/blob/f3e8c2304428ef16e9ae5de9e5741ed84d533b7b/trl/trainer/grpo_trainer.py#L965C39-L965C66). This approach quickly becomes a major bottleneck — generation is slow and inefficient, particularly for large batches or models. As a result, training times increase significantly, and overall efficiency drops. To address this, we turn to vLLM, which enables much faster and more scalable generation, helping eliminate this bottleneck in online methods.
### 🤔 How does vLLM solve the slow generation issue?
If you've ever done autoregressive decoder training, you know all the input tokens to the LLM produce their attention key and value tensors, and these tensors are kept in GPU memory to later generate subsequent tokens based on them. These cached key and value tensors are often referred to as the KV cache. However, storing the KV cache occupies a lot of memory, so vLLM uses a technique called **PagedAttention** to solve this problem. PagedAttention, which is inspired by the OSs virtual memory concept, stores continuous keys and values in **non-contiguous memory space**, which is much more efficient. The details of this are beyond the scope of this document, but in short, it allows the model to store the keys and values in a more efficient way, reducing the memory footprint and speeding up the generation process. If you are interested, make sure to check out the [vLLM PagedAttention](https://blog.vllm.ai/2023/06/20/vllm.html) for more details.
## How vLLM Works (Under the Hood) 🔍
### 🤔 What exactly happens when you run `trl vllm-serve --model <model_name>`?
When you run for example
```sh
CUDA_VISIBLE_DEVICES=0,1,2,3 trl vllm-serve --model Qwen/Qwen2.5-7B --tensor-parallel-size 1 --data-parallel-size 4
```
the following happens:
![vllm](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/vllm-doc.png)
1. vLLM first spawns multiple workers to handle incoming requests in parallel. The number of workers is determined by multiplying the `--tensor-parallel-size` and `--data-parallel-size` values. In this example, it spawns 4 workers (1 × 4).
Each worker operates independently and processes a chunk of the incoming requests — which are basically the prompts sent to the server for generation. A key point to understand is that these 4 workers are running in parallel, and each one is responsible for handling a subset of the total incoming load.
2. Once the incoming requests (prompts) are distributed across the workers, the model starts generating completions. Internally, the models weights are split across multiple GPUs based on the `--tensor-parallel-size` argument — this is how tensor parallelism is handled. Meanwhile, data parallelism (controlled by `--data-parallel-size`) ensures that different sets of requests are processed independently across the workers. In short: tensor parallelism splits the model across GPUs, and data parallelism splits the batch of requests across different model replicas.
3. Although the GPUs process requests independently and in parallel, they still need to communicate with each other. Remember that each GPU handles only a slice of the incoming prompts (for example, with 4 GPUs and 8 prompts using `--data-parallel-size=4`, each GPU processes 2 prompts).
This GPU-to-GPU communication is managed efficiently by NVIDIAs NCCL library. The communication mainly ensures that each GPU gets its correct portion of the incoming requests — its lightweight and doesnt interfere with generation itself.
Separately, the number of completions to generate per prompt is controlled by the `num_generations` setting in the GRPO config. For instance, if you set `num_generations=2` (like in the picture above), each prompt will have 2 completions. So, with 8 prompts and `num_generations=2`, you would end up with 16 completions total — regardless of the number of GPUs or parallelism settings.
### 🥸 More detail on what happens under the hood when running the server
- The vLLM server starts by running the command: `trl vllm-serve --model Qwen/Qwen2.5-7B`.
- Once the server is running, it generates completions based on requests from the client (trainer) using `vllm_client.generate` [these lines](https://github.com/huggingface/trl/blob/cc044e35b285be7dc062764b3364e1e684db4c7c/trl/trainer/grpo_trainer.py#L1025-L1035).
- The client (trainer) then requests these completions from the server.
- These completions are used to compute the reward signal.
- Based on the reward signal and the models output, the loss is computed, and the backward pass is performed to update the models weights.
- **Note**: The server only handles completion generation — it doesnt train the model. Therefore, the models weights arent updated on the server. Once the backward pass is complete, the client sends the updated weights to the server using `vllm_client.update_named_param(name, param.data)`.
When using vLLM, ensure the GPUs assigned for training and generation are separate to avoid NCCL communication conflicts. If you do not set the `CUDA_VISIBLE_DEVICES` environment variable, the training script will use all available GPUs by default, which may lead to device conflicts. Starting from TRL next release after v0.19.1, the code automatically detects and prevents same-device usage, raising a error at the vllm server process:
```log
RuntimeError: Attempting to use the same CUDA device for multiple distinct roles/ranks within the same communicator.
Ensure that trainer is using different devices than vLLM server.
```
For example, if you want to use GPUs 47 for training while the server runs on GPUs 0-3, set:
```sh
CUDA_VISIBLE_DEVICES=4,5,6,7 accelerate launch train.py
```
## Advanced usage
### 🍷 More customization options with vLLM?
You can customize the server configuration by passing additional arguments.
```txt
$ trl vllm-serve --help
usage: trl vllm-serve [-h] --model MODEL [--revision REVISION] [--tensor_parallel_size TENSOR_PARALLEL_SIZE] [--data_parallel_size DATA_PARALLEL_SIZE] [--host HOST]
[--port PORT] [--gpu_memory_utilization GPU_MEMORY_UTILIZATION] [--dtype DTYPE] [--max_model_len MAX_MODEL_LEN]
[--enable_prefix_caching ENABLE_PREFIX_CACHING] [--enforce_eager [ENFORCE_EAGER]] [--kv_cache_dtype KV_CACHE_DTYPE]
[--trust_remote_code [TRUST_REMOTE_CODE]] [--log_level LOG_LEVEL] [--vllm_model_impl VLLM_MODEL_IMPL]
options:
-h, --help show this help message and exit
--model MODEL Model name or path to load the model from. (default: None)
--revision REVISION Revision to use for the model. If not specified, the default branch will be used. (default: None)
--tensor_parallel_size TENSOR_PARALLEL_SIZE, --tensor-parallel-size TENSOR_PARALLEL_SIZE
Number of tensor parallel workers to use. (default: 1)
--data_parallel_size DATA_PARALLEL_SIZE, --data-parallel-size DATA_PARALLEL_SIZE
Number of data parallel workers to use. (default: 1)
--host HOST Host address to run the server on. (default: 0.0.0.0)
--port PORT Port to run the server on. (default: 8000)
--gpu_memory_utilization GPU_MEMORY_UTILIZATION, --gpu-memory-utilization GPU_MEMORY_UTILIZATION
Ratio (between 0 and 1) of GPU memory to reserve for the model weights, activations, and KV cache on the device dedicated to generation
powered by vLLM. Higher values will increase the KV cache size and thus improve the model's throughput. However, if the value is too high,
it may cause out-of-memory (OOM) errors during initialization. (default: 0.9)
--dtype DTYPE Data type to use for vLLM generation. If set to 'auto', the data type will be automatically determined based on the model configuration.
Find the supported values in the vLLM documentation. (default: auto)
--max_model_len MAX_MODEL_LEN, --max-model-len MAX_MODEL_LEN
If set, the `max_model_len` to use for vLLM. This can be useful when running with reduced `vllm_gpu_memory_utilization`, leading to a
reduced KV cache size. If not set, vLLM will use the model context size, which might be much larger than the KV cache, leading to
inefficiencies. (default: None)
--enable_prefix_caching ENABLE_PREFIX_CACHING, --enable-prefix-caching ENABLE_PREFIX_CACHING
Whether to enable prefix caching in vLLM. If set to `True`, ensure that the model and the hardware support this feature. (default: None)
--enforce_eager [ENFORCE_EAGER], --enforce-eager [ENFORCE_EAGER]
Whether to enforce eager execution. If set to `True`, we will disable CUDA graph and always execute the model in eager mode. If `False`
(default behavior), we will use CUDA graph and eager execution in hybrid. (default: False)
--kv_cache_dtype KV_CACHE_DTYPE, --kv-cache-dtype KV_CACHE_DTYPE
Data type to use for KV cache. If set to 'auto', the dtype will default to the model data type. (default: auto)
--trust_remote_code [TRUST_REMOTE_CODE], --trust-remote-code [TRUST_REMOTE_CODE]
Whether to trust remote code when loading models. Set to True to allow executing code from model repositories. This is required for some
custom models but introduces security risks. (default: False)
--log_level LOG_LEVEL, --log-level LOG_LEVEL
Log level for uvicorn. Possible choices: 'critical', 'error', 'warning', 'info', 'debug', 'trace'. (default: info)
--vllm_model_impl VLLM_MODEL_IMPL, --vllm-model-impl VLLM_MODEL_IMPL
Model implementation to use for vLLM. Must be one of `transformers` or `vllm`. `transformers`: Use the `transformers` backend for model
implementation. `vllm`: Use the `vllm` library for model implementation. (default: vllm)
```
### 💆🏻‍♀️ What's the best distributed setup?
![tp dp throughput 8 gpus](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/tp_dp_throughput_8_gpus.png)
![tp dp throughput 4 gpus](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/tp_dp_throughput_4_gpus.png)
First and foremost, always remember that the optimal setup depends on:
- The model size
- The number of GPUs you have
- The GPU memory size
- The batch size you are using
- The number of requests you are sending to the server (prompts)
- The `max_model_len` you are using (this is the max length of the input sequence that the model can process, a.k.a. the context window size)
- The number of completions you are generating for each request (`num_generations`)
Given these factors, our experiments on the Qwen model family (3B, 7B, 14B, 32B) using 8 H100 GPUs show that:
- For reasonable-sized models (3B14B) and a moderate context window (`max_len < 8k`), using full capacity for data parallelism gives better throughput. The setup `(tp=1, dp=8)` yields the best results.
- For larger models (32B) and longer context windows (`max_len > 8k`), a smaller DP size combined with some model-side parallelism performs better. For example, `(tp=2, dp=4)` is a good setup for 32B models with a larger context window.
### vLLM with Transformers Backend
vLLM can use the **Transformers backend** for model implementations, which works for both LLMs and VLMs.
To enable this, set `vllm_model_impl="transformers"` in your configuration or pass it via the command-line argument.
For more details, check out [vLLM Transformers Backend](https://blog.vllm.ai/2025/04/11/transformers-backend.html).
Example:
```sh
CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0 trl vllm-serve --model Qwen/Qwen
2.5-VL-3B-Instruct --tensor-parallel-size 1 --port 8000 --enforce_eager --vllm_model_impl transformers
```
### Modes of Using vLLM During Training
TRL supports **two modes** for integrating vLLM during training: **server mode** and **colocate mode**.
#### Server Mode
In **server mode**, vLLM runs as a separate process on dedicated GPUs and communicates with the trainer via HTTP.
This setup is ideal if you have GPUs dedicated to inference.
Example configuration:
<hfoptions id="vllm examples">
<hfoption id="GRPO">
```python
from trl import GRPOConfig
training_args = GRPOConfig(
...,
use_vllm=True,
vllm_mode="server", # default value, can be omitted
)
```
</hfoption>
<hfoption id="OnlineDPO">
```python
from trl import OnlineDPOConfig
training_args = OnlineDPOConfig(
...,
use_vllm=True,
vllm_mode="server", # default value, can be omitted
)
```
</hfoption>
<hfoption id="NashMD">
```python
from trl import NashMDConfig
training_args = NashMDConfig(
...,
use_vllm=True,
vllm_mode="server", # default value, can be omitted
)
```
</hfoption>
<hfoption id="XPO">
```python
from trl import XPOConfig
training_args = XPOConfig(
...,
use_vllm=True,
vllm_mode="server", # default value, can be omitted
)
```
</hfoption>
<hfoption id="RLOO">
```python
from trl import RLOOConfig
training_args = RLOOConfig(
...,
use_vllm=True,
vllm_mode="server", # default value, can be omitted
)
```
</hfoption>
</hfoptions>
#### Colocate Mode
In **colocate mode**, vLLM runs inside the trainer process and shares GPU memory with the training model.
This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs.
Example configuration:
<hfoptions id="vllm examples">
<hfoption id="GRPO">
```python
from trl import GRPOConfig
training_args = GRPOConfig(
...,
use_vllm=True,
vllm_mode="colocate",
)
```
</hfoption>
<hfoption id="OnlineDPO">
```python
from trl import OnlineDPOConfig
training_args = OnlineDPOConfig(
...,
use_vllm=True,
vllm_mode="colocate",
)
```
</hfoption>
<hfoption id="NashMD">
```python
from trl import NashMDConfig
training_args = NashMDConfig(
...,
use_vllm=True,
vllm_mode="colocate",
)
```
</hfoption>
<hfoption id="XPO">
```python
from trl import XPOConfig
training_args = XPOConfig(
...,
use_vllm=True,
vllm_mode="colocate",
)
```
</hfoption>
<hfoption id="RLOO">
```python
from trl import RLOOConfig
training_args = RLOOConfig(
...,
use_vllm=True,
vllm_mode="colocate",
)
```
</hfoption>
</hfoptions>
> [!WARNING]
> Check the documentation of the trainer you are using for specific details on vLLM usage and parameters.
> [!WARNING]
> To reduce GPU memory usage when running vLLM, consider [enabling vLLM sleep mode](reducing_memory_usage#vllm-sleep-mode).

160
docs/source/xpo_trainer.md Normal file
View File

@ -0,0 +1,160 @@
# XPO Trainer
[![model badge](https://img.shields.io/badge/All_models-XPO-blue)](https://huggingface.co/models?other=xpo,trl)
## Overview
Exploratory Preference Optimization (XPO) was proposed in the paper [Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF](https://huggingface.co/papers/2405.21046) by Tengyang Xie, Dylan J. Foster, Akshay Krishnamurthy, [Corby Rosset](https://huggingface.co/corbyrosset), [Ahmed Awadallah](https://huggingface.co/AhmedAwadallah), and Alexander Rakhlin. It is a simple online preference tuning method based on the DPO loss together with a reward model (RM). XPO augments the DPO objective with an exploration bonus allowing the method to explore outside the support of the initial model and human feedback data.
The abstract from the paper is the following:
> Reinforcement learning from human feedback (RLHF) has emerged as a central tool for language model alignment. We consider online exploration in RLHF, which exploits interactive access to human or AI feedback by deliberately encouraging the model to produce diverse, maximally informative responses. By allowing RLHF to confidently stray from the pre-trained model, online exploration offers the possibility of novel, potentially super-human capabilities, but its full potential as a paradigm for language model training has yet to be realized, owing to computational and statistical bottlenecks in directly adapting existing reinforcement learning techniques. We propose a new algorithm for online exploration in RLHF, Exploratory Preference Optimization (XPO), which is simple and practical -- a one-line change to (online) Direct Preference Optimization (DPO; Rafailov et al., 2023) -- yet enjoys the strongest known provable guarantees and promising empirical performance. XPO augments the DPO objective with a novel and principled exploration bonus, empowering the algorithm to explore outside the support of the initial model and human feedback data. In theory, we show that XPO is provably sample-efficient and converges to a near-optimal language model policy under natural exploration conditions, irrespective of whether the initial model has good coverage. Our analysis, which builds on the observation that DPO implicitly performs a form of Q*-approximation (or, Bellman error minimization), combines previously disparate techniques from language modeling and theoretical reinforcement learning in a serendipitous fashion through the perspective of KL-regularized Markov decision processes. Empirically, we find that XPO is more sample-efficient than non-exploratory DPO variants in a preliminary evaluation.
This post-training method was contributed by [Kashif Rasul](https://huggingface.co/kashif), [Quentin Gallouédec](https://huggingface.co/qgallouedec) and [Lewis Tunstall](https://huggingface.co/lewtun).
## Quick start
This example demonstrates how to train a model using the XPO method. We use the [Qwen 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) as the base model and [`PairRMJudge`] as a judge. We use the prompts from the [UltraFeedback dataset](https://huggingface.co/datasets/openbmb/UltraFeedback). You can view the prompts in the dataset here:
<iframe
src="https://huggingface.co/datasets/trl-lib/ultrafeedback-prompt/embed/viewer/default/train?row=0"
frameborder="0"
width="100%"
height="560px"
></iframe>
Below is the script to train the model:
```python
# train_xpo.py
from datasets import load_dataset
from trl import PairRMJudge, XPOConfig, XPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
judge = PairRMJudge()
train_dataset = load_dataset("trl-lib/ultrafeedback-prompt", split="train")
training_args = XPOConfig(output_dir="Qwen2-0.5B-XPO")
trainer = XPOTrainer(
model=model, judge=judge, args=training_args, processing_class=tokenizer, train_dataset=train_dataset
)
trainer.train()
```
Execute the script using the following command:
```bash
accelerate launch train_xpo.py
```
Distributed across 8 GPUs, the training takes approximately 1 hour.
To see how the [trained model](https://huggingface.co/trl-lib/Qwen2-0.5B-XPO) performs, you can use the [Transformers Chat CLI](https://huggingface.co/docs/transformers/quicktour#chat-with-text-generation-models).
<pre><code>$ transformers chat trl-lib/Qwen2-0.5B-XPO
<strong><span style="color: red;">&lt;quentin_gallouedec&gt;:</span></strong>
What is the best programming language?
<strong><span style="color: blue;">&lt;trl-lib/Qwen2-0.5B-XPO&gt;:</span></strong>
The best programming language depends on individual preferences and familiarity with coding concepts. Some popular languages include Python, Java, C++, and JavaScript.
</code></pre>
## Expected dataset type
XPO requires a [prompt-only dataset](dataset_formats#prompt-only). The [`XPOTrainer`] supports both [conversational](dataset_formats#conversational) and [standard](dataset_formats#standard) dataset format. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
## Usage tips
### Use a reward model
Instead of a judge, you can chose to use a reward model -- see [Reward Bench](https://huggingface.co/spaces/allenai/reward-bench) for a leaderboard of public models you can use. Below is a code example showing how to replace a judge with the [trl-lib/Qwen2-0.5B-Reward](https://huggingface.co/trl-lib/Qwen2-0.5B-Reward) model:
```diff
- from trl import PairRMJudge
+ from transformers import AutoModelForSequenceClassification
- judge = PairRMJudge()
+ reward_model = AutoModelForSequenceClassification.from_pretrained("trl-lib/Qwen2-0.5B-Reward", num_labels=1)
trainer = XPOTrainer(
...
- judge=judge,
+ reward_funcs=reward_model,
)
```
> [!WARNING]
> Make sure that the SFT model and reward model use the _same_ chat template and the same tokenizer. Otherwise, you may find the model completions are scored incorrectly during training.
### Encourage EOS token generation
When using a reward model, we may want the model to generate completions within a given length. During training, the model will generate completions up to the maximum length specified in the `max_new_tokens` argument of [`XPOConfig`]. If you want to penalize the model for not generating an EOS token before reaching the maximum length, you can use the `missing_eos_penalty` argument of [`XPOConfig`]:
```python
training_args = XPOConfig(..., max_new_tokens=128, missing_eos_penalty=1.0)
```
### Logging Completions
To better understand your models behavior during training, you can log sample completions periodically using the [`LogCompletionsCallback`].
```python
trainer = XPOTrainer(..., eval_dataset=eval_dataset)
completions_callback = LogCompletionsCallback(trainer, num_prompts=8)
trainer.add_callback(completions_callback)
```
This callback logs the model's generated completions directly to Weights & Biases.
![Logged Completions](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/wandb_completions.png)
## Example script
We provide an example script to train a model using the XPO method. The script is available in [`examples/scripts/xpo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/xpo.py)
To test the XPO script with the [Qwen2.5 0.5B model](https://huggingface.co/trl-lib/Qwen/Qwen2.5-0.5B-Instruct) on the [UltraFeedback dataset](https://huggingface.co/datasets/openbmb/UltraFeedback), run the following command:
```bash
python examples/scripts/xpo.py \
--model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
--judge pair_rm \
--dataset_name trl-lib/ultrafeedback-prompt \
--learning_rate 5.0e-7 \
--output_dir Qwen2.5-0.5B-XPO-PairRM \
--warmup_ratio 0.1 \
--push_to_hub
```
## Logged metrics
While training and evaluating we record the following reward metrics:
* `loss/xpo`: The mean xpo part of the full loss.
* `loss/dpo`: The mean dpo part of the full loss.
* `objective/kl`: The mean KL divergence between the model and reference data.
* `objective/entropy`: The mean entropy of the model and reference data.
* `objective/model_scores`: The mean scores (according to the reward model) of the model completions.
* `objective/ref_scores`: The mean scores (according to the reward model) of the reference completions.
* `objective/scores_margin`: The mean score margin (according to the external reward model) between the chosen and rejected completions.
* `rewards/chosen`: The mean reward (according to XPO's DPO implicit reward model) of the chosen completions.
* `rewards/rejected`: The mean reward (according to XPO's DPO implicit reward model) of the rejected completions.
* `rewards/accuracies`: The accuracies of the XPO's implicit reward model.
* `rewards/margins`: The mean reward margin (according to online DPO's implicit reward model) between the chosen and rejected completions.
* `logps/chosen`: The mean log probabilities of the chosen completions.
* `logps/rejected`: The mean log probabilities of the rejected completions.
* `val/model_contain_eos_token`: The amount of times the model's output contains the eos token.
* `val/ref_contain_eos_token`: The amount of times the reference's output contains the eos token.
* `alpha`: The weight of the XPO loss term. Typically fixed, but can be made dynamic by passing a list to [`XPOConfig`].
* `beta`: The parameter that controls the weight of the loss term representing the deviation from the reference model. Typically fixed, but can be made dynamic by passing a list to [`XPOConfig`].
## XPOTrainer
[[autodoc]] XPOTrainer
- train
- save_model
- push_to_hub
## XPOConfig
[[autodoc]] XPOConfig

View File

@ -1,66 +1,3 @@
# Sentiment Examples
# Examples
The notebooks and scripts in this examples show how to fine-tune a model with a sentiment classifier (such as `lvwerra/distilbert-imdb`).
Here's an overview of the notebooks and scripts:
| File | Description |
|---|---|
| `notebooks/gpt2-sentiment.ipynb` | Fine-tune GPT2 to generate positive movie reviews. |
| `notebooks/gpt2-sentiment-control.ipynb` | Fine-tune GPT2 to generate movie reviews with controlled sentiment. |
| `scripts/gpt2-sentiment.py` | Same as the notebook, but easier to use to use in mutli-GPU setup. |
| `scripts/t5-sentiment.py` | Same as GPT2 script, but for a Seq2Seq model (T5). |
## Installation
```bash
pip install trl
#optional: wandb
pip install wandb
```
Note: if you don't want to log with `wandb` remove `log_with="wandb"` in the scripts/notebooks. You can also replace it with your favourite experiment tracker that's [supported by `accelerate`](https://huggingface.co/docs/accelerate/usage_guides/tracking).
## Launch scripts
The `trl` library is powered by `accelerate`. As such it is best to configure and launch trainings with the following commands:
```bash
accelerate config # will prompt you to define the training configuration
accelerate launch scripts/gpt2-sentiment.py # launches training
```
# Summarization Example
The script in this example show how to train a reward model for summarization, following the OpenAI Learning to Summarize from Human Feedback [paper](https://arxiv.org/abs/2009.01325). We've validated that the script can be used to train a small GPT2 to get slightly over 60% validation accuracy, which is aligned with results from the paper. The model is [here](https://huggingface.co/Tristan/gpt2_reward_summarization).
Here's an overview of the files:
| File | Description |
|---|---|
| `scripts/reward_summarization.py` | For tuning the reward model. |
| `scripts/ds3_reward_summarization_example_config.json` | Can be used with the reward model script to scale it up to arbitrarily big models that don't fit on a single GPU. |
## Installation
```bash
pip install trl
pip install evaluate
# optional: deepspeed
pip install deepspeed
```
```bash
# If you want your reward model to follow the Learning to Summarize from Human Feedback paper closely, then tune a GPT model on summarization and then instantiate the reward model
# with it. In other words, pass in the name of your summarization-finetuned gpt on the hub, instead of the name of the pretrained gpt2 like we do in the following examples of how
# to run this script.
# Example of running this script with the small size gpt2 on a 40GB A100 (A100's support bf16). Here, the global batch size will be 64:
python -m torch.distributed.launch --nproc_per_node=1 reward_summarization.py --bf16
# Example of running this script with the xl size gpt2 on 16 40GB A100's. Here the global batch size will still be 64:
python -m torch.distributed.launch --nproc_per_node=16 reward_summarization.py --per_device_train_batch_size=1 --per_device_eval_batch_size=1 --gradient_accumulation_steps=4 --gpt_model_name=gpt2-xl --bf16 --deepspeed=ds3_reward_summarization_example_config.json
```
Please check out https://huggingface.co/docs/trl/example_overview for documentation on our examples.

View File

@ -0,0 +1,30 @@
# Context Parallelism with FSDP for 2 GPUs
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: false
fsdp_config:
fsdp_activation_checkpointing: true # Enable activation checkpointing for memory efficiency
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_cpu_ram_efficient_loading: true
fsdp_offload_params: false
fsdp_reshard_after_forward: true
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_version: 2
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2 # Number of GPUs
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
parallelism_config:
parallelism_config_dp_replicate_size: 1
parallelism_config_dp_shard_size: 1
parallelism_config_tp_size: 1
parallelism_config_cp_size: 2 # Context parallel size

View File

@ -0,0 +1,20 @@
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
deepspeed_multinode_launcher: standard
gradient_accumulation_steps: 1
zero3_init_flag: false
zero_stage: 1
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: 'bf16'
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

View File

@ -0,0 +1,21 @@
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
deepspeed_multinode_launcher: standard
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: false
zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: 'bf16'
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

View File

@ -0,0 +1,22 @@
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
deepspeed_multinode_launcher: standard
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

View File

@ -0,0 +1,28 @@
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: false
fsdp_config:
fsdp_activation_checkpointing: false
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_cpu_ram_efficient_loading: true
fsdp_forward_prefetch: true
fsdp_offload_params: false
fsdp_reshard_after_forward: FULL_SHARD
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_sync_module_states: true
fsdp_use_orig_params: true
fsdp_version: 1
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

View File

@ -0,0 +1,25 @@
# Requires accelerate 1.7.0 or higher
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: false
fsdp_config:
fsdp_activation_checkpointing: false
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_cpu_ram_efficient_loading: true
fsdp_offload_params: false
fsdp_reshard_after_forward: true
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_version: 2
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

View File

@ -0,0 +1,16 @@
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: 'bf16'
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

View File

@ -0,0 +1,16 @@
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: "NO"
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: 'bf16'
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

View File

@ -0,0 +1,18 @@
# This is an example configuration file of TRL CLI, you can use it for
# SFT like that: `trl sft --config config.yaml --output_dir test-sft`
# The YAML file supports environment variables by adding an `env` field
# as below
# env:
# CUDA_VISIBLE_DEVICES: 0
model_name_or_path:
Qwen/Qwen2.5-0.5B
dataset_name:
stanfordnlp/imdb
report_to:
none
learning_rate:
0.0001
lr_scheduler_type:
cosine

View File

@ -0,0 +1,133 @@
# Copyright 2020-2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import re
from dataclasses import dataclass, field
from typing import Optional
from datasets import load_dataset
from huggingface_hub import ModelCard
from transformers import HfArgumentParser
@dataclass
class ScriptArguments:
r"""
Arguments for the script.
Args:
push_to_hub (`bool`, *optional*, defaults to `False`):
Whether to push the dataset to the Hugging Face Hub.
repo_id (`str`, *optional*, defaults to `"trl-lib/hh-rlhf-helpful-base"`):
Hugging Face repository ID to push the dataset to.
dataset_num_proc (`int`, *optional*):
Number of workers to use for dataset processing.
"""
push_to_hub: bool = field(
default=False,
metadata={"help": "Whether to push the dataset to the Hugging Face Hub."},
)
repo_id: str = field(
default="trl-lib/hh-rlhf-helpful-base", metadata={"help": "Hugging Face repository ID to push the dataset to."}
)
dataset_num_proc: Optional[int] = field(
default=None, metadata={"help": "Number of workers to use for dataset processing."}
)
def common_start(str1: str, str2: str) -> str:
# Zip the two strings and iterate over them together
common_chars = []
for c1, c2 in zip(str1, str2):
if c1 == c2:
common_chars.append(c1)
else:
break
# Join the common characters and return as a string
return "".join(common_chars)
def extract_dialogue(example: str) -> list[dict[str, str]]:
# Extract the prompt, which corresponds to the common start of the chosen and rejected dialogues
prompt_text = common_start(example["chosen"], example["rejected"])
# The chosen and rejected may share a common start, so we need to remove the common part
if not prompt_text.endswith("\n\nAssistant: "):
prompt_text = prompt_text[: prompt_text.rfind("\n\nAssistant: ")] + "\n\nAssistant: "
# Extract the chosen and rejected lines
chosen_line = example["chosen"][len(prompt_text) :]
rejected_line = example["rejected"][len(prompt_text) :]
# Remove the generation prompt ("\n\nAssistant: ") from the prompt
prompt_text = prompt_text[: -len("\n\nAssistant: ")]
# Split the string at every occurrence of "Human: " or "Assistant: "
prompt_lines = re.split(r"(\n\nAssistant: |\n\nHuman: )", prompt_text)
# Remove the first element as it's empty
prompt_lines = prompt_lines[1:]
prompt = []
for idx in range(0, len(prompt_lines), 2):
role = "user" if prompt_lines[idx] == "\n\nHuman: " else "assistant"
content = prompt_lines[idx + 1]
prompt.append({"role": role, "content": content})
# Remove the prompt from the chosen and rejected dialogues
chosen = [{"role": "assistant", "content": chosen_line}]
rejected = [{"role": "assistant", "content": rejected_line}]
return {"prompt": prompt, "chosen": chosen, "rejected": rejected}
model_card = ModelCard("""
---
tags: [trl]
---
# HH-RLHF-Helpful-Base Dataset
## Summary
The HH-RLHF-Helpful-Base dataset is a processed version of [Anthropic's HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset, specifically curated to train models using the [TRL library](https://github.com/huggingface/trl) for preference learning and alignment tasks. It contains pairs of text samples, each labeled as either "chosen" or "rejected," based on human preferences regarding the helpfulness of the responses. This dataset enables models to learn human preferences in generating helpful responses, enhancing their ability to assist users effectively.
## Data Structure
- **Format**: [Conversational](https://huggingface.co/docs/trl/main/dataset_formats#conversational)
- **Type**: [Preference](https://huggingface.co/docs/trl/main/dataset_formats#preference)
Columns:
- `"prompt"`: The user query.
- `"chosen"`: A response deemed helpful by human evaluators.
- `"rejected"`: A response considered less helpful or unhelpful.
This structure allows models to learn to prefer the _chosen_ response over the _rejected_ one, thereby aligning with human preferences in helpfulness.
## Generation script
The script used to generate this dataset can be found [here](https://github.com/huggingface/trl/blob/main/examples/datasets/hh-rlhf-helpful-base.py).
""")
if __name__ == "__main__":
parser = HfArgumentParser(ScriptArguments)
script_args = parser.parse_args_into_dataclasses()[0]
dataset = load_dataset("Anthropic/hh-rlhf", data_dir="helpful-base")
dataset = dataset.map(extract_dialogue, num_proc=script_args.dataset_num_proc)
if script_args.push_to_hub:
dataset.push_to_hub(script_args.repo_id)
model_card.push_to_hub(script_args.repo_id, repo_type="dataset")

Some files were not shown because too many files have changed in this diff Show More