* Refactor reward processing in OnlineDPOTrainer
* Refactor completion decoding and reward processing
* remove strip
* remove warning
* Add reward_tokenizer to training script
* Add reward_tokenizer and reward_processing_class to OnlineDPOTrainer test
* propagate to xpo and nash
* style
* reduce memory requirement with inference_mode
* fix tests
* pairrm judge llmblender
* setUpClass(cls)
* Add setUpClass method to TestJudges class
* truncation left for reward tokenizer
* don't logcompletion without eval dataset
* only eval when possible
* use the pair-judges
* add test
* Update trl/trainer/online_dpo_trainer.py
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
* Update trl/trainer/online_dpo_trainer.py
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
* decode and skip special characters
* initial nash
* return tensors
* Update trl/trainer/online_dpo_trainer.py
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
* Update trl/trainer/online_dpo_trainer.py
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
* Update trl/trainer/online_dpo_trainer.py
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
* add back the logging
* use batch_decode
* add judges api to XPO trainer
* Update tests/test_online_dpo_trainer.py
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
* judge in examples
* judge in config
* add back logs when using reward model
* typo
* add back model_scores logging when using reward model
* log scores for reward model only
* better cond on what to log
* same for rlhf reward
* Update trl/trainer/online_dpo_trainer.py
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
* use decode_and_strip_padding
* error if both reward and judge or none are set
* remove unused check
* Uniform way to pass conversation into judge
* heading -> leading
* LogCompletionsCallback compat with online method
* Update Online DPO doc
* check if data is conversational for judges
* update example
* remove comment
* use zip
* fix stats xpo
* Replace judge with PairRMJudge and import AutoModelForSequenceClassification
* update xpo documentation
* Remove doc duplication
* update nash doc
* XPO trl chat
* nash md doc
* HfPairwiseJudge
---------
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
* `DPOScriptArguments` to `ScriptArguments`
* use dataset_train_split
* Use scriptarguments
* dataset names in command lines
* use `ScriptArguments` everywhere
* ignore biais buffer to end
* remove in v0.13
* rm comment
* update test commands
* Update docs/source/rloo_trainer.md
* Update tests/test_rloo_trainer.py
* Added dataset_train_split argument to ppo.py and rloo.py
* update scripts with dataset_train_split
* initial xpo trainer
* compute rewards and ref log probs in smaller batches
* add logging
* initial log docs
* fix global_step increment
* fix metric descriptions
* use messages API
* use training_step API
* fix logs
* add test
* add back max_new_tokens
* use max_new_tokens
* refactor
* top_k is an int
* fix formatting
* fix the loss
* fix logging
* fix logging
* fix logging
* fix loss
* calcuate pi_log_ratio once
* fix stats
* fix loss
* do not log loss again
* fix docs
* add disable_dropout_in_model via flag
* comments
* revert doc change
* rm empty cache in online dpo
* improve doc xpo config
* some comment
* fix loggings stats
* fix docs
* save the model
* fix model and reward model
* Update trl/trainer/xpo_trainer.py
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
---------
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>