* initial skeleton
* tokenize fn
* adding bos and eos to tokenization fn
* prmtrainer
* fixing small typo in tokenize
* typo in input_ids and labels construction
* numpy dimension
* introduce the stepwise reward trainer
* update markdown files
* let user decide post step separator in config
* doc post_step_separator
* do not add post step_tokens to last step of the reasoning process
* renaming prm to stepwisereward
* formatting
* fix tokenize kwargs
* adapt test to the new post_token args
* adding example script
* fix small typo
* add create_model_card and renaming
* fixing booleans
* Adding the new stepwise_preference instead of placeholders for datasets
* formatting
* Update docs/source/_toctree.yml
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
* Update examples/scripts/stepwise_reward_modeling.py
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
* Update trl/trainer/stepwise_reward_trainer.py
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
* Update trl/trainer/stepwise_reward_trainer.py
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
* update push to hub
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
* step_separator can't be None
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
* fix suggested typos
* add citation
* reformat doc
* reordering init
* push to hub prm800k
* changing dataset in example
* change dataset format to align with the sky is blue example
* fix tokenization column names
* fix num labels in openai example
* add support for conversational dataset
* remove training whitespace
* replace tokenizer with processing class
* Update docs/source/dataset_formats.mdx
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
* remove openai_prm800k
* Update trl/trainer/stepwise_reward_trainer.py
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
* Update trl/trainer/stepwise_reward_trainer.py
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
* Update docs/source/stepwise_reward_trainer.mdx
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
* Update docs/source/stepwise_reward_trainer.mdx
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
* renaming
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
* renaming
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
* minor renamings in docs
* using prm800k instead of openai_prm800k
* update num labels to 2 following the new format
* changing doc examples to math examples
* change reference to dataset_formats.mdx
* changing dataset config in test
* remove conversational dataset support
* remove conv dataset support
* fix bos token
* fix scriptarguments in example
* completion to completions
* remove valuerror for step_separator inside steps
* run precommit
* remove conv dataset support
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
* renaming zen dataset
* remove unused printing
* unknown label column
* introduce the train on last step arg
* _tokenize support train_on_last_step
* incorporate train_on_last_step to tests
* formatting
* remove comments in trainer
* Refactor `tokenize_row`
* Update max_completion_length parameter in StepwiseRewardConfig
* Collator
* Update comment
* Update type hint
* fix table
* Remove collator
* don't need pad token id
* add error back
* max length args
* use tokenizer arg
* Update doc
* label -> labels
* fixing tokenization issues in tokenize row
* correct labels for token classification
* adding max_length to tokenize_row
* reformat tests
* adding tests for tokenize row
* fixing typos in comments
* update doc
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
* Add math_shepherd.py script for dataset processing
* split the dataset
* formatting
* same evaluation method for the two training methods
* adding filtering to example script
* formatting
* Add features to avoid casting labels to bool in dataset tokenization
* Update docs/source/stepwise_reward_trainer.mdx [ci skip]
* Add learning_rate parameter to StepwiseRewardConfig class
* update doc
* Remove unused setup_chat_format function
* Fix warning message in stepwise_reward_modeling.py
* Update logging steps in stepwise_reward_trainer.mdx
* little doc change [ci skip]
* Fix copyrights
* fix space after copyrights
* Update dataset loading in stepwise_reward_modeling.py
* refine compute_accuracy and proper test
* fix tests
* style
* renamings
* renaming in init
* doc renaming
* fix sorting and tag
* experiemental [ci skip]
* trigger CI
* other doc fix
---------
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>