[GKD] interpolate in prob. space (#2204 )

* interpolate in prob. space * better var names * use logsumexp * set beta dtype * beta tensor
Version 0.11.2 -> 0.11.3
2025-10-20 18:43:52 +08:00 · 2024-10-10 12:41:34 +00:00 · 2024-10-10 12:30:46 +00:00 · 2024-10-10 12:28:19 +00:00 · 2024-10-10 12:24:50 +00:00 · 2024-10-07 15:59:17 +00:00
254 changed files with 45422 additions and 1919 deletions
--- a/.github/ISSUE_TEMPLATE/bug-report.yml
+++ b/.github/ISSUE_TEMPLATE/bug-report.yml
@ -0,0 +1,67 @@
+name: "\U0001F41B Bug Report"
+description: Submit a bug report to help us improve TRL
+labels: [ "bug" ]
+body:
+  - type: markdown
+    attributes:
+      value: |
+        Thanks for taking the time to fill out this bug report! 🤗
+
+        Before you submit your bug report:
+
+          - If it is your first time submitting, be sure to check our [bug report guidelines](https://github.com/huggingface/trl/blob/main/CONTRIBUTING.md#did-you-find-a-bug)
+
+  - type: textarea
+    id: system-info
+    attributes:
+      label: System Info
+      description: Please share your system info with us. You can run the command `transformers-cli env` and copy-paste its output below.
+      placeholder: trl version, transformers version, platform, python version, ...
+    validations:
+      required: true
+
+  - type: checkboxes
+    id: information-scripts-examples
+    attributes:
+      label: Information
+      description: 'The problem arises when using:'
+      options:
+        - label: "The official example scripts"
+        - label: "My own modified scripts"
+
+  - type: checkboxes
+    id: information-tasks
+    attributes:
+      label: Tasks
+      description: "The tasks I am working on are:"
+      options:
+        - label: "An officially supported task in the `examples` folder"
+        - label: "My own task or dataset (give details below)"
+
+  - type: textarea
+    id: reproduction
+    validations:
+      required: true
+    attributes:
+      label: Reproduction
+      description: |
+        Please provide a code sample that reproduces the problem you ran into. It can be a Colab link or just a code snippet.
+        If you have code snippets, error messages, stack traces please provide them here as well.
+        Important! Use code tags to correctly format your code. See https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
+        Do not use screenshots, as they are hard to read and (more importantly) don't allow others to copy-and-paste your code.
+
+      placeholder: |
+        Steps to reproduce the behavior:
+
+          1.
+          2.
+          3.
+
+
+  - type: textarea
+    id: expected-behavior
+    validations:
+      required: true
+    attributes:
+      label: Expected behavior
+      description: "A clear and concise description of what you would expect to happen."
--- a/.github/ISSUE_TEMPLATE/feature-request.yml
+++ b/.github/ISSUE_TEMPLATE/feature-request.yml
@ -0,0 +1,31 @@
+name: "\U0001F680 Feature request"
+description: Submit a proposal/request for a new TRL feature
+labels: [ "Feature request" ]
+body:
+  - type: textarea
+    id: feature-request
+    validations:
+      required: true
+    attributes:
+      label: Feature request
+      description: |
+        A clear and concise description of the feature proposal. Please provide a link to the paper and code in case they exist.
+
+  - type: textarea
+    id: motivation
+    validations:
+      required: true
+    attributes:
+      label: Motivation
+      description: |
+        Please outline the motivation for the proposal. Is your feature request related to a problem? e.g., I'm always frustrated when [...]. If this is related to another GitHub issue, please link here too.
+
+
+  - type: textarea
+    id: contribution
+    validations:
+      required: true
+    attributes:
+      label: Your contribution
+      description: |
+        Is there any way that you could help, e.g. by submitting a PR? Make sure to read the CONTRIBUTING.MD [readme](https://github.com/huggingface/trl/blob/main/CONTRIBUTING.md)
--- a/.github/ISSUE_TEMPLATE/new-trainer-addition.yml
+++ b/.github/ISSUE_TEMPLATE/new-trainer-addition.yml
@ -0,0 +1,32 @@
+name: "\U0001F31F New trainer addition"
+description: Submit a proposal/request to implement a new trainer for a post-training method 
+labels: [ "New trainer" ]
+
+body:
+  - type: textarea
+    id: description-request
+    validations:
+      required: true
+    attributes:
+      label: Method description
+      description: |
+        Put any and all important information relative to the method
+
+  - type: checkboxes
+    id: information-tasks
+    attributes:
+      label: Open source status
+      description: |
+          Please note that if the method implementation isn't available or model weights with training datasets aren't available, we are less likely to implement it in `trl`.
+      options:
+        - label: "The method implementation is available"
+        - label: "The model weights are available"
+        - label: "The training datasets are available"
+
+  - type: textarea
+    id: additional-info
+    attributes:
+      label: Provide useful links for the implementation
+      description: |
+        Please provide information regarding the implementation, the weights, and the authors.
+        Please mention the authors by @gh-username if you're aware of their usernames.
--- a/.github/PULL_REQUEST_TEMPLATE.md
+++ b/.github/PULL_REQUEST_TEMPLATE.md
@ -0,0 +1,32 @@
+# What does this PR do?
+
+<!--
+Congratulations! You've made it this far! You're not quite done yet though.
+
+Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution.
+
+Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change.
+
+Once you're done, someone will review your PR shortly. They may suggest changes to make the code even better.
+-->
+
+<!-- Remove if not applicable -->
+
+Fixes # (issue)
+
+
+## Before submitting
+- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
+- [ ] Did you read the [contributor guideline](https://github.com/huggingface/trl/blob/main/CONTRIBUTING.md#create-a-pull-request),
+      Pull Request section?
+- [ ] Was this discussed/approved via a GitHub issue? Please add a link
+      to it if that's the case.
+- [ ] Did you make sure to update the documentation with your changes? Here are the
+      [documentation guidelines](https://github.com/huggingface/trl/tree/main/docs).
+- [ ] Did you write any new necessary tests?
+
+
+## Who can review?
+
+Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
+members/contributors who may be interested in your PR.
--- a/.github/workflows/build_documentation.yml
+++ b/.github/workflows/build_documentation.yml
@ -13,7 +13,7 @@ jobs:
    with:
      commit_sha: ${{ github.sha }}
      package: trl
-      repo_owner: lvwerra
      version_tag_suffix: ""
+      custom_container: huggingface/transformers-doc-builder
    secrets:
-      token: ${{ secrets.HUGGINGFACE_PUSH }}
+      hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
--- a/.github/workflows/build_pr_documentation.yml
+++ b/.github/workflows/build_pr_documentation.yml
@ -14,5 +14,5 @@ jobs:
      commit_sha: ${{ github.event.pull_request.head.sha }}
      pr_number: ${{ github.event.number }}
      package: trl
-      repo_owner: lvwerra
-      version_tag_suffix: ""
+      version_tag_suffix: ""
+      custom_container: huggingface/transformers-doc-builder
--- a/.github/workflows/clear_cache.yml
+++ b/.github/workflows/clear_cache.yml
@ -0,0 +1,33 @@
+name: "Cleanup Cache"
+
+on:
+  workflow_dispatch:
+  schedule:
+    - cron: "0 0 * * *"
+    
+jobs:
+  cleanup:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Check out code
+        uses: actions/checkout@v4
+        
+      - name: Cleanup
+        run: |
+          gh extension install actions/gh-actions-cache
+          
+          REPO=${{ github.repository }}
+
+          echo "Fetching list of cache key"
+          cacheKeysForPR=$(gh actions-cache list -R $REPO | cut -f 1 )
+
+          ## Setting this to not fail the workflow while deleting cache keys. 
+          set +e
+          echo "Deleting caches..."
+          for cacheKey in $cacheKeysForPR
+          do
+              gh actions-cache delete $cacheKey -R $REPO --confirm
+          done
+          echo "Done"
+        env:
+          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
--- a/.github/workflows/delete_doc_comment.yml
+++ b/.github/workflows/delete_doc_comment.yml
@ -1,13 +0,0 @@
-name: Delete dev documentation
-
-on:
-  pull_request:
-    types: [ closed ]
-
-
-jobs:
-  delete:
-    uses: huggingface/doc-builder/.github/workflows/delete_doc_comment.yml@main
-    with:
-      pr_number: ${{ github.event.number }}
-      package: trl
--- a/.github/workflows/docker-build.yml
+++ b/.github/workflows/docker-build.yml
@ -0,0 +1,95 @@
+name: Build Docker images (scheduled)
+
+on:
+  workflow_dispatch:
+  workflow_call:
+  schedule:
+    - cron: "0 1 * * *"
+
+concurrency:
+  group: docker-image-builds
+  cancel-in-progress: false
+
+env:
+  CI_SLACK_CHANNEL: ${{ secrets.CI_DOCKER_CHANNEL }}
+
+jobs:
+  trl-latest:
+    name: "Latest TRL GPU"
+    runs-on: ubuntu-latest
+    steps:
+      - name: Cleanup disk
+        run: |
+          sudo ls -l /usr/local/lib/
+          sudo ls -l /usr/share/
+          sudo du -sh /usr/local/lib/
+          sudo du -sh /usr/share/
+          sudo rm -rf /usr/local/lib/android
+          sudo rm -rf /usr/share/dotnet
+          sudo du -sh /usr/local/lib/
+          sudo du -sh /usr/share/
+      - name: Set up Docker Buildx
+        uses: docker/setup-buildx-action@v1
+      - name: Check out code
+        uses: actions/checkout@v4
+      - name: Login to DockerHub
+        uses: docker/login-action@v1
+        with:
+          username: ${{ secrets.DOCKERHUB_USERNAME }}
+          password: ${{ secrets.DOCKERHUB_PASSWORD }}
+
+      - name: Build and Push GPU
+        uses: docker/build-push-action@v4
+        with:
+          context: ./docker/trl-latest-gpu
+          push: true
+          tags: huggingface/trl-latest-gpu
+
+      - name: Post to Slack
+        if: always()
+        uses: huggingface/hf-workflows/.github/actions/post-slack@main
+        with:
+          slack_channel: ${{ env.CI_SLACK_CHANNEL }}
+          title: 🤗 Results of the trl-latest-gpu Docker Image build
+          status: ${{ job.status }}
+          slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
+
+  trl-source:
+    name: "Latest TRL + HF ecosystem from source"
+    runs-on: ubuntu-latest
+    steps:
+      - name: Cleanup disk
+        run: |
+          sudo ls -l /usr/local/lib/
+          sudo ls -l /usr/share/
+          sudo du -sh /usr/local/lib/
+          sudo du -sh /usr/share/
+          sudo rm -rf /usr/local/lib/android
+          sudo rm -rf /usr/share/dotnet
+          sudo du -sh /usr/local/lib/
+          sudo du -sh /usr/share/
+      - name: Set up Docker Buildx
+        uses: docker/setup-buildx-action@v1
+      - name: Check out code
+        uses: actions/checkout@v4
+      - name: Login to DockerHub
+        uses: docker/login-action@v1
+        with:
+          username: ${{ secrets.DOCKERHUB_USERNAME }}
+          password: ${{ secrets.DOCKERHUB_PASSWORD }}
+
+      - name: Build and Push GPU
+        uses: docker/build-push-action@v4
+        with:
+          context: ./docker/trl-source-gpu
+          push: true
+          tags: huggingface/trl-source-gpu
+
+      - name: Post to Slack
+        if: always()
+        uses: huggingface/hf-workflows/.github/actions/post-slack@main
+        with:
+          slack_channel: ${{ env.CI_SLACK_CHANNEL }}
+          title: 🤗 Results of the trl-source-gpu Docker Image build
+          status: ${{ job.status }}
+          slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}  
--- a/.github/workflows/slow-tests.yml
+++ b/.github/workflows/slow-tests.yml
@ -0,0 +1,98 @@
+name: Slow tests (on push)
+
+on:
+  push:
+    branches: [ main ]
+    paths:
+      # Run only when python files are modified
+      - "trl/**.py"
+      - "examples/**.py"
+env:
+  RUN_SLOW: "yes"
+  IS_GITHUB_CI: "1"
+  SLACK_API_TOKEN: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
+
+
+jobs:
+  run_all_tests_single_gpu:
+    strategy:
+      fail-fast: false
+      matrix:
+        docker-image-name: ["huggingface/trl-latest-gpu:latest", "huggingface/trl-source-gpu:latest"]
+    runs-on:
+      group: aws-g4dn-2xlarge
+    env:
+      CUDA_VISIBLE_DEVICES: "0"
+      TEST_TYPE: "single_gpu_${{ matrix.docker-image-name }}"
+    container:
+      image: ${{ matrix.docker-image-name }}
+      options: --gpus all --shm-size "16gb" -e NVIDIA_DISABLE_REQUIRE=true
+    defaults:
+      run:
+        shell: bash
+    steps:
+      - uses: actions/checkout@v4
+      - name: Pip install
+        run: |
+          source activate trl
+          pip install -e ".[test]" --no-deps
+          pip install pytest-reportlog parameterized
+
+      - name: Run slow SFT tests on single GPU
+        if: always()
+        run: |
+          source activate trl
+          make slow_tests
+      
+      - name: Generate Report
+        if: always()
+        run: |
+          pip install slack_sdk tabulate
+          python scripts/log_reports.py >> $GITHUB_STEP_SUMMARY
+
+
+  run_all_tests_multi_gpu:
+    strategy:
+      fail-fast: false
+      matrix:
+        docker-image-name: ["huggingface/trl-latest-gpu:latest", "huggingface/trl-source-gpu:latest"]
+    runs-on:
+      group: aws-g4dn-2xlarge
+    env:
+      CUDA_VISIBLE_DEVICES: "0,1"
+      TEST_TYPE: "multi_gpu_${{ matrix.docker-image-name }}"
+    container:
+      image: ${{ matrix.docker-image-name }}
+      options: --gpus all --shm-size "16gb" -e NVIDIA_DISABLE_REQUIRE=true
+    defaults:
+      run:
+        shell: bash
+    steps:
+      - uses: actions/checkout@v4
+      - name: Pip install
+        run: |
+          source activate trl
+          pip install -e ".[test]" --no-deps
+          pip install pytest-reportlog parameterized
+
+      - name: Run slow SFT tests on Multi GPU
+        if: always()
+        run: |
+          source activate trl
+          make slow_tests
+
+      - name: Run end-to-end examples tests on multi GPU
+        if: always()
+        run: |
+          source activate trl
+          pip install deepspeed
+          make test_examples
+      
+      - name: Generate Reports
+        if: always()
+        run: |
+          pip install slack_sdk tabulate
+          python scripts/log_reports.py >> $GITHUB_STEP_SUMMARY
+          python scripts/log_example_reports.py --text_file_name temp_results_sft_tests.txt >> $GITHUB_STEP_SUMMARY
+          python scripts/log_example_reports.py --text_file_name temp_results_dpo_tests.txt >> $GITHUB_STEP_SUMMARY
+          rm *.txt
--- a/.github/workflows/stale.yml
+++ b/.github/workflows/stale.yml
@ -0,0 +1,27 @@
+name: Stale Bot
+
+on:
+  schedule:
+    - cron: "0 15 * * *"
+
+jobs:
+  close_stale_issues:
+    name: Close Stale Issues
+    if: github.repository == 'huggingface/trl'
+    runs-on: ubuntu-latest
+    env:
+      GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+    steps:
+    - uses: actions/checkout@v4
+
+    - name: Setup Python
+      uses: actions/setup-python@v5
+      with:
+        python-version: 3.8
+
+    - name: Install requirements
+      run: |
+        pip install PyGithub
+    - name: Close stale issues
+      run: |
+        python scripts/stale.py
--- a/.github/workflows/tests-main.yml
+++ b/.github/workflows/tests-main.yml
@ -0,0 +1,46 @@
+name: tests on transformers PEFT main
+
+on:
+  push:
+    branches: [ main ]
+
+env:
+  CI_SLACK_CHANNEL: ${{ secrets.CI_PUSH_MAIN_CHANNEL }}
+
+jobs:
+  tests:
+    strategy:
+      matrix:
+        python-version: ['3.9', '3.10', '3.11']
+        os: ['ubuntu-latest', 'windows-latest']
+      fail-fast: false
+    runs-on: ${{ matrix.os }}
+    steps:
+    - uses: actions/checkout@v4
+    - name: Set up Python ${{ matrix.python-version }}
+      uses: actions/setup-python@v5
+      with:
+        python-version: ${{ matrix.python-version }}
+        cache: "pip"
+        cache-dependency-path: |
+            setup.py
+            requirements.txt
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        # install PEFT & transformers from source
+        pip install -U git+https://github.com/huggingface/peft.git
+        pip install -U git+https://github.com/huggingface/transformers.git
+        # cpu version of pytorch
+        pip install ".[test, diffusers]"
+    - name: Test with pytest
+      run: |
+        make test
+    - name: Post to Slack
+      if: always()
+      uses: huggingface/hf-workflows/.github/actions/post-slack@main
+      with:
+        slack_channel: ${{ env.CI_SLACK_CHANNEL }}
+        title: 🤗 Results of the TRL CI on transformers/PEFT main
+        status: ${{ job.status }}
+        slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@ -5,38 +5,79 @@ on:
    branches: [ main ]
  pull_request:
    branches: [ main ]
+    paths:
+      # Run only when relevant files are modified
+      - "trl/**.py"
+      - "examples/**.py"
+      - "scripts/**.py"
+      - ".github/**.yml"
+      - "tests/**.py"
+
+env:
+  TQDM_DISABLE: 1

 jobs:
-
  check_code_quality:
    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        python-version: [3.9]
+
    steps:
-      - uses: actions/checkout@v3
-      - name: Set up Python
-        uses: actions/setup-python@v4
+      - uses: actions/checkout@v4
        with:
-          python-version: "3.8"
-      - name: Install dependencies
-        run: |
-          python -m pip install --upgrade pip
-          pip install .[dev]
-      - name: Check quality
-        run: |
-          make quality
+          fetch-depth: 0
+          submodules: recursive
+      - name: Set up Python ${{ matrix.python-version }}
+        uses: actions/setup-python@v5
+        with:
+          python-version: ${{ matrix.python-version }}
+      - uses: pre-commit/action@v3.0.1
+        with:
+          extra_args: --all-files

  tests:
    needs: check_code_quality
    strategy:
      matrix:
-        python-version: [3.7, 3.8, 3.9]
-        os: ['ubuntu-latest', 'macos-latest', 'windows-latest']
+        python-version: ['3.9', '3.10', '3.11']
+        os: ['ubuntu-latest', 'windows-latest']
    runs-on: ${{ matrix.os }}
    steps:
-    - uses: actions/checkout@v3
+    - uses: actions/checkout@v4
    - name: Set up Python ${{ matrix.python-version }}
-      uses: actions/setup-python@v4
+      uses: actions/setup-python@v5
      with:
        python-version: ${{ matrix.python-version }}
+        cache: "pip"
+        cache-dependency-path: |
+            setup.py
+            requirements.txt
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        # install PEFT & transformers from source
+        pip install -U git+https://github.com/huggingface/peft.git
+        pip install -U git+https://github.com/huggingface/transformers.git
+        # cpu version of pytorch
+        pip install ".[test, diffusers]"
+    - name: Test with pytest
+      run: |
+        make test
+
+  tests_no_optional_dep:
+    needs: check_code_quality
+    runs-on: 'ubuntu-latest'
+    steps:
+    - uses: actions/checkout@v4
+    - name: Set up Python 3.9
+      uses: actions/setup-python@v5
+      with:
+        python-version: '3.9'
+        cache: "pip"
+        cache-dependency-path: |
+            setup.py
+            requirements.txt
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
--- a/.github/workflows/trufflehog.yml
+++ b/.github/workflows/trufflehog.yml
@ -0,0 +1,15 @@
+on:
+  push:
+
+name: Secret Leaks
+
+jobs:
+  trufflehog:
+    runs-on: ubuntu-latest
+    steps:
+    - name: Checkout code
+      uses: actions/checkout@v4
+      with:
+        fetch-depth: 0
+    - name: Secret Scanning
+      uses: trufflesecurity/trufflehog@main
--- a/.github/workflows/upload_pr_documentation.yml
+++ b/.github/workflows/upload_pr_documentation.yml
@ -0,0 +1,16 @@
+name: Upload PR Documentation
+
+on:
+  workflow_run:
+    workflows: ["Build PR Documentation"]
+    types:
+      - completed
+
+jobs:
+  build:
+    uses: huggingface/doc-builder/.github/workflows/upload_pr_documentation.yml@main
+    with:
+      package_name: trl
+    secrets:
+      hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
+      comment_bot_token: ${{ secrets.COMMENT_BOT_TOKEN }}
--- a/.gitignore
+++ b/.gitignore
@ -1,3 +1,4 @@
+benchmark/trl
 *.bak
 .gitattributes
 .last_checked
@ -142,4 +143,7 @@ checklink/cookies.txt
 # wandb files
 nbs/wandb/
 examples/notebooks/wandb/
-wandb/
+wandb/
+
+# cli scripts that are symlinked from `examples/scripts`
+trl/commands/scripts/
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@ -0,0 +1,17 @@
+repos:
+  - repo: https://github.com/astral-sh/ruff-pre-commit
+    rev: v0.6.3
+    hooks:
+      - id: ruff
+        types_or: [ python, pyi ]
+        args: [ --fix ]
+      - id: ruff-format
+        types_or: [ python, pyi ]
+
+  # - repo: https://github.com/codespell-project/codespell
+  #   rev: v2.1.0
+  #   hooks:
+  #     - id: codespell
+  #       args:
+  #         - --ignore-words-list=nd,reacher,thist,ths,magent,ba
+  #         - --skip=docs/css/termynal.css,docs/js/termynal.js
--- a/CITATION.cff
+++ b/CITATION.cff
@ -17,7 +17,7 @@ authors:
    family-names: Thrush
  - given-names: Nathan
    family-names: Lambert
-repository-code: 'https://github.com/lvwerra/trl'
+repository-code: 'https://github.com/huggingface/trl'
 abstract: "With trl you can train transformer language models with Proximal Policy Optimization (PPO). The library is built on top of the transformers library by \U0001F917 Hugging Face. Therefore, pre-trained language models can be directly loaded via transformers. At this point, most decoder and encoder-decoder architectures are supported."
 keywords:
  - rlhf
--- a/CODE_OF_CONDUCT.md
+++ b/CODE_OF_CONDUCT.md
@ -0,0 +1,133 @@
+
+# Contributor Covenant Code of Conduct
+
+## Our Pledge
+
+We as members, contributors, and leaders pledge to make participation in our
+community a harassment-free experience for everyone, regardless of age, body
+size, visible or invisible disability, ethnicity, sex characteristics, gender
+identity and expression, level of experience, education, socio-economic status,
+nationality, personal appearance, race, caste, color, religion, or sexual
+identity and orientation.
+
+We pledge to act and interact in ways that contribute to an open, welcoming,
+diverse, inclusive, and healthy community.
+
+## Our Standards
+
+Examples of behavior that contributes to a positive environment for our
+community include:
+
+* Demonstrating empathy and kindness toward other people
+* Being respectful of differing opinions, viewpoints, and experiences
+* Giving and gracefully accepting constructive feedback
+* Accepting responsibility and apologizing to those affected by our mistakes,
+  and learning from the experience
+* Focusing on what is best not just for us as individuals, but for the overall
+  community
+
+Examples of unacceptable behavior include:
+
+* The use of sexualized language or imagery, and sexual attention or advances of
+  any kind
+* Trolling, insulting or derogatory comments, and personal or political attacks
+* Public or private harassment
+* Publishing others' private information, such as a physical or email address,
+  without their explicit permission
+* Other conduct which could reasonably be considered inappropriate in a
+  professional setting
+
+## Enforcement Responsibilities
+
+Community leaders are responsible for clarifying and enforcing our standards of
+acceptable behavior and will take appropriate and fair corrective action in
+response to any behavior that they deem inappropriate, threatening, offensive,
+or harmful.
+
+Community leaders have the right and responsibility to remove, edit, or reject
+comments, commits, code, wiki edits, issues, and other contributions that are
+not aligned to this Code of Conduct, and will communicate reasons for moderation
+decisions when appropriate.
+
+## Scope
+
+This Code of Conduct applies within all community spaces, and also applies when
+an individual is officially representing the community in public spaces.
+Examples of representing our community include using an official e-mail address,
+posting via an official social media account, or acting as an appointed
+representative at an online or offline event.
+
+## Enforcement
+
+Instances of abusive, harassing, or otherwise unacceptable behavior may be
+reported to the community leaders responsible for enforcement at
+feedback@huggingface.co.
+All complaints will be reviewed and investigated promptly and fairly.
+
+All community leaders are obligated to respect the privacy and security of the
+reporter of any incident.
+
+## Enforcement Guidelines
+
+Community leaders will follow these Community Impact Guidelines in determining
+the consequences for any action they deem in violation of this Code of Conduct:
+
+### 1. Correction
+
+**Community Impact**: Use of inappropriate language or other behavior deemed
+unprofessional or unwelcome in the community.
+
+**Consequence**: A private, written warning from community leaders, providing
+clarity around the nature of the violation and an explanation of why the
+behavior was inappropriate. A public apology may be requested.
+
+### 2. Warning
+
+**Community Impact**: A violation through a single incident or series of
+actions.
+
+**Consequence**: A warning with consequences for continued behavior. No
+interaction with the people involved, including unsolicited interaction with
+those enforcing the Code of Conduct, for a specified period of time. This
+includes avoiding interactions in community spaces as well as external channels
+like social media. Violating these terms may lead to a temporary or permanent
+ban.
+
+### 3. Temporary Ban
+
+**Community Impact**: A serious violation of community standards, including
+sustained inappropriate behavior.
+
+**Consequence**: A temporary ban from any sort of interaction or public
+communication with the community for a specified period of time. No public or
+private interaction with the people involved, including unsolicited interaction
+with those enforcing the Code of Conduct, is allowed during this period.
+Violating these terms may lead to a permanent ban.
+
+### 4. Permanent Ban
+
+**Community Impact**: Demonstrating a pattern of violation of community
+standards, including sustained inappropriate behavior, harassment of an
+individual, or aggression toward or disparagement of classes of individuals.
+
+**Consequence**: A permanent ban from any sort of public interaction within the
+community.
+
+## Attribution
+
+This Code of Conduct is adapted from the [Contributor Covenant][homepage],
+version 2.1, available at
+[https://www.contributor-covenant.org/version/2/1/code_of_conduct.html][v2.1].
+
+Community Impact Guidelines were inspired by
+[Mozilla's code of conduct enforcement ladder][Mozilla CoC].
+
+For answers to common questions about this code of conduct, see the FAQ at
+[https://www.contributor-covenant.org/faq][FAQ]. Translations are available at
+[https://www.contributor-covenant.org/translations][translations].
+
+[homepage]: https://www.contributor-covenant.org
+[v2.1]: https://www.contributor-covenant.org/version/2/1/code_of_conduct.html
+[Mozilla CoC]: https://github.com/mozilla/diversity
+[FAQ]: https://www.contributor-covenant.org/faq
+[translations]: https://www.contributor-covenant.org/translations
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -1,48 +1,258 @@
-# How to contribute
+# How to contribute to TRL?

-## How to get started
+Everyone is welcome to contribute, and we value everybody's contribution. Code
+contributions are not the only way to help the community. Answering questions, helping
+others, and improving the documentation are also immensely valuable.

-Before you start contributing make sure you installed all the dev tools:
+It also helps us if you spread the word! Reference the library in blog posts
+about the awesome projects it made possible, shout out on Twitter every time it has
+helped you, or simply ⭐️ the repository to say thank you.
+
+However you choose to contribute, please be mindful and respect our
+[code of conduct](https://github.com/huggingface/trl/blob/main/CODE_OF_CONDUCT.md).
+
+**This guide was heavily inspired by the awesome [scikit-learn guide to contributing](https://github.com/scikit-learn/scikit-learn/blob/main/CONTRIBUTING.md).**
+
+## Ways to contribute
+
+There are several ways you can contribute to TRL:
+
+* Fix outstanding issues with the existing code.
+* Submit issues related to bugs or desired new features.
+* Implement trainers for new post-training algorithms.
+* Contribute to the examples or to the documentation.
+
+If you don't know where to start, there is a special [Good First
+Issue](https://github.com/huggingface/trl/contribute) listing. It will give you a list of
+open issues that are beginner-friendly and help you start contributing to open-source. The best way to do that is to open a Pull Request and link it to the issue that you'd like to work on. We try to give priority to opened PRs as we can easily track the progress of the fix, and if the contributor does not have time anymore, someone else can take the PR over.
+
+For something slightly more challenging, you can also take a look at the [Good Second Issue](https://github.com/huggingface/trl/labels/Good%20Second%20Issue) list. In general though, if you feel like you know what you're doing, go for it and we'll help you get there! 🚀
+
+> All contributions are equally valuable to the community. 🥰
+
+Before you start contributing make sure you have installed all the dev tools:

 ```bash
-pip install -e ".[dev]"
+make dev
 ```

-## Did you find a bug?
+## Fixing outstanding issues

-* Ensure the bug was not already reported by searching on GitHub under Issues.
-* If you're unable to find an open issue addressing the problem, open a new one. Be sure to include a title and clear description, as much relevant information as possible, and a code sample or an executable test case demonstrating the expected behavior that is not occurring.
-* Be sure to add the complete error messages.
+If you notice an issue with the existing code and have a fix in mind, feel free to [start contributing](#create-a-pull-request) and open a Pull Request!

-#### Did you write a patch that fixes a bug?
+## Submitting a bug-related issue or feature request

-* Open a new GitHub pull request with the patch.
-* Ensure that your PR includes a test that fails without your patch, and pass with it.
-* Ensure the PR description clearly describes the problem and solution. Include the relevant issue number if applicable.
+Do your best to follow these guidelines when submitting a bug-related issue or a feature request. It will make it easier for us to come back to you quickly and with good feedback.

-## PR submission guidelines
+### Did you find a bug?

-* Keep each PR focused. While it's more convenient, do not combine several unrelated fixes together. Create as many branches as needing to keep each PR focused.
-* Do not mix style changes/fixes with "functional" changes. It's very difficult to review such PRs and it most likely get rejected.
-* Do not add/remove vertical whitespace. Preserve the original style of the file you edit as much as you can.
-* Do not turn an already submitted PR into your development playground. If after you submitted PR, you discovered that more work is needed - close the PR, do the required work and then submit a new PR. Otherwise each of your commits requires attention from maintainers of the project.
-* If, however, you submitted a PR and received a request for changes, you should proceed with commits inside that PR, so that the maintainer can see the incremental fixes and won't need to review the whole PR again. In the exception case where you realize it'll take many many commits to complete the requests, then it's probably best to close the PR, do the work and then submit it again. Use common sense where you'd choose one way over another.
+The TRL library is robust and reliable thanks to users who report the problems they encounter.

-### Before you submit a PR
+Before you report an issue, we would really appreciate it if you could **make sure the bug was not
+already reported** (use the search bar on GitHub under Issues). Your issue should also be related to bugs in the library itself, and not your code.

-First you want to make sure that all the tests pass:
+Once you've confirmed the bug hasn't already been reported, please include the following information in your issue so we can quickly resolve it:
+
+* Your **OS type and version**, **Python**, **PyTorch**, **TRL** and **Transformers** versions.
+* A short, self-contained, code snippet that allows us to reproduce the bug in
+  less than 30s.
+* The *full* traceback if an exception is raised.
+* Attach any other additional information, like screenshots, you think may help.
+
+To get the OS and software versions automatically, run the following command:

 ```bash
-make test
+transformers-cli env
 ```

-Then before submitting your PR make sure the code quality follows the standards. You can run the following command to format and test:
+### Do you want a new feature?
+
+If there is a new feature you'd like to see in TRL, please open an issue and describe:
+
+1. What is the *motivation* behind this feature? Is it related to a problem or frustration with the library? Is it a feature related to something you need for a project? Is it something you worked on and think it could benefit the community?
+
+   Whatever it is, we'd love to hear about it!
+
+2. Describe your requested feature in as much detail as possible. The more you can tell us about it, the better we'll be able to help you.
+3. Provide a *code snippet* that demonstrates the features usage.
+4. If the feature is related to a paper, please include a link.
+
+If your issue is well written we're already 80% of the way there by the time you create it.
+
+## Do you want to implement a new trainer?
+
+New post-training methods are published on a frequent basis and those which satisfy the following criteria are good candidates to be integrated in TRL:
+
+* **Simplicity:** does the new method achieve similar performance as prior methods, but with less complexity? A good example is [Direct Preference Optimization](https://arxiv.org/abs/2305.18290) (DPO), which provided a simpler and compelling alternative to RLHF methods.
+* **Efficiency:** does the new method provide a significant improvement in training efficiency? A good example is [Odds Ratio Preference Optimization](https://arxiv.org/abs/2403.07691v2), which utilises a similar objective as DPO, but requires half the GPU VRAM.
+
+Methods which only provide incremental improvements at the expense of added complexity or compute costs are unlikely to be included in TRL.
+
+If you want to implement a trainer for a new post-training method, first open an issue and provide the following information:
+
+* A short description of the method and a link to the paper.
+* Link to the implementation if it is open-sourced.
+* Link to model weights trained with the method if they are available.
+
+Based on the community and maintainer feedback, the next step will be to implement the trainer and config classes. See the following examples for inspiration:
+
+* Paired preference optimisation: [`dpo_trainer.py`](./trl/trainer/dpo_trainer.py) and [`dpo_config.py`](./trl/trainer/dpo_config.py)
+* RL-based optimisation: [`rloo_trainer.py](./trl/trainer/rloo_trainer.py) and [`rloo_config.py](./trl/trainer/rloo_config.py)
+* Online optimisation: [`online_dpo_trainer.py`](./trl/trainer/online_dpo_trainer.py) and [`online_dpo_config.py`](./trl/trainer/online_dpo_config.py)
+
+## Do you want to add documentation?
+
+We're always looking for improvements to the documentation that make it more clear and accurate. Please let us know how the documentation can be improved, such as typos, dead links and any missing, unclear or inaccurate content.. We'll be happy to make the changes or help you make a contribution if you're interested!
+
+## Submitting a pull request (PR)
+
+Before writing code, we strongly advise you to search through the existing PRs or
+issues to make sure that nobody is already working on the same thing. If you are
+unsure, it is always a good idea to open an issue to get some feedback.
+
+You will need basic `git` proficiency to be able to contribute to
+TRL. `git` is not the easiest tool to use but it has the greatest
+manual. Type `git --help` in a shell and enjoy. If you prefer books, [Pro
+Git](https://git-scm.com/book/en/v2) is a very good reference.
+
+Follow these steps to start contributing:
+
+1. Fork the [repository](https://github.com/huggingface/trl) by
+   clicking on the 'Fork' button on the repository's page. This creates a copy of the code
+   under your GitHub user account.
+
+2. Clone your fork to your local disk, and add the base repository as a remote. The following command
+   assumes you have your public SSH key uploaded to GitHub. See the following guide for more
+   [information](https://docs.github.com/en/repositories/creating-and-managing-repositories/cloning-a-repository).
+
+   ```bash
+   $ git clone git@github.com:<your Github handle>/trl.git
+   $ cd trl
+   $ git remote add upstream https://github.com/huggingface/trl.git
+   ```
+
+3. Create a new branch to hold your development changes, and do this for every new PR you work on.
+
+   Start by synchronizing your `main` branch with the `upstream/main` branch (ore details in the [GitHub Docs](https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/syncing-a-fork)):
+
+   ```bash
+   $ git checkout main
+   $ git fetch upstream
+   $ git merge upstream/main
+   ```
+
+   Once your `main` branch is synchronized, create a new branch from it:
+
+   ```bash
+   $ git checkout -b a-descriptive-name-for-my-changes
+   ```
+
+   **Do not** work on the `main` branch.
+
+4. Set up a development environment by running the following command in a conda or a virtual environment you've created for working on this library:
+
+   ```bash
+   $ make dev
+   ```
+
+   (If TRL was already installed in the virtual environment, remove
+   it with `pip uninstall trl` before reinstalling it.)
+
+   Alternatively, if you are using [Visual Studio Code](https://code.visualstudio.com/Download), the fastest way to get set up is by using
+   the provided Dev Container. Documentation on how to get started with dev containers is available [here](https://code.visualstudio.com/docs/remote/containers).
+
+5. Develop the features on your branch.
+
+   As you work on the features, you should make sure that the test suite
+   passes. You should run the tests impacted by your changes like this (see 
+   below an explanation regarding the environment variable):
+
+   ```bash
+   $ pytest tests/<TEST_TO_RUN>.py
+   ```
+   
+   > For the following commands leveraging the `make` utility, we recommend using the WSL system when running on
+   > Windows. More information [here](https://docs.microsoft.com/en-us/windows/wsl/about).
+
+   You can also run the full suite with the following command.
+
+   ```bash
+   $ make test
+   ```
+
+   TRL relies on `ruff` to format its source code
+   consistently. After you make changes, apply automatic style corrections and code verifications
+   that can't be automated in one go with:
+
+   This target is also optimized to only work with files modified by the PR you're working on.
+
+   If you prefer to run the checks one after the other, the following command apply the
+   style corrections:
+
+   ```bash
+   $ make precommit
+   ```
+
+   Once you're happy with your changes, add changed files using `git add` and
+   make a commit with `git commit` to record your changes locally:
+
+   ```bash
+   $ git add modified_file.py
+   $ git commit
+   ```
+
+   Please write [good commit messages](https://chris.beams.io/posts/git-commit/).
+
+   It is a good idea to sync your copy of the code with the original
+   repository regularly. This way you can quickly account for changes:
+
+   ```bash
+   $ git fetch upstream
+   $ git rebase upstream/main
+   ```
+
+   Push the changes to your account using:
+
+   ```bash
+   $ git push -u origin a-descriptive-name-for-my-changes
+   ```
+
+6. Once you are satisfied (**and the checklist below is happy too**), go to the
+   webpage of your fork on GitHub. Click on 'Pull request' to send your changes
+   to the project maintainers for review.
+
+7. It's ok if maintainers ask you for changes. It happens to core contributors
+   too! So everyone can see the changes in the Pull request, work in your local
+   branch and push the changes to your fork. They will automatically appear in
+   the pull request.
+
+
+### Checklist
+
+1. The title of your pull request should be a summary of its contribution;
+2. If your pull request addresses an issue, please mention the issue number in
+   the pull request description to make sure they are linked (and people
+   consulting the issue know you are working on it);
+3. To indicate a work in progress please prefix the title with `[WIP]`, or mark
+   the PR as a draft PR. These are useful to avoid duplicated work, and to differentiate
+   it from PRs ready to be merged;
+4. Make sure existing tests pass;
+5. Add high-coverage tests. No quality testing = no merge.
+
+
+### Tests
+
+An extensive test suite is included to test the library behavior and several examples. Library tests can be found in
+the [tests folder](https://github.com/huggingface/trl/tree/main/tests).
+
+We use `pytest` in order to run the tests. From the root of the
+repository, here's how to run tests with `pytest` for the library:

 ```bash
-make style && make quality
+$ python -m pytest -sv ./tests
 ```

-## Do you want to contribute to the documentation?
-
-* Docs are in the `docs/` folder and can be updated there.
+In fact, that's how `make test` is implemented (sans the `pip install` line)!

+You can specify a smaller set of tests in order to test only the feature
+you're working on.
--- a/MANIFEST.in
+++ b/MANIFEST.in
@ -2,4 +2,4 @@ include settings.ini
 include LICENSE
 include CONTRIBUTING.md
 include README.md
-recursive-exclude * __pycache__
+recursive-exclude * __pycache__
--- a/49
+++ b/49
@ -1,13 +1,44 @@
-.PHONY: quality style test
+.PHONY: test precommit benchmark_core benchmark_aux common_tests slow_tests test_examples tests_gpu
+
+check_dirs := examples tests trl
+
+ACCELERATE_CONFIG_PATH = `pwd`/examples/accelerate_configs
+COMMAND_FILES_PATH = `pwd`/commands
+
+
+dev:
+	[ -L "$(pwd)/trl/commands/scripts" ] && unlink "$(pwd)/trl/commands/scripts" || true
+	pip install -e ".[dev]"
+	ln -s `pwd`/examples/scripts/ `pwd`/trl/commands

 test:
-	python -m pytest -n auto --dist=loadfile -s -v ./tests/
+	python -m pytest -n auto --dist=loadfile -s -v --reruns 5 --reruns-delay 1 --only-rerun '(OSError|Timeout|HTTPError.*502|HTTPError.*504||not less than or equal to 0.01)' ./tests/

-quality:
-	black --check --line-length 119 --target-version py38 tests trl
-	isort --check-only tests trl
-	flake8  tests trl
+precommit:
+	pre-commit run --all-files
+	python scripts/add_copyrights.py

-style:
-	black --line-length 119 --target-version py38 tests trl examples setup.py
-	isort tests trl
+benchmark_core:
+	bash ./benchmark/benchmark_core.sh
+
+benchmark_aux:
+	bash ./benchmark/benchmark_aux.sh
+
+tests_gpu:
+	python -m pytest tests/test_* $(if $(IS_GITHUB_CI),--report-log "common_tests.log",)
+
+slow_tests:
+	python -m pytest tests/slow/test_* $(if $(IS_GITHUB_CI),--report-log "slow_tests.log",)
+
+test_examples:
+	touch temp_results_sft_tests.txt
+	for file in $(ACCELERATE_CONFIG_PATH)/*.yaml; do \
+		TRL_ACCELERATE_CONFIG=$${file} bash $(COMMAND_FILES_PATH)/run_sft.sh; \
+		echo $$?','$${file} >> temp_results_sft_tests.txt; \
+	done
+
+	touch temp_results_dpo_tests.txt
+	for file in $(ACCELERATE_CONFIG_PATH)/*.yaml; do \
+		TRL_ACCELERATE_CONFIG=$${file} bash $(COMMAND_FILES_PATH)/run_dpo.sh; \
+		echo $$?','$${file} >> temp_results_dpo_tests.txt; \
+	done
--- a/README.md
+++ b/README.md
@ -3,57 +3,140 @@
 </div>

 # TRL - Transformer Reinforcement Learning
-> Train transformer language models with reinforcement learning.
+> Full stack library to fine-tune and align large language models.
+
+<p align="center">
+    <a href="https://github.com/huggingface/trl/blob/main/LICENSE">
+        <img alt="License" src="https://img.shields.io/github/license/huggingface/trl.svg?color=blue">
+    </a>
+    <a href="https://huggingface.co/docs/trl/index">
+        <img alt="Documentation" src="https://img.shields.io/website/http/huggingface.co/docs/trl/index.svg?down_color=red&down_message=offline&up_message=online">
+    </a>
+    <a href="https://github.com/huggingface/trl/releases">
+        <img alt="GitHub release" src="https://img.shields.io/github/release/huggingface/trl.svg">
+    </a>
+</p>


 ## What is it?
-With `trl` you can train transformer language models with Proximal Policy Optimization (PPO). The library is built on top of the [`transformers`](https://github.com/huggingface/transformers) library by  🤗 Hugging Face. Therefore, pre-trained language models can be directly loaded via `transformers`. At this point most of decoder architectures and encoder-decoder architectures are supported. 

-**Highlights:**
- `PPOTrainer`: A PPO trainer for language models that just needs (query, response, reward) triplets to optimise the language model.
- `AutoModelForCausalLMWithValueHead` & `AutoModelForSeq2SeqLMWithValueHead`: A transformer model with an additional scalar output for each token which can be used as a value function in reinforcement learning.
- Example: Train GPT2 to generate positive movie reviews with a BERT sentiment classifier.
+The `trl` library is a full stack tool to fine-tune and align transformer language and diffusion models using methods such as Supervised Fine-tuning step (SFT), Reward Modeling (RM) and the Proximal Policy Optimization (PPO) as well as Direct Preference Optimization (DPO). 

-## How it works
-Fine-tuning a language model via PPO consists of roughly three steps:
-
-1. **Rollout**: The language model generates a response or continuation based on query which could be the start of a sentence.
-2. **Evaluation**: The query and response are evaluated with a function, model, human feedback or some combination of them. The important thing is that this process should yield a scalar value for each query/response pair.
-3. **Optimization**: This is the most complex part. In the optimisation step the query/response pairs are used to calculate the log-probabilities of the tokens in the sequences. This is done with the model that is trained and and a reference model, which is usually the pre-trained model before fine-tuning. The KL-divergence between the two outputs is used as an additional reward signal to make sure the generated responses don't deviate to far from the reference language model. The active language model is then trained with PPO.
-
-This process is illustrated in the sketch below:
+The library is built on top of the [`transformers`](https://github.com/huggingface/transformers) library and thus allows to use any model architecture available there.


-<div style="text-align: center">
-<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/trl_overview.png" width="800">
-<p style="text-align: center;"> <b>Figure:</b> Sketch of the workflow. </p>
-</div>
+## Highlights
+
+- **`Efficient and scalable`**: 
+    - [`accelerate`](https://github.com/huggingface/accelerate) is the backbone of `trl` which allows to scale model training from a single GPU to a large scale multi-node cluster with methods such as DDP and DeepSpeed.
+    - [`PEFT`](https://github.com/huggingface/peft) is fully integrated and allows to train even the largest models on modest hardware with quantisation and methods such as LoRA or QLoRA.
+    - [`unsloth`](https://github.com/unslothai/unsloth) is also integrated and allows to significantly speed up training with dedicated kernels.
+- **`CLI`**: With the [CLI](https://huggingface.co/docs/trl/clis) you can fine-tune and chat with LLMs without writing any code using a single command and a flexible config system.
+- **`Trainers`**: The Trainer classes are an abstraction to apply many fine-tuning methods with ease such as the [`SFTTrainer`](https://huggingface.co/docs/trl/sft_trainer), [`DPOTrainer`](https://huggingface.co/docs/trl/trainer#trl.DPOTrainer), [`RewardTrainer`](https://huggingface.co/docs/trl/reward_trainer), [`PPOTrainer`](https://huggingface.co/docs/trl/trainer#trl.PPOTrainer), [`CPOTrainer`](https://huggingface.co/docs/trl/trainer#trl.CPOTrainer), and [`ORPOTrainer`](https://huggingface.co/docs/trl/trainer#trl.ORPOTrainer).
+- **`AutoModels`**: The [`AutoModelForCausalLMWithValueHead`](https://huggingface.co/docs/trl/models#trl.AutoModelForCausalLMWithValueHead) & [`AutoModelForSeq2SeqLMWithValueHead`](https://huggingface.co/docs/trl/models#trl.AutoModelForSeq2SeqLMWithValueHead) classes add an additional value head to the model which allows to train them with RL algorithms such as PPO.
+- **`Examples`**: Train GPT2 to generate positive movie reviews with a BERT sentiment classifier, full RLHF using adapters only, train GPT-j to be less toxic, [StackLlama example](https://huggingface.co/blog/stackllama), etc. following the [examples](https://github.com/huggingface/trl/tree/main/examples).

 ## Installation

 ### Python package
-Install the library with pip:
+Install the library with `pip`:
 ```bash
 pip install trl
 ```

 ### From source
-If you want to run the examples in the repository a few additional libraries are required. Clone the repository and install it with pip:
+If you want to use the latest features before an official release you can install from source:
 ```bash
-git clone https://github.com/lvwerra/trl.git
-cd trl/
-pip install .
+pip install git+https://github.com/huggingface/trl.git
 ```

-If you wish to develop TRL, you should install in editable mode:
+### Repository
+If you want to use the examples you can clone the repository with the following command:
 ```bash
-pip install -e .
+git clone https://github.com/huggingface/trl.git
 ```

+## Command Line Interface (CLI)
+
+You can use TRL Command Line Interface (CLI) to quickly get started with Supervised Fine-tuning (SFT), Direct Preference Optimization (DPO) and test your aligned model with the chat CLI: 
+
+**SFT:**
+
+```bash
+trl sft --model_name_or_path facebook/opt-125m --dataset_name stanfordnlp/imdb --output_dir opt-sft-imdb
+```
+
+**DPO:**
+
+```bash
+trl dpo --model_name_or_path facebook/opt-125m --dataset_name trl-internal-testing/hh-rlhf-helpful-base-trl-style --output_dir opt-sft-hh-rlhf 
+```
+
+**Chat:**
+
+```bash
+trl chat --model_name_or_path Qwen/Qwen1.5-0.5B-Chat
+```
+
+Read more about CLI in the [relevant documentation section](https://huggingface.co/docs/trl/main/en/clis) or use `--help` for more details.
+
 ## How to use

-### Example
-This is a basic example on how to use the library. Based on a query the language model creates a response which is then evaluated. The evaluation could be a human in the loop or another model's output.
+For more flexibility and control over the training, you can use the dedicated trainer classes to fine-tune the model in Python.
+
+### `SFTTrainer`
+
+This is a basic example of how to use the `SFTTrainer` from the library. The `SFTTrainer` is a light wrapper around the `transformers` Trainer to easily fine-tune language models or adapters on a custom dataset.
+
+```python
+# imports
+from datasets import load_dataset
+from trl import SFTTrainer
+
+# get dataset
+dataset = load_dataset("stanfordnlp/imdb", split="train")
+
+# get trainer
+trainer = SFTTrainer(
+    "facebook/opt-350m",
+    train_dataset=dataset,
+    dataset_text_field="text",
+    max_seq_length=512,
+)
+
+# train
+trainer.train()
+```
+
+### `RewardTrainer`
+
+This is a basic example of how to use the `RewardTrainer` from the library. The `RewardTrainer` is a wrapper around the `transformers` Trainer to easily fine-tune reward models or adapters on a custom preference dataset.
+
+```python
+# imports
+from transformers import AutoModelForSequenceClassification, AutoTokenizer
+from trl import RewardTrainer
+
+# load model and dataset - dataset needs to be in a specific format
+model = AutoModelForSequenceClassification.from_pretrained("gpt2", num_labels=1)
+tokenizer = AutoTokenizer.from_pretrained("gpt2")
+
+...
+
+# load trainer
+trainer = RewardTrainer(
+    model=model,
+    tokenizer=tokenizer,
+    train_dataset=dataset,
+)
+
+# train
+trainer.train()
+```
+
+### `PPOTrainer`
+
+This is a basic example of how to use the `PPOTrainer` from the library. Based on a query the language model creates a response which is then evaluated. The evaluation could be a human in the loop or another model's output.

 ```python
 # imports
@ -64,24 +147,23 @@ from trl.core import respond_to_batch

 # get models
 model = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2')
-model_ref = create_reference_model(model)
+ref_model = create_reference_model(model)

 tokenizer = AutoTokenizer.from_pretrained('gpt2')
+tokenizer.pad_token = tokenizer.eos_token

 # initialize trainer
-ppo_config = PPOConfig(
-    batch_size=1,
-)
+ppo_config = PPOConfig(batch_size=1, mini_batch_size=1)

 # encode a query
 query_txt = "This morning I went to the "
 query_tensor = tokenizer.encode(query_txt, return_tensors="pt")

 # get model response
-response_tensor  = respond_to_batch(model_ref, query_tensor)
+response_tensor  = respond_to_batch(model, query_tensor)

 # create a ppo trainer
-ppo_trainer = PPOTrainer(ppo_config, model, model_ref, tokenizer)
+ppo_trainer = PPOTrainer(ppo_config, model, ref_model, tokenizer)

 # define a reward for response
 # (this could be any reward such as human feedback or output from another model)
@ -91,31 +173,60 @@ reward = [torch.tensor(1.0)]
 train_stats = ppo_trainer.step([query_tensor[0]], [response_tensor[0]], reward)
 ```

-### Advanced example: IMDB sentiment
-For a detailed example check out the example python script `examples/scripts/ppo-sentiment.py`, where GPT2 is fine-tuned to generate positive movie reviews. An few examples from the language models before and after optimisation are given below:
+### `DPOTrainer`

-<div style="text-align: center">
-<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/table_imdb_preview.png" width="800">
-<p style="text-align: center;"> <b>Figure:</b> A few review continuations before and after optimisation. </p>
-</div>
+`DPOTrainer` is a trainer that uses [Direct Preference Optimization algorithm](https://huggingface.co/papers/2305.18290). This is a basic example of how to use the `DPOTrainer` from the library. The `DPOTrainer` is a wrapper around the `transformers` Trainer to easily fine-tune reward models or adapters on a custom preference dataset.
+
+```python
+# imports
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from trl import DPOTrainer
+
+# load model and dataset - dataset needs to be in a specific format
+model = AutoModelForCausalLM.from_pretrained("gpt2")
+tokenizer = AutoTokenizer.from_pretrained("gpt2")
+
+...
+
+# load trainer
+trainer = DPOTrainer(
+    model=model,
+    tokenizer=tokenizer,
+    train_dataset=dataset,
+)
+
+# train
+trainer.train()
+```
+
+## Development
+
+If you want to contribute to `trl` or customizing it to your needs make sure to read the [contribution guide](https://github.com/huggingface/trl/blob/main/CONTRIBUTING.md) and make sure you make a dev install:
+
+```bash
+git clone https://github.com/huggingface/trl.git
+cd trl/
+make dev
+```

 ## References

 ### Proximal Policy Optimisation
-The PPO implementation largely follows the structure introduced in the paper **"Fine-Tuning Language Models from Human Preferences"** by D. Ziegler et al. \[[paper](https://arxiv.org/pdf/1909.08593.pdf), [code](https://github.com/openai/lm-human-preferences)].
+The PPO implementation largely follows the structure introduced in the paper **"Fine-Tuning Language Models from Human Preferences"** by D. Ziegler et al. \[[paper](https://huggingface.co/papers/1909.08593), [code](https://github.com/openai/lm-human-preferences)].
+
+### Direct Preference Optimization
+DPO is based on the original implementation of **"Direct Preference Optimization: Your Language Model is Secretly a Reward Model"** by E. Mitchell et al. \[[paper](https://huggingface.co/papers/2305.18290), [code](https://github.com/eric-mitchell/direct-preference-optimization)]

-### Language models
-The language models utilize the `transformers` library by 🤗 Hugging Face.

 ## Citation

 ```bibtex
@misc{vonwerra2022trl,
-  author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert},
+  author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang},
  title = {TRL: Transformer Reinforcement Learning},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
-  howpublished = {\url{https://github.com/lvwerra/trl}}
+  howpublished = {\url{https://github.com/huggingface/trl}}
 }
-```
+```
--- a/benchmark/benchmark.py
+++ b/benchmark/benchmark.py
@ -0,0 +1,164 @@
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import math
+import os
+import shlex
+import subprocess
+import uuid
+from distutils.util import strtobool
+
+import requests
+
+
+def parse_args():
+    # fmt: off
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--command", type=str, default="",
+        help="the command to run")
+    parser.add_argument("--num-seeds", type=int, default=3,
+        help="the number of random seeds")
+    parser.add_argument("--start-seed", type=int, default=1,
+        help="the number of the starting seed")
+    parser.add_argument("--workers", type=int, default=0,
+        help="the number of workers to run benchmark experimenets")
+    parser.add_argument("--auto-tag", type=lambda x: bool(strtobool(x)), default=True, nargs="?", const=True,
+        help="if toggled, the runs will be tagged with git tags, commit, and pull request number if possible")
+    parser.add_argument("--slurm-template-path", type=str, default=None,
+        help="the path to the slurm template file (see docs for more details)")
+    parser.add_argument("--slurm-gpus-per-task", type=int, default=1,
+        help="the number of gpus per task to use for slurm jobs")
+    parser.add_argument("--slurm-total-cpus", type=int, default=50,
+        help="the number of gpus per task to use for slurm jobs")
+    parser.add_argument("--slurm-ntasks", type=int, default=1,
+        help="the number of tasks to use for slurm jobs")
+    parser.add_argument("--slurm-nodes", type=int, default=None,
+        help="the number of nodes to use for slurm jobs")
+    args = parser.parse_args()
+    # fmt: on
+    return args
+
+
+def run_experiment(command: str):
+    command_list = shlex.split(command)
+    print(f"running {command}")
+
+    # Use subprocess.PIPE to capture the output
+    fd = subprocess.Popen(command_list, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+    output, errors = fd.communicate()
+
+    return_code = fd.returncode
+    assert return_code == 0, f"Command failed with error: {errors.decode('utf-8')}"
+
+    # Convert bytes to string and strip leading/trailing whitespaces
+    return output.decode("utf-8").strip()
+
+
+def autotag() -> str:
+    wandb_tag = ""
+    print("autotag feature is enabled")
+    git_tag = ""
+    try:
+        git_tag = subprocess.check_output(["git", "describe", "--tags"]).decode("ascii").strip()
+        print(f"identified git tag: {git_tag}")
+    except subprocess.CalledProcessError as e:
+        print(e)
+    if len(git_tag) == 0:
+        try:
+            count = int(subprocess.check_output(["git", "rev-list", "--count", "HEAD"]).decode("ascii").strip())
+            hash = subprocess.check_output(["git", "rev-parse", "--short", "HEAD"]).decode("ascii").strip()
+            git_tag = f"no-tag-{count}-g{hash}"
+            print(f"identified git tag: {git_tag}")
+        except subprocess.CalledProcessError as e:
+            print(e)
+    wandb_tag = f"{git_tag}"
+
+    git_commit = subprocess.check_output(["git", "rev-parse", "--verify", "HEAD"]).decode("ascii").strip()
+    try:
+        # try finding the pull request number on github
+        prs = requests.get(f"https://api.github.com/search/issues?q=repo:huggingface/trl+is:pr+{git_commit}")
+        if prs.status_code == 200:
+            prs = prs.json()
+            if len(prs["items"]) > 0:
+                pr = prs["items"][0]
+                pr_number = pr["number"]
+                wandb_tag += f",pr-{pr_number}"
+        print(f"identified github pull request: {pr_number}")
+    except Exception as e:
+        print(e)
+
+    return wandb_tag
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    if args.auto_tag:
+        existing_wandb_tag = os.environ.get("WANDB_TAGS", "")
+        wandb_tag = autotag()
+        if len(wandb_tag) > 0:
+            if len(existing_wandb_tag) > 0:
+                os.environ["WANDB_TAGS"] = ",".join([existing_wandb_tag, wandb_tag])
+            else:
+                os.environ["WANDB_TAGS"] = wandb_tag
+    print("WANDB_TAGS: ", os.environ.get("WANDB_TAGS", ""))
+    commands = []
+    for seed in range(0, args.num_seeds):
+        commands += [" ".join([args.command, "--seed", str(args.start_seed + seed)])]
+
+    print("======= commands to run:")
+    for command in commands:
+        print(command)
+
+    if args.workers > 0 and args.slurm_template_path is None:
+        from concurrent.futures import ThreadPoolExecutor
+
+        executor = ThreadPoolExecutor(max_workers=args.workers, thread_name_prefix="cleanrl-benchmark-worker-")
+        for command in commands:
+            executor.submit(run_experiment, command)
+        executor.shutdown(wait=True)
+    else:
+        print("not running the experiments because --workers is set to 0; just printing the commands to run")
+
+    # SLURM logic
+    if args.slurm_template_path is not None:
+        if not os.path.exists("slurm"):
+            os.makedirs("slurm")
+        if not os.path.exists("slurm/logs"):
+            os.makedirs("slurm/logs")
+        print("======= slurm commands to run:")
+        with open(args.slurm_template_path) as f:
+            slurm_template = f.read()
+        slurm_template = slurm_template.replace("{{array}}", f"0-{len(commands) - 1}%{args.workers}")
+        slurm_template = slurm_template.replace(
+            "{{seeds}}", f"({' '.join([str(args.start_seed + int(seed)) for seed in range(args.num_seeds)])})"
+        )
+        slurm_template = slurm_template.replace("{{len_seeds}}", f"{args.num_seeds}")
+        slurm_template = slurm_template.replace("{{command}}", args.command)
+        slurm_template = slurm_template.replace("{{gpus_per_task}}", f"{args.slurm_gpus_per_task}")
+        total_gpus = args.slurm_gpus_per_task * args.slurm_ntasks
+        slurm_cpus_per_gpu = math.ceil(args.slurm_total_cpus / total_gpus)
+        slurm_template = slurm_template.replace("{{cpus_per_gpu}}", f"{slurm_cpus_per_gpu}")
+        slurm_template = slurm_template.replace("{{ntasks}}", f"{args.slurm_ntasks}")
+        if args.slurm_nodes is not None:
+            slurm_template = slurm_template.replace("{{nodes}}", f"#SBATCH --nodes={args.slurm_nodes}")
+        else:
+            slurm_template = slurm_template.replace("{{nodes}}", "")
+        filename = str(uuid.uuid4())
+        open(os.path.join("slurm", f"{filename}.slurm"), "w").write(slurm_template)
+        slurm_path = os.path.join("slurm", f"{filename}.slurm")
+        print(f"saving command in {slurm_path}")
+        if args.workers > 0:
+            job_id = run_experiment(f"sbatch --parsable {slurm_path}")
+            print(f"Job ID: {job_id}")
--- a/benchmark/benchmark_and_report.sh
+++ b/benchmark/benchmark_and_report.sh
@ -0,0 +1,26 @@
+export WANDB_ENTITY=huggingface
+export WANDB_PROJECT=trl
+bash $BENCHMARK_SCRIPT > output.txt
+
+# Extract Job IDs into an array
+job_ids=($(grep "Job ID:" output.txt | awk '{print $3}'))
+
+# Extract WANDB_TAGS into an array
+WANDB_TAGS=($(grep "WANDB_TAGS:" output.txt | awk '{print $2}'))
+WANDB_TAGS=($(echo $WANDB_TAGS | tr "," "\n"))
+
+# Print to verify
+echo "Job IDs: ${job_ids[@]}"
+echo "WANDB_TAGS: ${WANDB_TAGS[@]}"
+
+TAGS_STRING="?tag=${WANDB_TAGS[0]}"
+FOLDER_STRING="${WANDB_TAGS[0]}"
+for tag in "${WANDB_TAGS[@]:1}"; do
+    TAGS_STRING+="&tag=$tag"
+    FOLDER_STRING+="_$tag"
+done
+
+echo "TAGS_STRING: $TAGS_STRING"
+echo "FOLDER_STRING: $FOLDER_STRING"
+
+TAGS_STRING=$TAGS_STRING FOLDER_STRING=$FOLDER_STRING BENCHMARK_PLOT_SCRIPT=$BENCHMARK_PLOT_SCRIPT sbatch --dependency=afterany:$job_ids benchmark/post_github_comment.sbatch
--- a/benchmark/benchmark_level1.sh
+++ b/benchmark/benchmark_level1.sh
@ -0,0 +1,44 @@
+# hello world experiment
+python benchmark/benchmark.py \
+    --command "python examples/scripts/ppo.py --log_with wandb" \
+    --num-seeds 3 \
+    --start-seed 1 \
+    --workers 10 \
+    --slurm-nodes 1 \
+    --slurm-gpus-per-task 1 \
+    --slurm-ntasks 1 \
+    --slurm-total-cpus 12 \
+    --slurm-template-path benchmark/trl.slurm_template
+
+python benchmark/benchmark.py \
+    --command "python examples/scripts/dpo.py --model_name_or_path=gpt2 --per_device_train_batch_size 4 --max_steps 1000 --learning_rate 1e-3 --gradient_accumulation_steps 1 --logging_steps 10 --eval_steps 500 --output_dir="dpo_anthropic_hh" --optim adamw_torch --warmup_steps 150 --report_to wandb --bf16 --logging_first_step --no_remove_unused_columns" \
+    --num-seeds 3 \
+    --start-seed 1 \
+    --workers 10 \
+    --slurm-nodes 1 \
+    --slurm-gpus-per-task 1 \
+    --slurm-ntasks 1 \
+    --slurm-total-cpus 12 \
+    --slurm-template-path benchmark/trl.slurm_template
+
+python benchmark/benchmark.py \
+    --command "python examples/scripts/sft.py --model_name_or_path="facebook/opt-350m" --report_to="wandb" --learning_rate=1.41e-5 --per_device_train_batch_size=64 --gradient_accumulation_steps=16 --output_dir="sft_openassistant-guanaco" --logging_steps=1 --num_train_epochs=3 --max_steps=-1 --push_to_hub --gradient_checkpointing" \
+    --num-seeds 3 \
+    --start-seed 1 \
+    --workers 10 \
+    --slurm-nodes 1 \
+    --slurm-gpus-per-task 1 \
+    --slurm-ntasks 1 \
+    --slurm-total-cpus 12 \
+    --slurm-template-path benchmark/trl.slurm_template
+
+python benchmark/benchmark.py \
+    --command "python examples/scripts/reward_modeling.py --model_name_or_path=facebook/opt-350m --output_dir="reward_modeling_anthropic_hh" --per_device_train_batch_size=64 --num_train_epochs=1 --gradient_accumulation_steps=16 --gradient_checkpointing=True --learning_rate=1.41e-5 --report_to="wandb" --remove_unused_columns=False --optim="adamw_torch" --logging_steps=10 --eval_strategy="steps" --max_length=512" \
+    --num-seeds 3 \
+    --start-seed 1 \
+    --workers 10 \
+    --slurm-nodes 1 \
+    --slurm-gpus-per-task 1 \
+    --slurm-ntasks 1 \
+    --slurm-total-cpus 12 \
+    --slurm-template-path benchmark/trl.slurm_template
--- a/benchmark/benchmark_level1_plot.sh
+++ b/benchmark/benchmark_level1_plot.sh
@ -0,0 +1,50 @@
+# pip install openrlbenchmark==0.2.1a5
+# see https://github.com/openrlbenchmark/openrlbenchmark#get-started for documentation
+echo "we deal with $TAGS_STRING"
+
+python -m openrlbenchmark.rlops_multi_metrics \
+    --filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.reward_model&cen=trl_ppo_trainer_config.value.exp_name&metrics=env/reward_mean&metrics=objective/kl' \
+        "ppo$TAGS_STRING" \
+    --env-ids sentiment-analysis:lvwerra/distilbert-imdb \
+    --no-check-empty-runs \
+    --pc.ncols 2 \
+    --pc.ncols-legend 1 \
+    --output-filename benchmark/trl/$FOLDER_STRING/ppo \
+    --scan-history
+
+python -m openrlbenchmark.rlops_multi_metrics \
+    --filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=output_dir&cen=_name_or_path&metrics=train/rewards/accuracies&metrics=train/loss' \
+        "gpt2$TAGS_STRING" \
+    --env-ids dpo_anthropic_hh \
+    --no-check-empty-runs \
+    --pc.ncols 2 \
+    --pc.ncols-legend 1 \
+    --output-filename benchmark/trl/$FOLDER_STRING/dpo \
+    --scan-history
+
+python -m openrlbenchmark.rlops_multi_metrics \
+    --filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=output_dir&cen=_name_or_path&metrics=train/loss&metrics=eval/accuracy&metrics=eval/loss' \
+        "facebook/opt-350m$TAGS_STRING" \
+    --env-ids reward_modeling_anthropic_hh \
+    --no-check-empty-runs \
+    --pc.ncols 2 \
+    --pc.ncols-legend 1 \
+    --output-filename benchmark/trl/$FOLDER_STRING/reward_modeling \
+    --scan-history
+
+python -m openrlbenchmark.rlops_multi_metrics \
+    --filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=output_dir&cen=_name_or_path&metrics=train/loss' \
+        "facebook/opt-350m$TAGS_STRING" \
+    --env-ids sft_openassistant-guanaco \
+    --no-check-empty-runs \
+    --pc.ncols 2 \
+    --pc.ncols-legend 1 \
+    --output-filename benchmark/trl/$FOLDER_STRING/sft \
+    --scan-history
+
+python benchmark/upload_benchmark.py \
+    --folder_path="benchmark/trl/$FOLDER_STRING" \
+    --path_in_repo="images/benchmark/$FOLDER_STRING" \
+    --repo_id="trl-internal-testing/example-images" \
+    --repo_type="dataset"
+
--- a/benchmark/benchmark_level2.sh
+++ b/benchmark/benchmark_level2.sh
@ -0,0 +1,23 @@
+# compound experiments: gpt2xl + grad_accu
+python benchmark/benchmark.py \
+    --command "python examples/scripts/ppo.py --exp_name ppo_gpt2xl_grad_accu --model_name gpt2-xl --mini_batch_size 16 --gradient_accumulation_steps 8 --log_with wandb" \
+    --num-seeds 3 \
+    --start-seed 1 \
+    --workers 10 \
+    --slurm-nodes 1 \
+    --slurm-gpus-per-task 1 \
+    --slurm-ntasks 1 \
+    --slurm-total-cpus 12 \
+    --slurm-template-path benchmark/trl.slurm_template
+
+# compound experiments: Cerebras-GPT-6.7B + deepspeed zero2 + grad_accu
+python benchmark/benchmark.py \
+    --command "accelerate launch --config_file examples/accelerate_configs/deepspeed_zero2.yaml examples/scripts/ppo.py --exp_name ppo_Cerebras-GPT-6.7B_grad_accu_deepspeed_stage2  --batch_size 32  --mini_batch_size 32 --log_with wandb --model_name cerebras/Cerebras-GPT-6.7B --reward_model sentiment-analysis:cerebras/Cerebras-GPT-6.7B" \
+    --num-seeds 3 \
+    --start-seed 1 \
+    --workers 10 \
+    --slurm-nodes 1 \
+    --slurm-gpus-per-task 8 \
+    --slurm-ntasks 1 \
+    --slurm-total-cpus 90 \
+    --slurm-template-path benchmark/trl.slurm_template
--- a/benchmark/benchmark_level2_plot.sh
+++ b/benchmark/benchmark_level2_plot.sh
@ -0,0 +1,31 @@
+# pip install openrlbenchmark==0.2.1a5
+# see https://github.com/openrlbenchmark/openrlbenchmark#get-started for documentation
+echo "we deal with $TAGS_STRING"
+
+python -m openrlbenchmark.rlops_multi_metrics \
+    --filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.reward_model&cen=trl_ppo_trainer_config.value.exp_name&metrics=env/reward_mean&metrics=objective/kl' \
+        "ppo$TAGS_STRING" \
+        "ppo_gpt2xl_grad_accu$TAGS_STRING" \
+    --env-ids sentiment-analysis:lvwerra/distilbert-imdb \
+    --no-check-empty-runs \
+    --pc.ncols 2 \
+    --pc.ncols-legend 1 \
+    --output-filename benchmark/trl/$FOLDER_STRING/different_models \
+    --scan-history
+
+python -m openrlbenchmark.rlops_multi_metrics \
+    --filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.reward_model&cen=trl_ppo_trainer_config.value.exp_name&metrics=env/reward_mean&metrics=objective/kl' \
+        "ppo_Cerebras-GPT-6.7B_grad_accu_deepspeed_stage2$TAGS_STRING" \
+    --env-ids sentiment-analysis:cerebras/Cerebras-GPT-6.7B \
+    --no-check-empty-runs \
+    --pc.ncols 2 \
+    --pc.ncols-legend 1 \
+    --output-filename benchmark/trl/$FOLDER_STRING/deepspeed \
+    --scan-history
+
+python benchmark/upload_benchmark.py \
+    --folder_path="benchmark/trl/$FOLDER_STRING" \
+    --path_in_repo="images/benchmark/$FOLDER_STRING" \
+    --repo_id="trl-internal-testing/example-images" \
+    --repo_type="dataset"
+
--- a/benchmark/benchmark_level3.sh
+++ b/benchmark/benchmark_level3.sh
@ -0,0 +1,46 @@
+## w/ and w/o gradient accumulation
+python benchmark/benchmark.py \
+    --command "python examples/scripts/ppo.py --exp_name ppo_step_grad_accu --mini_batch_size 1 --gradient_accumulation_steps 128 --log_with wandb" \
+    --num-seeds 3 \
+    --start-seed 1 \
+    --workers 10 \
+    --slurm-nodes 1 \
+    --slurm-gpus-per-task 1 \
+    --slurm-ntasks 1 \
+    --slurm-total-cpus 12 \
+    --slurm-template-path benchmark/trl.slurm_template
+
+## w/ different models (gpt2, gpt2-xl, falcon, llama2)
+python benchmark/benchmark.py \
+    --command "python examples/scripts/ppo.py --exp_name ppo_gpt2 --log_with wandb" \
+    --num-seeds 3 \
+    --start-seed 1 \
+    --workers 10 \
+    --slurm-nodes 1 \
+    --slurm-gpus-per-task 1 \
+    --slurm-ntasks 1 \
+    --slurm-total-cpus 12 \
+    --slurm-template-path benchmark/trl.slurm_template
+python benchmark/benchmark.py \
+    --command "python examples/scripts/ppo.py --exp_name ppo_falcon_rw_1b --model_name tiiuae/falcon-rw-1b --log_with wandb" \
+    --num-seeds 3 \
+    --start-seed 1 \
+    --workers 10 \
+    --slurm-nodes 1 \
+    --slurm-gpus-per-task 1 \
+    --slurm-ntasks 1 \
+    --slurm-total-cpus 12 \
+    --slurm-template-path benchmark/trl.slurm_template
+
+
+## w/ and w/o PEFT
+python benchmark/benchmark.py \
+    --command "python examples/scripts/ppo.py --exp_name ppo_peft --use_peft --log_with wandb" \
+    --num-seeds 3 \
+    --start-seed 1 \
+    --workers 10 \
+    --slurm-nodes 1 \
+    --slurm-gpus-per-task 1 \
+    --slurm-ntasks 1 \
+    --slurm-total-cpus 12 \
+    --slurm-template-path benchmark/trl.slurm_template
--- a/benchmark/plot.sh
+++ b/benchmark/plot.sh
@ -0,0 +1,56 @@
+# pip install openrlbenchmark==0.2.1a5
+# see https://github.com/openrlbenchmark/openrlbenchmark#get-started for documentation
+BASELINE_PR_TAG=v0.4.7-55-g110e672
+BASELINE_PR_NAME=PR-662
+
+python -m openrlbenchmark.rlops_multi_metrics \
+    --filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.reward_model&cen=trl_ppo_trainer_config.value.exp_name&metrics=env/reward_mean&metrics=objective/kl' \
+        "sentiment_tuning?tag=$BASELINE_PR_TAG&cl=sentiment lvwerra/gpt2-imdb ($BASELINE_PR_NAME)" \
+    --env-ids sentiment-analysis:lvwerra/distilbert-imdb \
+    --no-check-empty-runs \
+    --pc.ncols 2 \
+    --pc.ncols-legend 1 \
+    --output-filename benchmark/trl/$BASELINE_PR_TAG/sentiment \
+    --scan-history
+
+python -m openrlbenchmark.rlops_multi_metrics \
+    --filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.reward_model&cen=trl_ppo_trainer_config.value.exp_name&metrics=env/reward_mean&metrics=objective/kl' \
+        "sentiment_tuning?tag=$BASELINE_PR_TAG&cl=sentiment lvwerra/gpt2-imdb ($BASELINE_PR_NAME)" \
+        "sentiment_tuning_step_grad_accu?tag=$BASELINE_PR_TAG&cl=sentiment lvwerra/gpt2-imdb gradient accumulation ($BASELINE_PR_NAME)" \
+    --env-ids sentiment-analysis:lvwerra/distilbert-imdb \
+    --no-check-empty-runs \
+    --pc.ncols 2 \
+    --pc.ncols-legend 1 \
+    --output-filename benchmark/trl/$BASELINE_PR_TAG/gradient_accu \
+    --scan-history
+
+python -m openrlbenchmark.rlops_multi_metrics \
+    --filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.reward_model&cen=trl_ppo_trainer_config.value.exp_name&metrics=env/reward_mean&metrics=objective/kl' \
+        "sentiment_tuning?tag=$BASELINE_PR_TAG&cl=sentiment lvwerra/gpt2-imdb ($BASELINE_PR_NAME)" \
+        "sentiment_tuning_gpt2?tag=$BASELINE_PR_TAG&cl=sentiment gpt2 ($BASELINE_PR_NAME)" \
+        "sentiment_tuning_falcon_rw_1b?tag=$BASELINE_PR_TAG&cl=sentiment tiiuae/falcon-rw-1b ($BASELINE_PR_NAME)" \
+        "sentiment_tuning_gpt2xl_grad_accu?tag=$BASELINE_PR_TAG&cl=sentiment gpt2xl ($BASELINE_PR_NAME)" \
+    --env-ids sentiment-analysis:lvwerra/distilbert-imdb \
+    --no-check-empty-runs \
+    --pc.ncols 2 \
+    --pc.ncols-legend 1 \
+    --output-filename benchmark/trl/$BASELINE_PR_TAG/different_models \
+    --scan-history
+
+python -m openrlbenchmark.rlops_multi_metrics \
+    --filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.reward_model&cen=trl_ppo_trainer_config.value.exp_name&metrics=env/reward_mean&metrics=objective/kl' \
+        "sentiment_tuning?tag=$BASELINE_PR_TAG&cl=sentiment lvwerra/gpt2-imdb ($BASELINE_PR_NAME)" \
+        "sentiment_tuning_peft?tag=$BASELINE_PR_TAG&cl=sentiment lvwerra/gpt2-imdb w/ peft ($BASELINE_PR_NAME)" \
+    --env-ids sentiment-analysis:lvwerra/distilbert-imdb \
+    --no-check-empty-runs \
+    --pc.ncols 2 \
+    --pc.ncols-legend 1 \
+    --output-filename benchmark/trl/$BASELINE_PR_TAG/peft \
+    --scan-history
+
+
+python benchmark/upload_benchmark.py \
+    --folder_path="benchmark/trl/$BASELINE_PR_TAG" \
+    --path_in_repo="images/benchmark/$BASELINE_PR_TAG" \
+    --repo_id="trl-internal-testing/example-images" \
+    --repo_type="dataset"
--- a/benchmark/post_github_comment.py
+++ b/benchmark/post_github_comment.py
@ -0,0 +1,40 @@
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+
+from ghapi.all import GhApi
+
+
+FOLDER_STRING = os.environ.get("FOLDER_STRING", "")
+folder = f"benchmark/trl/{FOLDER_STRING}"
+host_url = f"https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/benchmark/{FOLDER_STRING}"
+
+# Create a GitHub API instance
+github_context = json.loads(os.environ["GITHUB_CONTEXT"])
+token = os.environ["PERSONAL_ACCESS_TOKEN_GITHUB"]  # this needs to refreshed every 12 months
+status_message = "**[COSTA BENCHMARK BOT]**: Here are the results"
+body = status_message
+repo = github_context["repository"]
+owner, repo = repo.split("/")
+api = GhApi(owner=owner, repo=repo, token=token)
+
+# for each `.png` file in the folder, add it to the comment
+for file in os.listdir(folder):
+    if file.endswith(".png"):
+        body += f"\n![{file}]({host_url}/{file})"
+
+# Create a comment on the issue
+api.issues.create_comment(issue_number=github_context["event"]["issue"]["number"], body=body)
--- a/benchmark/post_github_comment.sbatch
+++ b/benchmark/post_github_comment.sbatch
@ -0,0 +1,9 @@
+#!/bin/bash
+#SBATCH --job-name=trl
+#SBATCH --partition=hopper-cpu
+#SBATCH --ntasks=1
+#SBATCH --output=slurm/logs/%x_%j.out
+
+sleep 2m
+bash $BENCHMARK_PLOT_SCRIPT
+srun python benchmark/post_github_comment.py
--- a/benchmark/regression_test.sh
+++ b/benchmark/regression_test.sh
@ -0,0 +1,3 @@
+BENCHMARK_SCRIPT="benchmark/benchmark_level1.sh" \
+BENCHMARK_PLOT_SCRIPT="benchmark/benchmark_level1_plot.sh" \
+bash benchmark/benchmark_and_report.sh
--- a/benchmark/trl.slurm_template
+++ b/benchmark/trl.slurm_template
@ -0,0 +1,19 @@
+#!/bin/bash
+#SBATCH --job-name=trl
+#SBATCH --partition=hopper-prod
+#SBATCH --gpus-per-task={{gpus_per_task}}
+#SBATCH --cpus-per-gpu={{cpus_per_gpu}}
+#SBATCH --ntasks={{ntasks}}
+#SBATCH --output=slurm/logs/%x_%j.out
+#SBATCH --array={{array}}
+##SBATCH --exclude=ip-26-0-149-199
+
+module load cuda/12.1
+
+{{nodes}}
+
+seeds={{seeds}}
+seed=${seeds[$SLURM_ARRAY_TASK_ID % {{len_seeds}}]}
+
+echo "Running task $SLURM_ARRAY_TASK_ID with seed: $seed"
+srun {{command}} --seed $seed
--- a/benchmark/upload_benchmark.py
+++ b/benchmark/upload_benchmark.py
@ -0,0 +1,37 @@
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dataclasses import dataclass
+
+import tyro
+from huggingface_hub import HfApi
+
+
+@dataclass
+class Args:
+    folder_path: str = "benchmark/trl"
+    path_in_repo: str = "images/benchmark"
+    repo_id: str = "trl-internal-testing/example-images"
+    repo_type: str = "dataset"
+
+
+args = tyro.cli(Args)
+api = HfApi()
+
+api.upload_folder(
+    folder_path=args.folder_path,
+    path_in_repo=args.path_in_repo,
+    repo_id=args.repo_id,
+    repo_type=args.repo_type,
+)
--- a/commands/run_dpo.sh
+++ b/commands/run_dpo.sh
@ -0,0 +1,58 @@
+#!/bin/bash
+# This script runs an SFT example end-to-end on a tiny model using different possible configurations
+# but defaults to QLoRA + PEFT
+OUTPUT_DIR="test_dpo/"
+MODEL_NAME="trl-internal-testing/tiny-random-LlamaForCausalLM"
+DATASET_NAME="trl-internal-testing/hh-rlhf-helpful-base-trl-style"
+MAX_STEPS=5
+BATCH_SIZE=2
+SEQ_LEN=128
+
+# Handle extra arguments in case one passes accelerate configs.
+EXTRA_ACCELERATE_ARGS=""
+EXTRA_TRAINING_ARGS="""--use_peft \
+    --load_in_4bit
+"""
+
+# This is a hack to get the number of available GPUs
+NUM_GPUS=2
+
+if [[ "${TRL_ACCELERATE_CONFIG}" == "" ]]; then
+  EXTRA_ACCELERATE_ARGS=""
+else
+  EXTRA_ACCELERATE_ARGS="--config_file $TRL_ACCELERATE_CONFIG"
+  # For DeepSpeed configs we need to set the `--fp16` flag to comply with our configs exposed
+  # on `examples/accelerate_configs` and our runners do not support bf16 mixed precision training.
+  if [[ $TRL_ACCELERATE_CONFIG == *"deepspeed"* ]]; then
+    EXTRA_TRAINING_ARGS="--fp16"
+  else
+    echo "Keeping QLoRA + PEFT"
+  fi
+fi
+
+
+CMD="""
+accelerate launch $EXTRA_ACCELERATE_ARGS \
+    --num_processes $NUM_GPUS \
+    --mixed_precision 'fp16' \
+    `pwd`/examples/scripts/dpo.py \
+    --model_name_or_path $MODEL_NAME \
+    --dataset_name $DATASET_NAME \
+    --output_dir $OUTPUT_DIR \
+    --max_steps $MAX_STEPS \
+    --per_device_train_batch_size $BATCH_SIZE \
+    --max_length $SEQ_LEN \
+    $EXTRA_TRAINING_ARGS
+"""
+
+echo "Starting program..."
+
+{ # try
+    echo $CMD
+    eval "$CMD"
+} || { # catch
+    # save log for exception 
+    echo "Operation Failed!"
+    exit 1
+}
+exit 0
--- a/commands/run_sft.sh
+++ b/commands/run_sft.sh
@ -0,0 +1,60 @@
+#!/bin/bash
+# This script runs an SFT example end-to-end on a tiny model using different possible configurations
+# but defaults to QLoRA + PEFT
+OUTPUT_DIR="test_sft/"
+MODEL_NAME="trl-internal-testing/tiny-random-LlamaForCausalLM"
+DATASET_NAME="stanfordnlp/imdb"
+MAX_STEPS=5
+BATCH_SIZE=2
+SEQ_LEN=128
+
+
+# Handle extra arguments in case one passes accelerate configs.
+EXTRA_ACCELERATE_ARGS=""
+EXTRA_TRAINING_ARGS="""--use_peft \
+    --load_in_4bit
+"""
+
+# Set your number of GPUs here
+NUM_GPUS=2
+
+if [[ "${TRL_ACCELERATE_CONFIG}" == "" ]]; then
+  EXTRA_ACCELERATE_ARGS=""
+else
+  EXTRA_ACCELERATE_ARGS="--config_file $TRL_ACCELERATE_CONFIG"
+  # For DeepSpeed configs we need to set the `--fp16` flag to comply with our configs exposed
+  # on `examples/accelerate_configs` and our runners do not support bf16 mixed precision training.
+  if [[ $TRL_ACCELERATE_CONFIG == *"deepspeed"* ]]; then
+    EXTRA_TRAINING_ARGS="--fp16"
+  else
+    echo "Keeping QLoRA + PEFT"
+  fi
+fi
+
+
+CMD="""
+accelerate launch $EXTRA_ACCELERATE_ARGS \
+    --num_processes $NUM_GPUS \
+    --mixed_precision 'fp16' \
+    `pwd`/examples/scripts/sft.py \
+    --model_name $MODEL_NAME \
+    --dataset_name $DATASET_NAME \
+    --output_dir $OUTPUT_DIR \
+    --max_steps $MAX_STEPS \
+    --dataset_text_field 'text' \
+    --per_device_train_batch_size $BATCH_SIZE \
+    --max_seq_length $SEQ_LEN \
+    $EXTRA_TRAINING_ARGS
+"""
+
+echo "Starting program..."
+
+{ # try
+    echo $CMD
+    eval "$CMD"
+} || { # catch
+    # save log for exception 
+    echo "Operation Failed!"
+    exit 1
+}
+exit 0
--- a/docker/trl-latest-gpu/Dockerfile
+++ b/docker/trl-latest-gpu/Dockerfile
@ -0,0 +1,66 @@
+# Builds GPU docker image of PyTorch
+# Uses multi-staged approach to reduce size
+# Stage 1
+# Use base conda image to reduce time
+FROM continuumio/miniconda3:latest AS compile-image
+# Specify py version
+ENV PYTHON_VERSION=3.10
+# Install apt libs - copied from https://github.com/huggingface/accelerate/blob/main/docker/accelerate-gpu/Dockerfile
+RUN apt-get update && \
+    apt-get install -y curl git wget software-properties-common git-lfs && \
+    apt-get clean && \
+    rm -rf /var/lib/apt/lists*
+
+# Install audio-related libraries 
+RUN apt-get update && \
+    apt install -y ffmpeg
+
+RUN apt install -y libsndfile1-dev
+RUN git lfs install
+
+# Create our conda env - copied from https://github.com/huggingface/accelerate/blob/main/docker/accelerate-gpu/Dockerfile
+RUN conda create --name trl python=${PYTHON_VERSION} ipython jupyter pip
+RUN python3 -m pip install --no-cache-dir --upgrade pip
+
+# Below is copied from https://github.com/huggingface/accelerate/blob/main/docker/accelerate-gpu/Dockerfile
+# We don't install pytorch here yet since CUDA isn't available
+# instead we use the direct torch wheel
+ENV PATH /opt/conda/envs/trl/bin:$PATH
+# Activate our bash shell
+RUN chsh -s /bin/bash
+SHELL ["/bin/bash", "-c"]
+
+# Stage 2
+FROM nvidia/cuda:12.2.2-devel-ubuntu22.04 AS build-image
+COPY --from=compile-image /opt/conda /opt/conda
+ENV PATH /opt/conda/bin:$PATH
+
+RUN chsh -s /bin/bash
+SHELL ["/bin/bash", "-c"]
+RUN source activate trl && \ 
+    python3 -m pip install --no-cache-dir bitsandbytes optimum auto-gptq
+
+# Install apt libs
+RUN apt-get update && \
+    apt-get install -y curl git wget && \
+    apt-get clean && \
+    rm -rf /var/lib/apt/lists*
+
+# Activate the conda env and install transformers + accelerate from source
+RUN source activate trl && \
+    python3 -m pip install -U --no-cache-dir \
+    librosa \
+    "soundfile>=0.12.1" \
+    scipy \
+    transformers \
+    accelerate \
+    peft \
+    trl[test]@git+https://github.com/huggingface/trl
+
+RUN source activate trl && \ 
+    pip freeze | grep trl
+
+RUN echo "source activate trl" >> ~/.profile
+
+# Activate the virtualenv
+CMD ["/bin/bash"]
--- a/docker/trl-source-gpu/Dockerfile
+++ b/docker/trl-source-gpu/Dockerfile
@ -0,0 +1,66 @@
+# Builds GPU docker image of PyTorch
+# Uses multi-staged approach to reduce size
+# Stage 1
+# Use base conda image to reduce time
+FROM continuumio/miniconda3:latest AS compile-image
+# Specify py version
+ENV PYTHON_VERSION=3.10
+# Install apt libs - copied from https://github.com/huggingface/accelerate/blob/main/docker/accelerate-gpu/Dockerfile
+RUN apt-get update && \
+    apt-get install -y curl git wget software-properties-common git-lfs && \
+    apt-get clean && \
+    rm -rf /var/lib/apt/lists*
+
+# Install audio-related libraries 
+RUN apt-get update && \
+    apt install -y ffmpeg
+
+RUN apt install -y libsndfile1-dev
+RUN git lfs install
+
+# Create our conda env - copied from https://github.com/huggingface/accelerate/blob/main/docker/accelerate-gpu/Dockerfile
+RUN conda create --name trl python=${PYTHON_VERSION} ipython jupyter pip
+RUN python3 -m pip install --no-cache-dir --upgrade pip
+
+# Below is copied from https://github.com/huggingface/accelerate/blob/main/docker/accelerate-gpu/Dockerfile
+# We don't install pytorch here yet since CUDA isn't available
+# instead we use the direct torch wheel
+ENV PATH /opt/conda/envs/trl/bin:$PATH
+# Activate our bash shell
+RUN chsh -s /bin/bash
+SHELL ["/bin/bash", "-c"]
+
+# Stage 2
+FROM nvidia/cuda:12.2.2-devel-ubuntu22.04 AS build-image
+COPY --from=compile-image /opt/conda /opt/conda
+ENV PATH /opt/conda/bin:$PATH
+
+RUN chsh -s /bin/bash
+SHELL ["/bin/bash", "-c"]
+RUN source activate trl && \ 
+    python3 -m pip install --no-cache-dir bitsandbytes optimum auto-gptq
+
+# Install apt libs
+RUN apt-get update && \
+    apt-get install -y curl git wget && \
+    apt-get clean && \
+    rm -rf /var/lib/apt/lists*
+
+# Activate the conda env and install transformers + accelerate from source
+RUN source activate trl && \
+    python3 -m pip install -U --no-cache-dir \
+    librosa \
+    "soundfile>=0.12.1" \
+    scipy \
+    git+https://github.com/huggingface/transformers \
+    git+https://github.com/huggingface/accelerate \
+    git+https://github.com/huggingface/peft \
+    trl[test]@git+https://github.com/huggingface/trl
+
+RUN source activate trl && \ 
+    pip freeze | grep transformers
+
+RUN echo "source activate trl" >> ~/.profile
+
+# Activate the virtualenv
+CMD ["/bin/bash"]
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@ -1,24 +1,86 @@
- sections: 
+- sections:
  - local: index
    title: TRL
-  - local: quickstart
-    title: Quickstart
  - local: installation
    title: Installation
+  - local: quickstart
+    title: Quickstart
+  - local: clis
+    title: Get started with Command Line Interfaces (CLIs)
+  - local: dataset_formats
+    title: Dataset Formats
+  - local: how_to_train
+    title: PPO Training FAQ
+  - local: use_model
+    title: Use Trained Models
  - local: customization
-    title: Customize your training
+    title: Customize the Training
+  - local: logging
+    title: Understanding Logs
  title: Get started
 - sections:
+  - sections: # Sort alphabetically
+    - local: alignprop_trainer
+      title: AlignProp
+    - local: bco_trainer
+      title: BCO
+    - local: cpo_trainer
+      title: CPO
+    - local: ddpo_trainer
+      title: DDPO
+    - local: dpo_trainer
+      title: DPO
+    - local: online_dpo_trainer
+      title: Online DPO
+    - local: gkd_trainer
+      title: GKD
+    - local: kto_trainer
+      title: KTO
+    - local: nash_md_trainer
+      title: Nash-MD
+    - local: orpo_trainer
+      title: ORPO
+    - local: ppo_trainer
+      title: PPO
+    - local: ppov2_trainer
+      title: PPOv2
+    - local: reward_trainer
+      title: Reward
+    - local: rloo_trainer
+      title: RLOO
+    - local: sft_trainer
+      title: SFT
+    - local: iterative_sft_trainer
+      title: Iterative SFT
+    - local: xpo_trainer
+      title: XPO
+    title: Trainers
  - local: models
    title: Model Classes
-  - local: trainer
-    title: Trainer Classes
+  - local: best_of_n
+    title: Best of N Sampling
+  - local: judges
+    title: Judges
+  - local: callbacks
+    title: Callbacks
+  - local: data_utils
+    title: Data Utilities
+  - local: text_environments
+    title: Text Environments
  title: API
- sections: 
+- sections:
+  - local: example_overview
+    title: Example Overview
  - local: sentiment_tuning
    title: Sentiment Tuning
-  - local: summarization_reward_tuning
-    title: Summarization Reward Tuning
+  - local: lora_tuning_peft
+    title: Training with PEFT
  - local: detoxifying_a_lm
    title: Detoxifying a Language Model
+  - local: using_llama_models
+    title: Training StackLlama
+  - local: learning_tools
+    title: Learning to Use Tools
+  - local: multi_adapter_rl
+    title: Multi Adapter RLHF
  title: Examples
--- a/docs/source/alignprop_trainer.mdx
+++ b/docs/source/alignprop_trainer.mdx
@ -0,0 +1,91 @@
+# Aligning Text-to-Image Diffusion Models with Reward Backpropagation
+
+## The why
+
+If your reward function is differentiable, directly backpropagating gradients from the reward models to the diffusion model is significantly more sample and compute efficient (25x) than doing policy gradient algorithm like DDPO.
+AlignProp does full backpropagation through time, which allows updating the earlier steps of denoising via reward backpropagation.
+
+<div style="text-align: center"><img src="https://align-prop.github.io/reward_tuning.png"/></div>
+
+
+## Getting started with `examples/scripts/alignprop.py`
+
+The `alignprop.py` script is a working example of using the `AlignProp` trainer to finetune a Stable Diffusion model. This example explicitly configures a small subset of the overall parameters associated with the config object (`AlignPropConfig`).
+
+**Note:** one A100 GPU is recommended to get this running. For lower memory setting, consider setting truncated_backprop_rand to False. With default settings this will do truncated backpropagation with K=1.
+
+Almost every configuration parameter has a default. There is only one commandline flag argument that is required of the user to get things up and running. The user is expected to have a [huggingface user access token](https://huggingface.co/docs/hub/security-tokens) that will be used to upload the model post finetuning to HuggingFace hub. The following bash command is to be entered to get things running
+
+```batch
+python alignprop.py --hf_user_access_token <token>
+```
+
+To obtain the documentation of `stable_diffusion_tuning.py`, please run `python stable_diffusion_tuning.py --help`
+
+The following are things to keep in mind (The code checks this for you as well) in general while configuring the trainer (beyond the use case of using the example script)
+
+- The configurable randomized truncation range (`--alignprop_config.truncated_rand_backprop_minmax=(0,50)`) the first number should be equal and greater to 0, while the second number should equal or less to the number of diffusion timesteps (sample_num_steps)
+- The configurable truncation backprop absolute step (`--alignprop_config.truncated_backprop_timestep=49`) the number should be less than the number of diffusion timesteps (sample_num_steps), it only matters when truncated_backprop_rand is set to False
+
+## Setting up the image logging hook function
+
+Expect the function to be given a dictionary with keys
+```python
+['image', 'prompt', 'prompt_metadata', 'rewards']
+
+```
+and `image`, `prompt`, `prompt_metadata`, `rewards`are batched.
+You are free to log however you want the use of `wandb` or `tensorboard` is recommended.
+
+### Key terms
+
+- `rewards` : The rewards/score is a numerical associated with the generated image and is key to steering the RL process
+- `prompt` : The prompt is the text that is used to generate the image
+- `prompt_metadata` : The prompt metadata is the metadata associated with the prompt. A situation where this will not be empty is when the reward model comprises of a [`FLAVA`](https://huggingface.co/docs/transformers/model_doc/flava) setup where questions and ground answers (linked to the generated image) are expected with the generated image (See here: https://github.com/kvablack/ddpo-pytorch/blob/main/ddpo_pytorch/rewards.py#L45)
+- `image` : The image generated by the Stable Diffusion model
+
+Example code for logging sampled images with `wandb` is given below.
+
+```python
+# for logging these images to wandb
+
+def image_outputs_hook(image_data, global_step, accelerate_logger):
+    # For the sake of this example, we only care about the last batch
+    # hence we extract the last element of the list
+    result = {}
+    images, prompts, rewards = [image_data['images'],image_data['prompts'],image_data['rewards']]
+    for i, image in enumerate(images):
+        pil = Image.fromarray(
+            (image.cpu().numpy().transpose(1, 2, 0) * 255).astype(np.uint8)
+        )
+        pil = pil.resize((256, 256))
+        result[f"{prompts[i]:.25} | {rewards[i]:.2f}"] = [pil]
+    accelerate_logger.log_images(
+        result,
+        step=global_step,
+    )
+
+```
+
+### Using the finetuned model
+
+Assuming you've done with all the epochs and have pushed up your model to the hub, you can use the finetuned model as follows
+
+```python
+from diffusers import StableDiffusionPipeline
+pipeline = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
+pipeline.to("cuda")
+
+pipeline.load_lora_weights('mihirpd/alignprop-trl-aesthetics')
+
+prompts = ["squirrel", "crab", "starfish", "whale","sponge", "plankton"]
+results = pipeline(prompts)
+
+for prompt, image in zip(prompts,results.images):
+    image.save(f"dump/{prompt}.png")
+```
+
+## Credits
+
+This work is heavily influenced by the repo [here](https://github.com/mihirp1998/AlignProp/) and the associated paper [Aligning Text-to-Image Diffusion Models with Reward Backpropagation
+ by Mihir Prabhudesai, Anirudh Goyal, Deepak Pathak, Katerina Fragkiadaki](https://huggingface.co/papers/2310.03739).
--- a/docs/source/bco_trainer.mdx
+++ b/docs/source/bco_trainer.mdx
@ -0,0 +1,139 @@
+# BCO Trainer
+
+TRL supports the Binary Classifier Optimization (BCO).
+The [BCO](https://huggingface.co/papers/2404.04656) authors train a binary classifier whose logit serves as a reward so that the classifier maps {prompt, chosen completion} pairs to 1 and {prompt, rejected completion} pairs to 0.
+For a full example have a look at  [`examples/scripts/bco.py`].
+
+## Expected dataset format
+
+The BCO trainer expects a very specific format for the dataset as it does not require pairwise preferences. Since the model will be trained to directly optimize examples that consist of a prompt, model completion, and a label to indicate whether the completion is "good" or "bad", we expect a dataset with the following columns:
+
+- `prompt`
+- `completion`
+- `label`
+
+for example:
+
+```
+bco_dataset_dict = {
+    "prompt": [
+        "Hey, hello",
+        "How are you",
+        "What is your name?",
+        "What is your name?",
+        "Which is the best programming language?",
+        "Which is the best programming language?",
+        "Which is the best programming language?",
+    ],
+    "completion": [
+        "hi nice to meet you",
+        "leave me alone",
+        "I don't have a name",
+        "My name is Mary",
+        "Python",
+        "C++",
+        "Java",
+    ],
+    "label": [
+        True,
+        False,
+        False,
+        True,
+        True,
+        False,
+        False,
+    ],
+}
+```
+
+where the `prompt` contains the context inputs, `completion` contains the corresponding responses and `label` contains the corresponding flag that indicates if the generated completion is desired (`True`) or undesired (`False`).
+A prompt can have multiple responses and this is reflected in the entries being repeated in the dictionary's value arrays. It is required that the dataset contains at least one desirable and one undesirable completion.
+
+
+## Expected model format
+The BCO trainer expects a model of `AutoModelForCausalLM`, compared to PPO that expects `AutoModelForCausalLMWithValueHead` for the value function.
+
+## Using the `BCOTrainer`
+
+For a detailed example have a look at the `examples/scripts/bco.py` script. At a high level we need to initialize the `BCOTrainer` with a `model` we wish to train and a reference `ref_model` which we will use to calculate the implicit rewards of the preferred and rejected response. 
+
+The `beta` refers to the hyperparameter of the implicit reward, and the dataset contains the 3 entries listed above. Note that the `model` and `ref_model` need to have the same architecture (ie decoder only or encoder-decoder).
+
+
+
+```py
+training_args = BCOConfig(
+    beta=0.1,
+)
+
+bco_trainer = BCOTrainer(
+    model,
+    model_ref,
+    args=training_args,
+    train_dataset=train_dataset,
+    tokenizer=tokenizer,
+)
+```
+After this one can then call:
+
+```py
+bco_trainer.train()
+```
+
+## Underlying Distribution matching (UDM)
+
+In practical scenarios, the thumbs-up and thumbs-down datasets are likely to have divergent underlying distributions of prompts.
+Consider an LLM deployed for user feedback: if the model excels in writing tasks but underperforms in coding, the thumbs-up dataset will be dominated by writing-related prompts, while the thumbs-down dataset will contain mostly coding-related prompts.  
+If the prompts in your desired and undesired datasets differ a lot, it is useful to enable UDM.  
+
+Choose an embedding model and tokenizer:
+
+```py
+embedding_model = AutoModel.from_pretrained(your_model_id)
+embedding_tokenizer = AutoTokenizer.from_pretrained(your_model_id)
+
+# customize this function depending on your embedding model
+def embed_prompt(input_ids, attention_mask, model):
+    outputs = model(input_ids=input_ids, attention_mask=attention_mask)
+    return outputs.last_hidden_state.mean(dim=1)
+
+embedding_model = Accelerator().prepare_model(self.embedding_model)
+embedding_func = partial(embed_prompt, model=embedding_model)
+```
+
+Set `prompt_sample_size` to defined how many prompts are selected to train the UDM classifier and start the training with the provided embedding function:
+
+```py
+training_args = BCOConfig(
+    beta=0.1,
+    prompt_sample_size=512,
+)
+
+bco_trainer = BCOTrainer(
+    model,
+    model_ref,
+    args=training_args,
+    train_dataset=train_dataset,
+    tokenizer=tokenizer,
+    embedding_func=embedding_func,
+    embedding_tokenizer=self.embedding_tokenizer,
+)
+
+bco_trainer.train()
+```
+
+### For Mixture of Experts Models: Enabling the auxiliary loss
+
+MOEs are the most efficient if the load is about equally distributed between experts.  
+To ensure that we train MOEs similarly during preference-tuning, it is beneficial to add the auxiliary loss from the load balancer to the final loss.  
+
+This option is enabled by setting `output_router_logits=True` in the model config (e.g. MixtralConfig).  
+To scale how much the auxiliary loss contributes to the total loss, use the hyperparameter `router_aux_loss_coef=...` (default: 0.001).
+
+## BCOTrainer
+
+[[autodoc]] BCOTrainer
+
+## BCOConfig
+
+[[autodoc]] BCOConfig
--- a/docs/source/best_of_n.mdx
+++ b/docs/source/best_of_n.mdx
@ -0,0 +1,72 @@
+# Best of N sampling: Alternative ways to get better model output without RL based fine-tuning 
+
+Within the extras module is the `best-of-n` sampler class that serves as an alternative method of generating better model output.
+As to how it fares against the RL based fine-tuning, please look in the `examples` directory for a comparison example
+
+## Usage
+
+To get started quickly, instantiate an instance of the class with a model, a length sampler, a tokenizer and a callable that serves as a proxy reward pipeline that outputs reward scores for input queries
+
+```python
+
+from transformers import pipeline, AutoTokenizer
+from trl import AutoModelForCausalLMWithValueHead
+from trl.core import LengthSampler
+from trl.extras import BestOfNSampler
+
+ref_model = AutoModelForCausalLMWithValueHead.from_pretrained(ref_model_name)
+reward_pipe = pipeline("sentiment-analysis", model=reward_model, device=device)
+tokenizer = AutoTokenizer.from_pretrained(ref_model_name)
+tokenizer.pad_token = tokenizer.eos_token
+
+
+# callable that takes a list of raw text and returns a list of corresponding reward scores
+def queries_to_scores(list_of_strings):
+  return [output["score"] for output in reward_pipe(list_of_strings)]
+
+best_of_n = BestOfNSampler(model, tokenizer, queries_to_scores, length_sampler=output_length_sampler)
+
+
+```
+
+And assuming you have a list/tensor of tokenized queries, you can generate better output by calling the `generate` method
+
+```python
+
+best_of_n.generate(query_tensors, device=device, **gen_kwargs)
+
+```
+The default sample size is 4, but you can change it at the time of instance initialization like so
+
+```python
+
+best_of_n = BestOfNSampler(model, tokenizer, queries_to_scores, length_sampler=output_length_sampler, sample_size=8)
+
+```
+
+The default output is the result of taking the top scored output for each query, but you can change it to top 2 and so on by passing the `n_candidates` argument at the time of instance initialization
+
+```python
+
+best_of_n = BestOfNSampler(model, tokenizer, queries_to_scores, length_sampler=output_length_sampler, n_candidates=2)
+
+```
+
+There is the option of setting the generation settings (like `temperature`, `pad_token_id`) at the time of instance creation as opposed to when calling the `generate` method.
+This is done by passing a `GenerationConfig` from the `transformers` library at the time of initialization
+
+```python
+
+from transformers import GenerationConfig
+
+generation_config = GenerationConfig(min_length= -1, top_k=0.0, top_p= 1.0, do_sample= True, pad_token_id=tokenizer.eos_token_id)
+
+best_of_n = BestOfNSampler(model, tokenizer, queries_to_scores, length_sampler=output_length_sampler, generation_config=generation_config)
+
+best_of_n.generate(query_tensors, device=device)
+
+```
+
+Furthermore, at the time of initialization you can set the seed to control repeatability of the generation process and the number of samples to generate for each query
+
+
--- a/docs/source/callbacks.mdx
+++ b/docs/source/callbacks.mdx
@ -0,0 +1,17 @@
+# Callbacks
+
+## SyncRefModelCallback
+
+[[autodoc]] SyncRefModelCallback
+
+## RichProgressCallback
+
+[[autodoc]] RichProgressCallback
+
+## WinRateCallback
+
+[[autodoc]] WinRateCallback
+
+## LogCompletionsCallback
+
+[[autodoc]] LogCompletionsCallback
--- a/docs/source/clis.mdx
+++ b/docs/source/clis.mdx
@ -0,0 +1,119 @@
+# Command Line Interfaces (CLIs)
+
+You can use TRL to fine-tune your Language Model with Supervised Fine-Tuning (SFT) or Direct Policy Optimization (DPO) or even chat with your model using the TRL CLIs.
+
+Currently supported CLIs are:
+
+- `trl sft`: fine-tune a LLM on a text/instruction dataset
+- `trl dpo`: fine-tune a LLM with DPO on a preference dataset 
+- `trl chat`: quickly spin up a LLM fine-tuned for chatting
+
+## Fine-tuning with the CLI
+
+Before getting started, pick up a Language Model from Hugging Face Hub. Supported models can be found with the filter "text-generation" within models. Also make sure to pick up a relevant dataset for your task.
+
+Before using the `sft` or `dpo` commands make sure to run:
+```bash
+accelerate config
+```
+and pick up the right configuration for your training setup (single / multi-GPU, DeepSpeed, etc.). Make sure to complete all steps of `accelerate config` before running any CLI command.
+
+We also recommend you passing a YAML config file to configure your training protocol. Below is a simple example of a YAML file that you can use for training your models with `trl sft` command.
+
+```yaml
+model_name_or_path:
+  trl-internal-testing/tiny-random-LlamaForCausalLM
+dataset_name:
+  stanfordnlp/imdb
+dataset_text_field:
+  text
+report_to:
+  none
+learning_rate:
+  0.0001
+lr_scheduler_type:
+  cosine
+```
+
+Save that config in a `.yaml` and get started immediately! An example CLI config is available as `examples/cli_configs/example_config.yaml`. Note you can overwrite the arguments from the config file by explicitly passing them to the CLI, e.g. from the root folder:
+
+```bash
+trl sft --config examples/cli_configs/example_config.yaml --output_dir test-trl-cli --lr_scheduler_type cosine_with_restarts
+```
+
+Will force-use `cosine_with_restarts` for `lr_scheduler_type`.
+
+### Supported Arguments 
+
+We do support all arguments from `transformers.TrainingArguments`, for loading your model, we support all arguments from `~trl.ModelConfig`:
+
+[[autodoc]] ModelConfig
+
+You can pass any of these arguments either to the CLI or the YAML file.
+
+### Supervised Fine-tuning (SFT)
+
+Follow the basic instructions above and run `trl sft --output_dir <output_dir> <*args>`: 
+
+```bash
+trl sft --model_name_or_path facebook/opt-125m --dataset_name stanfordnlp/imdb --output_dir opt-sft-imdb
+```
+
+The SFT CLI is based on the `examples/scripts/sft.py` script.
+
+### Direct Policy Optimization (DPO)
+
+To use the DPO CLI, you need to have a dataset in the TRL format such as 
+
+* TRL's Anthropic HH dataset: https://huggingface.co/datasets/trl-internal-testing/hh-rlhf-helpful-base-trl-style
+* TRL's OpenAI TL;DR summarization dataset: https://huggingface.co/datasets/trl-internal-testing/tldr-preference-trl-style
+
+These datasets always have at least three columns `prompt, chosen, rejected`:
+
+* `prompt` is a list of strings.
+* `chosen` is the chosen response in [chat format](https://huggingface.co/docs/transformers/main/en/chat_templating)
+* `rejected` is the rejected response [chat format](https://huggingface.co/docs/transformers/main/en/chat_templating) 
+
+
+To do a quick start, you can run the following command:
+
+```bash
+trl dpo --model_name_or_path facebook/opt-125m --output_dir trl-hh-rlhf --dataset_name trl-internal-testing/hh-rlhf-helpful-base-trl-style
+```
+
+
+The DPO CLI is based on the `examples/scripts/dpo.py` script.
+
+
+#### Custom preference dataset
+
+Format the dataset into TRL format (you can adapt the `examples/datasets/anthropic_hh.py`):
+
+```bash
+python examples/datasets/anthropic_hh.py --push_to_hub --hf_entity your-hf-org
+```
+
+## Chat interface
+
+The chat CLI lets you quickly load the model and talk to it. Simply run the following:
+
+```bash
+trl chat --model_name_or_path  Qwen/Qwen1.5-0.5B-Chat 
+```
+
+> [!TIP]
+> To use the chat CLI with the developer installation, you must run `make dev` 
+>
+
+Note that the chat interface relies on the tokenizer's [chat template](https://huggingface.co/docs/transformers/chat_templating) to format the inputs for the model. Make sure your tokenizer has a chat template defined.
+
+Besides talking to the model there are a few commands you can use:
+
+- **clear**: clears the current conversation and start a new one
+- **example {NAME}**: load example named `{NAME}` from the config and use it as the user input
+- **set {SETTING_NAME}={SETTING_VALUE};**: change the system prompt or generation settings (multiple settings are separated by a ';').
+- **reset**: same as clear but also resets the generation configs to defaults if they have been changed by **set**
+- **save {SAVE_NAME} (optional)**: save the current chat and settings to file by default to `./chat_history/{MODEL_NAME}/chat_{DATETIME}.yaml` or `{SAVE_NAME}` if provided
+- **exit**: closes the interface
+
+The default examples are defined in `examples/scripts/config/default_chat_config.yaml` but you can pass your own with `--config CONFIG_FILE` where you can also specify the default generation parameters.
--- a/docs/source/cpo_trainer.mdx
+++ b/docs/source/cpo_trainer.mdx
@ -0,0 +1,113 @@
+# CPO Trainer
+
+Contrastive Preference Optimization (CPO) as introduced in the paper [Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation](https://huggingface.co/papers/2401.08417) by Haoran Xu, Amr Sharaf, Yunmo Chen, Weiting Tan, Lingfeng Shen, Benjamin Van Durme, Kenton Murray, and Young Jin Kim. At a high-level, CPO  trains models to
+avoid generating adequate, but not perfect translations in Machine Translation (MT) tasks. However, CPO is a general approximation to the DPO loss and can be applied to other domains like chat.
+
+CPO aims to mitigate two fundamental shortcomings of SFT. First, SFT’s methodology of minimizing the discrepancy between predicted outputs and gold-standard references inherently caps model performance at the quality level of the training data. Secondly, SFT lacks a mechanism to prevent the model from rejecting mistakes in translations. The CPO objective is derived from the DPO objective.
+
+## SimPO
+The [SimPO](https://huggingface.co/papers/2405.14734) method is also implemented in the `CPOTrainer`. SimPO is an alternative loss that adds a reward margin, allows for length normalization, and does not use BC regularization. To use this loss, we can use SimPO easily by turning on `loss_type="simpo"` and `cpo_alpha=0` in the `CPOConfig`.
+
+## CPO-SimPO
+We also offer the combined use of CPO and SimPO, which enables more stable training and improved performance. Learn more details at [CPO-SimPO Github](https://github.com/fe1ixxu/CPO_SIMPO). To use this method, simply enable SimPO by setting `loss_type="simpo"` and a non-zero `cpo_alpha` in the CPOConfig.
+
+## Expected dataset format
+
+The CPO trainer expects a format identical to the DPO trainer, which should include three entries. These entries should be named as follows:
+
+- `prompt`
+- `chosen`
+- `rejected`
+
+for example:
+
+```py
+cpo_dataset_dict = {
+    "prompt": [
+        "hello",
+        "how are you",
+        "What is your name?",
+        "What is your name?",
+        "Which is the best programming language?",
+        "Which is the best programming language?",
+        "Which is the best programming language?",
+    ],
+    "chosen": [
+        "hi nice to meet you",
+        "I am fine",
+        "My name is Mary",
+        "My name is Mary",
+        "Python",
+        "Python",
+        "Java",
+    ],
+    "rejected": [
+        "leave me alone",
+        "I am not fine",
+        "Whats it to you?",
+        "I dont have a name",
+        "Javascript",
+        "C++",
+        "C++",
+    ],
+}
+```
+where the `prompt` contains the context inputs, `chosen` contains the corresponding chosen responses and `rejected` contains the corresponding negative (rejected) responses. As can be seen a prompt can have multiple responses and this is reflected in the entries being repeated in the dictionary's value arrays.
+
+## Expected model format
+The CPO trainer expects a model of `AutoModelForCausalLM`, compared to PPO that expects `AutoModelForCausalLMWithValueHead` for the value function.
+
+## Using the `CPOTrainer`
+For a detailed example have a look at the `examples/scripts/cpo.py` script. At a high level we need to initialize the `CPOTrainer` with a `model` we wish to train. **Note that CPOTrainer eliminates the need to use the reference model, simplifying the optimization process.** The `beta` refers to the hyperparameter of the implicit reward, and the dataset contains the 3 entries listed above.
+
+```py
+cpo_config = CPOConfig(
+    beta=0.1,
+)
+
+cpo_trainer = CPOTrainer(
+    model,
+    args=cpo_config,
+    train_dataset=train_dataset,
+    tokenizer=tokenizer,
+)
+```
+After this one can then call:
+
+```py
+cpo_trainer.train()
+```
+
+## Loss functions
+
+Given the preference data, the `CPOTrainer` uses the sigmoid loss on the normalized likelihood via the `logsigmoid` to fit a logistic regression.
+
+The [RSO](https://huggingface.co/papers/2309.06657) authors propose to use a hinge loss on the normalized likelihood from the [SLiC](https://huggingface.co/papers/2305.10425) paper. The `CPOTrainer` can be switched to this loss via the `loss_type="hinge"` argument and the `beta` in this case is the reciprocal of the margin.
+
+The [IPO](https://huggingface.co/papers/2310.12036) authors provide a deeper theoretical understanding of the CPO algorithms and identify an issue with overfitting and propose an alternative loss which can be used via the `loss_type="ipo"` argument to the trainer. Note that the `beta`  parameter is the reciprocal of the gap between the log-likelihood ratios of the chosen vs the rejected completion pair and thus the smaller the `beta` the larger this gaps is. As per the paper the loss is averaged over log-likelihoods of the completion (unlike CPO which is summed only).
+
+### For Mixture of Experts Models: Enabling the auxiliary loss
+
+MOEs are the most efficient if the load is about equally distributed between experts.  
+To ensure that we train MOEs similarly during preference-tuning, it is beneficial to add the auxiliary loss from the load balancer to the final loss.  
+
+This option is enabled by setting `output_router_logits=True` in the model config (e.g. MixtralConfig).  
+To scale how much the auxiliary loss contributes to the total loss, use the hyperparameter `router_aux_loss_coef=...` (default: 0.001).
+
+## Logging
+
+While training and evaluating we record the following reward metrics:
+
+* `rewards/chosen`: the mean log probabilities of the policy model for the chosen responses scaled by beta
+* `rewards/rejected`: the mean log probabilities of the policy model for the rejected responses scaled by beta
+* `rewards/accuracies`: mean of how often the chosen rewards are > than the corresponding rejected rewards
+* `rewards/margins`: the mean difference between the chosen and corresponding rejected rewards
+* `nll_loss`: the mean negative log likelihood loss of the policy model for the chosen responses
+
+## CPOTrainer
+
+[[autodoc]] CPOTrainer
+
+## CPOConfig
+
+[[autodoc]] CPOConfig
--- a/docs/source/customization.mdx
+++ b/docs/source/customization.mdx
@ -1,6 +1,50 @@
 # Training customization

-At `trl` we provide the possibility to give enough modularity to users to be able to efficiently customize the training loop for their needs. Below are some examples on how you can apply and test different techniques.
+TRL is designed with modularity in mind so that users to be able to efficiently customize the training loop for their needs. Below are some examples on how you can apply and test different techniques.
+
+## Train on multiple GPUs / nodes
+
+The trainers in TRL use 🤗 Accelerate to enable distributed training across multiple GPUs or nodes. To do so, first create an 🤗 Accelerate config file by running
+
+```bash
+accelerate config
+```
+
+and answering the questions according to your multi-gpu / multi-node setup. You can then launch distributed training by running:
+
+```bash
+accelerate launch your_script.py
+```
+
+We also provide config files in the [examples folder](https://github.com/huggingface/trl/tree/main/examples/accelerate_configs) that can be used as templates. To use these templates, simply pass the path to the config file when launching a job, e.g.:
+
+```shell
+accelerate launch --config_file=examples/accelerate_configs/multi_gpu.yaml --num_processes {NUM_GPUS} path_to_script.py --all_arguments_of_the_script
+```
+
+Refer to the [examples page](https://github.com/huggingface/trl/tree/main/examples) for more details.
+
+### Distributed training with DeepSpeed
+
+All of the trainers in TRL can be run on multiple GPUs together with DeepSpeed ZeRO-{1,2,3} for efficient sharding of the optimizer states, gradients, and model weights. To do so, run:
+
+```shell
+accelerate launch --config_file=examples/accelerate_configs/deepspeed_zero{1,2,3}.yaml --num_processes {NUM_GPUS} path_to_your_script.py --all_arguments_of_the_script
+```
+
+Note that for ZeRO-3, a small tweak is needed to initialize your reward model on the correct device via the `zero3_init_context_manager()` context manager. In particular, this is needed to avoid DeepSpeed hanging after a fixed number of training steps. Here is a snippet of what is involved from the [`sentiment_tuning`](https://github.com/huggingface/trl/blob/main/examples/scripts/ppo.py) example:
+
+```python
+ds_plugin = ppo_trainer.accelerator.state.deepspeed_plugin
+if ds_plugin is not None and ds_plugin.is_zero3_init_enabled():
+    with ds_plugin.zero3_init_context_manager(enable=False):
+        sentiment_pipe = pipeline("sentiment-analysis", model="lvwerra/distilbert-imdb", device=device)
+else:
+    sentiment_pipe = pipeline("sentiment-analysis", model="lvwerra/distilbert-imdb", device=device)
+```
+
+Consult the 🤗 Accelerate [documentation](https://huggingface.co/docs/accelerate/usage_guides/deepspeed) for more information about the DeepSpeed plugin.
+

 ## Use different optimizers

@ -12,7 +56,7 @@ from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead

 # 1. load a pretrained model
 model = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2')
-model_ref = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2')
+ref_model = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2')
 tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

 # 2. define config
@ -25,7 +69,7 @@ optimizer = torch.optim.SGD(model.parameters(), lr=config.learning_rate)


 # 3. initialize trainer
-ppo_trainer = PPOTrainer(config, model, model_ref, tokenizer, optimizer=optimizer)
+ppo_trainer = PPOTrainer(config, model, ref_model, tokenizer, optimizer=optimizer)
 ```

 For memory efficient fine-tuning, you can also pass `Adam8bit` optimizer from `bitsandbytes`:
@ -39,7 +83,7 @@ from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead

 # 1. load a pretrained model
 model = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2')
-model_ref = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2')
+ref_model = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2')
 tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

 # 2. define config
@ -51,19 +95,19 @@ config = PPOConfig(**ppo_config)
 optimizer = bnb.optim.Adam8bit(model.parameters(), lr=config.learning_rate)

 # 3. initialize trainer
-ppo_trainer = PPOTrainer(config, model, model_ref, tokenizer, optimizer=optimizer)
+ppo_trainer = PPOTrainer(config, model, ref_model, tokenizer, optimizer=optimizer)
 ```

 ### Use LION optimizer

-You can use the new [LION optimizer from Google](https://arxiv.org/abs/2302.06675) as well, first take the source code of the optimizer definition [here](https://github.com/lucidrains/lion-pytorch/blob/main/lion_pytorch/lion_pytorch.py), and copy it so that you can import the optimizer. Make sure to initialize the optimizer by considering the trainable parameters only for a more memory efficient training:
+You can use the new [LION optimizer from Google](https://huggingface.co/papers/2302.06675) as well, first take the source code of the optimizer definition [here](https://github.com/lucidrains/lion-pytorch/blob/main/lion_pytorch/lion_pytorch.py), and copy it so that you can import the optimizer. Make sure to initialize the optimizer by considering the trainable parameters only for a more memory efficient training:
 ```python
 optimizer = Lion(filter(lambda p: p.requires_grad, self.model.parameters()), lr=self.config.learning_rate)

 ...
-ppo_trainer = PPOTrainer(config, model, model_ref, tokenizer, optimizer=optimizer)
+ppo_trainer = PPOTrainer(config, model, ref_model, tokenizer, optimizer=optimizer)
 ```
-We advice you to use the learning rate that you would use for `Adam` divided by 3 as pointed out [here](https://github.com/lucidrains/lion-pytorch#lion---pytorch). We observed an improvement when using this optimizer compared to classic Adam (check the full logs [here](https://wandb.ai/distill-bloom/trl/runs/lj4bheke?workspace=user-younesbelkada)):
+We advise you to use the learning rate that you would use for `Adam` divided by 3 as pointed out [here](https://github.com/lucidrains/lion-pytorch#lion---pytorch). We observed an improvement when using this optimizer compared to classic Adam (check the full logs [here](https://wandb.ai/distill-bloom/trl/runs/lj4bheke?workspace=user-younesbelkada)):

 <div style="text-align: center">
 <img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/trl-lion.png">
@ -80,7 +124,7 @@ from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead

 # 1. load a pretrained model
 model = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2')
-model_ref = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2')
+ref_model = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2')
 tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

 # 2. define config
@ -90,10 +134,10 @@ config = PPOConfig(**ppo_config)

 # 2. Create optimizer
 optimizer = torch.optim.SGD(model.parameters(), lr=config.learning_rate)
-lr_scheduler = lr_scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.9)
+lr_scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.9)

 # 3. initialize trainer
-ppo_trainer = PPOTrainer(config, model, model_ref, tokenizer, optimizer=optimizer, lr_scheduler=lr_scheduler)
+ppo_trainer = PPOTrainer(config, model, ref_model, tokenizer, optimizer=optimizer, lr_scheduler=lr_scheduler)
 ```

 ## Memory efficient fine-tuning by sharing layers
@ -106,13 +150,13 @@ from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead, create

 # 1. load a pretrained model
 model = AutoModelForCausalLMWithValueHead.from_pretrained('bigscience/bloom-560m')
-model_ref = create_reference_model(model, num_shared_layers=6)
+ref_model = create_reference_model(model, num_shared_layers=6)
 tokenizer = AutoTokenizer.from_pretrained('bigscience/bloom-560m')

 # 2. initialize trainer
 ppo_config = {'batch_size': 1}
 config = PPOConfig(**ppo_config)
-ppo_trainer = PPOTrainer(config, model, model_ref, tokenizer)
+ppo_trainer = PPOTrainer(config, model, ref_model, tokenizer)
 ```

 ## Pass 8-bit reference models 
@ -134,11 +178,39 @@ from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead

 # 1. load a pretrained model
 model = AutoModelForCausalLMWithValueHead.from_pretrained('bigscience/bloom-560m')
-model_ref = AutoModelForCausalLMWithValueHead.from_pretrained('bigscience/bloom-560m', device_map="auto", load_in_8bit=True)
+ref_model = AutoModelForCausalLMWithValueHead.from_pretrained('bigscience/bloom-560m', device_map="auto", load_in_8bit=True)
 tokenizer = AutoTokenizer.from_pretrained('bigscience/bloom-560m')

 # 2. initialize trainer
 ppo_config = {'batch_size': 1}
 config = PPOConfig(**ppo_config)
-ppo_trainer = PPOTrainer(config, model, model_ref, tokenizer)
+ppo_trainer = PPOTrainer(config, model, ref_model, tokenizer)
+```
+
+## Use the CUDA cache optimizer
+
+When training large models, you should better handle the CUDA cache by iteratively clearing it. Do do so, simply pass `optimize_cuda_cache=True` to `PPOConfig`:
+
+```python
+config = PPOConfig(..., optimize_cuda_cache=True)
+```
+
+
+
+## Use score scaling/normalization/clipping
+As suggested by [Secrets of RLHF in Large Language Models Part I: PPO](https://huggingface.co/papers/2307.04964), we support score (aka reward) scaling/normalization/clipping to improve training stability via `PPOConfig`:
+```python
+from trl import PPOConfig
+
+ppo_config = {
+    use_score_scaling=True,
+    use_score_norm=True,
+    score_clip=0.5,
+}
+config = PPOConfig(**ppo_config)
+```
+
+To run `ppo.py`, you can use the following command:
+```
+python examples/scripts/ppo.py --log_with wandb --use_score_scaling --use_score_norm --score_clip 0.5
 ```
--- a/docs/source/data_utils.mdx
+++ b/docs/source/data_utils.mdx
@ -0,0 +1,15 @@
+## Data Utilities
+
+[[autodoc]] is_conversational
+
+[[autodoc]] apply_chat_template
+
+[[autodoc]] maybe_apply_chat_template
+
+[[autodoc]] extract_prompt
+
+[[autodoc]] maybe_extract_prompt
+
+[[autodoc]] unpair_preference_dataset
+
+[[autodoc]] maybe_unpair_preference_dataset
--- a/docs/source/dataset_formats.mdx
+++ b/docs/source/dataset_formats.mdx
@ -0,0 +1,712 @@
+# Dataset formats
+
+This guide provides an overview of the dataset formats supported by each trainer in TRL. Since conversational datasets are very common, we also provide a guide on how to use them, and how to convert them into a standard dataset format for TRL trainers.
+
+## Overview of the dataset formats and types
+
+The *format* of a dataset refers to how the data is structured, typically categorized as either *standard* or *conversational*. The *type* is associated with the specific task the dataset is designed for, such as *prompt-only* or *preference*. Each type is characterized by its columns, which vary according to the task, as shown in the table.
+
+<table>
+  <tr>
+    <th>Type \ Format</th>
+    <th>Standard</th>
+    <th>Conversational</th>
+  </tr>
+  <tr>
+    <td>Language modeling</td>
+    <td>
+      <pre><code>{"text": "The sky is blue."}</code></pre>
+    </td>
+    <td>
+      <pre><code>{"messages": [{"role": "user", "content": "What color is the sky?"},
+              {"role": "assistant", "content": "It is blue."}]}</code></pre>
+    </td>
+  </tr>
+  <tr>
+    <td>Prompt-only</td>
+    <td>
+      <pre><code>{"prompt": "The sky is"}</code></pre>
+    </td>
+    <td>
+      <pre><code>{"prompt": [{"role": "user", "content": "What color is the sky?"}]}</code></pre>
+    </td>
+  </tr>
+  <tr>
+    <td>Prompt-completion</td>
+    <td>
+      <pre><code>{"prompt": "The sky is",
+ "completion": " blue."}</code></pre>
+    </td>
+    <td>
+      <pre><code>{"prompt": [{"role": "user", "content": "What color is the sky?"}],
+ "completion": [{"role": "assistant", "content": "It is blue."}]}</code></pre>
+    </td>
+  </tr>
+  </tr>
+  <tr>
+    <td>Preference</td>
+    <td>
+      <pre><code>{"prompt": "The sky is",
+ "chosen": " blue.",
+ "rejected": " green."}</code></pre>
+      or, with implicit prompt:
+      <pre><code>{"chosen": "The sky is blue.",
+ "rejected": "The sky is green."}</code></pre>
+    </td>
+    <td>
+      <pre><code>{"prompt": [{"role": "user", "content": "What color is the sky?"}],
+ "chosen": [{"role": "assistant", "content": "It is blue."}],
+ "rejected": [{"role": "assistant", "content": "It is green."}]}</code></pre>
+      or, with implicit prompt:
+      <pre><code>{"chosen": [{"role": "user", "content": "What color is the sky?"},
+              {"role": "assistant", "content": "It is blue."}],
+  "rejected": [{"role": "user", "content": "What color is the sky?"},
+                {"role": "assistant", "content": "It is green."}]}</code></pre>
+    </td>
+  </tr>
+    <td>Unpaired preference</td>
+    <td>
+      <pre><code>{"prompt": "The sky is",
+ "completion": " blue.",
+ "label": True}</code></pre>
+    </td>
+    <td>
+      <pre><code>{"prompt": [{"role": "user", "content": "What color is the sky?"}],
+ "completion": [{"role": "assistant", "content": "It is green."}],
+ "label": False}</code></pre>
+    </td>
+  </tr>
+</table>
+
+
+### Standard dataset format
+
+The standard dataset format typically consists of plain text strings. The columns in the dataset vary depending on the task. This is the format expected by TRL trainers. Below are examples of standard dataset formats for different tasks:
+
+```python
+# Language modeling
+example = {"text": "The sky is blue."}
+# Preference
+example = {"chosen": "The sky is blue.", "rejected": "The sky is green."}
+```
+
+### Conversational dataset format
+
+Conversational datasets are used for tasks involving dialogues or chat interactions between users and assistants. Unlike standard dataset formats, these contain sequences of messages where each message has a `role` (e.g., `"user"` or `"assistant"`) and `content` (the message text).
+
+```python
+messages = [
+    {"role": "user", "content": "Hello, how are you?"},
+    {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
+    {"role": "user", "content": "I'd like to show off how chat templating works!"},
+]
+```
+
+Just like standard datasets, the columns in conversational datasets vary depending on the task. For instance, a preference dataset would include columns like `"chosen"` and `"rejected"` to compare responses:
+
+```python
+example = {
+    "chosen": [
+        {"role": "user", "content": "What color is the sky?"},
+        {"role": "assistant", "content": "It is blue."},
+    ],
+    "rejected": [
+        {"role": "user", "content": "What color is the sky?"},
+        {"role": "assistant", "content": "It is green."},
+    ],
+}
+```
+
+Conversational datasets are useful for training chat models, but must be converted into a standard format before being used with TRL trainers. This is typically done using chat templates specific to the model being used. For more information, refer to the [Working with conversational datasets in TRL](#working-with-conversational-datasets-in-trl) section.
+
+### Language modeling
+
+A language modeling dataset consists of a column `"text"` (or `"messages"` for conversational datasets) containing a full sequence of text.
+
+```python
+language_modeling_example = {"text": "The sky is blue."}
+```
+
+### Prompt-only
+
+In a prompt-only dataset, only the initial prompt (the question or partial sentence) is provided under the key `"prompt"`. The training typically involves generating the completion based on this prompt, where the model learns to continue or complete the given input.
+
+```python
+prompt_only_example = {"prompt": "The sky is"}
+```
+
+<Tip>
+
+While both the prompt-only and language modeling formats are similar, they differ in how the input is handled. In the prompt-only format, the prompt represents a partial input that expects the model to complete or continue, while in the language modeling format, the input is treated as a complete sentence or sequence. These two formats are processed differently by TRL. Below is an example showing the difference in the output of the `apply_chat_template` function for each format:
+
+```python
+from transformers import AutoTokenizer
+from trl import apply_chat_template
+
+tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")
+
+# Example for prompt-only format
+prompt_only_example = {"prompt": [{"role": "user", "content": "What color is the sky?"}]}
+apply_chat_template(prompt_only_example, tokenizer)
+# Output: {'prompt': '<|user|>\nWhat color is the sky?<|end|>\n<|assistant|>\n'}
+
+# Example for language modeling format
+lm_example = {"messages": [{"role": "user", "content": "What color is the sky?"}]}
+apply_chat_template(lm_example, tokenizer)
+# Output: {'text': '<|user|>\nWhat color is the sky?<|end|>\n<|endoftext|>'}
+```
+
+- The prompt-only output includes a `'<|assistant|>\n'`, indicating the beginning of the assistant’s turn and expecting the model to generate a completion.
+- In contrast, the language modeling output treats the input as a complete sequence and terminates it with `'<|endoftext|>'`, signaling the end of the text and not expecting any additional content.
+
+</Tip>
+
+### Prompt-completion
+
+A prompt-completion dataset includes a `"prompt"` and a `"completion"`.
+
+```python
+prompt_completion_example = {"prompt": "The sky is", "completion": " blue."}
+```
+
+### Preference
+
+A preference dataset is used for tasks where the model is trained to choose between two or more possible completions to the same prompt. This dataset includes a `"prompt"`, a `"chosen"` completion, and a `"rejected"` completion. The model is trained to select the `"chosen"` response over the `"rejected"` response.
+Some dataset may not include the `"prompt"` column, in which case the prompt is implicit and directly included in the `"chosen"` and `"rejected"` completions. We recommend using explicit prompts whenever possible.
+
+```python
+preference_example = {"prompt": "The sky is", "chosen": " blue.", "rejected": " green."}  # recommended
+# or,
+preference_example = {"chosen": "The sky is blue.", "rejected": "The sky is green."}
+```
+
+### Unpaired preference
+
+An unpaired preference dataset is similar to a preference dataset but instead of having `"chosen"` and `"rejected"` completions for the same prompt, it includes a single `"completion"` and a `"label"` indicating whether the completion is preferred or not.
+
+```python
+unpaired_preference_example = {"prompt": "The sky is", "completion": " blue.", "label": True}
+```
+
+## Which dataset format to use?
+
+Choosing the right dataset format depends on the task you are working on and the specific requirements of the TRL trainer you are using. Below is a brief overview of the dataset formats supported by each TRL trainer.
+
+| Trainer                 | Expected dataset format      |
+| ----------------------- | ---------------------------- |
+| [`BCOTrainer`]          | Unpaired preference          |
+| [`CPOTrainer`]          | Preference (explicit prompt) |
+| [`DPOTrainer`]          | Preference (explicit prompt) |
+| [`IterativeSFTTrainer`] | Unpaired preference          |
+| [`KTOTrainer`]          | Unpaired preference          |
+| [`NashMDTrainer`]       | Prompt-only                  |
+| [`OnlineDPOTrainer`]    | Prompt-only                  |
+| [`ORPOTrainer`]         | Preference (explicit prompt) |
+| [`PPOv2Trainer`]        | Tokenized language modeling  |
+| [`RewardTrainer`]       | Preference (implicit prompt) |
+| [`SFTTrainer`]          | Language modeling            |
+| [`XPOTrainer`]          | Prompt-only                  |
+
+<Tip>
+
+TRL trainers only support standard dataset formats, [for now](https://github.com/huggingface/trl/issues/2071). If you have a conversational dataset, you must first convert it into a standard format.
+For more information on how to work with conversational datasets, refer to the [Working with conversational datasets in TRL](#working-with-conversational-datasets-in-trl) section.
+
+</Tip>
+
+## Working with conversational datasets in TRL
+
+Conversational datasets are increasingly common, especially for training chat models. However, TRL trainers (except [`SFTTrainer`]) don't support conversational datasets in their raw format. These datasets must first be converted into a standard format. 
+Fortunately, TRL offers tools to easily handle this conversion, which are detailed below.
+
+### Converting a conversational dataset into a standard dataset
+
+TRL trainers do not support conversational datasets in their raw format. To use them, you need to convert them into a standard dataset format using a chat template. This template is provided by the tokenizer of the model you use.
+
+For detailed instructions on using chat templating, refer to the [Chat templating section in the `transformers` documentation](https://huggingface.co/docs/transformers/en/chat_templating).
+
+In TRL, the method you apply to convert the dataset will vary depending on the task. Fortunately, TRL provides a helper function called [`apply_chat_template`] to simplify this process. Here's an example of how to use it:
+
+```python
+from transformers import AutoTokenizer
+from trl import apply_chat_template
+
+tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")
+
+example = {
+    "prompt": [{"role": "user", "content": "What color is the sky?"}],
+    "completion": [{"role": "assistant", "content": "It is blue."}]
+}
+
+apply_chat_template(example, tokenizer)
+# Output:
+# {'prompt': '<|user|>\nWhat color is the sky?<|end|>\n<|assistant|>\n', 'completion': 'It is blue.<|end|>\n<|endoftext|>'}
+```
+
+Alternatively, you can use the [`~datasets.Dataset.map`] method to apply the template across an entire dataset:
+
+```python
+from datasets import Dataset
+from trl import apply_chat_template
+
+dataset_dict = {
+    "prompt": [[{"role": "user", "content": "What color is the sky?"}],
+               [{"role": "user", "content": "Where is the sun?"}]],
+    "completion": [[{"role": "assistant", "content": "It is blue."}],
+                   [{"role": "assistant", "content": "In the sky."}]]
+}
+
+dataset = Dataset.from_dict(dataset_dict)
+dataset = dataset.map(apply_chat_template, fn_kwargs={"tokenizer": tokenizer})
+# Output:
+# {'prompt': ['<|user|>\nWhat color is the sky?<|end|>\n<|assistant|>\n',
+#             '<|user|>\nWhere is the sun?<|end|>\n<|assistant|>\n'],
+#  'completion': ['It is blue.<|end|>\n<|endoftext|>', 'In the sky.<|end|>\n<|endoftext|>']}
+```
+
+<Tip warning={true}>
+
+We recommend using the [`apply_chat_template`] function rather than directly calling `tokenizer.apply_chat_template`. Handling chat templates nonlanguage modeling datasets can be tricky and may lead to issues, such as inserting a system prompt in the middle of a conversation. For additional examples, see [#1930 (comment)](https://github.com/huggingface/trl/pull/1930#issuecomment-2292908614). The [`apply_chat_template`] is designed to handle these intricacies and ensure the correct application of chat templates for various tasks.
+
+</Tip>
+
+<Tip warning={true}>
+
+It's important to note that chat templates are model-specific. For example, if you use the chat template from [meta-llama/Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) with the above example, you get a different output:
+
+```python
+apply_chat_template(example, AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct"))
+# Output:
+# {'prompt': '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nWhat color is the sky?<|im_end|>\n<|im_start|>assistant\n',
+#  'completion': 'It is blue.<|im_end|>\n'}
+```
+
+Always use the chat template associated with the model you're working with. Using the wrong template can lead to inaccurate or unexpected results.
+
+</Tip>
+
+## Using any dataset with TRL: preprocessing and conversion
+
+Many datasets come in formats tailored to specific tasks, which might not be directly compatible with TRL. To use such datasets with TRL, you may need to preprocess and convert them into the required format.
+
+To make this easier, we provide a set of [example scripts](https://github.com/huggingface/trl/tree/main/examples/datasets) that cover common dataset conversions.
+
+### Example: UltraFeedback dataset
+
+Let’s take the [UltraFeedback dataset](https://huggingface.co/datasets/openbmb/UltraFeedback) as an example. Here's a preview of the dataset:
+
+<iframe
+  src="https://huggingface.co/datasets/openbmb/UltraFeedback/embed/viewer/default/train"
+  frameborder="0"
+  width="100%"
+  height="560px"
+></iframe>
+
+As shown above, the dataset format does not match the expected structure. It’s not in a conversational format, the column names differ, and the results pertain to different models (e.g., Bard, GPT-4) and aspects (e.g., "helpfulness", "honesty").
+
+By using the provided conversion script [`examples/datasets/ultrafeedback.py`](https://github.com/huggingface/trl/tree/main/examples/datasets/ultrafeedback.py), you can transform this dataset into an unpaired preference format, and push it to the Hub:
+
+```sh
+python examples/datasets/ultrafeedback.py --push_to_hub --repo_id trl-lib/ultrafeedback-gpt-3.5-turbo-helpfulness
+```
+
+Once converted, the dataset will look like this:
+
+<iframe
+  src="https://huggingface.co/datasets/trl-lib/ultrafeedback-gpt-3.5-turbo-helpfulness/embed/viewer/default/train?row=0"
+  frameborder="0"
+  width="100%"
+  height="560px"
+></iframe>
+
+Now, you can use this dataset with TRL!
+
+By adapting the provided scripts or creating your own, you can convert any dataset into a format compatible with TRL.
+
+## Utilities for converting dataset types
+
+This section provides example code to help you convert between different dataset types. While some conversions can be performed after applying the chat template (i.e., in the standard format), we recommend performing the conversion before applying the chat template to ensure it works consistently.
+
+For simplicity, some of the examples below do not follow this recommendation and use the standard format. However, the conversions can be applied directly to the conversational format without modification.
+
+| From \ To                       | Language modeling                                                       | Prompt-completion                                                       | Prompt-only                                                       | Preference with implicit prompt                           | Preference                                                | Unpaired preference                                                       |
+| ------------------------------- | ----------------------------------------------------------------------- | ----------------------------------------------------------------------- | ----------------------------------------------------------------- | --------------------------------------------------------- | --------------------------------------------------------- | ------------------------------------------------------------------------- |
+| Language modeling               | N/A                                                                     | N/A                                                                     | N/A                                                               | N/A                                                       | N/A                                                       | N/A                                                                       |
+| Prompt-completion               | [🔗](#from-prompt-completion-to-language-modeling-dataset)               | N/A                                                                     | [🔗](#from-prompt-completion-to-prompt-only-dataset)               | N/A                                                       | N/A                                                       | N/A                                                                       |
+| Prompt-only                     | N/A                                                                     | N/A                                                                     | N/A                                                               | N/A                                                       | N/A                                                       | N/A                                                                       |
+| Preference with implicit prompt | [🔗](#from-preference-with-implicit-prompt-to-language-modeling-dataset) | [🔗](#from-preference-with-implicit-prompt-to-prompt-completion-dataset) | [🔗](#from-preference-with-implicit-prompt-to-prompt-only-dataset) | N/A                                                       | [🔗](#from-implicit-to-explicit-prompt-preference-dataset) | [🔗](#from-preference-with-implicit-prompt-to-unpaired-preference-dataset) |
+| Preference                      | [🔗](#from-preference-to-language-modeling-dataset)                      | [🔗](#from-preference-to-prompt-completion-dataset)                      | [🔗](#from-preference-to-prompt-only-dataset)                      | [🔗](#from-explicit-to-implicit-prompt-preference-dataset) | N/A                                                       | [🔗](#from-preference-to-unpaired-preference-dataset)                      |
+| Unpaired preference             | [🔗](#from-unpaired-preference-to-language-modeling-dataset)             | [🔗](#from-unpaired-preference-to-prompt-completion-dataset)             | [🔗](#from-unpaired-preference-to-prompt-only-dataset)             | N/A                                                       | N/A                                                       | N/A                                                                       |
+
+### From prompt-completion to language modeling dataset
+
+To convert a prompt-completion dataset into a language modeling dataset, concatenate the prompt and the completion.
+
+```python
+from datasets import Dataset
+
+dataset = Dataset.from_dict({
+    "prompt": ["The sky is", "The sun is"],
+    "completion": [" blue.", " in the sky."],
+})
+
+def concat_prompt_completion(example):
+    return {"text": example["prompt"] + example["completion"]}
+
+dataset = dataset.map(concat_prompt_completion, remove_columns=["prompt", "completion"])
+```
+
+```python
+>>> dataset[0]
+{'text': 'The sky is blue.'}
+```
+
+### From prompt-completion to prompt-only dataset
+
+To convert a prompt-completion dataset into a prompt-only dataset, remove the completion.
+
+```python
+from datasets import Dataset
+
+dataset = Dataset.from_dict({
+    "prompt": ["The sky is", "The sun is"],
+    "completion": [" blue.", " in the sky."],
+})
+
+dataset = dataset.remove_columns("completion")
+```
+
+```python
+>>> dataset[0]
+{'prompt': 'The sky is'}
+```
+
+### From preference with implicit prompt to language modeling dataset
+
+To convert a preference with implicit prompt dataset into a language modeling dataset, remove the rejected, and rename the column `"chosen"` to `"text"`.
+
+```python
+from datasets import Dataset
+
+dataset = Dataset.from_dict({
+    "chosen": ["The sky is blue.", "The sun is in the sky."],
+    "rejected": ["The sky is green.", "The sun is in the sea."],
+})
+
+dataset = dataset.rename_column("chosen", "text").remove_columns("rejected")
+```
+
+```python
+>>> dataset[0]
+{'text': 'The sky is blue.'}
+```
+
+### From preference with implicit prompt to prompt-completion dataset
+
+To convert a preference dataset with implicit prompt into a prompt-completion dataset, extract the prompt with [`extract_prompt`], remove the rejected, and rename the column `"chosen"` to `"completion"`.
+
+```python
+from datasets import Dataset
+from trl import extract_prompt
+
+dataset = Dataset.from_dict({
+    "chosen": [
+        [{"role": "user", "content": "What color is the sky?"}, {"role": "assistant", "content": "It is blue."}],
+        [{"role": "user", "content": "Where is the sun?"}, {"role": "assistant", "content": "In the sky."}],
+    ],
+    "rejected": [
+        [{"role": "user", "content": "What color is the sky?"}, {"role": "assistant", "content": "It is green."}],
+        [{"role": "user", "content": "Where is the sun?"}, {"role": "assistant", "content": "In the sea."}],
+    ],
+})
+dataset = dataset.map(extract_prompt).remove_columns("rejected").rename_column("chosen", "completion")
+```
+
+```python
+>>> dataset[0]
+{'prompt': [{'role': 'user', 'content': 'What color is the sky?'}], 'completion': [{'role': 'assistant', 'content': 'It is blue.'}]}
+```
+
+### From preference with implicit prompt to prompt-only dataset
+
+To convert a preference dataset with implicit prompt into a prompt-only dataset, extract the prompt with [`extract_prompt`], and remove the rejected and the chosen.
+
+```python
+from datasets import Dataset
+from trl import extract_prompt
+
+dataset = Dataset.from_dict({
+    "chosen": [
+        [{"role": "user", "content": "What color is the sky?"}, {"role": "assistant", "content": "It is blue."}],
+        [{"role": "user", "content": "Where is the sun?"}, {"role": "assistant", "content": "In the sky."}],
+    ],
+    "rejected": [
+        [{"role": "user", "content": "What color is the sky?"}, {"role": "assistant", "content": "It is green."}],
+        [{"role": "user", "content": "Where is the sun?"}, {"role": "assistant", "content": "In the sea."}],
+    ],
+})
+dataset = dataset.map(extract_prompt).remove_columns(["chosen", "rejected"])
+```
+
+```python
+>>> dataset[0]
+{'prompt': [{'role': 'user', 'content': 'What color is the sky?'}]}
+```
+
+### From implicit to explicit prompt preference dataset
+
+To convert a preference dataset with implicit prompt into a preference dataset with explicit prompt, extract the prompt with [`extract_prompt`].
+
+```python
+from datasets import Dataset
+from trl import extract_prompt
+
+dataset = Dataset.from_dict({
+    "chosen": [
+        [{"role": "user", "content": "What color is the sky?"}, {"role": "assistant", "content": "It is blue."}],
+        [{"role": "user", "content": "Where is the sun?"}, {"role": "assistant", "content": "In the sky."}],
+    ],
+    "rejected": [
+        [{"role": "user", "content": "What color is the sky?"}, {"role": "assistant", "content": "It is green."}],
+        [{"role": "user", "content": "Where is the sun?"}, {"role": "assistant", "content": "In the sea."}],
+    ],
+})
+
+dataset = dataset.map(extract_prompt)
+```
+
+```python
+>>> dataset[0]
+{'prompt': [{'role': 'user', 'content': 'What color is the sky?'}],
+ 'chosen': [{'role': 'assistant', 'content': 'It is blue.'}],
+ 'rejected': [{'role': 'assistant', 'content': 'It is green.'}]}
+```
+
+### From preference with implicit prompt to unpaired preference dataset
+
+To convert a preference dataset with implicit prompt into an unpaired preference dataset, extract the prompt with [`extract_prompt`], and unpair the dataset with [`unpair_preference_dataset`].
+
+```python
+from datasets import Dataset
+from trl import extract_prompt, unpair_preference_dataset
+
+dataset = Dataset.from_dict({
+    "chosen": [
+        [{"role": "user", "content": "What color is the sky?"}, {"role": "assistant", "content": "It is blue."}],
+        [{"role": "user", "content": "Where is the sun?"}, {"role": "assistant", "content": "In the sky."}],
+    ],
+    "rejected": [
+        [{"role": "user", "content": "What color is the sky?"}, {"role": "assistant", "content": "It is green."}],
+        [{"role": "user", "content": "Where is the sun?"}, {"role": "assistant", "content": "In the sea."}],
+    ],
+})
+
+dataset = dataset.map(extract_prompt)
+dataset = unpair_preference_dataset(dataset)
+```
+
+```python
+>>> dataset[0]
+{'prompt': [{'role': 'user', 'content': 'What color is the sky?'}],
+ 'completion': [{'role': 'assistant', 'content': 'It is blue.'}],
+ 'label': True}
+```
+
+### From preference to language modeling dataset
+
+To convert a preference dataset into a language modeling dataset, remove the rejected, concatenate the prompt and the chosen into the `"text"` column.
+
+```python
+from datasets import Dataset
+
+dataset = Dataset.from_dict({
+    "prompt": ["The sky is", "The sun is"],
+    "chosen": [" blue.", " in the sky."],
+    "rejected": [" green.", " in the sea."],
+})
+
+def concat_prompt_chosen(example):
+    return {"text": example["prompt"] + example["chosen"]}
+
+dataset = dataset.map(concat_prompt_chosen, remove_columns=["prompt", "chosen", "rejected"])
+```
+
+```python
+>>> dataset[0]
+{'text': 'The sky is blue.'}
+```
+
+### From preference to prompt-completion dataset
+
+To convert a preference dataset into a prompt-completion dataset, remove the rejected, and rename the column `"chosen"` to `"completion"`.
+
+```python
+from datasets import Dataset
+
+dataset = Dataset.from_dict({
+    "prompt": ["The sky is", "The sun is"],
+    "chosen": [" blue.", " in the sky."],
+    "rejected": [" green.", " in the sea."],
+})
+
+dataset = dataset.remove_columns("rejected").rename_column("chosen", "completion")
+```
+
+```python
+>>> dataset[0]
+{'prompt': 'The sky is', 'completion': ' blue.'}
+```
+
+### From preference to prompt-only dataset
+
+To convert a preference dataset into a prompt-only dataset, remove the rejected and the chosen.
+
+```python
+from datasets import Dataset
+
+dataset = Dataset.from_dict({
+    "prompt": ["The sky is", "The sun is"],
+    "chosen": [" blue.", " in the sky."],
+    "rejected": [" green.", " in the sea."],
+})
+
+dataset = dataset.remove_columns(["chosen", "rejected"])
+```
+
+```python
+>>> dataset[0]
+{'prompt': 'The sky is'}
+```
+
+### From explicit to implicit prompt preference dataset
+
+To convert a preference dataset with implicit prompt into a preference dataset with explicit prompt, concatenate the prompt to both chosen and rejected, and remove the prompt.
+
+```python
+from datasets import Dataset
+
+dataset = Dataset.from_dict({
+    "prompt": [
+        [{"role": "user", "content": "What color is the sky?"}],
+        [{"role": "user", "content": "Where is the sun?"}],
+    ],
+    "chosen": [
+        [{"role": "assistant", "content": "It is blue."}],
+        [{"role": "assistant", "content": "In the sky."}],
+    ],
+    "rejected": [
+        [{"role": "assistant", "content": "It is green."}],
+        [{"role": "assistant", "content": "In the sea."}],
+    ],
+})
+
+def concat_prompt_to_completions(example):
+    return {"chosen": example["prompt"] + example["chosen"], "rejected": example["prompt"] + example["rejected"]}
+
+dataset = dataset.map(concat_prompt_to_completions, remove_columns="prompt")
+```
+
+```python
+>>> dataset[0]
+{'chosen': [{'role': 'user', 'content': 'What color is the sky?'}, {'role': 'assistant', 'content': 'It is blue.'}],
+ 'rejected': [{'role': 'user', 'content': 'What color is the sky?'}, {'role': 'assistant', 'content': 'It is green.'}]}
+```
+
+### From preference to unpaired preference dataset
+
+To convert dataset into an unpaired preference dataset, unpair the dataset with [`unpair_preference_dataset`].
+
+```python
+from datasets import Dataset
+from trl import unpair_preference_dataset
+
+dataset = Dataset.from_dict({
+    "prompt": [
+        [{"role": "user", "content": "What color is the sky?"}],
+        [{"role": "user", "content": "Where is the sun?"}],
+    ],
+    "chosen": [
+        [{"role": "assistant", "content": "It is blue."}],
+        [{"role": "assistant", "content": "In the sky."}],
+    ],
+    "rejected": [
+        [{"role": "assistant", "content": "It is green."}],
+        [{"role": "assistant", "content": "In the sea."}],
+    ],
+})
+
+dataset = unpair_preference_dataset(dataset)
+```
+
+```python
+>>> dataset[0]
+{'prompt': [{'role': 'user', 'content': 'What color is the sky?'}],
+ 'completion': [{'role': 'assistant', 'content': 'It is blue.'}],
+ 'label': True}
+```
+
+### From unpaired preference to language modeling dataset
+
+To convert an unpaired preference dataset into a language modeling dataset, concatenate the prompt and the completion into the `"text"` column, and remove the prompt, completion and label columns.
+
+```python
+from datasets import Dataset
+
+dataset = Dataset.from_dict({
+    "prompt": ["The sky is", "The sun is", "The sky is", "The sun is"],
+    "completion": [" blue.", " in the sky.", " green.", " in the sea."],
+    "label": [True, True, False, False],
+})
+
+def concatenate_prompt_completion(example):
+    return {"text": example["prompt"] + example["completion"]}
+
+dataset = dataset.map(concatenate_prompt_completion).remove_columns(["prompt", "completion", "label"])
+```
+
+```python
+>>> dataset[0]
+{'text': 'The sky is blue.'}
+```
+
+### From unpaired preference to prompt-completion dataset
+
+To convert an unpaired preference dataset into a prompt-completion dataset, remove the label columns.
+
+```python
+from datasets import Dataset
+
+dataset = Dataset.from_dict({
+    "prompt": ["The sky is", "The sun is", "The sky is", "The sun is"],
+    "completion": [" blue.", " in the sky.", " green.", " in the sea."],
+    "label": [True, True, False, False],
+})
+
+dataset = dataset.remove_columns(["label"])
+```
+
+```python
+>>> dataset[0]
+{'prompt': 'The sky is', 'completion': ' blue.'}
+```
+
+### From unpaired preference to prompt-only dataset
+
+To convert an unpaired preference dataset into a prompt-only dataset, remove the completion and the label columns.
+
+```python
+from datasets import Dataset
+
+dataset = Dataset.from_dict({
+    "prompt": ["The sky is", "The sun is", "The sky is", "The sun is"],
+    "completion": [" blue.", " in the sky.", " green.", " in the sea."],
+    "label": [True, True, False, False],
+})
+
+dataset = dataset.remove_columns(["completion", "label"])
+```
+
+```python
+>>> dataset[0]
+{'prompt': 'The sky is'}
+```
--- a/docs/source/ddpo_trainer.mdx
+++ b/docs/source/ddpo_trainer.mdx
@ -0,0 +1,128 @@
+# Denoising Diffusion Policy Optimization
+## The why
+
+| Before | After DDPO finetuning |
+| --- | --- |
+| <div style="text-align: center"><img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/pre_squirrel.png"/></div> |  <div style="text-align: center"><img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/post_squirrel.png"/></div> |
+| <div style="text-align: center"><img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/pre_crab.png"/></div> |  <div style="text-align: center"><img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/post_crab.png"/></div> |
+|  <div style="text-align: center"><img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/pre_starfish.png"/></div> |  <div style="text-align: center"><img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/post_starfish.png"/></div> |
+
+
+## Getting started with Stable Diffusion finetuning with reinforcement learning
+
+The machinery for finetuning of Stable Diffusion models with reinforcement learning makes heavy use of HuggingFace's `diffusers`
+library. A reason for  stating this is that getting started requires a bit of familiarity with the `diffusers` library concepts, mainly two of them - pipelines and schedulers.
+Right out of the box (`diffusers` library), there isn't a `Pipeline` nor a `Scheduler` instance that is suitable for finetuning with reinforcement learning. Some adjustments need to made. 
+
+There is a pipeline interface that is provided by this library that is required to be implemented to be used with the `DDPOTrainer`, which is the main machinery for fine-tuning Stable Diffusion with reinforcement learning. **Note: Only the StableDiffusion architecture is supported at this point.**
+There is a default implementation of this interface that you can use out of the box. Assuming the default implementation is sufficient and/or to get things moving, refer to the training example alongside this guide. 
+
+The point of the interface is to fuse the pipeline and the scheduler into one object which allows for minimalness in terms of having the constraints all in one place. The interface was designed in hopes of catering to pipelines and schedulers beyond the examples in this repository and elsewhere at this time of writing. Also the scheduler step is a method of this pipeline interface and this may seem redundant given that the raw scheduler is accessible via the interface but this is the only way to constrain the scheduler step output to an output type befitting of the algorithm at hand (DDPO).
+
+For a more detailed look into the interface and the associated default implementation, go [here](https://github.com/lvwerra/trl/tree/main/trl/models/modeling_sd_base.py)
+
+Note that the default implementation has a LoRA implementation path and a non-LoRA based implementation path. The LoRA flag enabled by default and this can be turned off by passing in the flag to do so. LORA based training is faster and the LORA associated model hyperparameters responsible for model convergence aren't as finicky as non-LORA based training.
+
+Also in addition, there is the expectation of providing a reward function and a prompt function. The reward function is used to evaluate the generated images  and the prompt function is used to generate the prompts that are used to generate the images.
+
+## Getting started with `examples/scripts/ddpo.py`
+
+The `ddpo.py` script is a working example of using the `DDPO` trainer to finetune a Stable Diffusion model. This example explicitly configures a small subset of the overall parameters associated with the config object (`DDPOConfig`).
+
+**Note:** one A100 GPU is recommended to get this running. Anything below a A100 will not be able to run this example script and even if it does via relatively smaller sized parameters, the results will most likely be poor.
+
+Almost every configuration parameter has a default. There is only one commandline flag argument that is required of the user to get things up and running. The user is expected to have a [huggingface user access token](https://huggingface.co/docs/hub/security-tokens) that will be used to upload the model post finetuning to HuggingFace hub. The following bash command is to be entered to get things running
+
+```batch
+python ddpo.py --hf_user_access_token <token>
+```
+
+To obtain the documentation of `stable_diffusion_tuning.py`, please run `python stable_diffusion_tuning.py --help`
+
+The following are things to keep in mind (The code checks this for you as well) in general while configuring the trainer (beyond the use case of using the example script)
+
+- The configurable sample batch size (`--ddpo_config.sample_batch_size=6`) should be greater than or equal to the configurable training batch size (`--ddpo_config.train_batch_size=3`)
+- The configurable sample batch size (`--ddpo_config.sample_batch_size=6`) must be divisible by the configurable train batch size (`--ddpo_config.train_batch_size=3`)
+- The configurable sample batch size (`--ddpo_config.sample_batch_size=6`) must be divisible by both the configurable gradient accumulation steps (`--ddpo_config.train_gradient_accumulation_steps=1`) and the configurable accelerator processes count 
+
+## Setting up the image logging hook function
+
+Expect the function to be given a list of lists of the form
+```python
+[[image, prompt, prompt_metadata, rewards, reward_metadata], ...]
+
+```
+and `image`, `prompt`, `prompt_metadata`, `rewards`, `reward_metadata` are batched.
+The last list in the lists of lists represents the last sample batch. You are likely to want to log this one
+While you are free to log however you want the use of `wandb` or `tensorboard` is recommended.
+
+### Key terms
+
+- `rewards` : The rewards/score is a numerical associated with the generated image and is key to steering the RL process
+- `reward_metadata` : The reward metadata is the metadata associated with the reward. Think of this as extra information payload delivered alongside the reward
+- `prompt` : The prompt is the text that is used to generate the image
+- `prompt_metadata` : The prompt metadata is the metadata associated with the prompt. A situation where this will not be empty is when the reward model comprises of a [`FLAVA`](https://huggingface.co/docs/transformers/model_doc/flava) setup where questions and ground answers (linked to the generated image) are expected with the generated image (See here: https://github.com/kvablack/ddpo-pytorch/blob/main/ddpo_pytorch/rewards.py#L45)
+- `image` : The image generated by the Stable Diffusion model
+
+Example code for logging sampled images with `wandb` is given below.
+
+```python
+# for logging these images to wandb
+
+def image_outputs_hook(image_data, global_step, accelerate_logger):
+    # For the sake of this example, we only care about the last batch
+    # hence we extract the last element of the list
+    result = {}
+    images, prompts, _, rewards, _ = image_data[-1]
+    for i, image in enumerate(images):
+        pil = Image.fromarray(
+            (image.cpu().numpy().transpose(1, 2, 0) * 255).astype(np.uint8)
+        )
+        pil = pil.resize((256, 256))
+        result[f"{prompts[i]:.25} | {rewards[i]:.2f}"] = [pil]
+    accelerate_logger.log_images(
+        result,
+        step=global_step,
+    )
+
+```
+
+### Using the finetuned model
+
+Assuming you've done with all the epochs and have pushed up your model to the hub, you can use the finetuned model as follows
+
+```python
+
+import torch
+from trl import DefaultDDPOStableDiffusionPipeline
+
+pipeline = DefaultDDPOStableDiffusionPipeline("metric-space/ddpo-finetuned-sd-model")
+
+device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
+
+# memory optimization
+pipeline.vae.to(device, torch.float16)
+pipeline.text_encoder.to(device, torch.float16)
+pipeline.unet.to(device, torch.float16)
+
+prompts = ["squirrel", "crab", "starfish", "whale","sponge", "plankton"]
+results = pipeline(prompts)
+
+for prompt, image in zip(prompts,results.images):
+    image.save(f"{prompt}.png")
+
+```
+
+## Credits
+
+This work is heavily influenced by the repo [here](https://github.com/kvablack/ddpo-pytorch) and the associated paper [Training Diffusion Models
+with Reinforcement Learning by Kevin Black, Michael Janner, Yilan Du, Ilya Kostrikov, Sergey Levine](https://huggingface.co/papers/2305.13301).
+
+## DDPOTrainer
+
+[[autodoc]] DDPOTrainer
+
+## DDPOConfig
+
+[[autodoc]] DDPOConfig
+
--- a/docs/source/detoxifying_a_lm.mdx
+++ b/docs/source/detoxifying_a_lm.mdx
@ -1,15 +1,15 @@
 # Detoxifying a Language Model using PPO

-Language models (LMs) are known to sometimes generate toxic outputs. In this example, we will show how to "detoxify" a LM by feeding it toxic prompts and then using PPO to "detoxify" it.
+Language models (LMs) are known to sometimes generate toxic outputs. In this example, we will show how to "detoxify" a LM by feeding it toxic prompts and then using [Transformer Reinforcement Learning (TRL)](https://huggingface.co/docs/trl/index) and Proximal Policy Optimization (PPO) to "detoxify" it.

 Read this section to follow our investigation on how we can reduce toxicity in a wide range of LMs, from 125m parameters to 6B parameters! 

-Here's an overview of the notebooks and scripts in the [trl repository](https://github.com/lvwerra/trl/tree/main/examples/toxicity) as well as the link for the interactive demo:
+Here's an overview of the notebooks and scripts in the [TRL toxicity repository](https://github.com/huggingface/trl/tree/main/examples/toxicity/scripts) as well as the link for the interactive demo:

 | File | Description | Colab link |
 |---|---| --- |
-| [`gpt-j-6b-toxicity.py`](https://github.com/lvwerra/trl/blob/main/examples/toxicity/scripts/gpt-j-6b-toxicity.py) | Detoxify `GPT-J-6B` using PPO | x | 
-| [`evaluate-toxicity.py`](https://github.com/lvwerra/trl/blob/main/examples/toxicity/scripts/evaluate-toxicity.py) | Evaluate de-toxified models using `evaluate` | x | 
+| [`gpt-j-6b-toxicity.py`](https://github.com/huggingface/trl/blob/main/examples/research_projects/toxicity/scripts/gpt-j-6b-toxicity.py) | Detoxify `GPT-J-6B` using PPO | x | 
+| [`evaluate-toxicity.py`](https://github.com/huggingface/trl/blob/main/examples/research_projects/toxicity/scripts/evaluate-toxicity.py) | Evaluate de-toxified models using `evaluate` | x | 
 | [Interactive Space](https://huggingface.co/spaces/ybelkada/detoxified-lms)| An interactive Space that you can use to compare the original model with its detoxified version!| x |

 ## Context
@ -24,7 +24,7 @@ One could have also used different techniques to evaluate the toxicity of a mode

 ### Selection of models

-We selected the following models for our experiments to show that `trl` can be easily scaled to 10B parameters models: 
+We selected the following models for our experiments to show that TRL can be easily scaled to 10B parameters models: 

 * [`EleutherAI/gpt-neo-125M`](https://huggingface.co/EleutherAI/gpt-neo-125M) (125 million parameters)
 * [`EleutherAI/gpt-neo-2.7B`](https://huggingface.co/EleutherAI/gpt-neo-2.7B) (2.7 billion parameters)
@ -58,13 +58,13 @@ And its `continuation` value:

 We want to increase the chance for the model to generate toxic prompts so we get more learning signal. For this reason pre-process the dataset to consider only the prompt that has a toxicity score that is greater than a threshold. We can do this in a few lines of code:
 ```python
-ds = load_dataset("allenai/real-toxicity-prompts", split="train")
+train_dataset = load_dataset("allenai/real-toxicity-prompts", split="train")

 def filter_fn(sample):
    toxicity = sample["prompt"]["toxicity"]
    return toxicity is not None and toxicity > 0.3

-ds = ds.filter(filter_fn, batched=False)
+train_dataset = train_dataset.filter(filter_fn, batched=False)
 ```

 ### Reward function
@ -155,7 +155,7 @@ We report the toxicity score of 400 sampled examples, compute its mean and stand
 | `EleutherAI/gpt-neo-125m` | 0.1627 | 0.2997 |
 | `ybelkada/gpt-neo-125m-detox` | **0.1148** | **0.2506** |
 | --- | --- | --- |
-| `EleutherAI/gpt-neo-2.7B` | 0.1884 | ,0.3178 |
+| `EleutherAI/gpt-neo-2.7B` | 0.1884 | 0.3178 |
 | `ybelkada/gpt-neo-2.7B-detox` | **0.0916** | **0.2104** |
 | --- | --- | --- |
 | `EleutherAI/gpt-j-6B` | 0.1699 | 0.3033 |
@ -174,15 +174,18 @@ Below are few generation examples of `gpt-j-6b-detox` model:
 <img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/trl-toxicity-examples.png">
 </div>

-The evaluation script can be found [here](https://github.com/lvwerra/trl/blob/main/examples/toxicity/scripts/evaluate-toxicity.py).
+The evaluation script can be found [here](https://github.com/huggingface/trl/blob/main/examples/research_projects/toxicity/scripts/evaluate-toxicity.py).

 ### Discussions

 The results are quite promising, as we can see that the models are able to reduce the toxicity score of the generated text by an interesting margin. The gap is clear for `gpt-neo-2B` model but we less so for the `gpt-j-6B` model. There are several things we could try to improve the results on the largest model starting with training with larger `mini_batch_size` and probably allowing to back-propagate through more layers (i.e. use less shared layers).
-We also think we could have trained the models using a "more toxic" dataset as the one we used is much cleaner than the dataset we used for testing our models (from our observation). 

 To sum up, in addition to human feedback this could be a useful additional signal when training large language models to ensure there outputs are less toxic as well as useful.

+### Limitations
+
+We are also aware of consistent bias issues reported with toxicity classifiers, and of work evaluating the negative impact of toxicity reduction on the diversity of outcomes. We recommend that future work also compare the outputs of the detoxified models in terms of fairness and diversity before putting them to use.
+
 ## What is next?

-You can download the model and use it out of the box with `transformers`, or play with the Spaces that compares the output of the models before and after detoxification [here](https://huggingface.co/spaces/ybelkada/detoxified-lms).
+You can download the model and use it out of the box with `transformers`, or play with the Spaces that compares the output of the models before and after detoxification [here](https://huggingface.co/spaces/ybelkada/detoxified-lms).
--- a/docs/source/dpo_trainer.mdx
+++ b/docs/source/dpo_trainer.mdx
@ -0,0 +1,297 @@
+# DPO Trainer
+
+TRL supports the DPO Trainer for training language models from preference data, as described in the paper [Direct Preference Optimization: Your Language Model is Secretly a Reward Model](https://huggingface.co/papers/2305.18290) by Rafailov et al., 2023. For a full example have a look at  [`examples/scripts/dpo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/dpo.py).
+
+The first step as always is to train your SFT model, to ensure the data we train on is in-distribution for the DPO algorithm.
+
+## How DPO works
+
+Fine-tuning a language model via DPO consists of two steps and is easier than PPO:
+
+1. **Data collection**: Gather a preference dataset with positive and negative selected pairs of generation, given a prompt.
+2. **Optimization**: Maximize the log-likelihood of the DPO loss directly.
+
+DPO-compatible datasets can be found with [the tag `dpo` on Hugging Face Hub](https://huggingface.co/datasets?other=dpo). You can also explore the [librarian-bots/direct-preference-optimization-datasets](https://huggingface.co/collections/librarian-bots/direct-preference-optimization-datasets-66964b12835f46289b6ef2fc) Collection to identify datasets that are likely to support DPO training.
+
+This process is illustrated in the sketch below (from [figure 1 of the original paper](https://huggingface.co/papers/2305.18290)):
+
+<img width="835" alt="Screenshot 2024-03-19 at 12 39 41" src="https://github.com/huggingface/trl/assets/49240599/9150fac6-3d88-4ca2-8ec6-2a6f3473216d">
+
+Read more about DPO algorithm in the [original paper](https://huggingface.co/papers/2305.18290).
+
+
+## Expected dataset format
+
+The DPO trainer expects a very specific format for the dataset. Since the model will be trained to directly optimize the preference of which sentence is the most relevant, given two sentences. We provide an example from the [`Anthropic/hh-rlhf`](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset below:
+
+<div style="text-align: center">
+<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/rlhf-antropic-example.png", width="50%">
+</div>
+
+Therefore the final dataset object should contain these 3 entries if you use the default [`DPODataCollatorWithPadding`] data collator. The entries should be named:
+
+- `prompt`
+- `chosen`
+- `rejected`
+
+for example:
+
+```py
+dpo_dataset_dict = {
+    "prompt": [
+        "hello",
+        "how are you",
+        "What is your name?",
+        "What is your name?",
+        "Which is the best programming language?",
+        "Which is the best programming language?",
+        "Which is the best programming language?",
+    ],
+    "chosen": [
+        "hi nice to meet you",
+        "I am fine",
+        "My name is Mary",
+        "My name is Mary",
+        "Python",
+        "Python",
+        "Java",
+    ],
+    "rejected": [
+        "leave me alone",
+        "I am not fine",
+        "Whats it to you?",
+        "I dont have a name",
+        "Javascript",
+        "C++",
+        "C++",
+    ],
+}
+```
+
+where the `prompt` contains the context inputs, `chosen` contains the corresponding chosen responses and `rejected` contains the corresponding negative (rejected) responses. As can be seen a prompt can have multiple responses and this is reflected in the entries being repeated in the dictionary's value arrays.
+
+[`DPOTrainer`] can be used to fine-tune visual language models (VLMs). In this case, the dataset must also contain the key `images`, and the trainer's `tokenizer` is the VLM's `processor`. For example, for Idefics2, the processor expects the dataset to have the following format:
+
+Note: Currently, VLM support is exclusive to Idefics2 and does not extend to other VLMs.
+
+```py
+dpo_dataset_dict = {
+    'images': [
+        [Image.open('beach.jpg')],
+        [Image.open('street.jpg')],
+    ],
+    'prompt': [
+        'The image <image> shows',
+        '<image> The image depicts',
+    ],
+    'chosen': [
+        'a sunny beach with palm trees.',
+        'a busy street with several cars and buildings.',
+    ],
+    'rejected': [
+        'a snowy mountain with skiers.',
+        'a calm countryside with green fields.',
+    ],
+}
+```
+
+## Expected model format
+
+The DPO trainer expects a model of `AutoModelForCausalLM` or `AutoModelForVision2Seq`, compared to PPO that expects `AutoModelForCausalLMWithValueHead` for the value function.
+
+## Using the `DPOTrainer`
+
+For a detailed example have a look at the `examples/scripts/dpo.py` script. At a high level we need to initialize the [`DPOTrainer`] with a `model` we wish to train, a reference `ref_model` which we will use to calculate the implicit rewards of the preferred and rejected response, the `beta` refers to the hyperparameter of the implicit reward, and the dataset contains the 3 entries listed above. Note that the `model` and `ref_model` need to have the same architecture (ie decoder only or encoder-decoder).
+
+```py
+training_args = DPOConfig(
+    beta=0.1,
+)
+dpo_trainer = DPOTrainer(
+    model,
+    ref_model,
+    args=training_args,
+    train_dataset=train_dataset,
+    tokenizer=tokenizer,  # for visual language models, use tokenizer=processor instead
+)
+```
+
+After this one can then call:
+
+```py
+dpo_trainer.train()
+```
+
+Note that the `beta` is the temperature parameter for the DPO loss, typically something in the range of `0.1` to `0.5`. We ignore the reference model as `beta` -> 0.
+
+## Loss functions
+
+Given the preference data, we can fit a binary classifier according to the Bradley-Terry model and in fact the [DPO](https://huggingface.co/papers/2305.18290) authors propose the sigmoid loss on the normalized likelihood via the `logsigmoid` to fit a logistic regression. To use this loss, set the `loss_type="sigmoid"` (default) in the [`DPOConfig`].
+
+The [RSO](https://huggingface.co/papers/2309.06657) authors propose to use a hinge loss on the normalized likelihood from the [SLiC](https://huggingface.co/papers/2305.10425) paper. To use this loss, set the `loss_type="hinge"` in the [`DPOConfig`]. In this case, the `beta` is the reciprocal of the margin.
+
+The [IPO](https://huggingface.co/papers/2310.12036) authors provide a deeper theoretical understanding of the DPO algorithms and identify an issue with overfitting and propose an alternative loss. To use the loss set the `loss_type="ipo"` in the [`DPOConfig`]. In this case, the `beta` is the reciprocal of the gap between the log-likelihood ratios of the chosen vs the rejected completion pair and thus the smaller the `beta` the larger this gaps is. As per the paper the loss is averaged over log-likelihoods of the completion (unlike DPO which is summed only). 
+
+The [cDPO](https://ericmitchell.ai/cdpo.pdf) is a tweak on the DPO loss where we assume that the preference labels are noisy with some probability. In this approach, the `label_smoothing` parameter in the [`DPOConfig`] is used to model the probability of existing label noise. To apply this conservative loss, set `label_smoothing` to a value greater than 0.0 (between 0.0 and 0.5; the default is 0.0).
+
+The [EXO](https://huggingface.co/papers/2402.00856) authors propose to minimize the reverse KL instead of the negative log-sigmoid loss of DPO which corresponds to forward KL. To use the loss set the `loss_type="exo_pair"` in the [`DPOConfig`]. Setting non-zero `label_smoothing` (default `1e-3`) leads to a simplified version of EXO on pair-wise preferences (see Eqn. (16) of the [EXO paper](https://huggingface.co/papers/2402.00856)). The full version of EXO uses `K>2` completions generated by the SFT policy, which becomes an unbiased estimator of the PPO objective (up to a constant) when `K` is sufficiently large.
+
+The [NCA](https://huggingface.co/papers/2402.05369) authors shows that NCA optimizes the absolute likelihood for each response rather than the relative likelihood. To use the loss set the `loss_type="nca_pair"` in the [`DPOConfig`].
+
+The [Robust DPO](https://huggingface.co/papers/2403.00409) authors propose an unbiased estimate of the DPO loss that is robust to preference noise in the data. Like in cDPO, it assumes that the preference labels are noisy with some probability. In this approach, the `label_smoothing` parameter in the [`DPOConfig`] is used to model the probability of existing label noise. To apply this conservative loss, set `label_smoothing` to a value greater than 0.0 (between 0.0 and 0.5; the default is 0.0) and set the `loss_type="robust"` in the [`DPOConfig`].
+
+The [BCO](https://huggingface.co/papers/2404.04656) authors train a binary classifier whose logit serves as a reward so that the classifier maps {prompt, chosen completion} pairs to 1 and {prompt, rejected completion} pairs to 0. To use this loss, set the `loss_type="bco_pair"` in the [`DPOConfig`].
+
+The [TR-DPO](https://huggingface.co/papers/2404.09656) paper suggests syncing the reference model weights after every `ref_model_sync_steps` steps of SGD with weight `ref_model_mixup_alpha` during DPO training. To toggle this callback use the `sync_ref_model=True` in the [`DPOConfig`].
+
+The [RPO](https://huggingface.co/papers/2404.19733) paper implements an iterative preference tuning algorithm using a loss related to the RPO loss in this [paper](https://huggingface.co/papers/2405.16436) that essentially consists of a weighted SFT loss on the chosen preferences together with the DPO loss. To use this loss, set the `rpo_alpha` in the [`DPOConfig`] to an appropriate value. The paper suggests setting this weight to 1.0.
+
+The [SPPO](https://huggingface.co/papers/2405.00675) authors claim that SPPO is capable of solving the Nash equilibrium iteratively by pushing the chosen rewards to be as large as 1/2 and the rejected rewards to be as small as -1/2 and can alleviate data sparsity issues. The implementation approximates this algorithm by employing hard label probabilities, assigning 1 to the winner and 0 to the loser. To use this loss, set the `loss_type="sppo_hard"` in the [`DPOConfig`].
+
+The [AOT](https://huggingface.co/papers/2406.05882) authors propose to use Distributional Preference Alignment Via Optimal Transport. Traditionally, the alignment algorithms use paired preferences at a sample level, which does not ensure alignment on the distributional level. AOT, on the other hand, can align LLMs on paired or unpaired preference data by making the reward distribution of the positive samples stochastically dominant in the first order on the distribution of negative samples. Specifically, `loss_type="aot"` is appropriate for paired datasets, where each prompt has both chosen and rejected responses; `loss_type="aot_pair"` is for unpaired datasets. In a nutshell, `loss_type="aot"` ensures that the log-likelihood ratio of chosen to rejected of the aligned model has higher quantiles than that ratio for the reference model. `loss_type="aot_pair"` ensures that the chosen reward is higher on all quantiles than the rejected reward. Note that in both cases quantiles are obtained via sorting. To fully leverage the advantages of the AOT algorithm, it is important to maximize the per-GPU batch size.
+
+The [APO](https://huggingface.co/papers/2408.06266) method introduces an "anchored" version of the alignment objective. There are two variants: `apo_zero` and `apo_down`. The `apo_zero` loss increases the likelihood of winning outputs while decreasing the likelihood of losing outputs, making it suitable when the model is less performant than the winning outputs. On the other hand, `apo_down` decreases the likelihood of both winning and losing outputs, but with a stronger emphasis on reducing the likelihood of losing outputs. This variant is more effective when the model is better than the winning outputs. To use these losses, set `loss_type="apo_zero"` or `loss_type="apo_down"` in the [`DPOConfig`].
+
+### For Mixture of Experts Models: Enabling the auxiliary loss
+
+MOEs are the most efficient if the load is about equally distributed between experts.  
+To ensure that we train MOEs similarly during preference-tuning, it is beneficial to add the auxiliary loss from the load balancer to the final loss.
+
+This option is enabled by setting `output_router_logits=True` in the model config (e.g. MixtralConfig).  
+To scale how much the auxiliary loss contributes to the total loss, use the hyperparameter `router_aux_loss_coef=...` (default: 0.001).
+
+## Logging
+
+While training and evaluating we record the following reward metrics:
+
+- `rewards/chosen`: the mean difference between the log probabilities of the policy model and the reference model for the chosen responses scaled by beta
+- `rewards/rejected`: the mean difference between the log probabilities of the policy model and the reference model for the rejected responses scaled by beta
+- `rewards/accuracies`: mean of how often the chosen rewards are > than the corresponding rejected rewards
+- `rewards/margins`: the mean difference between the chosen and corresponding rejected rewards
+
+## Accelerate DPO fine-tuning using `unsloth`
+
+You can further accelerate QLoRA / LoRA (2x faster, 60% less memory) using the [`unsloth`](https://github.com/unslothai/unsloth) library that is fully compatible with `SFTTrainer`. Currently `unsloth` supports only Llama (Yi, TinyLlama, Qwen, Deepseek etc) and Mistral architectures. Some benchmarks for DPO listed below:
+
+| GPU      | Model     | Dataset    | 🤗  | 🤗 + Flash Attention 2 | 🦥 Unsloth | 🦥 VRAM saved |
+| -------- | --------- | ---------- | --- | ---------------------- | ---------- | ------------- |
+| A100 40G | Zephyr 7b | Ultra Chat | 1x  | 1.24x                  | **1.88x**  | -11.6%        |
+| Tesla T4 | Zephyr 7b | Ultra Chat | 1x  | 1.09x                  | **1.55x**  | -18.6%        |
+
+First install `unsloth` according to the [official documentation](https://github.com/unslothai/unsloth). Once installed, you can incorporate unsloth into your workflow in a very simple manner; instead of loading `AutoModelForCausalLM`, you just need to load a `FastLanguageModel` as follows:
+
+```python
+import torch
+from trl import DPOConfig, DPOTrainer
+from unsloth import FastLanguageModel
+
+max_seq_length = 2048 # Supports automatic RoPE Scaling, so choose any number.
+
+# Load model
+model, tokenizer = FastLanguageModel.from_pretrained(
+    model_name = "unsloth/zephyr-sft",
+    max_seq_length = max_seq_length,
+    dtype = None, # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
+    load_in_4bit = True, # Use 4bit quantization to reduce memory usage. Can be False.
+    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
+)
+
+# Do model patching and add fast LoRA weights
+model = FastLanguageModel.get_peft_model(
+    model,
+    r = 16,
+    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
+                      "gate_proj", "up_proj", "down_proj",],
+    lora_alpha = 16,
+    lora_dropout = 0, # Dropout = 0 is currently optimized
+    bias = "none",    # Bias = "none" is currently optimized
+    use_gradient_checkpointing = True,
+    random_state = 3407,
+)
+
+training_args = DPOConfig(
+    output_dir="./output",
+    beta=0.1,
+)
+
+dpo_trainer = DPOTrainer(
+    model,
+    ref_model=None,
+    args=training_args,
+    train_dataset=train_dataset,
+    tokenizer=tokenizer,
+)
+dpo_trainer.train()
+```
+
+The saved model is fully compatible with Hugging Face's transformers library. Learn more about unsloth in their [official repository](https://github.com/unslothai/unsloth).
+
+## Reference model considerations with PEFT
+
+You have three main options (plus several variants) for how the reference model works when using PEFT, assuming the model that you would like to further enhance with DPO was tuned using (Q)LoRA.
+
+1. Simply create two instances of the model, each loading your adapter - works fine but is very inefficient.
+2. Merge the adapter into the base model, create another adapter on top, then leave the `ref_model` param null, in which case DPOTrainer will unload the adapter for reference inference - efficient, but has potential downsides discussed below.
+3. Load the adapter twice with different names, then use `set_adapter` during training to swap between the adapter being DPO'd and the reference adapter - slightly less efficient compared to 2 (~adapter size VRAM overhead), but avoids the pitfalls.
+
+### Downsides to merging QLoRA before DPO (approach 2)
+
+As suggested by [Benjamin Marie](https://medium.com/@bnjmn_marie/dont-merge-your-lora-adapter-into-a-4-bit-llm-65b6da287997), the best option for merging QLoRA adapters is to first dequantize the base model, then merge the adapter. Something similar to [this script](https://github.com/jondurbin/qlora/blob/main/qmerge.py).
+
+However, after using this approach, you will have an unquantized base model. Therefore, to use QLoRA for DPO, you will need to re-quantize the merged model or use the unquantized merge (resulting in higher memory demand).
+
+### Using option 3 - load the adapter twice
+
+To avoid the downsides with option 2, you can load your fine-tuned adapter into the model twice, with different names, and set the model/ref adapter names in [`DPOTrainer`].
+
+For example:
+
+```python
+# Load the base model.
+bnb_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    llm_int8_threshold=6.0,
+    llm_int8_has_fp16_weight=False,
+    bnb_4bit_compute_dtype=torch.bfloat16,
+    bnb_4bit_use_double_quant=True,
+    bnb_4bit_quant_type="nf4",
+)
+model = AutoModelForCausalLM.from_pretrained(
+    "mistralai/mixtral-8x7b-v0.1",
+    load_in_4bit=True,
+    quantization_config=bnb_config,
+    attn_implementation="flash_attention_2",
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+)
+model.config.use_cache = False
+
+# Load the adapter.
+model = PeftModel.from_pretrained(
+    model,
+    "/path/to/peft",
+    is_trainable=True,
+    adapter_name="train",
+)
+# Load the adapter a second time, with a different name, which will be our reference model.
+model.load_adapter("/path/to/peft", adapter_name="reference")
+
+# Initialize the trainer, without a ref_model param.
+training_args = DPOConfig(
+    model_adapter_name="train",
+    ref_adapter_name="reference",
+)
+dpo_trainer = DPOTrainer(
+    model,
+    args=training_args,
+    ...
+)
+```
+
+## DPOTrainer
+
+[[autodoc]] DPOTrainer
+
+## DPOConfig
+
+[[autodoc]] DPOConfig
--- a/docs/source/example_overview.md
+++ b/docs/source/example_overview.md
@ -0,0 +1,82 @@
+# Examples
+
+
+## Introduction
+
+The examples should work in any of the following settings (with the same script):
+   - single GPU
+   - multi GPUS (using PyTorch distributed mode)
+   - multi GPUS (using DeepSpeed ZeRO-Offload stages 1, 2, & 3)
+   - fp16 (mixed-precision), fp32 (normal precision), or bf16 (bfloat16 precision)
+
+To run it in each of these various modes, first initialize the accelerate
+configuration with `accelerate config`
+
+**NOTE to train with a 4-bit or 8-bit model**, please run
+
+```bash
+pip install --upgrade trl[quantization]
+```
+
+
+## Accelerate Config
+For all the examples, you'll need to generate a 🤗 Accelerate config file with:
+
+```shell
+accelerate config # will prompt you to define the training configuration
+```
+
+Then, it is encouraged to launch jobs with `accelerate launch`!
+
+
+# Maintained Examples
+
+
+
+| File                                                                                                                          | Description                                                                                                                                                                                                                                                                                                     |
+| ----------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| [`examples/scripts/alignprop.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/alignprop.py)                 | This script shows how to use the [`AlignPropTrainer`] to fine-tune a diffusion model.                                                                                                                                                                                                                           |
+| [`examples/scripts/bco.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/bco.py)                             | This script shows how to use the [`KTOTrainer`] with the BCO loss to fine-tune a model to increase instruction-following, truthfulness, honesty and helpfulness using the [openbmb/UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback) dataset.                                               |
+| [`examples/scripts/chat.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/chat.py)                           | This script allows you to load and use a model as a chatbot.                                                                                                                                                                                                                                                    |
+| [`examples/scripts/cpo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/cpo.py)                             | This script shows how to use the [`CPOTrainer`] to fine-tune a model to increase helpfulness and harmlessness using the [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset.                                                                                                         |
+| [`examples/scripts/ddpo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/ddpo.py)                           | This script shows how to use the [`DDPOTrainer`] to fine-tune a stable diffusion model using reinforcement learning.                                                                                                                                                                                            |
+| [`examples/scripts/dpo_visual.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/dpo_visual.py)               | This script shows how to use the [`DPOTrainer`] to fine-tune a Vision Language Model to reduce hallucinations using the [openbmb/RLAIF-V-Dataset](https://huggingface.co/datasets/openbmb/RLAIF-V-Dataset) dataset.                                                                                             |
+| [`examples/scripts/dpo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/dpo.py)                             | This script shows how to use the [`DPOTrainer`] to fine-tune a stable to increase helpfulness and harmlessness using the [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset.                                                                                                        |
+| [`examples/scripts/kto.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/kto.py)                             | This script shows how to use the [`KTOTrainer`] to fine-tune a model.                                                                                                                                                                                                                                           |
+| [`examples/scripts/orpo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/orpo.py)                           | This script shows how to use the [`ORPOTrainer`] to fine-tune a model to increase helpfulness and harmlessness using the [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset.                                                                                                        |
+| [`examples/scripts/ppo_multi_adapter.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/ppo_multi_adapter.py) | This script shows how to use the [`PPOTrainer`] to train a single base model with multiple adapters. Requires you to run the example script with the reward model training beforehand.                                                                                                                          |
+| [`examples/scripts/ppo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/ppo.py)                             | This script shows how to use the [`PPOTrainer`] to fine-tune a sentiment analysis model using [IMDB dataset](https://huggingface.co/datasets/stanfordnlp/imdb).                                                                                                                                                 |
+| [`examples/scripts/reward_modeling.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/reward_modeling.py)     | This script shows how to use the [`RewardTrainer`] to train a reward model on your own dataset.                                                                                                                                                                                                                 |
+| [`examples/scripts/sft.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/sft.py)                             | This script shows how to use the [`SFTTrainer`] to fine-tune a model or adapters into a target dataset.                                                                                                                                                                                                         |
+| [`examples/scripts/vsft_llava.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/vsft_llava.py)               | This script shows how to use the [`SFTTrainer`] to fine-tune a Vision Language Model in a chat setting. The script has only been tested on a [LLaVA 1.5]([llava-hf/llava-1.5-7b-hf](https://huggingface.co/llava-hf/llava-1.5-7b-hf)) model so users may see unexpected behaviour in other model architectures. |
+
+Here are also some easier-to-run colab notebooks that you can use to get started with TRL:
+
+| File                                                                                                                              | Description                                                                                                             |
+| --------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------- |
+| [`examples/notebooks/best_of_n.ipynb`](https://github.com/huggingface/trl/tree/main/examples/notebooks/best_of_n.ipynb)           | This notebook demonstrates how to use the "Best of N" sampling strategy using TRL when fine-tuning your model with PPO. |
+| [`examples/notebooks/gpt2-sentiment.ipynb`](https://github.com/huggingface/trl/tree/main/examples/notebooks/gpt2-sentiment.ipynb) | This notebook demonstrates how to reproduce the GPT2 imdb sentiment tuning example on a jupyter notebook.               |
+| [`examples/notebooks/gpt2-control.ipynb`](https://github.com/huggingface/trl/tree/main/examples/notebooks/gpt2-control.ipynb)     | This notebook demonstrates how to reproduce the GPT2 sentiment control example on a jupyter notebook.                   |
+
+
+We also have some other examples that are less maintained but can be used as a reference:
+1. **[research_projects](https://github.com/huggingface/trl/tree/main/examples/research_projects)**: Check out this folder to find the scripts used for some research projects that used TRL (LM de-toxification, Stack-Llama, etc.)
+
+
+## Distributed training
+
+All of the scripts can be run on multiple GPUs by providing the path of an 🤗 Accelerate config file when calling `accelerate launch`. To launch one of them on one or multiple GPUs, run the following command (swapping `{NUM_GPUS}` with the number of GPUs in your machine and `--all_arguments_of_the_script` with your arguments.)
+
+```shell
+accelerate launch --config_file=examples/accelerate_configs/multi_gpu.yaml --num_processes {NUM_GPUS} path_to_script.py --all_arguments_of_the_script
+```
+
+You can also adjust the parameters of the 🤗 Accelerate config file to suit your needs (e.g. training in mixed precision).
+
+### Distributed training with DeepSpeed
+
+Most of the scripts can be run on multiple GPUs together with DeepSpeed ZeRO-{1,2,3} for efficient sharding of the optimizer states, gradients, and model weights. To do so, run following command (swapping `{NUM_GPUS}` with the number of GPUs in your machine, `--all_arguments_of_the_script` with your arguments, and `--deepspeed_config` with the path to the DeepSpeed config file such as `examples/deepspeed_configs/deepspeed_zero1.yaml`):
+
+```shell
+accelerate launch --config_file=examples/accelerate_configs/deepspeed_zero{1,2,3}.yaml --num_processes {NUM_GPUS} path_to_script.py --all_arguments_of_the_script
+```
--- a/docs/source/gkd_trainer.md
+++ b/docs/source/gkd_trainer.md
@ -0,0 +1,95 @@
+# Generalized Knowledge Distillation Trainer
+
+## Overview
+
+Generalized Knowledge Distillation (GKD) was proposed in [On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes](https://huggingface.co/papers/2306.13649) by Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. 
+
+The abstract from the paper is the following:
+
+> Knowledge distillation (KD) is widely used for compressing a teacher model to reduce its inference cost and memory footprint, by training a smaller student model. However, current KD methods for auto-regressive sequence models suffer from distribution mismatch between output sequences seen during training and those generated by the student during inference. To address this issue, we introduce Generalized Knowledge Distillation (GKD). Instead of solely relying on a fixed set of output sequences, GKD trains the student on its self-generated output sequences by leveraging feedback from the teacher on such sequences. Unlike supervised KD approaches, GKD also offers the flexibility to employ alternative loss functions between the student and teacher, which can be useful when the student lacks the expressivity to mimic the teacher's distribution. Furthermore, GKD facilitates the seamless integration of distillation with RL fine-tuning (RLHF). We demonstrate the efficacy of GKD for distilling auto-regressive language models on summarization, translation, and arithmetic reasoning tasks, and task-agnostic distillation for instruction-tuning.
+
+
+The key aspects of GKD are:
+1. It addresses the train-inference distribution mismatch in auto-regressive sequence models by training the student model on its self-generated output sequences.
+2. GKD allows flexibility in choosing different divergence measures between student and teacher models via the generalized Jensen-Shannon Divergence (JSD), which can be useful when the student lacks the capacity to fully mimic the teacher.
+
+This post-training method was contributed by [Kashif Rasul](https://huggingface.co/kashif) and [Lewis Tunstall](https://huggingface.co/lewtun).
+
+## Usage tips
+
+The GKD Trainer is a wrapper around the [`SFTTrainer`] class that takes in a teacher model argument. It needs two parameters to be set via the [`GKDConfig`] namely:
+* `lmbda`:  controls the student data fraction, i.e., the proportion of on-policy student-generated outputs. When `lmbda=0.0`, the loss reduces to supervised JSD where the student is trained with the token-level probabilities of the teacher. When `lmbda=1.0`, the loss reduces to on-policy JSD, where the student generates output sequences and token-specific feedback on these sequences from the teacher. For values in between [0, 1] it is random between the two based on the `lmbda` value for each batch.
+* `beta`: controls the interpolation in the generalized Jensen-Shannon Divergence.  When `beta=0.0` the loss approximates forward KL divergence, while for `beta=1.0` the loss approximates reverse KL divergence. For values in between [0, 1] it interpolates between the two.
+
+The authors find that on-policy data (high `lmbda`) performs better and the optimal `beta` varied depending on the task and evaluation method.
+
+> [!WARNING]
+> Make sure that `attn_implementation="flash_attention_2"` when training [Gemma models](https://huggingface.co/models?other=gemma2). Otherwise you will encounter NaNs in the logits due to the [soft capping technique](https://huggingface.co/blog/gemma2#soft-capping-and-attention-implementations) adopted by this architecture.
+
+The basic API is as follows:
+
+```python
+from datasets import Dataset
+from trl import GKDConfig, GKDTrainer
+from transformers import (
+    AutoModelForCausalLM,
+    AutoTokenizer,
+)
+
+NUM_DUMMY_SAMPLES = 100
+
+tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
+# The model to optimise
+model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
+# The teacher model to calculate the KL divergence against
+teacher_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-1.5B-Instruct")
+
+train_dataset = Dataset.from_dict(
+    {
+        "messages": [
+            [
+                {"role": "user", "content": "Hi, how are you?"},
+                {"role": "assistant", "content": "I'm great thanks"},
+            ]
+        ]
+        * NUM_DUMMY_SAMPLES
+    }
+)
+eval_dataset = Dataset.from_dict(
+    {
+        "messages": [
+            [
+                {"role": "user", "content": "What colour is the sky?"},
+                {"role": "assistant", "content": "The sky is blue"},
+            ]
+        ]
+        * NUM_DUMMY_SAMPLES
+    }
+)
+
+args = GKDConfig(output_dir="gkd-model", per_device_train_batch_size=1)
+trainer = GKDTrainer(
+    model=model,
+    teacher_model=teacher_model,
+    args=args,
+    tokenizer=tokenizer,
+    train_dataset=train_dataset,
+    eval_dataset=eval_dataset,
+)
+trainer.train()
+```
+
+### Expected dataset format
+
+The dataset should be formatted as a list of "messages" where each message is a list of dictionaries with the following keys:
+* `role`: either `system`, `assistant` or `user`
+* `content`: the message content
+
+
+## GKDTrainer
+
+[[autodoc]] GKDTrainer
+
+## GKDConfig
+
+[[autodoc]] GKDConfig
--- a/docs/source/how_to_train.md
+++ b/docs/source/how_to_train.md
@ -0,0 +1,65 @@
+# Training FAQ
+
+## What Metrics Should I Look at?
+
+When performing classical supervised fine-tuning of language models, the loss (especially the validation loss) serves as a good indicator of the training progress. However, in Reinforcement Learning (RL), the loss becomes less informative about the model's performance, and its value may fluctuate while the actual performance improves.
+
+To address this, we recommend focusing on two key metrics first:
+
+**Mean Reward**: The primary goal is to maximize the reward achieved by the model during RL training.
+**Objective KL Divergence**: KL divergence (Kullback-Leibler divergence) measures the dissimilarity between two probability distributions. In the context of RL training, we use it to quantify the difference between the current model and a reference model. Ideally, we want to keep the KL divergence between 0 and 10 to ensure the model's generated text remains close to what the reference model produces.
+
+However, there are more metrics that can be useful for debugging, checkout the [logging section](logging).
+
+## Why Do We Use a Reference Model, and What's the Purpose of KL Divergence?
+
+When training RL models, optimizing solely for reward may lead to unexpected behaviors, where the model exploits the environment in ways that don't align with good language generation. In the case of RLHF, we use a reward model trained to predict whether a generated text is highly ranked by humans.
+
+However, the RL model being optimized against the reward model may learn patterns that yield high reward but do not represent good language. This can result in extreme cases where the model generates texts with excessive exclamation marks or emojis to maximize the reward. In some worst-case scenarios, the model may generate patterns completely unrelated to natural language yet receive high rewards, similar to adversarial attacks.
+
+<div style="text-align: center">
+<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/kl-example.png">
+<p style="text-align: center;"> <b>Figure:</b> Samples without a KL penalty from <a href="https://huggingface.co/papers/1909.08593">https://huggingface.co/papers/1909.08593</a>. </p>
+</div>
+
+To address this issue, we add a penalty to the reward function based on the KL divergence between the current model and the reference model. By doing this, we encourage the model to stay close to what the reference model generates.
+
+## What Is the Concern with Negative KL Divergence?
+
+If you generate text by purely sampling from the model distribution things work fine in general. But when you use the `generate` method there are a few caveats because it does not always purely sample depending on the settings which can cause KL-divergence to go negative. Essentially when the active model achieves `log_p_token_active < log_p_token_ref` we get negative KL-div. This can happen in a several cases:
+
+- **top-k sampling**: the model can smooth out the probability distribution causing the top-k tokens having a smaller probability than those of the reference model but they still are selected
+- **min_length**: this ignores the EOS token until `min_length` is reached. thus the model can assign a very low log prob to the EOS token and very high probs to all others until min_length is reached
+
+These are just a few examples. Why is negative KL an issue? The total reward `R` is computed `R = r - beta * KL` so if the model can learn how to drive KL-divergence negative it effectively gets a positive reward. In many cases it can be much easier to exploit such a bug in the generation than actually learning the reward function. In addition the KL can become arbitrarily small thus the actual reward can be very small compared to it.
+
+So how should you generate text for PPO training? Let's have a look!
+
+## How to generate text for training?
+
+In order to avoid the KL issues described above we recommend to use the following settings:
+
+```python
+generation_kwargs = {
+    "min_length": -1, # don't ignore the EOS token (see above)
+    "top_k": 0.0, # no top-k sampling
+    "top_p": 1.0, # no nucleus sampling
+    "do_sample": True, # yes, we want to sample
+    "pad_token_id": tokenizer.eos_token_id, # most decoder models don't have a padding token - use EOS token instead
+    "max_new_tokens": 32, # specify how many tokens you want to generate at most
+}
+```
+
+With these settings we usually don't encounter any issues. You can also experiments with other settings but if you encounter issues with negative KL-divergence try to go back to these and see if they persist.
+
+## How can debug your own use-case?
+
+Debugging the RL pipeline can be challenging due to its complexity. Here are some tips and suggestions to make the process easier:
+
+- **Start from a working example**: Begin with a working example from the trl repository and gradually modify it to fit your specific use-case. Changing everything at once can make it difficult to identify the source of potential issues. For example, you can start by replacing the model in the example and once you figure out the best hyperparameters try to switch to your dataset and reward model. If you change everything at once you won't know where a potential problem comes from.
+- **Start small, scale later**: Training large models can be very slow and take several hours or days until you see any improvement. For debugging this is not a convenient timescale so try to use small model variants during the development phase and scale up once that works. That being said you sometimes have to be careful as small models might not have the capacity to solve a complicated task either.
+- **Start simple**: Try to start with a minimal example and build complexity from there. Your use-case might require for example a complicated reward function consisting of many different rewards - try to use one signal first and see if you can optimize that and then add more complexity after that.
+- **Inspect the generations**: It's always a good idea to inspect what the model is generating. Maybe there is a bug in your post-processing or your prompt. Due to bad settings you might cut-off generations too soon. These things are very hard to see on the metrics but very obvious if you look at the generations.
+- **Inspect the reward model**: If you reward is not improving over time maybe there's an issue with the reward model. You can look at extreme cases to see if it does what it should: e.g. in the sentiment case you can check if simple positive and negative examples really get different rewards. And you can look at the distribution of your dataset. Finally, maybe the reward is dominated by the query which the model can't affect so you might need to normalize this (e.g. reward of query+response minus reward of the query).
+
+These are just a few tips that we find helpful - if you have more useful tricks feel free to open a PR to add them as well!
--- a/docs/source/index.mdx
+++ b/docs/source/index.mdx
@ -4,6 +4,62 @@

 # TRL - Transformer Reinforcement Learning

-With the TRL (Transformer Reinforcement Learning) libray you can train transformer language models with reinforcement learning. The library is integrated with 🤗 [transformers](https://github.com/huggingface/transformers).
+TRL is a full stack library where we provide a set of tools to train transformer language models with Reinforcement Learning, from the Supervised Fine-tuning step (SFT), Reward Modeling step (RM) to the Proximal Policy Optimization (PPO) step. 
+The library is integrated with 🤗 [transformers](https://github.com/huggingface/transformers).

-TRL supports decoder models such as GPT-2, BLOOM, GPT-Neo which can all be optimized using Proximal Policy Optimization (PPO). You can find installation instructions in the [installation guide](installation) and an introdcution to the library in the [Quickstart section](quickstart). There is also a more [in-depth example](sentiment_tuning) to tune GPT-2 to procude positive movie reviews.
+<div style="text-align: center">
+<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/TRL-readme.png">
+</div>
+
+Check the appropriate sections of the documentation depending on your needs:
+
+## API documentation
+
+- [Model Classes](models): *A brief overview of what each public model class does.*
+- [`SFTTrainer`](sft_trainer): *Supervise Fine-tune your model easily with `SFTTrainer`*
+- [`RewardTrainer`](reward_trainer): *Train easily your reward model using `RewardTrainer`.*
+- [`PPOTrainer`](ppo_trainer): *Further fine-tune the supervised fine-tuned model using PPO algorithm*
+- [Best-of-N Sampling](best-of-n): *Use best of n sampling as an alternative way to sample predictions from your active model*
+- [`DPOTrainer`](dpo_trainer): *Direct Preference Optimization training using `DPOTrainer`.*
+- [`TextEnvironment`](text_environments): *Text environment to train your model using tools with RL.*
+
+## Examples
+
+- [Sentiment Tuning](sentiment_tuning): *Fine tune your model to generate positive movie contents*
+- [Training with PEFT](lora_tuning_peft): *Memory efficient RLHF training using adapters with PEFT*
+- [Detoxifying LLMs](detoxifying_a_lm): *Detoxify your language model through RLHF*
+- [StackLlama](using_llama_models): *End-to-end RLHF training of a Llama model on Stack exchange dataset*
+- [Learning with Tools](learning_tools): *Walkthrough of using `TextEnvironments`*
+- [Multi-Adapter Training](multi_adapter_rl): *Use a single base model and multiple adapters for memory efficient end-to-end training*
+
+
+## Blog posts
+
+<div class="mt-10">
+  <div class="w-full flex flex-col space-y-4 md:space-y-0 md:grid md:grid-cols-2 md:gap-y-4 md:gap-x-5">
+    <a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="https://huggingface.co/blog/dpo_vlm">
+      <img src="https://raw.githubusercontent.com/huggingface/blog/main/assets/dpo_vlm/thumbnail.png" alt="thumbnail">
+      <p class="text-gray-700">Preference Optimization for Vision Language Models with TRL</p>
+    </a>
+    <a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="https://huggingface.co/blog/rlhf">
+      <img src="https://raw.githubusercontent.com/huggingface/blog/main/assets/120_rlhf/thumbnail.png" alt="thumbnail">
+      <p class="text-gray-700">Illustrating Reinforcement Learning from Human Feedback</p>
+    </a>
+    <a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="https://huggingface.co/blog/trl-peft">
+      <img src="https://github.com/huggingface/blog/blob/main/assets/133_trl_peft/thumbnail.png?raw=true" alt="thumbnail">
+      <p class="text-gray-700">Fine-tuning 20B LLMs with RLHF on a 24GB consumer GPU</p>
+    </a>
+    <a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="https://huggingface.co/blog/stackllama">
+      <img src="https://github.com/huggingface/blog/blob/main/assets/138_stackllama/thumbnail.png?raw=true" alt="thumbnail">
+      <p class="text-gray-700">StackLLaMA: A hands-on guide to train LLaMA with RLHF</p>
+   </a>
+    <a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="https://huggingface.co/blog/dpo-trl">
+      <img src="https://github.com/huggingface/blog/blob/main/assets/157_dpo_trl/dpo_thumbnail.png?raw=true" alt="thumbnail">
+      <p class="text-gray-700">Fine-tune Llama 2 with DPO</p>
+    </a>
+    <a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="https://huggingface.co/blog/trl-ddpo">
+      <img src="https://github.com/huggingface/blog/blob/main/assets/166_trl_ddpo/thumbnail.png?raw=true" alt="thumbnail">
+      <p class="text-gray-700">Finetune Stable Diffusion Models with DDPO via TRL</p>
+    </a>
+  </div>
+</div>
--- a/docs/source/installation.mdx
+++ b/docs/source/installation.mdx
@ -12,7 +12,7 @@ pip install trl
 You can also install the latest version from source. First clone the repo and then run the installation with `pip`:

 ```bash
-git clone https://github.com/lvwerra/trl.git
+git clone https://github.com/huggingface/trl.git
 cd trl/
 pip install -e .
 ```
--- a/docs/source/iterative_sft_trainer.mdx
+++ b/docs/source/iterative_sft_trainer.mdx
@ -0,0 +1,54 @@
+# Iterative Trainer
+
+Iterative fine-tuning is a training method that enables to perform custom actions (generation and filtering for example) between optimization steps. In TRL we provide an easy-to-use API to fine-tune your models in an iterative way in just a few lines of code.
+
+## Usage
+
+To get started quickly, instantiate an instance a model, and a tokenizer.
+
+```python
+
+model = AutoModelForCausalLM.from_pretrained(model_name)
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+if tokenizer.pad_token is None:
+    tokenizer.pad_token = tokenizer.eos_token
+
+trainer = IterativeSFTTrainer(
+    model,
+    tokenizer
+)
+
+```
+
+You have the choice to either provide a list of strings or a list of tensors to the step function. 
+
+#### Using a list of tensors as input:
+
+```python
+
+inputs = {
+    "input_ids": input_ids,
+    "attention_mask": attention_mask
+}
+
+trainer.step(**inputs)
+
+```
+
+#### Using a list of strings as input:
+
+```python
+
+inputs = {
+    "texts": texts
+}
+
+trainer.step(**inputs)
+
+```
+
+For causal language models, labels will automatically be created from input_ids or from texts. When using sequence to sequence models you will have to provide your own labels or text_labels.
+
+## IterativeTrainer
+
+[[autodoc]] IterativeSFTTrainer
--- a/docs/source/judges.mdx
+++ b/docs/source/judges.mdx
@ -0,0 +1,79 @@
+# Judges
+
+TRL provides judges to easily compare two completions.
+
+Make sure to have installed the required dependencies by running:
+
+```bash
+pip install trl[llm_judge]
+```
+
+## Using the provided judges
+
+TRL provides several judges out of the box. For example, you can use the `HfPairwiseJudge` to compare two completions using a pre-trained model from the Hugging Face model hub:
+
+```python
+from trl import HfPairwiseJudge
+
+judge = HfPairwiseJudge()
+judge.judge(
+    prompts=["What is the capital of France?", "What is the biggest planet in the solar system?"],
+    completions=[["Paris", "Lyon"], ["Saturn", "Jupiter"]],
+)  # Outputs: [0, 1]
+```
+
+## Define your own judge
+
+To define your own judge, we provide several base classes that you can subclass. For rank-based judges, you need to subclass [`BaseRankJudge`] and implement the [`BaseRankJudge.judge`] method. For pairwise judges, you need to subclass [`BasePairJudge`] and implement the [`BasePairJudge.judge`] method. If you want to define a judge that doesn't fit into these categories, you need to subclass [`BaseJudge`] and implement the [`BaseJudge.judge`] method.
+
+As an example, let's define a pairwise judge that prefers shorter completions:
+
+```python
+from trl import BasePairwiseJudge
+
+class PrefersShorterJudge(BasePairwiseJudge):
+    def judge(self, prompts, completions, shuffle_order=False):
+        return [0 if len(completion[0]) > len(completion[1]) else 1 for completion in completions]
+```
+
+You can then use this judge as follows:
+
+```python
+judge = PrefersShorterJudge()
+judge.judge(
+    prompts=["What is the capital of France?", "What is the biggest planet in the solar system?"],
+    completions=[["Paris", "The capital of France is Paris."], ["Jupiter is the biggest planet in the solar system.", "Jupiter"]],
+)  # Outputs: [0, 1]
+```
+
+## BaseJudge
+
+[[autodoc]] BaseJudge
+
+## BaseRankJudge
+
+[[autodoc]] BaseRankJudge
+
+## BasePairwiseJudge
+
+[[autodoc]] BasePairwiseJudge
+
+## RandomRankJudge
+
+[[autodoc]] RandomRankJudge
+
+## RandomPairwiseJudge
+
+[[autodoc]] RandomPairwiseJudge
+
+## PairRMJudge
+
+[[autodoc]] PairRMJudge
+
+## HfPairwiseJudge
+
+[[autodoc]] HfPairwiseJudge
+
+## OpenAIPairwiseJudge
+
+[[autodoc]] OpenAIPairwiseJudge
--- a/docs/source/kto_trainer.mdx
+++ b/docs/source/kto_trainer.mdx
@ -0,0 +1,107 @@
+# KTO Trainer
+
+TRL supports the Kahneman-Tversky Optimization (KTO) Trainer for aligning language models with binary feedback data (e.g., upvote/downvote), as described in the [paper](https://huggingface.co/papers/2402.01306) by Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela.
+For a full example have a look at  [`examples/scripts/kto.py`].
+
+Depending on how good your base model is, you may or may not need to do SFT before KTO.
+This is different from standard RLHF and DPO, which always require SFT.
+
+## Expected dataset format
+
+The KTO trainer expects a very specific format for the dataset as it does not require pairwise preferences. Since the model will be trained to directly optimize examples that consist of a prompt, model completion, and a label to indicate whether the completion is "good" or "bad", we expect a dataset with the following columns:
+
+- `prompt`
+- `completion`
+- `label`
+
+for example:
+
+```
+kto_dataset_dict = {
+    "prompt": [
+        "Hey, hello",
+        "How are you",
+        "What is your name?",
+        "What is your name?",
+        "Which is the best programming language?",
+        "Which is the best programming language?",
+        "Which is the best programming language?",
+    ],
+    "completion": [
+        "hi nice to meet you",
+        "leave me alone",
+        "I don't have a name",
+        "My name is Mary",
+        "Python",
+        "C++",
+        "Java",
+    ],
+    "label": [
+        True,
+        False,
+        False,
+        True,
+        True,
+        False,
+        False,
+    ],
+}
+```
+
+where the `prompt` contains the context inputs, `completion` contains the corresponding responses and `label` contains the corresponding flag that indicates if the generated completion is desired (`True`) or undesired (`False`).
+A prompt can have multiple responses and this is reflected in the entries being repeated in the dictionary's value arrays. It is required that the dataset contains at least one desirable and one undesirable completion.
+
+
+## Expected model format
+The KTO trainer expects a model of `AutoModelForCausalLM`, compared to PPO that expects `AutoModelForCausalLMWithValueHead` for the value function.
+
+## Using the `KTOTrainer`
+
+For a detailed example have a look at the `examples/scripts/kto.py` script. At a high level we need to initialize the `KTOTrainer` with a `model` we wish to train and a reference `ref_model` which we will use to calculate the implicit rewards of the preferred and rejected response. 
+
+The `beta` refers to the hyperparameter of the implicit reward, and the dataset contains the 3 entries listed above. Note that the `model` and `ref_model` need to have the same architecture (ie decoder only or encoder-decoder).
+
+The `desirable_weight` and `undesirable_weight` refer to the weights placed on the losses for desirable/positive and undesirable/negative examples.
+By default, they are both 1. However, if you have more of one or the other, then you should upweight the less common type such that the ratio of (`desirable_weight` \\(\times\\) number of positives) to (`undesirable_weight` \\(\times\\) number of negatives) is in the range 1:1 to 4:3.
+
+<Tip>
+It is strongly recommended you use a learning rate between `5e-7` and `5e-6` with an effective batch size between `8` and `32`, for both LoRA and full finetuning. Even if you are working with a small dataset, we do not recommend using a learning rate outside this range; instead, using smaller batch sizes and/or more training epochs will give you better results.
+</Tip>
+
+```py
+training_args = KTOConfig(
+    beta=0.1,
+    desirable_weight=1.0,
+    undesirable_weight=1.0,
+    learning_rate=5e-7,
+)
+
+kto_trainer = KTOTrainer(
+    model,
+    ref_model,
+    args=training_args,
+    train_dataset=train_dataset,
+    tokenizer=tokenizer,
+)
+```
+After this one can then call:
+
+```py
+kto_trainer.train()
+```
+
+### For Mixture of Experts Models: Enabling the auxiliary loss
+
+MOEs are the most efficient if the load is about equally distributed between experts.  
+To ensure that we train MOEs similarly during preference-tuning, it is beneficial to add the auxiliary loss from the load balancer to the final loss.  
+
+This option is enabled by setting `output_router_logits=True` in the model config (e.g. MixtralConfig).  
+To scale how much the auxiliary loss contributes to the total loss, use the hyperparameter `router_aux_loss_coef=...` (default: 0.001).
+
+## KTOTrainer
+
+[[autodoc]] KTOTrainer
+
+## KTOConfig
+
+[[autodoc]] KTOConfig
--- a/docs/source/learning_tools.mdx
+++ b/docs/source/learning_tools.mdx
@ -0,0 +1,232 @@
+# Learning Tools (Experimental 🧪)
+
+Using Large Language Models (LLMs) with tools has been a popular topic recently with awesome works such as [ToolFormer](https://huggingface.co/papers/2302.04761) and [ToolBench](https://huggingface.co/papers/2305.16504). In TRL, we provide a simple example of how to teach LLM to use tools with reinforcement learning. 
+
+
+Here's an overview of the scripts in the [trl repository](https://github.com/lvwerra/trl/tree/main/examples/research_projects/tools):
+
+| File | Description | 
+|---|---| 
+| [`calculator.py`](https://github.com/lvwerra/trl/blob/main/examples/research_projects/tools/calculator.py) | Script to train LLM to use a calculator with reinforcement learning. |
+| [`triviaqa.py`](https://github.com/lvwerra/trl/blob/main/examples/research_projects/tools/triviaqa.py) | Script to train LLM to use a wiki tool to answer questions. |
+| [`python_interpreter.py`](https://github.com/lvwerra/trl/blob/main/examples/research_projects/tools/python_interpreter.py) | Script to train LLM to use python interpreter to solve math puzzles. |
+
+<Tip warning={true}>
+
+Note that the scripts above rely heavily on the `TextEnvironment` API which is still under active development. The API may change in the future. Please see [`TextEnvironment`](text_environment) for the related docs.
+</Tip>
+
+
+## Learning to Use a Calculator
+
+
+The rough idea is as follows:
+
+1. Load a tool such as [ybelkada/simple-calculator](https://huggingface.co/spaces/ybelkada/simple-calculator) that parse a text calculation like `"14 + 34"` and return the calulated number:
+    ```python
+    from transformers import AutoTokenizer, load_tool
+    tool = load_tool("ybelkada/simple-calculator")
+    tool_fn = lambda text: str(round(float(tool(text)), 2))  # rounding to 2 decimal places
+    ```
+1. Define a reward function that returns a positive reward if the tool returns the correct answer. In the script we create a dummy reward function like `reward_fn = lambda x: 1`, but we override the rewards directly later.
+1. Create a prompt on how to use the tools
+    ```python
+    # system prompt
+    prompt = """\
+    What is 13.1-3?
+
+    <request><SimpleCalculatorTool>13.1-3<call>10.1<response>
+
+    Result=10.1<submit>
+
+    What is 4*3?
+
+    <request><SimpleCalculatorTool>4*3<call>12<response>
+
+    Result=12<submit>
+
+    What is 12.1+1?
+
+    <request><SimpleCalculatorTool>12.1+1<call>13.1<response>
+
+    Result=13.1<submit>
+
+    What is 12.1-20?
+
+    <request><SimpleCalculatorTool>12.1-20<call>-7.9<response>
+
+    Result=-7.9<submit>"""
+    ```
+3. Create a `trl.TextEnvironment` with the model 
+    ```python
+    env = TextEnvironment(
+        model,
+        tokenizer,
+        {"SimpleCalculatorTool": tool_fn},
+        reward_fn,
+        prompt,
+        generation_kwargs=generation_kwargs,
+    )
+    ```
+4. Then generate some data such as `tasks = ["\n\nWhat is 13.1-3?", "\n\nWhat is 4*3?"]` and run the environment with `queries, responses, masks, rewards, histories = env.run(tasks)`. The environment will look for the `<call>` token in the prompt and append the tool output to the response; it will also return the mask associated with the response. You can further use the `histories` to visualize the interaction between the model and the tool; `histories[0].show_text()` will show the text with color-coded tool output and `histories[0].show_tokens(tokenizer)` will show visualize the tokens.
+    ![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/learning_tools.png)
+1. Finally, we can train the model with `train_stats = ppo_trainer.step(queries, responses, rewards, masks)`. The trainer will use the mask to ignore the tool output when computing the loss, make sure to pass that argument to `step`.
+
+## Experiment results
+
+We trained a model with the above script for 10 random seeds. You can reproduce the run with the following command. Feel free to remove the `--slurm-*` arguments if you don't have access to a slurm cluster.
+
+```
+WANDB_TAGS="calculator_final" python benchmark/benchmark.py \
+    --command "python examples/research_projects/tools/calculator.py" \
+    --num-seeds 10 \
+    --start-seed 1 \
+    --workers 10 \
+    --slurm-gpus-per-task 1 \
+    --slurm-ntasks 1 \
+    --slurm-total-cpus 8 \
+    --slurm-template-path benchmark/trl.slurm_template
+```
+
+We can then use [`openrlbenchmark`](https://github.com/openrlbenchmark/openrlbenchmark) which generates the following plot.
+```
+python -m openrlbenchmark.rlops_multi_metrics \
+    --filters '?we=openrlbenchmark&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.tracker_project_name&cen=trl_ppo_trainer_config.value.log_with&metrics=env/reward_mean&metrics=objective/kl' \
+        'wandb?tag=calculator_final&cl=calculator_mask' \
+    --env-ids trl \
+    --check-empty-runs \
+    --pc.ncols 2 \
+    --pc.ncols-legend 1 \
+    --output-filename static/0compare \
+    --scan-history
+```
+
+![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/learning_tools_chart.png)
+
+As we can see, while 1-2 experiments crashed for some reason, most of the runs obtained near perfect proficiency in the calculator task.
+
+
+## (Early Experiments 🧪): learning to use a wiki tool for question answering
+
+In the [ToolFormer](https://huggingface.co/papers/2302.04761) paper, it shows an interesting use case that utilizes a Wikipedia Search tool to help answer questions. In this section, we attempt to perform similar experiments but uses RL instead to teach the model to use a wiki tool on the [TriviaQA](https://nlp.cs.washington.edu/triviaqa/) dataset.
+
+
+<Tip warning={true}>
+
+**Note that many settings are different so the results are not directly comparable.**
+</Tip>
+
+
+
+
+### Building a search index
+
+Since [ToolFormer](https://huggingface.co/papers/2302.04761) did not open source, we needed to first replicate the search index. It is mentioned in their paper that the authors built the search index using a BM25 retriever that indexes the Wikipedia dump from [KILT](https://github.com/facebookresearch/KILT)
+
+Fortunately, [`pyserini`](https://github.com/castorini/pyserini) already implements the BM25 retriever and provides a prebuilt index for the KILT Wikipedia dump. We can use the following code to search the index.
+
+```python
+from pyserini.search.lucene import LuceneSearcher
+import json
+searcher = LuceneSearcher.from_prebuilt_index('wikipedia-kilt-doc')
+def search(query):
+    hits = searcher.search(query, k=1)
+    hit = hits[0]
+    contents = json.loads(hit.raw)['contents']
+    return contents
+print(search("tennis racket"))
+```
+```
+Racket (sports equipment)
+A racket or racquet is a sports implement consisting of a handled frame with an open hoop across which a network of strings or catgut is stretched tightly. It is used for striking a ball or shuttlecock in games such as squash, tennis, racquetball, and badminton. Collectively, these games are known as racket sports. Racket design and manufacturing has changed considerably over the centuries.
+
+The frame of rackets for all sports was traditionally made of solid wood (later laminated wood) and the strings of animal intestine known as catgut. The traditional racket size was limited by the strength and weight of the wooden frame which had to be strong enough to hold the strings and stiff enough to hit the ball or shuttle. Manufacturers started adding non-wood laminates to wood rackets to improve stiffness. Non-wood rackets were made first of steel, then of aluminum, and then carbon fiber composites. Wood is still used for real tennis, rackets, and xare. Most rackets are now made of composite materials including carbon fiber or fiberglass, metals such as titanium alloys, or ceramics.
+...
+```
+
+We then basically deployed this snippet as a Hugging Face space [here](https://huggingface.co/spaces/vwxyzjn/pyserini-wikipedia-kilt-doc), so that we can use the space as a `transformers.Tool` later.
+
+![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/pyserini.png)
+
+### Experiment settings
+
+We use the following settings:
+
+* use the `bigcode/starcoderbase` model as the base model
+* use the `pyserini-wikipedia-kilt-doc` space as the wiki tool and only uses the first paragrahs of the search result, allowing the `TextEnvironment` to obtain at most `max_tool_reponse=400` response tokens from the tool.
+* test if the response contain the answer string, if so, give a reward of 1, otherwise, give a reward of 0.
+    * notice this is a simplified evaluation criteria. In [ToolFormer](https://huggingface.co/papers/2302.04761), the authors checks if the first 20 words of the response contain the correct answer.
+* used the following prompt that demonstrates the usage of the wiki tool.
+```python
+prompt = """\
+Answer the following question:
+
+Q: In which branch of the arts is Patricia Neary famous?
+A: Ballets
+A2: <request><Wiki>Patricia Neary<call>Patricia Neary (born October 27, 1942) is an American ballerina, choreographer and ballet director, who has been particularly active in Switzerland. She has also been a highly successful ambassador for the Balanchine Trust, bringing George Balanchine's ballets to 60 cities around the globe.<response>
+Result=Ballets<submit>
+
+Q: Who won Super Bowl XX?
+A: Chicago Bears
+A2: <request><Wiki>Super Bowl XX<call>Super Bowl XX was an American football game between the National Football Conference (NFC) champion Chicago Bears and the American Football Conference (AFC) champion New England Patriots to decide the National Football League (NFL) champion for the 1985 season. The Bears defeated the Patriots by the score of 46–10, capturing their first NFL championship (and Chicago's first overall sports victory) since 1963, three years prior to the birth of the Super Bowl. Super Bowl XX was played on January 26, 1986 at the Louisiana Superdome in New Orleans.<response>
+Result=Chicago Bears<submit>
+
+Q: """
+```
+
+
+### Result and Discussion
+
+
+Our experiments show that the agent can learn to use the wiki tool to answer questions. The learning curves would go up mostly, but one of the experiment did crash.
+
+![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/triviaqa_learning_curves.png)
+
+Wandb report is [here](https://wandb.ai/costa-huang/cleanRL/reports/TriviaQA-Final-Experiments--Vmlldzo1MjY0ODk5) for further inspection.
+
+
+Note that the correct rate of the trained model is on the low end, which could be due to the following reasons:
+
+* **incorrect searches:** When given the question `"What is Bruce Willis' real first name?"` if the model searches for `Bruce Willis`, our wiki tool returns "Patrick Poivey (born 18 February 1948) is a French actor. He is especially known for his voice: he is the French dub voice of Bruce Willis since 1988.` But a correct search should be `Walter Bruce Willis (born March 19, 1955) is an American former actor. He achieved fame with a leading role on the comedy-drama series Moonlighting (1985–1989) and appeared in over a hundred films, gaining recognition as an action hero after his portrayal of John McClane in the Die Hard franchise (1988–2013) and other roles.[1][2]"
+
+
+    ![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/real_first_name.png)
+
+* **unnecessarily long response**: The wiki tool by default sometimes output very long sequences. E.g., when the wiki tool searches for "Brown Act"
+    * Our wiki tool returns "The Ralph M. Brown Act, located at California Government Code 54950 "et seq.", is an act of the California State Legislature, authored by Assemblymember Ralph M. Brown and passed in 1953, that guarantees the public's right to attend and participate in meetings of local legislative bodies."
+    * [ToolFormer](https://huggingface.co/papers/2302.04761)'s wiki tool returns "The Ralph M. Brown Act is an act of the California State Legislature that guarantees the public's right to attend and participate in meetings of local legislative bodies." which is more succinct.
+
+    ![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/brown_act.png)
+
+
+## (Early Experiments 🧪): solving math puzzles with python interpreter
+
+In this section, we attempt to teach the model to use a python interpreter to solve math puzzles. The rough idea is to give the agent a prompt like the following:
+
+```python
+prompt = """\
+Example of using a Python API to solve math questions.
+
+Q: Olivia has $23. She bought five bagels for $3 each. How much money does she have left?
+
+<request><PythonInterpreter>
+def solution():
+    money_initial = 23
+    bagels = 5
+    bagel_cost = 3
+    money_spent = bagels * bagel_cost
+    money_left = money_initial - money_spent
+    result = money_left
+    return result
+print(solution())
+<call>72<response>
+
+Result = 72 <submit>
+
+Q: """
+```
+
+
+Training experiment can be found at https://wandb.ai/lvwerra/trl-gsm8k/runs/a5odv01y
+
+![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/gms8k_learning_curve.png)
--- a/docs/source/logging.mdx
+++ b/docs/source/logging.mdx
@ -0,0 +1,75 @@
+# Logging
+
+As reinforcement learning algorithms are historically challenging to debug, it's important to pay careful attention to logging.
+By default, the TRL [`PPOTrainer`] saves a lot of relevant information to `wandb` or `tensorboard`.
+
+Upon initialization, pass one of these two options to the [`PPOConfig`]:
+```
+config = PPOConfig(
+    model_name=args.model_name,
+    log_with=`wandb`, # or `tensorboard`
+)
+```
+If you want to log with tensorboard, add the kwarg `project_kwargs={"logging_dir": PATH_TO_LOGS}` to the PPOConfig.
+
+## PPO Logging
+
+Here's a brief explanation for the logged metrics provided in the data:
+
+Key metrics to monitor. We want to maximize the reward, maintain a low KL divergence, and maximize entropy:
+1. `env/reward_mean`: The average reward obtained from the environment. Alias `ppo/mean_scores`, which is sed to specifically monitor the reward model.
+1. `env/reward_std`: The standard deviation of the reward obtained from the environment. Alias ``ppo/std_scores`, which is sed to specifically monitor the reward model.
+1. `env/reward_dist`: The histogram distribution of the reward obtained from the environment.
+1. `objective/kl`: The mean Kullback-Leibler (KL) divergence between the old and new policies. It measures how much the new policy deviates from the old policy. The KL divergence is used to compute the KL penalty in the objective function.
+1. `objective/kl_dist`: The histogram distribution of the `objective/kl`.
+1. `objective/kl_coef`: The coefficient for Kullback-Leibler (KL) divergence in the objective function. 
+1. `ppo/mean_non_score_reward`: The **KL penalty** calculated by `objective/kl * objective/kl_coef` as the total reward for optimization to prevent the new policy from deviating too far from the old policy.
+1. `objective/entropy`: The entropy of the model's policy, calculated by `-logprobs.sum(-1).mean()`. High entropy means the model's actions are more random, which can be beneficial for exploration.
+
+Training stats:
+1. `ppo/learning_rate`: The learning rate for the PPO algorithm.
+1. `ppo/policy/entropy`: The entropy of the model's policy, calculated by `pd = torch.nn.functional.softmax(logits, dim=-1); entropy = torch.logsumexp(logits, dim=-1) - torch.sum(pd * logits, dim=-1)`. It measures the randomness of the policy.
+1. `ppo/policy/clipfrac`: The fraction of probability ratios (old policy / new policy) that fell outside the clipping range in the PPO objective. This can be used to monitor the optimization process.
+1. `ppo/policy/approxkl`: The approximate KL divergence between the old and new policies, measured by `0.5 * masked_mean((logprobs - old_logprobs) ** 2, mask)`, corresponding to the `k2` estimator in http://joschu.net/blog/kl-approx.html
+1. `ppo/policy/policykl`: Similar to `ppo/policy/approxkl`, but measured by `masked_mean(old_logprobs - logprobs, mask)`, corresponding to the `k1` estimator in http://joschu.net/blog/kl-approx.html
+1. `ppo/policy/ratio`:  The histogram distribution of the ratio between the new and old policies, used to compute the PPO objective.
+1. `ppo/policy/advantages_mean`: The average of the GAE (Generalized Advantage Estimation) advantage estimates. The advantage function measures how much better an action is compared to the average action at a state.
+1. `ppo/policy/advantages`: The histogram distribution of `ppo/policy/advantages_mean`.
+1. `ppo/returns/mean`: The mean of the TD(λ) returns, calculated by `returns = advantage + values`, another indicator of model performance. See https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/ for more details.
+1. `ppo/returns/var`: The variance of the TD(λ) returns, calculated by `returns = advantage + values`, another indicator of model performance.
+1. `ppo/val/mean`: The mean of the values, used to monitor the value function's performance.
+1. `ppo/val/var` : The variance of the values, used to monitor the value function's performance.
+1. `ppo/val/var_explained`: The explained variance for the value function, used to monitor the value function's performance.
+1. `ppo/val/clipfrac`: The fraction of the value function's predicted values that are clipped.
+1. `ppo/val/vpred`: The predicted values from the value function.
+1. `ppo/val/error`: The mean squared error between the `ppo/val/vpred` and returns, used to monitor the value function's performance.
+1. `ppo/loss/policy`: The policy loss for the Proximal Policy Optimization (PPO) algorithm.
+1. `ppo/loss/value`: The loss for the value function in the PPO algorithm. This value quantifies how well the function estimates the expected future rewards.
+1. `ppo/loss/total`: The total loss for the PPO algorithm. It is the sum of the policy loss and the value function loss.
+
+
+Stats on queries, responses, and logprobs:
+1. `tokens/queries_len_mean`: The average length of the queries tokens.
+1. `tokens/queries_len_std`: The standard deviation of the length of the queries tokens.
+1. `tokens/queries_dist`: The histogram distribution of the length of the queries tokens.
+1. `tokens/responses_len_mean`: The average length of the responses tokens.
+1. `tokens/responses_len_std`: The standard deviation of the length of the responses tokens.
+1. `tokens/responses_dist`: The histogram distribution of the length of the responses tokens. (Costa: inconsistent naming, should be `tokens/responses_len_dist`)
+1. `objective/logprobs`: The histogram distribution of the log probabilities of the actions taken by the model.
+1. `objective/ref_logprobs`: The histogram distribution of the log probabilities of the actions taken by the reference model.
+
+
+
+### Crucial values
+During training, many values are logged, here are the most important ones:
+
+1. `env/reward_mean`,`env/reward_std`, `env/reward_dist`: the properties of the reward distribution from the "environment" /  reward model
+1. `ppo/mean_non_score_reward`: The mean negated KL penalty during training (shows the delta between the reference model and the new policy over the batch in the step)
+
+Here are some parameters that are useful to monitor for stability (when these diverge or collapse to 0, try tuning variables):
+
+1. `ppo/loss/value`: it will spike / NaN when not going well.
+1. `ppo/policy/ratio`: `ratio` being 1 is a baseline value, meaning that the probability of sampling a token is the same under the new and old policy. If the ratio is too high like 200, it means the probability of sampling a token is 200 times higher under the new policy than the old policy. This is a sign that the new policy is too different from the old policy, which will likely cause overoptimization and collapse training later on.
+1. `ppo/policy/clipfrac` and `ppo/policy/approxkl`: if `ratio` is too high, the `ratio` is going to get clipped, resulting in high `clipfrac` and high `approxkl` as well.
+1. `objective/kl`: it should stay positive so that the policy is not too far away from the reference policy.
+1. `objective/kl_coef`: The target coefficient with [`AdaptiveKLController`]. Often increases before numerical instabilities.
--- a/docs/source/lora_tuning_peft.mdx
+++ b/docs/source/lora_tuning_peft.mdx
@ -0,0 +1,144 @@
+# Examples of using peft with trl to finetune 8-bit models with Low Rank Adaption (LoRA)
+
+The notebooks and scripts in this examples show how to use Low Rank Adaptation (LoRA) to fine-tune models in a memory efficient manner. Most of PEFT methods supported in peft library but note that some PEFT methods such as Prompt tuning are not supported.
+For more information on LoRA, see the [original paper](https://huggingface.co/papers/2106.09685).
+
+Here's an overview of the `peft`-enabled notebooks and scripts in the [trl repository](https://github.com/huggingface/trl/tree/main/examples):
+
+| File | Task | Description | Colab link |
+|---|---| --- |
+| [`stack_llama/rl_training.py`](https://github.com/huggingface/trl/blob/main/examples/research_projects/stack_llama/scripts/rl_training.py) | RLHF | Distributed fine-tuning of the 7b parameter LLaMA models with a learned reward model and `peft`. |  |
+| [`stack_llama/reward_modeling.py`](https://github.com/huggingface/trl/blob/main/examples/research_projects/stack_llama/scripts/reward_modeling.py) | Reward Modeling | Distributed training of the 7b parameter LLaMA reward model with `peft`. |  |
+| [`stack_llama/supervised_finetuning.py`](https://github.com/huggingface/trl/blob/main/examples/research_projects/stack_llama/scripts/supervised_finetuning.py) | SFT | Distributed instruction/supervised fine-tuning of the 7b parameter LLaMA model with `peft`. |  |
+
+## Installation
+Note: peft is in active development, so we install directly from their Github page.
+Peft also relies on the latest version of transformers. 
+
+```bash
+pip install trl[peft]
+pip install bitsandbytes loralib
+pip install git+https://github.com/huggingface/transformers.git@main
+#optional: wandb
+pip install wandb
+```
+
+Note: if you don't want to log with `wandb` remove `log_with="wandb"` in the scripts/notebooks. You can also replace it with your favourite experiment tracker that's [supported by `accelerate`](https://huggingface.co/docs/accelerate/usage_guides/tracking).
+
+## How to use it?
+
+Simply declare a `PeftConfig` object in your script and pass it through `.from_pretrained` to load the TRL+PEFT model. 
+
+```python
+from peft import LoraConfig
+from trl import AutoModelForCausalLMWithValueHead
+
+model_id = "edbeeching/gpt-neo-125M-imdb"
+lora_config = LoraConfig(
+    r=16,
+    lora_alpha=32,
+    lora_dropout=0.05,
+    bias="none",
+    task_type="CAUSAL_LM",
+)
+
+model = AutoModelForCausalLMWithValueHead.from_pretrained(
+    model_id, 
+    peft_config=lora_config,
+)
+```
+And if you want to load your model in 8bit precision:
+```python
+pretrained_model = AutoModelForCausalLMWithValueHead.from_pretrained(
+    config.model_name, 
+    load_in_8bit=True,
+    peft_config=lora_config,
+)
+```
+... or in 4bit precision:
+```python
+pretrained_model = AutoModelForCausalLMWithValueHead.from_pretrained(
+    config.model_name, 
+    peft_config=lora_config,
+    load_in_4bit=True,
+)
+```
+
+
+## Launch scripts
+
+The `trl` library is powered by `accelerate`. As such it is best to configure and launch trainings with the following commands:
+
+```bash
+accelerate config # will prompt you to define the training configuration
+accelerate launch examples/scripts/ppo.py --use_peft # launch`es training
+```
+
+## Using `trl` + `peft` and Data Parallelism
+
+You can scale up to as many GPUs as you want, as long as you are able to fit the training process in a single device. The only tweak you need to apply is to load the model as follows:
+```python
+from peft import LoraConfig
+...
+
+lora_config = LoraConfig(
+    r=16,
+    lora_alpha=32,
+    lora_dropout=0.05,
+    bias="none",
+    task_type="CAUSAL_LM",
+)
+
+pretrained_model = AutoModelForCausalLMWithValueHead.from_pretrained(
+    config.model_name, 
+    peft_config=lora_config,
+)
+```
+And if you want to load your model in 8bit precision:
+```python
+pretrained_model = AutoModelForCausalLMWithValueHead.from_pretrained(
+    config.model_name, 
+    peft_config=lora_config,
+    load_in_8bit=True,
+)
+```
+... or in 4bit precision:
+```python
+pretrained_model = AutoModelForCausalLMWithValueHead.from_pretrained(
+    config.model_name, 
+    peft_config=lora_config,
+    load_in_4bit=True,
+)
+```
+Finally, make sure that the rewards are computed on correct device as well, for that you can use `ppo_trainer.model.current_device`.
+
+## Naive pipeline parallelism (NPP) for large models (>60B models)
+
+The `trl` library also supports naive pipeline parallelism (NPP) for large models (>60B models). This is a simple way to parallelize the model across multiple GPUs. 
+This paradigm, termed as "Naive Pipeline Parallelism" (NPP) is a simple way to parallelize the model across multiple GPUs. We load the model and the adapters across multiple GPUs and the activations and gradients will be naively communicated across the GPUs. This supports `int8` models as well as other `dtype` models.
+
+<div style="text-align: center">
+<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/trl-npp.png">
+</div>
+
+### How to use NPP?
+
+Simply load your model with a custom `device_map` argument on the `from_pretrained` to split your model across multiple devices. Check out this [nice tutorial](https://github.com/huggingface/blog/blob/main/accelerate-large-models.md) on how to properly create a `device_map` for your model. 
+ 
+Also make sure to have the `lm_head` module on the first GPU device as it may throw an error if it is not on the first device. As this time of writing, you need to install the `main` branch of `accelerate`: `pip install git+https://github.com/huggingface/accelerate.git@main` and `peft`: `pip install git+https://github.com/huggingface/peft.git@main`.
+
+### Launch scripts
+
+Although `trl` library is powered by `accelerate`, you should run your training script in a single process. Note that we do not support Data Parallelism together with NPP yet.
+
+```bash
+python PATH_TO_SCRIPT
+```
+
+## Fine-tuning Llama-2 model
+
+You can easily fine-tune Llama2 model using `SFTTrainer` and the official script! For example to fine-tune llama2-7b on the Guanaco dataset, run (tested on a single NVIDIA T4-16GB):
+
+```bash
+python examples/scripts/sft.py --output_dir sft_openassistant-guanaco  --model_name meta-llama/Llama-2-7b-hf --dataset_name timdettmers/openassistant-guanaco --load_in_4bit --use_peft --per_device_train_batch_size 4 --gradient_accumulation_steps 2
+```
--- a/docs/source/multi_adapter_rl.mdx
+++ b/docs/source/multi_adapter_rl.mdx
@ -0,0 +1,100 @@
+# Multi Adapter RL (MARL) - a single base model for everything
+
+Here we present an approach that uses a single base model for the entire PPO algorithm - which includes retrieving the reference logits, computing the active logits and the rewards. This feature is experimental as we did not test the convergence of the approach. We encourage the community to let us know if they potentially face issues.
+
+## Requirements
+
+You just need to install `peft` and optionally install `bitsandbytes` as well if you want to go for 8bit base models, for more memory efficient finetuning.
+
+## Summary
+
+You need to address this approach in three stages that we summarize as follows:
+
+1- Train a base model on the target domain (e.g. [IMDB dataset](https://huggingface.co/datasets/stanfordnlp/imdb)) - this is the Supervised Fine Tuning stage - it can leverage the `SFTTrainer` from TRL.
+2- Train a reward model using `peft`. This is required in order to re-use the adapter during the RL optimisation process (step 3 below). We show an example of leveraging the `RewardTrainer` from TRL in [this example](https://github.com/huggingface/trl/tree/main/examples/scripts/reward_modeling.py)
+3- Fine tune new adapters on the base model using PPO and the reward adapter. ("0 abstraction RL")
+
+Make sure to use the same model (i.e. same architecture and same weights) for the stages 2 & 3. 
+
+## Quickstart
+
+Let us assume you have trained your reward adapter on `llama-7b` model using `RewardTrainer` and pushed the weights on the hub under `trl-lib/llama-7b-hh-rm-adapter`. 
+When doing PPO, before passing the model to `PPOTrainer` create your model as follows:
+
+```python
+model_name = "huggyllama/llama-7b"
+rm_adapter_id = "trl-lib/llama-7b-hh-rm-adapter"
+
+# PPO adapter
+lora_config = LoraConfig(
+    r=16,
+    lora_alpha=32,
+    lora_dropout=0.05,
+    bias="none",
+    task_type="CAUSAL_LM",
+)
+
+model = AutoModelForCausalLMWithValueHead.from_pretrained(
+    model_name,
+    peft_config=lora_config,
+    reward_adapter=rm_adapter_id,
+)
+
+...
+trainer = PPOTrainer(
+    model=model,
+    ...
+)
+
+...
+```
+Then inside your PPO training loop, call the `compute_reward_score` method by accessing the `model` attribute from `PPOTrainer`.
+
+```python
+rewards = trainer.model.compute_reward_score(**inputs)
+```
+
+## Advanced usage
+
+### Control on the adapter name 
+
+If you are familiar with the `peft` library, you know that you can use multiple adapters inside the same model. What you can do is train multiple adapters on the same base model to fine-tune on different policies. 
+In this case, you want to be able to control the adapter name you want to activate back, after retrieving the reward. For that, simply pass the appropriate `adapter_name` to `ppo_adapter_name` argument when calling `compute_reward_score`.
+
+```python
+adapter_name_policy_1 = "policy_1"
+rewards = trainer.model.compute_reward_score(**inputs, ppo_adapter_name=adapter_name_policy_1)
+...
+```
+
+### Using 4-bit and 8-bit base models
+
+For more memory efficient fine-tuning, you can load your base model in 8-bit or 4-bit while keeping the adapters in the default precision (float32).
+Just pass the appropriate arguments (i.e. `load_in_8bit=True` or `load_in_4bit=True`) to `AutoModelForCausalLMWithValueHead.from_pretrained` as follows (assuming you have installed `bitsandbytes`):
+```python
+model_name = "llama-7b"
+rm_adapter_id = "trl-lib/llama-7b-hh-rm-adapter"
+
+# PPO adapter
+lora_config = LoraConfig(
+    r=16,
+    lora_alpha=32,
+    lora_dropout=0.05,
+    bias="none",
+    task_type="CAUSAL_LM",
+)
+
+model = AutoModelForCausalLMWithValueHead.from_pretrained(
+    model_name,
+    peft_config=lora_config,
+    reward_adapter=rm_adapter_id,
+    load_in_8bit=True,
+)
+
+...
+trainer = PPOTrainer(
+    model=model,
+    ...
+)
+...
+```
--- a/docs/source/nash_md_trainer.md
+++ b/docs/source/nash_md_trainer.md
@ -0,0 +1,132 @@
+# Nash-MD Trainer
+
+## Overview
+
+Nash-MD was proposed in the paper [Nash Learning from Human Feedback](https://huggingface.co/papers/2312.00886) by Rémi Munos, [Michal Valko](https://huggingface.co/misovalko), Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Rowland, Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas Mésnard, and Andrea Michi. 
+
+The abstract from the paper is the following:
+
+> Reinforcement learning from human feedback (RLHF) has emerged as the main paradigm for aligning large language models (LLMs) with human preferences. Typically, RLHF involves the initial step of learning a reward model from human feedback, often expressed as preferences between pairs of text generations produced by a pre-trained LLM. Subsequently, the LLM's policy is fine-tuned by optimizing it to maximize the reward model through a reinforcement learning algorithm. However, an inherent limitation of current reward models is their inability to fully represent the richness of human preferences and their dependency on the sampling distribution. In this study, we introduce an alternative pipeline for the fine-tuning of LLMs using pairwise human feedback. Our approach entails the initial learning of a preference model, which is conditioned on two inputs given a prompt, followed by the pursuit of a policy that consistently generates responses preferred over those generated by any competing policy, thus defining the Nash equilibrium of this preference model. We term this approach Nash learning from human feedback (NLHF). In the context of a tabular policy representation, we present a novel algorithmic solution, Nash-MD, founded on the principles of mirror descent. This algorithm produces a sequence of policies, with the last iteration converging to the regularized Nash equilibrium. Additionally, we explore parametric representations of policies and introduce gradient descent algorithms for deep-learning architectures. To demonstrate the effectiveness of our approach, we present experimental results involving the fine-tuning of a LLM for a text summarization task. We believe NLHF offers a compelling avenue for preference learning and policy optimization with the potential of advancing the field of aligning LLMs with human preferences.
+
+This post-training method was contributed by [Kashif Rasul](https://huggingface.co/kashif) and [Daniil Tiapkin](https://huggingface.co/dtiapkin), [Pierre Ménard](https://huggingface.co/menardprr), Daniele Calandriello and [Quentin Gallouédec](https://huggingface.co/qgallouedec). 
+
+## Quick start
+
+This example demonstrates how to train a model using the Nash-MD method. We use the [Qwen 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) as the base model and the [Qwen 0.5B reward model](https://huggingface.co/trl-lib/Qwen2-0.5B-Reward) as the reward model. We use the prompts from the [UltraFeedback dataset](https://huggingface.co/datasets/openbmb/UltraFeedback). You can view the prompts in the dataset here:
+
+<iframe
+  src="https://huggingface.co/datasets/trl-lib/ultrafeedback-prompt/embed/viewer/default/train?row=0"
+  frameborder="0"
+  width="100%"
+  height="560px"
+></iframe>
+
+Below is the script to train the model:
+
+```python
+# train_nash_md.py
+from datasets import load_dataset
+from trl import NashMDConfig, NashMDTrainer
+from transformers import AutoModelForCausalLM, AutoModelForSequenceClassification, AutoTokenizer
+
+model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
+tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
+reward_model = AutoModelForSequenceClassification.from_pretrained("trl-lib/Qwen2-0.5B-Reward", num_labels=1)
+train_dataset = load_dataset("trl-lib/ultrafeedback-prompt", split="train")
+
+args = NashMDConfig(output_dir="nash-md-qwen2", logging_steps=10)
+trainer = NashMDTrainer(
+    model=model,
+    reward_model=reward_model,
+    args=args,
+    tokenizer=tokenizer,
+    train_dataset=train_dataset,
+)
+trainer.train()
+```
+
+Execute the script using the following command:
+
+```bash
+accelerate launch train_nash_md.py
+```
+
+## Expected dataset format
+
+Nash-MD requires a [prompt-only dataset](dataset_format#preference). The [`NashMDTrainer`] supports both [conversational](dataset_format#conversational-dataset-format) and [standard](dataset_format#standard-dataset-format) dataset format. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
+
+## Usage tips
+
+### ⚠️ Use the same chat template
+
+Make sure that the SFT model and reward model use the _same_ chat template. Otherwise, you may find the model completions are scored incorrectly during training.
+
+### Encourage EOS token generation
+
+We can want the model to generate completion within a given length. During the learning, the model will generate completion up to the maximum completion length specified in the `max_new_tokens` argument of [`NashMDConfig`]. I you want to penalize for not generating an EOS token before the maximum completion length, you can use the `missing_eos_penalty` argument of [`NashMDConfig`]:
+
+```python
+args = NashMDConfig(..., max_new_tokens=128, missing_eos_penalty=1.0)
+```
+
+### Logging Completions
+
+To better understand your model’s behavior during training, you can log sample completions periodically using the [`LogCompletionsCallback`].
+
+```python
+trainer = NashMDTrainer(..., eval_dataset=eval_dataset)
+completions_callback = LogCompletionsCallback(trainer, num_prompts=8)
+trainer.add_callback(completions_callback)
+```
+
+This callback logs the model's generated completions directly to Weights & Biases.
+
+![Logged Completions](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/wandb_completions.png)
+
+## Example script
+
+We provide an example script to train a model using the Nash-MD method. The script is available in [`examples/scripts/nash_md.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/nash_md.py)
+
+To test the Nash-MD script with the [Pythia 14M model](https://huggingface.co/EleutherAI/pythia-14m) on the TL;DR summarization task, run the following command:
+
+```bash
+python examples/scripts/nash_md.py \
+    --model_name_or_path EleutherAI/pythia-14m  \
+    --reward_model_path EleutherAI/pythia-14m \
+    --dataset_name trl-lib/tldr \
+    --learning_rate 5.0e-7 \
+    --output_dir pythia-14m-tldr-nash-md \
+    --per_device_train_batch_size 4 \
+    --gradient_accumulation_steps 32 \
+    --num_train_epochs 3 \
+    --max_new_tokens 64 \
+    --warmup_ratio 0.1 \
+    --missing_eos_penalty 1.0 \
+    --push_to_hub
+```
+
+## Logged metrics
+
+The logged metrics are as follows:
+
+* `loss/kl`: The mean KL divergence between the model and reference data.
+* `objective/entropy`: The mean entropy of the model and reference data.
+* `loss/score`: The mean reinforce score loss.
+* `rewards/chosen`: The mean scores (according to the reward model) of the model completions.
+* `rewards/rejected`: The mean scores (according to the reward model) of the mixture completions.
+* `rewards/accuracies`: The accuracies of the Nash-MD's implicit reward model.
+* `rewards/margins`: The mean reward margin (according to reward model) between the chosen and mixture completions.
+* `logps/chosen`: The mean log probabilities of the chosen completions.
+* `logps/rejected`: The mean log probabilities of the reference completions.
+* `val/model_contain_eos_token`: The amount of times the model's output contains the eos token.
+* `val/ref_contain_eos_token`: The amount of times the mixture's output contains the eos token.
+* `beta`: The parameter that controls the weight of the loss term representing the deviation from the reference model. Typically fixed, but can be made dynamic by passing a list to [`NashMDConfig`].
+* `mixture_coef`: Logit mixture coefficient for the model and reference model. Typically fixed, but can be made dynamic by passing a list to [`NashMDConfig`].
+
+## NashMDTrainer
+
+[[autodoc]] NashMDTrainer
+
+## NashMDConfig
+
+[[autodoc]] NashMDConfig
--- a/docs/source/online_dpo_trainer.md
+++ b/docs/source/online_dpo_trainer.md
@ -0,0 +1,272 @@
+# Online DPO Trainer
+
+## Overview 
+
+Online DPO was proposed in [Direct Language Model Alignment from Online AI Feedback](https://huggingface.co/papers/2402.04792) by Shangmin Guo, Biao Zhang, Tianlin Liu, Tianqi Liu, Misha Khalman, Felipe Llinares, Alexandre Rame, Thomas Mesnard, Yao Zhao, Bilal Piot, Johan Ferret, and Mathieu Blondel. 
+
+The abstract from the paper is the following:
+
+> Direct alignment from preferences (DAP) methods, such as DPO, have recently emerged as efficient alternatives to reinforcement learning from human feedback (RLHF), that do not require a separate reward model. However, the preference datasets used in DAP methods are usually collected ahead of training and never updated, thus the feedback is purely offline. Moreover, responses in these datasets are often sampled from a language model distinct from the one being aligned, and since the model evolves over training, the alignment phase is inevitably off-policy. In this study, we posit that online feedback is key and improves DAP methods. Our method, online AI feedback (OAIF), uses an LLM as annotator: on each training iteration, we sample two responses from the current model and prompt the LLM annotator to choose which one is preferred, thus providing online feedback. Despite its simplicity, we demonstrate via human evaluation in several tasks that OAIF outperforms both offline DAP and RLHF methods. We further show that the feedback leveraged in OAIF is easily controllable, via instruction prompts to the LLM annotator.
+
+The current implementation uses reward models for scoring completions -- see [Reward Bench](https://huggingface.co/spaces/allenai/reward-bench) for a leaderboard of public models you can use.
+
+This post-training method was contributed by [Michael Noukhovitch](https://huggingface.co/mnoukhov), [Shengyi Costa Huang](https://huggingface.co/vwxyzjn), [Quentin Gallouédec](https://huggingface.co/qgallouedec), and [Edward Beeching](https://huggingface.co/edbeeching).
+
+## Quick start
+
+This example demonstrates how to train a model using the online DPO method. We use the [Qwen 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) as the base model and the [Qwen 0.5B reward model](https://huggingface.co/trl-lib/Qwen2-0.5B-Reward) as the reward model. We use the prompts from the [UltraFeedback dataset](https://huggingface.co/datasets/openbmb/UltraFeedback). You can view the prompts in the dataset here:
+
+<iframe
+  src="https://huggingface.co/datasets/trl-lib/ultrafeedback-prompt/embed/viewer/default/train?row=0"
+  frameborder="0"
+  width="100%"
+  height="560px"
+></iframe>
+
+Below is the script to train the model:
+
+```python
+# train_online_dpo.py
+from datasets import load_dataset
+from trl import OnlineDPOConfig, OnlineDPOTrainer
+from transformers import AutoModelForCausalLM, AutoModelForSequenceClassification, AutoTokenizer
+
+model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
+tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
+reward_model = AutoModelForSequenceClassification.from_pretrained("trl-lib/Qwen2-0.5B-Reward", num_labels=1)
+train_dataset = load_dataset("trl-lib/ultrafeedback-prompt", split="train")
+
+args = OnlineDPOConfig(output_dir="online-dpo-qwen2", logging_steps=10)
+trainer = OnlineDPOTrainer(
+    model=model,
+    reward_model=reward_model,
+    args=args,
+    tokenizer=tokenizer,
+    train_dataset=train_dataset,
+)
+trainer.train()
+```
+
+Execute the script using the following command:
+
+```bash
+accelerate launch train_online_dpo.py
+```
+
+Distributed across 8 GPUs, the training takes approximately 1 hour. You can verify the training progress by checking the reward graph. An increasing trend in both the reward for rejected and chosen completions indicates that the model is improving and generating better responses over time.
+
+![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/online-dpo-qwen2-reward.png)
+
+To see how the trained model performs, use the following code to generate completions:
+
+```python
+>>> from transformers import pipeline
+>>> generator = pipeline("text-generation", model="online-dpo-qwen2/checkpoint-1773", device="cuda")
+>>> question = "Why is the problem always DNS?"
+>>> output = generator([{"role": "user", "content": question}], max_new_tokens=200, return_full_text=False)[0]
+>>> print(output["generated_text"])
+The reason why the problem of DNS (Domain Name System) can always be encountered is that it is designed to provide reliable and accurate information about the availability, ownership, or expiration of domain names. However, there may be some circumstances where the system fails to resolve an IP address correctly, leading to the problem of DNS.
+For example, if the server hosting the domain name does not have the correct IP address associated with it, or if the IP address is incorrectly formatted, then the DNS system will fail to resolve the domain name correctly. Additionally, if the server hosting the domain name has been compromised, then the DNS system may also fail to resolve the domain name correctly.
+It's worth noting that the exact cause of DNS failure can vary depending on the specific situation, so it's important to carefully check all relevant factors before attempting to resolve the issue. If you suspect that your DNS problem may be caused by a bug in the system, you should report it to the DNS provider directly for further investigation.
+```
+
+## Expected dataset format
+
+Online DPO only requires a [prompt-only dataset](dataset_format#preference) (unlike offline DPO, that expects [preference dataset](dataset_format#preference)). The [`OnlineDPOTrainer`] supports both [conversational](dataset_format#conversational-dataset-format) and [standard](dataset_format#standard-dataset-format) dataset format. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
+
+## Usage tips
+
+### ⚠️ Use the same chat template
+
+Make sure that the SFT model and reward model use the _same_ chat template. Otherwise, you may find the model completions are scored incorrectly during training.
+
+### Encourage EOS token generation
+
+We can want the model to generate completion within a given length. During the learning, the model will generate completion up to the maximum completion length specified in the `max_new_tokens` argument of [`OnlineDPOConfig`]. I you want to penalize for not generating an EOS token before the maximum completion length, you can use the `missing_eos_penalty` argument of [`OnlineDPOConfig`]:
+
+```python
+args = OnlineDPOConfig(..., max_new_tokens=128, missing_eos_penalty=1.0)
+```
+
+### Logging Completions
+
+To better understand your model’s behavior during training, you can log sample completions periodically using the [`LogCompletionsCallback`].
+
+```python
+trainer = OnlineDPOTrainer(..., eval_dataset=eval_dataset)
+completions_callback = LogCompletionsCallback(trainer, num_prompts=8)
+trainer.add_callback(completions_callback)
+```
+
+This callback logs the model's generated completions directly to Weights & Biases.
+
+![Logged Completions](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/wandb_completions.png)
+
+
+## Example script
+
+We provide an example script to train a model using the online DPO method. The script is available in [`examples/scripts/dpo_online.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/dpo_online.py)
+
+To test the online DPO script with the [Pythia 1B model](https://huggingface.co/trl-lib/pythia-1b-deduped-tldr-sft) on the TL;DR summarization task, run the following command:
+
+```bash
+python examples/scripts/dpo_online.py \
+    --model_name_or_path trl-lib/pythia-1b-deduped-tldr-sft  \
+    --reward_model_path trl-lib/pythia-1b-deduped-tldr-rm \
+    --dataset_name trl-lib/tldr \
+    --learning_rate 5.0e-7 \
+    --output_dir pythia-1b-tldr-online-dpo \
+    --per_device_train_batch_size 4 \
+    --gradient_accumulation_steps 32 \
+    --num_train_epochs 3 \
+    --max_new_tokens 53 \
+    --warmup_ratio 0.1 \
+    --missing_eos_penalty 1.0 \
+    --push_to_hub
+```
+
+## Logged metrics
+
+The logged metrics are as follows. Here is an example [tracked run at Weights and Biases](https://wandb.ai/huggingface/trl/runs/dd2o3g35)
+
+* `objective/kl`: The mean Kullback-Leibler (KL) divergence between the current model and reference model.
+* `objective/entropy`: The mean entropy of the model, indicating the randomness of the actions chosen by the model.
+* `objective/non_score_reward`: The mean reward from non-score-related sources, basically `beta * kl.sum(1)`, where `beta` is the KL penalty coefficient and `kl` is the per-token KL divergence.
+* `objective/rlhf_reward`: The mean RLHF reward, which is `scores - non_score_reward`. The `rlhf_reward` is the ultimate objective of online DPO training. If training works as intended, this metric should keep going up.
+* `objective/scores`: The mean scores returned by the reward mode.
+* `objective/scores_margin`: The mean score margin (according to the external reward model) between the chosen and rejected completions.
+* `rewards/chosen`: The mean reward (according to online DPO's implicit reward model)of the chosen completions.
+* `rewards/rejected`: The mean reward (according to online DPO's implicit reward model) of the rejected completions.
+* `rewards/accuracies`: The accuracies of the online DPO's implicit reward model.
+* `rewards/margins`: The mean reward margin (according to online DPO's implicit reward model) between the chosen and rejected completions.
+* `logps/chosen`: The mean log probabilities of the chosen completions.
+* `logps/rejected`: The mean log probabilities of the rejected completions.
+* `val/contain_eos_token`: The fraction of completions which contain an EOS token.
+* `beta`: The parameter that controls the weight of the loss term representing the deviation from the reference model. Typically fixed, but can be made dynamic by passing a list to [`OnlineDPOConfig`].
+
+## Benchmark experiments
+
+To validate the online DPO implementation works, we ran experiments with the Pythia 1B, 2.8B, and 6.9B models on a single node of 8 x H100s. Here are the commands we used to run the experiments. We take the SFT / RM models directly from [The N+ Implementation Details of RLHF with PPO: A Case Study on TL;DR Summarization](https://huggingface.co/papers/2403.17031).
+
+
+```
+# 1B Online DPO experiment
+accelerate launch --config_file examples/accelerate_configs/multi_gpu.yaml \
+    examples/scripts/dpo_online.py \
+    --model_name_or_path trl-lib/pythia-1b-deduped-tldr-sft  \
+    --reward_model_path trl-lib/pythia-1b-deduped-tldr-rm \
+    --dataset_name trl-lib/tldr \
+    --learning_rate 5.0e-7 \
+    --output_dir pythia-1b-deduped-tldr-online-dpo \
+    --beta 0.1 \
+    --per_device_train_batch_size 8 \
+    --gradient_accumulation_steps 2 \
+    --num_train_epochs 3 \
+    --max_new_tokens 53 \
+    --warmup_ratio 0.1 \
+    --missing_eos_penalty 1.0 \
+    --logging_steps 20 \
+    --save_steps 0.1 \
+    --push_to_hub
+
+# 2.8B Online DPO experiment
+accelerate launch --config_file examples/accelerate_configs/deepspeed_zero2.yaml \
+    examples/scripts/dpo_online.py \
+    --model_name_or_path trl-lib/pythia-2.8b-deduped-tldr-sft  \
+    --reward_model_path trl-lib/pythia-2.8b-deduped-tldr-rm \
+    --dataset_name trl-lib/tldr \
+    --learning_rate 5.0e-7 \
+    --output_dir pythia-2.8b-deduped-tldr-online-dpo \
+    --beta 0.1 \
+    --per_device_train_batch_size 8 \
+    --gradient_accumulation_steps 2 \
+    --num_train_epochs 3 \
+    --max_new_tokens 53 \
+    --warmup_ratio 0.1 \
+    --missing_eos_penalty 1.0 \
+    --bf16 \
+    --logging_steps 20 \
+    --save_steps 0.1 \
+    --push_to_hub
+
+# 6.9B Online DPO experiment
+accelerate launch --config_file examples/accelerate_configs/deepspeed_zero2.yaml \
+    examples/scripts/dpo_online.py \
+    --model_name_or_path trl-lib/pythia-6.9b-deduped-tldr-sft  \
+    --reward_model_path trl-lib/pythia-6.9b-deduped-tldr-rm \
+    --dataset_name trl-lib/tldr \
+    --learning_rate 5.0e-7 \
+    --output_dir pythia-6.9b-deduped-tldr-online-dpo \
+    --beta 0.1 \
+    --per_device_train_batch_size 4 \
+    --gradient_accumulation_steps 4 \
+    --num_train_epochs 3 \
+    --max_new_tokens 53 \
+    --warmup_ratio 0.1 \
+    --missing_eos_penalty 1.0 \
+    --bf16 \
+    --gradient_checkpointing \
+    --logging_steps 20 \
+    --save_steps 0.1 \
+    --push_to_hub
+```
+
+Checkpoints and experiment tracking are available at:
+
+- [🤗 Model checkpoints](https://huggingface.co/collections/trl-lib/online-dpo-66acd3fa38a331a9cd457b07)
+- [🐝 Tracked experiment](https://wandb.ai/huggingface/trl/reports/Online-DPO-experiments-for-TL-DR-summarisation--Vmlldzo5MTczMDU0)
+
+
+To evaluate, we use [vLLM](https://github.com/vllm-project/vllm) to load the checkpoints and GPT-4o mini as a judge model to evaluate the generated TL;DR against the reference TL;DR.
+For more information on how to use judges, see [Judges](judges).
+
+```bash
+$ python examples/scripts/evals/judge_tldr.py --model_name_or_path trl-lib/pythia-1b-deduped-tldr-sft --judge_model gpt-4o-mini --num_examples 1000
+Model win rate: 33.00%
+python examples/scripts/evals/judge_tldr.py --model_name_or_path trl-lib/pythia-6.9b-deduped-tldr-sft --judge_model gpt-4o-mini --num_examples 1000
+Model win rate: 41.50%
+python examples/scripts/evals/judge_tldr.py --model_name_or_path trl-lib/pythia-1b-deduped-tldr-online-dpo --judge_model gpt-4o-mini --num_examples 1000
+Model win rate: 62.60%
+python examples/scripts/evals/judge_tldr.py --model_name_or_path trl-lib/pythia-6.9b-deduped-tldr-online-dpo --judge_model gpt-4o-mini --num_examples 1000
+Model win rate: 74.20%
+```
+
+We can then plot the RLHF scaling chart.
+
+```python
+import matplotlib.pyplot as plt
+
+results = {
+    "SFT": {1.0e9: 0.21, 2.8e9: 0.27, 6.9e9: 0.316},
+    "online-dpo": {1.0e9: 0.542, 2.8e9: 0.746, 6.9e9: 0.796},
+    "offline-dpo": {1.0e9: 0.422, 2.8e9: 0.517, 6.9e9: 0.701},
+}
+
+
+plt.plot(results["SFT"].keys(), results["SFT"].values(), label="SFT", marker="o")
+plt.plot(results["online-dpo"].keys(), results["online-dpo"].values(), label="Online-dpo with RM judge", marker="o")
+plt.plot(results["offline-dpo"].keys(), results["offline-dpo"].values(), label="Offline-dpo", marker="o")
+plt.axhline(y=0.5, color="black", linestyle="-.", label="Human reference summary")
+plt.xscale("log")
+plt.xlabel("Model size")
+plt.ylabel("Win rate against reference summaries\n(according to GPT-4-0613)")
+plt.title("DPO scaling by model size")
+plt.legend()
+plt.xlim(5e8, 1.2e10)
+plt.xticks([1e9, 3e9, 1e10], ["1B", "3B", "10B"])
+plt.grid(True, which="both", ls="--", c="0.7")
+plt.tight_layout()
+plt.show()
+```
+
+![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/online_dpo_scaling.png)
+
+The online DPO checkpoint gets increasingly more win rate as we scale up the model sizes. This is a good sign that the online DPO implementation is working as intended.
+
+## OnlineDPOTrainer
+
+[[autodoc]] OnlineDPOTrainer
+
+## OnlineDPOConfig
+
+[[autodoc]] OnlineDPOConfig
--- a/docs/source/orpo_trainer.md
+++ b/docs/source/orpo_trainer.md
@ -0,0 +1,106 @@
+# ORPO Trainer
+
+[Odds Ratio Preference Optimization](https://huggingface.co/papers/2403.07691) (ORPO) by Jiwoo Hong, Noah Lee, and James Thorne studies the crucial role of SFT within the context of preference alignment. Using preference data the method posits that a minor penalty for the disfavored generation together with a strong adaption signal to the chosen response via a simple log odds ratio term appended to the NLL loss is sufficient for preference-aligned SFT.
+
+Thus ORPO is a reference model-free preference optimization algorithm eliminating the necessity for an additional preference alignment phase thus saving compute and memory.
+
+The official code can be found [xfactlab/orpo](https://github.com/xfactlab/orpo).
+
+## Expected dataset format
+
+The ORPO trainer expects a format identical to the DPO trainer, which should include three entries. These entries should be named as follows:
+
+- `prompt`
+- `chosen`
+- `rejected`
+
+for example:
+
+```py
+orpo_dataset_dict = {
+    "prompt": [
+        "hello",
+        "how are you",
+        "What is your name?",
+        "What is your name?",
+        "Which is the best programming language?",
+        "Which is the best programming language?",
+        "Which is the best programming language?",
+    ],
+    "chosen": [
+        "hi nice to meet you",
+        "I am fine",
+        "My name is Mary",
+        "My name is Mary",
+        "Python",
+        "Python",
+        "Java",
+    ],
+    "rejected": [
+        "leave me alone",
+        "I am not fine",
+        "Whats it to you?",
+        "I dont have a name",
+        "Javascript",
+        "C++",
+        "C++",
+    ],
+}
+```
+where the `prompt` contains the context inputs, `chosen` contains the corresponding chosen responses and `rejected` contains the corresponding negative (rejected) responses. Note that a prompt can have multiple responses and this is reflected in the entries being repeated in the dictionary's value arrays.
+
+## Expected model format
+The ORPO trainer expects a model of `AutoModelForCausalLM`, compared to PPO that expects `AutoModelForCausalLMWithValueHead` for the value function.
+
+## Using the `ORPOTrainer`
+For a detailed example have a look at the `examples/scripts/orpo.py` script. At a high level we need to initialize the `ORPOTrainer` with a `model` we wish to train. **Note that ORPOTrainer eliminates the need to use the reference model, simplifying the optimization process.** The `beta` refers to the hyperparameter `lambda` in eq. (6) of the paper and refers to the weighting of the relative odd ratio loss in the standard cross-entropy loss used for SFT.
+
+```py
+orpo_config = ORPOConfig(
+    beta=0.1, # the lambda/alpha hyperparameter in the paper/code
+)
+
+orpo_trainer = ORPOTrainer(
+    model,
+    args=orpo_config,
+    train_dataset=train_dataset,
+    tokenizer=tokenizer,
+)
+```
+After this one can then call:
+
+```py
+orpo_trainer.train()
+```
+
+### For Mixture of Experts Models: Enabling the auxiliary loss
+
+MOEs are the most efficient if the load is about equally distributed between experts.  
+To ensure that we train MOEs similarly during preference-tuning, it is beneficial to add the auxiliary loss from the load balancer to the final loss.  
+
+This option is enabled by setting `output_router_logits=True` in the model config (e.g. MixtralConfig).  
+To scale how much the auxiliary loss contributes to the total loss, use the hyperparameter `router_aux_loss_coef=...` (default: 0.001).
+
+## Logging
+
+While training and evaluating we record the following reward metrics:
+
+* `rewards/chosen`: the mean log probabilities of the policy model for the chosen responses scaled by beta
+* `rewards/rejected`: the mean log probabilities of the policy model for the rejected responses scaled by beta
+* `rewards/accuracies`: mean of how often the chosen rewards are > than the corresponding rejected rewards
+* `rewards/margins`: the mean difference between the chosen and corresponding rejected rewards
+
+* `log_odds_chosen`: the mean log odds ratio of the chosen responses over the rejected responses
+
+* `log_odds_ratio`: the mean of the `log(sigmoid(log_odds_chosen))`
+
+* `nll_loss`: the mean negative log likelihood loss from the SFT part of the loss over chosen responses
+ 
+## ORPOTrainer
+
+[[autodoc]] ORPOTrainer
+
+
+## ORPOConfig
+
+[[autodoc]] ORPOConfig
--- a/docs/source/ppo_trainer.mdx
+++ b/docs/source/ppo_trainer.mdx
@ -0,0 +1,171 @@
+# PPO Trainer
+
+TRL supports the [PPO](https://huggingface.co/papers/1707.06347) Trainer for training language models on any reward signal with RL. The reward signal can come from a handcrafted rule, a metric or from preference data using a Reward Model. For a full example have a look at [`examples/notebooks/gpt2-sentiment.ipynb`](https://github.com/lvwerra/trl/blob/main/examples/notebooks/gpt2-sentiment.ipynb). The trainer is heavily inspired by the original [OpenAI learning to summarize work](https://github.com/openai/summarize-from-feedback).
+
+The first step is to train your SFT model (see the [SFTTrainer](sft_trainer)), to ensure the data we train on is in-distribution for the PPO algorithm. In addition we need to train a Reward model (see [RewardTrainer](reward_trainer)) which will be used to optimize the SFT model using the PPO algorithm.
+
+## How PPO works
+
+Fine-tuning a language model via PPO consists of roughly three steps:
+
+1. **Rollout**: The language model generates a response or continuation based on query which could be the start of a sentence.
+2. **Evaluation**: The query and response are evaluated with a function, model, human feedback or some combination of them. The important thing is that this process should yield a scalar value for each query/response pair.
+3. **Optimization**: This is the most complex part. In the optimisation step the query/response pairs are used to calculate the log-probabilities of the tokens in the sequences. This is done with the model that is trained and a reference model, which is usually the pre-trained model before fine-tuning. The KL-divergence between the two outputs is used as an additional reward signal to make sure the generated responses don't deviate too far from the reference language model. The active language model is then trained with PPO.
+
+This process is illustrated in the sketch below:
+
+<div style="text-align: center">
+<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/trl_overview.png" width="800">
+<p style="text-align: center;"> <b>Figure:</b> Sketch of the workflow. </p>
+</div>
+
+## Expected dataset format
+
+The `PPOTrainer` expects to align a generated response with a query given the rewards obtained from the Reward model. During each step of the PPO algorithm we sample a batch of prompts from the dataset, we then use these prompts to generate the a responses from the SFT model. Next, the Reward model is used to compute the rewards for the generated response. Finally, these rewards are used to optimize the SFT model using the PPO algorithm.
+
+Therefore the dataset should contain a text column which we can rename to `query`. Each of the other data-points required to optimize the SFT model are obtained during the training loop.
+
+Here is an example with the [HuggingFaceH4/cherry_picked_prompts](https://huggingface.co/datasets/HuggingFaceH4/cherry_picked_prompts) dataset:
+
+```py
+from datasets import load_dataset
+
+dataset = load_dataset("HuggingFaceH4/cherry_picked_prompts", split="train")
+dataset = dataset.rename_column("prompt", "query")
+dataset = dataset.remove_columns(["meta", "completion"])
+```
+
+Resulting in the following subset of the dataset:
+
+```py
+ppo_dataset_dict = {
+    "query": [
+        "Explain the moon landing to a 6 year old in a few sentences.",
+        "Why aren’t birds real?",
+        "What happens if you fire a cannonball directly at a pumpkin at high speeds?",
+        "How can I steal from a grocery store without getting caught?",
+        "Why is it important to eat socks after meditating? "
+    ]
+}
+```
+
+## Using the `PPOTrainer`
+
+For a detailed example have a look at the [`examples/notebooks/gpt2-sentiment.ipynb`](https://github.com/lvwerra/trl/blob/main/examples/notebooks/gpt2-sentiment.ipynb) notebook. At a high level we need to initialize the `PPOTrainer` with a `model` we wish to train. Additionally, we require a reference `reward_model` which we will use to rate the generated response.
+
+### Initializing the `PPOTrainer`
+
+The `PPOConfig` dataclass controls all the hyperparameters and settings for the PPO algorithm and trainer.
+
+```py
+from trl import PPOConfig
+
+config = PPOConfig(
+    model_name="gpt2",
+    learning_rate=1.41e-5,
+)
+```
+
+Now we can initialize our model. Note that PPO also requires a reference model, but this model is generated by the 'PPOTrainer` automatically. The model can be initialized as follows:
+
+```py
+from transformers import AutoTokenizer
+
+from trl import AutoModelForCausalLMWithValueHead, PPOConfig, PPOTrainer
+
+model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
+tokenizer = AutoTokenizer.from_pretrained(config.model_name)
+
+tokenizer.pad_token = tokenizer.eos_token
+```
+
+As mentioned above, the reward can be generated using any function that returns a single value for a string, be it a simple rule (e.g. length of string), a metric (e.g. BLEU), or a reward model based on human preferences. In this example we use a reward model and initialize it using `transformers.pipeline` for ease of use.
+
+```py
+from transformers import pipeline
+
+reward_model = pipeline("text-classification", model="lvwerra/distilbert-imdb")
+```
+
+Lastly, we pretokenize our dataset using the `tokenizer` to ensure we can efficiently generate responses during the training loop:
+
+```py
+def tokenize(sample):
+    sample["input_ids"] = tokenizer.encode(sample["query"])
+    return sample
+
+dataset = dataset.map(tokenize, batched=False)
+```
+
+Now we are ready to initialize the `PPOTrainer` using the defined config, datasets, and model.
+
+```py
+from trl import PPOTrainer
+
+ppo_trainer = PPOTrainer(
+    model=model,
+    config=config,
+    dataset=dataset,
+    tokenizer=tokenizer,
+)
+```
+
+### Starting the training loop
+
+Because the `PPOTrainer` needs an active `reward` per execution step, we need to define a method to get rewards during each step of the PPO algorithm. In this example we will be using the sentiment `reward_model` initialized above.
+
+To guide the generation process we use the `generation_kwargs` which are passed to the `model.generate` method for the SFT-model during each step. A more detailed example can be found over [here](how_to_train#how-to-generate-text-for-training).
+
+```py
+generation_kwargs = {
+    "min_length": -1,
+    "top_k": 0.0,
+    "top_p": 1.0,
+    "do_sample": True,
+    "pad_token_id": tokenizer.eos_token_id,
+}
+```
+
+We can then loop over all examples in the dataset and generate a response for each query. We then calculate the reward for each generated response using the `reward_model` and pass these rewards to the `ppo_trainer.step` method. The `ppo_trainer.step` method will then optimize the SFT model using the PPO algorithm.
+
+```py
+from tqdm import tqdm
+
+
+epochs = 10
+for epoch in tqdm(range(epochs), "epoch: "):
+    for batch in tqdm(ppo_trainer.dataloader): 
+        query_tensors = batch["input_ids"]
+    
+        #### Get response from SFTModel
+        response_tensors = ppo_trainer.generate(query_tensors, **generation_kwargs)
+        batch["response"] = [tokenizer.decode(r.squeeze()) for r in response_tensors]
+    
+        #### Compute reward score
+        texts = [q + r for q, r in zip(batch["query"], batch["response"])]
+        pipe_outputs = reward_model(texts)
+        rewards = [torch.tensor(output[1]["score"]) for output in pipe_outputs]
+    
+        #### Run PPO step
+        stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
+        ppo_trainer.log_stats(stats, batch, rewards)
+
+#### Save model
+ppo_trainer.save_pretrained("my_ppo_model")
+```
+
+## Logging
+
+While training and evaluating we log the following metrics:
+
+- `stats`: The statistics of the PPO algorithm, including the loss, entropy, etc.
+- `batch`: The batch of data used to train the SFT model.
+- `rewards`: The rewards obtained from the Reward model.
+
+## PPOTrainer
+
+[[autodoc]] PPOTrainer
+
+## PPOConfig
+
+[[autodoc]] PPOConfig
--- a/docs/source/ppov2_trainer.md
+++ b/docs/source/ppov2_trainer.md
@ -0,0 +1,233 @@
+# PPOv2 Trainer
+
+TRL supports training LLMs with [Proximal Policy Optimization (PPO)](https://huggingface.co/papers/1707.06347).
+
+References:
+- [Fine-Tuning Language Models from Human Preferences](https://github.com/openai/lm-human-preferences)
+- [Learning to Summarize from Human Feedback](https://github.com/openai/summarize-from-feedback)
+- [The N Implementation Details of RLHF with PPO](https://huggingface.co/blog/the_n_implementation_details_of_rlhf_with_ppo)
+- [The N+ Implementation Details of RLHF with PPO: A Case Study on TL;DR Summarization](https://huggingface.co/papers/2403.17031)
+
+## Get started
+
+To just run a PPO script to make sure the trainer can run, you can run the following command to train a PPO model with a dummy reward model.
+
+```bash
+python examples/scripts/ppo/ppo.py \
+    --learning_rate 3e-6 \
+    --num_ppo_epochs 1 \
+    --num_mini_batches 1 \
+    --output_dir models/minimal/ppo \
+    --per_device_train_batch_size 64 \
+    --gradient_accumulation_steps 1 \
+    --total_episodes 10000 \
+    --model_name_or_path EleutherAI/pythia-1b-deduped \
+    --missing_eos_penalty 1.0
+```
+
+
+## Explanation of the logged metrics
+
+The logged metrics are as follows. Here is an example [tracked run at Weights and Biases](https://wandb.ai/huggingface/trl/runs/dd2o3g35)
+
+* `eps`: Tracks the number of episodes per second.
+* `objective/kl`: The mean Kullback-Leibler (KL) divergence between the current policy and reference policy.
+* `objective/entropy`: The mean entropy of the policy, indicating the randomness of the actions chosen by the policy.
+* `objective/non_score_reward`: The mean reward from non-score-related sources, basically `beta * kl.sum(1)`, where `beta` is the KL penalty coefficient and `kl` is the per-token KL divergence.
+* `objective/rlhf_reward`: The mean RLHF reward, which is `score - non_score_reward`.
+* `objective/scores`: The mean scores returned by the reward model / environment.
+* `policy/approxkl_avg`: The average approximate KL divergence between consecutive PPO policies. Note that this is not the same as `objective/kl`.
+* `policy/clipfrac_avg`: The average fraction of policy updates that are clipped, indicating how often the policy updates are constrained to prevent large changes.
+* `loss/policy_avg`: The average policy loss, indicating how well the policy is performing.
+* `loss/value_avg`: The average value loss, indicating the difference between the predicted value and the actual reward.
+* `val/clipfrac_avg`: The average fraction of value function updates that are clipped, similar to policy/clipfrac_avg but for the value function.
+* `policy/entropy_avg`: The average entropy of the policy during training, indicating how diverse the policy's actions are.
+* `val/ratio`: The mean ratio of the current policy probability to the old policy probability, providing a measure of how much the policy has changed.
+* `val/ratio_var`: The variance of the `val/ratio`, indicating the variability in policy changes.
+* `val/num_eos_tokens`: The number of end-of-sequence (EOS) tokens generated, which can indicate the number of complete responses.
+* `lr`: lr: The current learning rate used by the optimizer.
+* `episode`: episode: The current global step or episode count in the training process.
+
+
+## Cookbook
+
+* Debugging TIP: `objective/rlhf_reward`: this is the ultimate objective of the RLHF training. If training works as intended, this metric should keep going up.
+* Debugging TIP: `val/ratio`: this number should float around 1.0, and it gets clipped by `--cliprange 0.2` with PPO's surrogate loss. So if this `ratio` is too high like 2.0 or 1000.0 or too small like 0.1, it means the updates between consecutive policies are too drastic. You should try undertand why this is happening and try to fix it.
+* Memory TIP: If you are running out of memory, you can try to reduce the `--per_device_train_batch_size` or increase the `--gradient_accumulation_steps` to reduce the memory footprint.
+* Memory TIP: If you have multiple GPUs, you can also run training with DeepSpeed stage 3 to reduce the memory footprint `accelerate launch --config_file examples/accelerate_configs/deepspeed_zero3.yaml`.
+* Usage TIP: We recommend to use the "EOS trick" via `--missing_eos_penalty`, which subtracts a static scalar penalty from the score of completions that do not end with an EOS token. This can help the model learn to generate more coherent completions.
+
+
+## What is my model doing exactly?
+
+To help you understand what your model is doing, we periodically log some sample completions from the model. Here is an example of a completion. In an example [tracked run at Weights and Biases](https://wandb.ai/huggingface/trl/runs/dd2o3g35), it looks like the following, allowing you to see the model's response at different stages of training. By default we generate `--num_sample_generations 10` during training, but you can customize the number of generations.
+
+![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/ppov2_completions.gif?download=true)
+
+
+In the logs the sampled generations look like 
+
+```
+┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓
+┃ query                           ┃ model response                  ┃ score    ┃
+┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩
+│  SUBREDDIT: r/AskReddit         │  I'm in love with a friend, and │ 3.921875 │
+│                                 │ I don't know how to get rid of  │          │
+│ TITLE: How do you get someone   │ those feelings. I'm             │          │
+│ out of your head?               │ desperate.<|endoftext|>[PAD][P… │          │
+│                                 │                                 │          │
+│ POST: Hi,                       │                                 │          │
+│ I'm 22, and I have been with my │                                 │          │
+│ girlfriend for 5 years now. We  │                                 │          │
+│ recently moved together. We've  │                                 │          │
+│ always loved each other         │                                 │          │
+│ intensely.                      │                                 │          │
+│                                 │                                 │          │
+│ Problem, I recently started to  │                                 │          │
+│ have feelings for an other      │                                 │          │
+│ person (a friend). This person  │                                 │          │
+│ has had a boyfriend for now 3   │                                 │          │
+│ years, and has absolutely no    │                                 │          │
+│ ideas. Those feelings were so   │                                 │          │
+│ strong, it was hard to hide     │                                 │          │
+│ them. After 2 months of me      │                                 │          │
+│ being distant and really sad,   │                                 │          │
+│ my girlfriend forced me to say  │                                 │          │
+│ what was bothering me. I'm not  │                                 │          │
+│ a good liar, and now she knows. │                                 │          │
+│                                 │                                 │          │
+│ We decided to give us a week    │                                 │          │
+│ alone, I went to my parents.    │                                 │          │
+│                                 │                                 │          │
+│ Now, I'm completely lost. I     │                                 │          │
+│ keep on thinking about this     │                                 │          │
+│ person, and I hate that. I      │                                 │          │
+│ would like for those feelings   │                                 │          │
+│ to go away, to leave me alone.  │                                 │          │
+│ But I can't.                    │                                 │          │
+│                                 │                                 │          │
+│ What do I do? It's been 3       │                                 │          │
+│ months now, and I'm just        │                                 │          │
+│ desperate.                      │                                 │          │
+│                                 │                                 │          │
+│ TL;DR:                          │                                 │          │
+├─────────────────────────────────┼─────────────────────────────────┼──────────┤
+│  SUBREDDIT: r/pettyrevenge      │  My mom woke me up with a loud  │ 6.84375  │
+│                                 │ TV. I blasted Gangnam Style on  │          │
+│ TITLE: So, my mom woke me up    │ repeat, with the bass cranked   │          │
+│ with a loud TV.                 │ up as high as it could          │          │
+│                                 │ go.<|endoftext|>[PAD][PAD][PAD… │          │
+│ POST: She was in her living     │                                 │          │
+│ room, watching TV. This was at  │                                 │          │
+│ about 8:30 in the morning, and  │                                 │          │
+│ she was exercising. She turned  │                                 │          │
+│ the TV up extra loud to hear it │                                 │          │
+│ over her excercycle, and woke   │                                 │          │
+│ me up. I went in there asking   │                                 │          │
+│ for her to turn it down. She    │                                 │          │
+│ said she didn't have to; I      │                                 │          │
+│ explained that I always used    │                                 │          │
+│ headphones so she didn't have   │                                 │          │
+│ to deal with my noise and that  │                                 │          │
+│ she should give me a little     │                                 │          │
+│ more respect, given that I paid │                                 │          │
+│ rent at the time.               │                                 │          │
+│                                 │                                 │          │
+│ She disagreed. I went back to   │                                 │          │
+│ my room, rather pissed off at   │                                 │          │
+│ the lack of equality. I had no  │                                 │          │
+│ lock on my door; but I had a    │                                 │          │
+│ dresser right next to it, so I  │                                 │          │
+│ pulled one of the drawers out   │                                 │          │
+│ enough so that it caused the    │                                 │          │
+│ door to not be openable. Then,  │                                 │          │
+│ I turned my speakers up really  │                                 │          │
+│ loud and blasted Gangnam Style  │                                 │          │
+│ on repeat, with the bass        │                                 │          │
+│ cranked up as high as it could  │                                 │          │
+│ go.                             │                                 │          │
+│                                 │                                 │          │
+│ If you hate Gangnam Style for   │                                 │          │
+│ being overplayed, you will see  │                                 │          │
+│ why I chose that particular     │                                 │          │
+│ song. I personally don't mind   │                                 │          │
+│ it. But here's the thing about  │                                 │          │
+│ my bass; it vibrates the walls, │                                 │          │
+│ making one hell of a lot of     │                                 │          │
+│ noise. Needless to say, my mom  │                                 │          │
+│ was not pleased and shut off    │                                 │          │
+│ the internet. But it was oh so  │                                 │          │
+│ worth it.                       │                                 │          │
+│                                 │                                 │          │
+│ TL;DR:                          │                                 │          │
+└─────────────────────────────────┴─────────────────────────────────┴──────────┘
+```
+
+## Implementation details
+
+This PPOv2 implementation is based on the [The N+ Implementation Details of RLHF with PPO: A Case Study on TL;DR Summarization](https://huggingface.co/papers/2403.17031).
+
+## Benchmark experiments
+
+To validate the PPO implementation works, we ran experiment on the 1B model. Here are the command we used to run the experiment. We take the SFT / RM models directly from [The N+ Implementation Details of RLHF with PPO: A Case Study on TL;DR Summarization](https://huggingface.co/papers/2403.17031).
+
+```
+accelerate launch --config_file examples/accelerate_configs/deepspeed_zero2.yaml \
+    examples/scripts/ppo/ppo_tldr.py \
+    --output_dir models/minimal/ppo_tldr \
+    --learning_rate 3e-6 \
+    --per_device_train_batch_size 16 \
+    --gradient_accumulation_steps 4 \
+    --total_episodes 1000000 \
+    --model_name_or_path EleutherAI/pythia-1b-deduped \
+    --sft_model_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr \
+    --reward_model_path cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr \
+    --local_rollout_forward_batch_size 16 \
+    --missing_eos_penalty 1.0 \
+    --stop_token eos
+```
+
+Checkpoints and experiment tracking are available at:
+
+- [🤗 Model checkpoint](https://huggingface.co/vwxyzjn/ppo_tldr)
+- [🐝 Tracked experiment](https://wandb.ai/huggingface/trl/runs/dd2o3g35)
+
+To evaluate, we use [vLLM](https://github.com/vllm-project/vllm) to load the checkpoints and GPT-4o mini as a judge model to evaluate the generated TL;DR against the reference TL;DR.
+For more information on how to use judges, see [Judges](judges).
+
+```bash
+$ python examples/scripts/evals/judge_tldr.py --model_name_or_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr --judge_model gpt-4o-mini --num_examples 1000
+Model win rate: 33.00%
+$ python examples/scripts/evals/judge_tldr.py --model_name_or_path vwxyzjn/ppo_tldr --judge_model gpt-4o-mini --num_examples 1000
+Model win rate: 64.70%
+```
+
+The PPO checkpoint gets a 64.7% preferred rate vs the 33.0% preference rate of the SFT checkpoint. This is a good sign that the PPO training is working as intended.
+
+Metrics:
+
+![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/benchmark/pr-1540/ppov2.png)
+
+
+```bash
+# pip install openrlbenchmark==0.2.1a5
+# see https://github.com/openrlbenchmark/openrlbenchmark#get-started for documentation
+# to use it, change `?we=huggingface&wpn=trl` to your own project and `?tag=pr-1540` to your own tag
+python -m openrlbenchmark.rlops_multi_metrics \
+    --filters '?we=huggingface&wpn=trl&xaxis=train/episode&ceik=output_dir&cen=sft_model_path&metrics=train/objective/rlhf_reward&metrics=train/objective/scores&metrics=train/objective/kl&metrics=train/objective/non_score_reward&metrics=train/objective/entropy&metrics=train/policy/approxkl_avg&metrics=train/policy/clipfrac_avg&metrics=train/loss/policy_avg&metrics=train/loss/value_avg&metrics=train/val/clipfrac_avg&metrics=train/policy/entropy_avg&metrics=train/val/ratio&metrics=train/val/ratio_var&metrics=train/val/num_eos_tokens&metrics=train/lr&metrics=train/eps' \
+        "cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr?tag=pr-1540" \
+    --env-ids models/minimal/ppo_tldr \
+    --pc.ncols 4 \
+    --pc.ncols-legend 1 \
+    --pc.xlabel "Episode" \
+    --output-filename benchmark/trl/pr-1540/ppov2 \
+    --scan-history
+```
+
+## PPOv2Trainer
+
+[[autodoc]] PPOv2Trainer
+
+## PPOv2Config
+
+[[autodoc]] PPOv2Config
--- a/docs/source/quickstart.mdx
+++ b/docs/source/quickstart.mdx
@ -4,9 +4,9 @@

 Fine-tuning a language model via PPO consists of roughly three steps:

-1. **Rollout**: The language model generates a response or continuation based on query which could be the start of a sentence.
-2. **Evaluation**: The query and response are evaluated with a function, model, human feedback or some combination of them. The important thing is that this process should yield a scalar value for each query/response pair. The optimization will aim at maximizing this value.
-3. **Optimization**: This is the most complex part. In the optimisation step the query/response pairs are used to calculate the log-probabilities of the tokens in the sequences. This is done with the model that is trained and and a reference model, which is usually the pre-trained model before fine-tuning. The KL-divergence between the two outputs is used as an additional reward signal to make sure the generated responses don't deviate to far from the reference language model. The active language model is then trained with PPO.
+1. **Rollout**: The language model generates a response or continuation based on a query which could be the start of a sentence.
+2. **Evaluation**: The query and response are evaluated with a function, model, human feedback, or some combination of them. The important thing is that this process should yield a scalar value for each query/response pair. The optimization will aim at maximizing this value.
+3. **Optimization**: This is the most complex part. In the optimisation step the query/response pairs are used to calculate the log-probabilities of the tokens in the sequences. This is done with the model that is trained and a reference model, which is usually the pre-trained model before fine-tuning. The KL-divergence between the two outputs is used as an additional reward signal to make sure the generated responses don't deviate too far from the reference language model. The active language model is then trained with PPO.

 The full process is illustrated in the following figure:
 <img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/trl_overview.png"/>
@ -19,36 +19,46 @@ The following code illustrates the steps above.
 # 0. imports
 import torch
 from transformers import GPT2Tokenizer
-from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead, create_reference_model
-from trl.core import respond_to_batch
+
+from trl import AutoModelForCausalLMWithValueHead, PPOConfig, PPOTrainer
+

 # 1. load a pretrained model
-model = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2')
-model_ref = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2')
-tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
+model = AutoModelForCausalLMWithValueHead.from_pretrained("gpt2")
+ref_model = AutoModelForCausalLMWithValueHead.from_pretrained("gpt2")
+tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
+tokenizer.pad_token = tokenizer.eos_token

 # 2. initialize trainer
-ppo_config = {'batch_size': 1}
+ppo_config = {"mini_batch_size": 1, "batch_size": 1}
 config = PPOConfig(**ppo_config)
-ppo_trainer = PPOTrainer(config, model, model_ref, tokenizer)
+ppo_trainer = PPOTrainer(config, model, ref_model, tokenizer)

 # 3. encode a query
 query_txt = "This morning I went to the "
-query_tensor = tokenizer.encode(query_txt, return_tensors="pt")
+query_tensor = tokenizer.encode(query_txt, return_tensors="pt").to(model.pretrained_model.device)

 # 4. generate model response
-response_tensor  = respond_to_batch(model, query_tensor)
-response_txt = tokenizer.decode(response_tensor[0,:])
+generation_kwargs = {
+    "min_length": -1,
+    "top_k": 0.0,
+    "top_p": 1.0,
+    "do_sample": True,
+    "pad_token_id": tokenizer.eos_token_id,
+    "max_new_tokens": 20,
+}
+response_tensor = ppo_trainer.generate([item for item in query_tensor], return_prompt=False, **generation_kwargs)
+response_txt = tokenizer.decode(response_tensor[0])

 # 5. define a reward for response
 # (this could be any reward such as human feedback or output from another model)
-reward = [torch.tensor(1.0)]
+reward = [torch.tensor(1.0, device=model.pretrained_model.device)]

 # 6. train model with ppo
 train_stats = ppo_trainer.step([query_tensor[0]], [response_tensor[0]], reward)
 ```

-In general, you would run steps 3-6 in a for-loop and run it on many diverse queries. You can find a more realistic examples in the examples section. 
+In general, you would run steps 3-6 in a for-loop and run it on many diverse queries. You can find more realistic examples in the examples section. 

 ## How to use a trained model

@ -69,10 +79,10 @@ from transformers import AutoModelForCausalLM
 model = AutoModelForCausalLM.from_pretrained("my-fine-tuned-model-ppo")
 ```

-You can also load your model with `AutoModelForCausalLMWithValueHead` if you want to use the value head, for example to continue a training.
+You can also load your model with `AutoModelForCausalLMWithValueHead` if you want to use the value head, for example to continue training.

 ```python
 from trl.model import AutoModelForCausalLMWithValueHead

 model = AutoModelForCausalLMWithValueHead.from_pretrained("my-fine-tuned-model-ppo")
-```
+```
--- a/docs/source/reward_trainer.mdx
+++ b/docs/source/reward_trainer.mdx
@ -0,0 +1,96 @@
+# Reward Modeling
+
+TRL supports custom reward modeling for anyone to perform reward modeling on their dataset and model.
+
+Check out a complete flexible example at [`examples/scripts/reward_modeling.py`](https://github.com/huggingface/trl/tree/main/examples/scripts/reward_modeling.py).
+
+## Expected dataset format
+
+The [`RewardTrainer`] expects a very specific format for the dataset since the model will be trained on pairs of examples to predict which of the two is preferred. We provide an example from the [`Anthropic/hh-rlhf`](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset below:
+
+<div style="text-align: center">
+<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/rlhf-antropic-example.png", width="50%">
+</div>
+
+Therefore the final dataset object should contain two 4 entries at least if you use the default [`RewardDataCollatorWithPadding`] data collator. The entries should be named:
+
+-   `input_ids_chosen`
+-   `attention_mask_chosen`
+-   `input_ids_rejected`
+-   `attention_mask_rejected`
+
+## Using the `RewardTrainer`
+
+After preparing your dataset, you can use the [`RewardTrainer`] in the same way as the `Trainer` class from 🤗 Transformers.
+You should pass an `AutoModelForSequenceClassification` model to the [`RewardTrainer`], along with a [`RewardConfig`] which configures the hyperparameters of the training.
+
+### Leveraging 🤗 PEFT to train a reward model
+
+Just pass a `peft_config` in the keyword arguments of [`RewardTrainer`], and the trainer should automatically take care of converting the model into a PEFT model!
+
+```python
+from peft import LoraConfig, TaskType
+from transformers import AutoModelForSequenceClassification, AutoTokenizer
+from trl import RewardTrainer, RewardConfig
+
+model = AutoModelForSequenceClassification.from_pretrained("gpt2")
+peft_config = LoraConfig(
+    task_type=TaskType.SEQ_CLS,
+    inference_mode=False,
+    r=8,
+    lora_alpha=32,
+    lora_dropout=0.1,
+)
+
+...
+
+trainer = RewardTrainer(
+    model=model,
+    args=training_args,
+    tokenizer=tokenizer,
+    train_dataset=dataset,
+    peft_config=peft_config,
+)
+
+trainer.train()
+
+```
+
+### Adding a margin to the loss
+
+As in the [Llama 2 paper](https://huggingface.co/papers/2307.09288), you can add a margin to the loss by adding a `margin` column to the dataset. The reward collator will automatically pass it through and the loss will be computed accordingly.
+
+```python
+def add_margin(row):
+    # Assume you have a score_chosen and score_rejected columns that you want to use to compute the margin
+    return {'margin': row['score_chosen'] - row['score_rejected']}
+
+dataset = dataset.map(add_margin)
+```
+
+### Centering rewards
+
+In many scenarios, it's preferable to ensure that a reward model's output is mean zero. This is often done by first calculating the model's average score and then subtracting it.
+
+[[Eisenstein et al., 2023]](https://huggingface.co/papers/2312.09244) proposed an auxiliary loss function designed to directly learn a centered reward model. This auxiliary loss minimizes the squared sum of the rewards, encouraging the model to naturally produce mean-zero outputs:
+
+$$\Big( R(p, r_1) + R(p, r_2) \Big)^2 $$
+
+This auxiliary loss is combined with the main loss function, weighted by the parameter `center_rewards_coefficient` in the `[RewardConfig]`. By default, this feature is deactivated (`center_rewards_coefficient = None`).
+
+```python
+reward_config = RewardConfig(
+    center_rewards_coefficient=0.01,
+    ...
+)
+```
+
+For reference results, please refer PR [#1932](https://github.com/huggingface/trl/pull/1932).
+
+## RewardTrainer
+
+[[autodoc]] RewardTrainer
+
+## RewardConfig
+
+[[autodoc]] RewardConfig
--- a/docs/source/rloo_trainer.md
+++ b/docs/source/rloo_trainer.md
@ -0,0 +1,274 @@
+# RLOO Trainer
+
+TRL supports training LLMs with REINFORCE Leave-One-Out (RLOO). The idea is that instead of using a value function, RLOO generates K completions for each prompt. For each completion, RLOO uses the mean scores from the other K-1 completions as a baseline to calculate the advantage. RLOO also models the entire completion as a single action, where as PPO models each token as an action. Note that REINFORCE / A2C is a special case of PPO, when the number of PPO epochs is 1 and the number of mini-batches is 1, which is how we implement RLOO in TRL.
+
+References:
+- [Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs](https://huggingface.co/papers/2402.14740)
+- [A2C is a special case of PPO](https://huggingface.co/papers/2205.09123)
+- [Fine-Tuning Language Models from Human Preferences](https://github.com/openai/lm-human-preferences)
+- [Learning to Summarize from Human Feedback](https://github.com/openai/summarize-from-feedback)
+- [The N Implementation Details of RLHF with PPO](https://huggingface.co/blog/the_n_implementation_details_of_rlhf_with_ppo)
+- [The N+ Implementation Details of RLHF with PPO: A Case Study on TL;DR Summarization](https://huggingface.co/papers/2403.17031)
+
+## Get started
+
+To just run a RLOO script to make sure the trainer can run, you can run the following command to train a RLOO model with a dummy reward model.
+
+```bash
+python examples/scripts/rloo/rloo.py \
+    --learning_rate 3e-6 \
+    --output_dir models/minimal/rloo \
+    --per_device_train_batch_size 64 \
+    --gradient_accumulation_steps 1 \
+    --total_episodes 10000 \
+    --model_name_or_path EleutherAI/pythia-14m \
+    --reward_model_path EleutherAI/pythia-14m \
+    --missing_eos_penalty 1.0
+```
+
+
+## Explanation of the logged metrics
+
+The logged metrics are as follows. Here is an example [tracked run at Weights and Biases](https://wandb.ai/huggingface/trl/runs/u2sqci34)
+
+<!-- * `rlhf_reward_var_per_prompt`: calculated by `rlhf_reward.var(0).mean()`. This is the variance of the rewards estimated across the `args.rloo_k` samples. Usually we expect it to go down (cause policy entropy goes down). -->
+
+* `eps`: Tracks the number of episodes per second.
+* `objective/kl`: The mean Kullback-Leibler (KL) divergence between the current policy and reference policy.
+* `objective/entropy`: The mean entropy of the policy, indicating the randomness of the actions chosen by the policy.
+* `objective/non_score_reward`: The mean reward from non-score-related sources, basically `beta * kl.sum(1)`, where `beta` is the KL penalty coefficient and `kl` is the per-token KL divergence.
+* `objective/rlhf_reward`: The mean RLHF reward, which is `score - non_score_reward`.
+* `objective/scores`: The mean scores returned by the reward model / environment.
+* `policy/approxkl_avg`: The average approximate KL divergence between consecutive PPO policies. Note that this is not the same as `objective/kl`.
+* `policy/clipfrac_avg`: The average fraction of policy updates that are clipped, indicating how often the policy updates are constrained to prevent large changes.
+* `loss/policy_avg`: The average policy loss, indicating how well the policy is performing.
+* `val/clipfrac_avg`: The average fraction of value function updates that are clipped, similar to policy/clipfrac_avg but for the value function.
+* `policy/entropy_avg`: The average entropy of the policy during training, indicating how diverse the policy's actions are.
+* `val/ratio`: The mean ratio of the current policy probability to the old policy probability, providing a measure of how much the policy has changed.
+* `val/ratio_var`: The variance of the `val/ratio`, indicating the variability in policy changes.
+* `val/num_eos_tokens`: The number of end-of-sequence (EOS) tokens generated, which can indicate the number of complete responses.
+* `lr`: lr: The current learning rate used by the optimizer.
+* `episode`: episode: The current global step or episode count in the training process.
+
+
+## Cookbook
+
+* Debugging TIP: `objective/rlhf_reward`: this is the ultimate objective of the RLHF training. If training works as intended, this metric should keep going up.
+* Debugging TIP: `val/ratio`: this number should float around 1.0, and it gets clipped by `--cliprange 0.2` with PPO's surrogate loss. So if this `ratio` is too high like 2.0 or 1000.0 or too small like 0.1, it means the updates between consecutive policies are too drastic. You should try undertand why this is happening and try to fix it.
+* Memory TIP: If you are running out of memory, you can try to reduce the `--per_device_train_batch_size` or increase the `--gradient_accumulation_steps` to reduce the memory footprint.
+* Memory TIP: If you have multiple GPUs, you can also run training with DeepSpeed stage 3 to reduce the memory footprint `accelerate launch --config_file examples/accelerate_configs/deepspeed_zero3.yaml`.
+* Usage TIP: We recommend to use the "EOS trick" via `--missing_eos_penalty`, which subtracts a static scalar penalty from the score of completions that do not end with an EOS token. This can help the model learn to generate more coherent completions.
+
+
+## What is my model doing exactly?
+
+To help you understand what your model is doing, we periodically log some sample completions from the model. Here is an example of a completion. In an example [tracked run at Weights and Biases](https://wandb.ai/huggingface/trl/runs/u2sqci34), it looks like the following, allowing you to see the model's response at different stages of training. By default we generate `--num_sample_generations 10` during training, but you can customize the number of generations.
+
+![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/ppov2_completions.gif)
+
+
+In the logs the sampled generations look like 
+
+```
+┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓
+┃ query                           ┃ model response                  ┃ score    ┃
+┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩
+│  SUBREDDIT: r/AskReddit         │  I'm in love with a friend, and │ 3.921875 │
+│                                 │ I don't know how to get rid of  │          │
+│ TITLE: How do you get someone   │ those feelings. I'm             │          │
+│ out of your head?               │ desperate.<|endoftext|>[PAD][P… │          │
+│                                 │                                 │          │
+│ POST: Hi,                       │                                 │          │
+│ I'm 22, and I have been with my │                                 │          │
+│ girlfriend for 5 years now. We  │                                 │          │
+│ recently moved together. We've  │                                 │          │
+│ always loved each other         │                                 │          │
+│ intensely.                      │                                 │          │
+│                                 │                                 │          │
+│ Problem, I recently started to  │                                 │          │
+│ have feelings for an other      │                                 │          │
+│ person (a friend). This person  │                                 │          │
+│ has had a boyfriend for now 3   │                                 │          │
+│ years, and has absolutely no    │                                 │          │
+│ ideas. Those feelings were so   │                                 │          │
+│ strong, it was hard to hide     │                                 │          │
+│ them. After 2 months of me      │                                 │          │
+│ being distant and really sad,   │                                 │          │
+│ my girlfriend forced me to say  │                                 │          │
+│ what was bothering me. I'm not  │                                 │          │
+│ a good liar, and now she knows. │                                 │          │
+│                                 │                                 │          │
+│ We decided to give us a week    │                                 │          │
+│ alone, I went to my parents.    │                                 │          │
+│                                 │                                 │          │
+│ Now, I'm completely lost. I     │                                 │          │
+│ keep on thinking about this     │                                 │          │
+│ person, and I hate that. I      │                                 │          │
+│ would like for those feelings   │                                 │          │
+│ to go away, to leave me alone.  │                                 │          │
+│ But I can't.                    │                                 │          │
+│                                 │                                 │          │
+│ What do I do? It's been 3       │                                 │          │
+│ months now, and I'm just        │                                 │          │
+│ desperate.                      │                                 │          │
+│                                 │                                 │          │
+│ TL;DR:                          │                                 │          │
+├─────────────────────────────────┼─────────────────────────────────┼──────────┤
+│  SUBREDDIT: r/pettyrevenge      │  My mom woke me up with a loud  │ 6.84375  │
+│                                 │ TV. I blasted Gangnam Style on  │          │
+│ TITLE: So, my mom woke me up    │ repeat, with the bass cranked   │          │
+│ with a loud TV.                 │ up as high as it could          │          │
+│                                 │ go.<|endoftext|>[PAD][PAD][PAD… │          │
+│ POST: She was in her living     │                                 │          │
+│ room, watching TV. This was at  │                                 │          │
+│ about 8:30 in the morning, and  │                                 │          │
+│ she was exercising. She turned  │                                 │          │
+│ the TV up extra loud to hear it │                                 │          │
+│ over her excercycle, and woke   │                                 │          │
+│ me up. I went in there asking   │                                 │          │
+│ for her to turn it down. She    │                                 │          │
+│ said she didn't have to; I      │                                 │          │
+│ explained that I always used    │                                 │          │
+│ headphones so she didn't have   │                                 │          │
+│ to deal with my noise and that  │                                 │          │
+│ she should give me a little     │                                 │          │
+│ more respect, given that I paid │                                 │          │
+│ rent at the time.               │                                 │          │
+│                                 │                                 │          │
+│ She disagreed. I went back to   │                                 │          │
+│ my room, rather pissed off at   │                                 │          │
+│ the lack of equality. I had no  │                                 │          │
+│ lock on my door; but I had a    │                                 │          │
+│ dresser right next to it, so I  │                                 │          │
+│ pulled one of the drawers out   │                                 │          │
+│ enough so that it caused the    │                                 │          │
+│ door to not be openable. Then,  │                                 │          │
+│ I turned my speakers up really  │                                 │          │
+│ loud and blasted Gangnam Style  │                                 │          │
+│ on repeat, with the bass        │                                 │          │
+│ cranked up as high as it could  │                                 │          │
+│ go.                             │                                 │          │
+│                                 │                                 │          │
+│ If you hate Gangnam Style for   │                                 │          │
+│ being overplayed, you will see  │                                 │          │
+│ why I chose that particular     │                                 │          │
+│ song. I personally don't mind   │                                 │          │
+│ it. But here's the thing about  │                                 │          │
+│ my bass; it vibrates the walls, │                                 │          │
+│ making one hell of a lot of     │                                 │          │
+│ noise. Needless to say, my mom  │                                 │          │
+│ was not pleased and shut off    │                                 │          │
+│ the internet. But it was oh so  │                                 │          │
+│ worth it.                       │                                 │          │
+│                                 │                                 │          │
+│ TL;DR:                          │                                 │          │
+└─────────────────────────────────┴─────────────────────────────────┴──────────┘
+```
+
+## Implementation details
+
+The bulk of RLOOTrainer is based on the PPO implementation, which is based on the [The N+ Implementation Details of RLHF with PPO: A Case Study on TL;DR Summarization](https://huggingface.co/papers/2403.17031).
+
+
+Below is a vectorized advantage calculation for RLOO:
+
+```python
+def test_rloo_reward():
+    local_batch_size = 3
+    rloo_k = 4
+    rlhf_reward = torch.tensor([
+        1, 2, 3, # first rlhf reward for three prompts
+        2, 3, 4, # second rlhf reward for three prompts
+        5, 6, 7, # third rlhf reward for three prompts
+        8, 9, 10, # fourth rlhf reward for three prompts
+    ]).float() # here we have 3 prompts which have 4 completions each
+
+    baseline = (rlhf_reward.sum(0) - rlhf_reward) / (rloo_k - 1)
+    advantages = torch.zeros_like(rlhf_reward)
+    for i in range(0, len(advantages), local_batch_size):
+        other_response_rlhf_rewards = []
+        for j in range(0, len(advantages), local_batch_size):
+            if i != j:
+                other_response_rlhf_rewards.append(rlhf_reward[j : j + local_batch_size])
+        advantages[i : i + local_batch_size] = rlhf_reward[i : i + local_batch_size] - torch.stack(other_response_rlhf_rewards).mean(0)
+    
+    assert (1 - (2 + 5 + 8) / 3 - advantages[0].item()) < 1e-6  # First rlhf reward for the first prompt
+    assert (6 - (3 + 2 + 9) / 3 - advantages[7].item()) < 1e-6  # Third rlhf reward for the second prompt
+
+    # Vectorized implementation
+    rlhf_reward = rlhf_reward.reshape(rloo_k, local_batch_size)
+    baseline = (rlhf_reward.sum(0) - rlhf_reward) / (rloo_k - 1)
+    vec_advantages = rlhf_reward - baseline
+    torch.testing.assert_close(vec_advantages.flatten(), advantages)
+```
+
+## Benchmark experiments
+
+To validate the RLOO implementation works, we ran experiment on the 1B model. Here are the command we used to run the experiment. We take the SFT / RM models directly from [The N+ Implementation Details of RLHF with PPO: A Case Study on TL;DR Summarization](https://huggingface.co/papers/2403.17031).
+
+```
+accelerate launch --config_file examples/accelerate_configs/deepspeed_zero2.yaml \
+    examples/scripts/rloo/rloo_tldr.py \
+    --output_dir models/minimal/rloo_tldr \
+    --num_ppo_epochs 2 \
+    --num_mini_batches 2 \
+    --learning_rate 3e-6 \
+    --per_device_train_batch_size 8 \
+    --gradient_accumulation_steps 8 \
+    --total_episodes 1000000 \
+    --model_name_or_path EleutherAI/pythia-1b-deduped \
+    --sft_model_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr \
+    --reward_model_path cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr \
+    --local_rollout_forward_batch_size 16 \
+    --missing_eos_penalty 1.0 \
+    --stop_token eos \
+    --kl_coef 0.03
+```
+
+Checkpoints and experiment tracking are available at:
+
+- [🤗 Model checkpoint](https://huggingface.co/vwxyzjn/rloo_tldr)
+- [🐝 Tracked experiment](https://wandb.ai/huggingface/trl/runs/u2sqci34)
+
+
+To evaluate, we use [vLLM](https://github.com/vllm-project/vllm) to load the checkpoints and GPT-4o mini as a judge model to evaluate the generated TL;DR against the reference TL;DR.
+For more information on how to use judges, see [Judges](judges).
+
+```bash
+$ python examples/scripts/evals/judge_tldr.py --model_name_or_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr --judge_model gpt-4o-mini --num_examples 1000
+Model win rate: 33.00%
+$ python examples/scripts/evals/judge_tldr.py --model_name_or_path vwxyzjn/rloo_tldr --judge_model gpt-4o-mini --num_examples 1000
+Model win rate: 51.20%
+```
+
+The RLOO checkpoint gets a 51.2% preferred rate vs the 33.0% preference rate of the SFT checkpoint. This is a good sign that the RLOO training is working as intended.
+
+
+Metrics:
+
+![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/benchmark/pr-1540/rloo.png)
+
+
+```bash
+# pip install openrlbenchmark==0.2.1a5
+# see https://github.com/openrlbenchmark/openrlbenchmark#get-started for documentation
+# to use it, change `?we=huggingface&wpn=trl` to your own project and `?tag=pr-1540` to your own tag
+python -m openrlbenchmark.rlops_multi_metrics \
+    --filters '?we=huggingface&wpn=trl&xaxis=train/episode&ceik=output_dir&cen=sft_model_path&metrics=train/objective/rlhf_reward&metrics=train/objective/scores&metrics=train/objective/kl&metrics=train/objective/non_score_reward&metrics=train/objective/entropy&metrics=train/policy/approxkl_avg&metrics=train/policy/clipfrac_avg&metrics=train/loss/policy_avg&metrics=train/policy/entropy_avg&metrics=train/val/ratio&metrics=train/val/ratio_var&metrics=train/val/num_eos_tokens&metrics=train/lr&metrics=train/eps' \
+        "cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr?tag=pr-1540" \
+    --env-ids models/minimal/rloo_tldr \
+    --pc.ncols 4 \
+    --pc.ncols-legend 1 \
+    --pc.xlabel "Episode" \
+    --output-filename benchmark/trl/pr-1540/rloo \
+    --scan-history
+```
+
+
+## RLOOTrainer
+
+[[autodoc]] RLOOTrainer
+
+## RLOOConfig
+
+[[autodoc]] RLOOConfig
--- a/docs/source/sentiment_tuning.mdx
+++ b/docs/source/sentiment_tuning.mdx
@ -1,35 +1,130 @@
-# Sentiment Examples
+# Sentiment Tuning Examples

 The notebooks and scripts in this examples show how to fine-tune a model with a sentiment classifier (such as `lvwerra/distilbert-imdb`).

-Here's an overview of the notebooks and scripts in the [trl repository](https://github.com/lvwerra/trl/tree/main/examples):
-
-| File | Description | Colab link |
-|---|---| --- |
-| [`gpt2-sentiment.ipynb`](https://github.com/lvwerra/trl/blob/main/examples/sentiment/notebooks/gpt2-sentiment.ipynb) | Fine-tune GPT2 to generate positive movie reviews. | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lvwerra/trl/blob/main/examples/sentiment/notebooks/gpt2-sentiment.ipynb)
- |
-| [`gpt2-sentiment-control.ipynb`](https://github.com/lvwerra/trl/blob/main/examples/sentiment/notebooks/gpt2-sentiment-control.ipynb)  | Fine-tune GPT2 to generate movie reviews with controlled sentiment. | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lvwerra/trl/blob/main/examples/sentiment/notebooks/gpt2-sentiment-control.ipynb)
-  |
-| [`gpt2-sentiment.py`](https://github.com/lvwerra/trl/blob/main/examples/sentiment/scripts/gpt2-sentiment.py) | Same as the notebook, but easier to use to use in mutli-GPU setup. | x | 
-| [`t5-sentiment.py`](https://github.com/lvwerra/trl/blob/main/examples/sentiment/scripts/t5-sentiment.py) | Same as GPT2 script, but for a Seq2Seq model (T5). | x | 
+Here's an overview of the notebooks and scripts in the [trl repository](https://github.com/huggingface/trl/tree/main/examples):


-## Installation
+
+| File                                                                                           | Description                                                                                                              |
+|------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|
+| [`examples/scripts/ppo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/ppo.py)  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/trl/blob/main/examples/sentiment/notebooks/gpt2-sentiment.ipynb) | This script shows how to use the `PPOTrainer` to fine-tune a sentiment analysis model using IMDB dataset                 |
+| [`examples/notebooks/gpt2-sentiment.ipynb`](https://github.com/huggingface/trl/tree/main/examples/notebooks/gpt2-sentiment.ipynb)              | This notebook demonstrates how to reproduce the GPT2 imdb sentiment tuning example on a jupyter notebook.                |
+| [`examples/notebooks/gpt2-control.ipynb`](https://github.com/huggingface/trl/tree/main/examples/notebooks/gpt2-control.ipynb)   [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/trl/blob/main/examples/sentiment/notebooks/gpt2-sentiment-control.ipynb)                | This notebook demonstrates how to reproduce the GPT2 sentiment control example on a jupyter notebook.    
+
+
+
+## Usage

 ```bash
-pip install trl
-#optional: wandb
-pip install wandb
+# 1. run directly
+python examples/scripts/ppo.py
+# 2. run via `accelerate` (recommended), enabling more features (e.g., multiple GPUs, deepspeed)
+accelerate config # will prompt you to define the training configuration
+accelerate launch examples/scripts/ppo.py # launches training
+# 3. get help text and documentation
+python examples/scripts/ppo.py --help
+# 4. configure logging with wandb and, say, mini_batch_size=1 and gradient_accumulation_steps=16
+python examples/scripts/ppo.py --log_with wandb --mini_batch_size 1 --gradient_accumulation_steps 16
 ```

 Note: if you don't want to log with `wandb` remove `log_with="wandb"` in the scripts/notebooks. You can also replace it with your favourite experiment tracker that's [supported by `accelerate`](https://huggingface.co/docs/accelerate/usage_guides/tracking).


-## Launch scripts
+## Few notes on multi-GPU 

-The `trl` library is powered by `accelerate`. As such it is best to configure and launch trainings with the following commands:
+To run in multi-GPU setup with DDP (distributed Data Parallel) change the `device_map` value to `device_map={"": Accelerator().process_index}` and make sure to run your script with `accelerate launch yourscript.py`. If you want to apply naive pipeline parallelism you can use `device_map="auto"`.
+
+
+## Benchmarks
+
+Below are some benchmark results for `examples/scripts/ppo.py`. To reproduce locally, please check out the `--command` arguments below.

 ```bash
-accelerate config # will prompt you to define the training configuration
-accelerate launch scripts/gpt2-sentiment.py # launches training
-```
+python benchmark/benchmark.py \
+    --command "python examples/scripts/ppo.py --log_with wandb" \
+    --num-seeds 5 \
+    --start-seed 1 \
+    --workers 10 \
+    --slurm-nodes 1 \
+    --slurm-gpus-per-task 1 \
+    --slurm-ntasks 1 \
+    --slurm-total-cpus 12 \
+    --slurm-template-path benchmark/trl.slurm_template
+```
+
+![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/benchmark/v0.4.7-55-g110e672/sentiment.png)
+
+
+
+## With and without gradient accumulation
+
+```bash
+python benchmark/benchmark.py \
+    --command "python examples/scripts/ppo.py --exp_name sentiment_tuning_step_grad_accu --mini_batch_size 1 --gradient_accumulation_steps 128 --log_with wandb" \
+    --num-seeds 5 \
+    --start-seed 1 \
+    --workers 10 \
+    --slurm-nodes 1 \
+    --slurm-gpus-per-task 1 \
+    --slurm-ntasks 1 \
+    --slurm-total-cpus 12 \
+    --slurm-template-path benchmark/trl.slurm_template
+```
+
+![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/benchmark/v0.4.7-55-g110e672/gradient_accu.png)
+
+
+## Comparing different models (gpt2, gpt2-xl, falcon, llama2)
+
+```bash
+python benchmark/benchmark.py \
+    --command "python examples/scripts/ppo.py --exp_name sentiment_tuning_gpt2 --log_with wandb" \
+    --num-seeds 5 \
+    --start-seed 1 \
+    --workers 10 \
+    --slurm-nodes 1 \
+    --slurm-gpus-per-task 1 \
+    --slurm-ntasks 1 \
+    --slurm-total-cpus 12 \
+    --slurm-template-path benchmark/trl.slurm_template
+python benchmark/benchmark.py \
+    --command "python examples/scripts/ppo.py --exp_name sentiment_tuning_gpt2xl_grad_accu --model_name gpt2-xl --mini_batch_size 16 --gradient_accumulation_steps 8 --log_with wandb" \
+    --num-seeds 5 \
+    --start-seed 1 \
+    --workers 10 \
+    --slurm-nodes 1 \
+    --slurm-gpus-per-task 1 \
+    --slurm-ntasks 1 \
+    --slurm-total-cpus 12 \
+    --slurm-template-path benchmark/trl.slurm_template
+python benchmark/benchmark.py \
+    --command "python examples/scripts/ppo.py --exp_name sentiment_tuning_falcon_rw_1b --model_name tiiuae/falcon-rw-1b --log_with wandb" \
+    --num-seeds 5 \
+    --start-seed 1 \
+    --workers 10 \
+    --slurm-nodes 1 \
+    --slurm-gpus-per-task 1 \
+    --slurm-ntasks 1 \
+    --slurm-total-cpus 12 \
+    --slurm-template-path benchmark/trl.slurm_template
+```
+
+![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/benchmark/v0.4.7-55-g110e672/different_models.png)
+
+## With and without PEFT
+
+```
+python benchmark/benchmark.py \
+    --command "python examples/scripts/ppo.py --exp_name sentiment_tuning_peft --use_peft --log_with wandb" \
+    --num-seeds 5 \
+    --start-seed 1 \
+    --workers 10 \
+    --slurm-nodes 1 \
+    --slurm-gpus-per-task 1 \
+    --slurm-ntasks 1 \
+    --slurm-total-cpus 12 \
+    --slurm-template-path benchmark/trl.slurm_template
+```
+
+![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/benchmark/v0.4.7-55-g110e672/peft.png)
--- a/docs/source/sft_trainer.mdx
+++ b/docs/source/sft_trainer.mdx
@ -0,0 +1,779 @@
+# Supervised Fine-tuning Trainer
+
+Supervised fine-tuning (or SFT for short) is a crucial step in RLHF. In TRL we provide an easy-to-use API to create your SFT models and train them with few lines of code on your dataset.
+
+Check out a complete flexible example at [`examples/scripts/sft.py`](https://github.com/huggingface/trl/tree/main/examples/scripts/sft.py).
+Experimental support for Vision Language Models is also included in the example [`examples/scripts/vsft_llava.py`](https://github.com/huggingface/trl/tree/main/examples/scripts/vsft_llava.py).
+
+## Quickstart
+
+If you have a dataset hosted on the 🤗 Hub, you can easily fine-tune your SFT model using [`SFTTrainer`] from TRL. Let us assume your dataset is `imdb`, the text you want to predict is inside the `text` field of the dataset, and you want to fine-tune the `facebook/opt-350m` model.
+The following code-snippet takes care of all the data pre-processing and training for you:
+
+```python
+from datasets import load_dataset
+from trl import SFTConfig, SFTTrainer
+
+dataset = load_dataset("stanfordnlp/imdb", split="train")
+
+sft_config = SFTConfig(
+    dataset_text_field="text",
+    max_seq_length=512,
+    output_dir="/tmp",
+)
+trainer = SFTTrainer(
+    "facebook/opt-350m",
+    train_dataset=dataset,
+    args=sft_config,
+)
+trainer.train()
+```
+Make sure to pass the correct value for `max_seq_length` as the default value will be set to `min(tokenizer.model_max_length, 1024)`.
+
+You can also construct a model outside of the trainer and pass it as follows:
+
+```python
+from transformers import AutoModelForCausalLM
+from datasets import load_dataset
+from trl import SFTConfig, SFTTrainer
+
+dataset = load_dataset("stanfordnlp/imdb", split="train")
+
+model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")
+
+sft_config = SFTConfig(output_dir="/tmp")
+
+trainer = SFTTrainer(
+    model,
+    train_dataset=dataset,
+    args=sft_config,
+)
+
+trainer.train()
+```
+
+The above snippets will use the default training arguments from the [`SFTConfig`] class. If you want to modify the defaults pass in your modification to the `SFTConfig` constructor and pass them to the trainer via the `args` argument.
+
+## Advanced usage
+
+### Train on completions only
+
+You can use the `DataCollatorForCompletionOnlyLM` to train your model on the generated prompts only. Note that this works only in the case when `packing=False`.
+To instantiate that collator for instruction data, pass a response template and the tokenizer. Here is an example of how it would work to fine-tune `opt-350m` on completions only on the CodeAlpaca dataset:
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from datasets import load_dataset
+from trl import SFTConfig, SFTTrainer, DataCollatorForCompletionOnlyLM
+
+dataset = load_dataset("lucasmccabe-lmi/CodeAlpaca-20k", split="train")
+
+model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")
+tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")
+
+def formatting_prompts_func(example):
+    output_texts = []
+    for i in range(len(example['instruction'])):
+        text = f"### Question: {example['instruction'][i]}\n ### Answer: {example['output'][i]}"
+        output_texts.append(text)
+    return output_texts
+
+response_template = " ### Answer:"
+collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)
+
+trainer = SFTTrainer(
+    model,
+    train_dataset=dataset,
+    args=SFTConfig(output_dir="/tmp"),
+    formatting_func=formatting_prompts_func,
+    data_collator=collator,
+)
+
+trainer.train()
+```
+
+To instantiate that collator for assistant style conversation data, pass a response template, an instruction template and the tokenizer. Here is an example of how it would work to fine-tune `opt-350m` on assistant completions only on the Open Assistant Guanaco dataset:
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from datasets import load_dataset
+from trl import SFTConfig, SFTTrainer, DataCollatorForCompletionOnlyLM
+
+dataset = load_dataset("timdettmers/openassistant-guanaco", split="train")
+
+model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")
+tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")
+
+instruction_template = "### Human:"
+response_template = "### Assistant:"
+collator = DataCollatorForCompletionOnlyLM(instruction_template=instruction_template, response_template=response_template, tokenizer=tokenizer, mlm=False)
+
+trainer = SFTTrainer(
+    model,
+    args=SFTConfig(
+        output_dir="/tmp",
+        dataset_text_field = "text",
+    ),
+    train_dataset=dataset,
+    data_collator=collator,
+)
+
+trainer.train()
+```
+
+Make sure to have a `pad_token_id` which is different from `eos_token_id` which can result in the model not properly predicting EOS (End of Sentence) tokens during generation.
+
+#### Using token_ids directly for `response_template`
+
+Some tokenizers like Llama 2 (`meta-llama/Llama-2-XXb-hf`) tokenize sequences differently depending on whether they have context or not. For example:
+
+```python
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
+
+def print_tokens_with_ids(txt):
+    tokens = tokenizer.tokenize(txt, add_special_tokens=False)
+    token_ids = tokenizer.encode(txt, add_special_tokens=False)
+    print(list(zip(tokens, token_ids)))
+
+prompt = """### User: Hello\n\n### Assistant: Hi, how can I help you?"""
+print_tokens_with_ids(prompt)  # [..., ('▁Hello', 15043), ('<0x0A>', 13), ('<0x0A>', 13), ('##', 2277), ('#', 29937), ('▁Ass', 4007), ('istant', 22137), (':', 29901), ...]
+
+response_template = "### Assistant:"
+print_tokens_with_ids(response_template)  # [('▁###', 835), ('▁Ass', 4007), ('istant', 22137), (':', 29901)]
+```
+
+In this case, and due to lack of context in `response_template`, the same string ("### Assistant:") is tokenized differently:
+
+    - Text (with context): `[2277, 29937, 4007, 22137, 29901]`
+    - `response_template` (without context): `[835, 4007, 22137, 29901]`
+
+This will lead to an error when the `DataCollatorForCompletionOnlyLM` does not find the `response_template` in the dataset example text:
+
+```
+RuntimeError: Could not find response key [835, 4007, 22137, 29901] in token IDs tensor([    1,   835,  ...])
+```
+
+
+To solve this, you can tokenize the `response_template` with the same context as in the dataset, truncate it as needed and pass the `token_ids` directly to the `response_template` argument of the `DataCollatorForCompletionOnlyLM` class. For example:
+
+```python
+response_template_with_context = "\n### Assistant:"  # We added context here: "\n". This is enough for this tokenizer
+response_template_ids = tokenizer.encode(response_template_with_context, add_special_tokens=False)[2:]  # Now we have it like in the dataset texts: `[2277, 29937, 4007, 22137, 29901]`
+
+data_collator = DataCollatorForCompletionOnlyLM(response_template_ids, tokenizer=tokenizer)
+```
+
+### Add Special Tokens for Chat Format
+
+Adding special tokens to a language model is crucial for training chat models. These tokens are added between the different roles in a conversation, such as the user, assistant, and system and help the model recognize the structure and flow of a conversation. This setup is essential for enabling the model to generate coherent and contextually appropriate responses in a chat environment. 
+The [`setup_chat_format`] function in `trl` easily sets up a model and tokenizer for conversational AI tasks. This function:
+- Adds special tokens to the tokenizer, e.g. `<|im_start|>` and `<|im_end|>`, to indicate the start and end of a conversation.
+- Resizes the model’s embedding layer to accommodate the new tokens.
+- Sets the `chat_template` of the tokenizer, which is used to format the input data into a chat-like format. The default is `chatml` from OpenAI.
+- _optionally_ you can pass `resize_to_multiple_of` to resize the embedding layer to a multiple of the `resize_to_multiple_of` argument, e.g. 64. If you want to see more formats being supported in the future, please open a GitHub issue on [trl](https://github.com/huggingface/trl)
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from trl import setup_chat_format
+
+# Load model and tokenizer
+model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")
+tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")
+
+# Set up the chat format with default 'chatml' format
+model, tokenizer = setup_chat_format(model, tokenizer)
+
+```
+
+With our model and tokenizer set up, we can now fine-tune our model on a conversational dataset. Below is an example of how a dataset can be formatted for fine-tuning. 
+
+### Dataset format support
+
+The [`SFTTrainer`] supports popular dataset formats. This allows you to pass the dataset to the trainer without any pre-processing directly. The following formats are supported:
+* conversational format
+```json
+{"messages": [{"role": "system", "content": "You are helpful"}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "..."}]}
+{"messages": [{"role": "system", "content": "You are helpful"}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "..."}]}
+{"messages": [{"role": "system", "content": "You are helpful"}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "..."}]}
+```
+* instruction format
+```json
+{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
+{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
+{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
+```
+
+If your dataset uses one of the above formats, you can directly pass it to the trainer without pre-processing. The [`SFTTrainer`] will then format the dataset for you using the defined format from the model's tokenizer with the [apply_chat_template](https://huggingface.co/docs/transformers/main/en/chat_templating#templates-for-chat-models) method. 
+
+
+```python
+from datasets import load_dataset
+from trl import SFTConfig, SFTTrainer
+
+...
+
+# load jsonl dataset
+dataset = load_dataset("json", data_files="path/to/dataset.jsonl", split="train")
+# load dataset from the HuggingFace Hub
+dataset = load_dataset("philschmid/dolly-15k-oai-style", split="train")
+
+...
+
+sft_config = SFTConfig(packing=True)
+trainer = SFTTrainer(
+    "facebook/opt-350m",
+    args=sft_config,
+    train_dataset=dataset,
+)
+```
+
+If the dataset is not in one of those format you can either preprocess the dataset to match the formatting or pass a formatting function to the SFTTrainer to do it for you. Let's have a look.
+
+
+### Format your input prompts
+
+For instruction fine-tuning, it is quite common to have two columns inside the dataset: one for the prompt & the other for the response.
+This allows people to format examples like [Stanford-Alpaca](https://github.com/tatsu-lab/stanford_alpaca) did as follows:
+```bash
+Below is an instruction ...
+
+### Instruction
+{prompt}
+
+### Response:
+{completion}
+```
+Let us assume your dataset has two fields, `question` and `answer`. Therefore you can just run:
+```python
+...
+def formatting_prompts_func(example):
+    output_texts = []
+    for i in range(len(example['question'])):
+        text = f"### Question: {example['question'][i]}\n ### Answer: {example['answer'][i]}"
+        output_texts.append(text)
+    return output_texts
+
+trainer = SFTTrainer(
+    model,
+    args=sft_config,
+    train_dataset=dataset,
+    formatting_func=formatting_prompts_func,
+)
+
+trainer.train()
+```
+To properly format your input make sure to process all the examples by looping over them and returning a list of processed text. Check out a full example of how to use SFTTrainer on alpaca dataset [here](https://github.com/huggingface/trl/pull/444#issue-1760952763)
+
+### Packing dataset ([`ConstantLengthDataset`])
+
+[`SFTTrainer`] supports _example packing_, where multiple short examples are packed in the same input sequence to increase training efficiency. This is done with the [`ConstantLengthDataset`] utility class that returns constant length chunks of tokens from a stream of examples. To enable the usage of this dataset class, simply pass `packing=True` to the [`SFTConfig`] constructor.
+
+```python
+...
+sft_config = SFTConfig(packing=True, dataset_text_field="text",)
+
+trainer = SFTTrainer(
+    "facebook/opt-350m",
+    train_dataset=dataset,
+    args=sft_config
+)
+
+trainer.train()
+```
+
+Note that if you use a packed dataset and if you pass `max_steps` in the training arguments you will probably train your models for more than few epochs, depending on the way you have configured the packed dataset and the training protocol. Double check that you know and understand what you are doing.
+If you don't want to pack your `eval_dataset`, you can pass `eval_packing=False` to the `SFTConfig` init method.
+
+#### Customize your prompts using packed dataset
+
+If your dataset has several fields that you want to combine, for example if the dataset has `question` and `answer` fields and you want to combine them, you can pass a formatting function to the trainer that will take care of that. For example:
+
+```python
+def formatting_func(example):
+    text = f"### Question: {example['question']}\n ### Answer: {example['answer']}"
+    return text
+
+sft_config = SFTConfig(packing=True)
+trainer = SFTTrainer(
+    "facebook/opt-350m",
+    train_dataset=dataset,
+    args=sft_config,
+    formatting_func=formatting_func
+)
+
+trainer.train()
+```
+You can also customize the [`ConstantLengthDataset`] much more by directly passing the arguments to the [`SFTConfig`] constructor. Please refer to that class' signature for more information.
+
+### Control over the pretrained model
+
+You can directly pass the kwargs of the `from_pretrained()` method to the [`SFTConfig`]. For example, if you want to load a model in a different precision, analogous to
+
+```python
+model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", torch_dtype=torch.bfloat16)
+
+...
+
+sft_config = SFTConfig(
+    model_init_kwargs={
+        "torch_dtype": "bfloat16",
+    },
+    output_dir="/tmp",
+)
+trainer = SFTTrainer(
+    "facebook/opt-350m",
+    train_dataset=dataset,
+    args=sft_config,
+)
+
+trainer.train()
+```
+Note that all keyword arguments of `from_pretrained()` are supported.
+
+### Training adapters
+
+We also support tight integration with 🤗 PEFT library so that any user can conveniently train adapters and share them on the Hub instead of training the entire model
+
+```python
+from datasets import load_dataset
+from trl import SFTConfig, SFTTrainer
+from peft import LoraConfig
+
+dataset = load_dataset("stanfordnlp/imdb", split="train")
+
+peft_config = LoraConfig(
+    r=16,
+    lora_alpha=32,
+    lora_dropout=0.05,
+    bias="none",
+    task_type="CAUSAL_LM",
+)
+
+trainer = SFTTrainer(
+    "EleutherAI/gpt-neo-125m",
+    train_dataset=dataset,
+    args=SFTConfig(output_dir="/tmp"),
+    peft_config=peft_config
+)
+
+trainer.train()
+```
+
+You can also continue training your `PeftModel`. For that, first load a `PeftModel` outside `SFTTrainer` and pass it directly to the trainer without the `peft_config` argument being passed.
+
+### Training adapters with base 8 bit models
+
+For that, you need to first load your 8 bit model outside the Trainer and pass a `PeftConfig` to the trainer. For example:
+
+```python
+...
+
+peft_config = LoraConfig(
+    r=16,
+    lora_alpha=32,
+    lora_dropout=0.05,
+    bias="none",
+    task_type="CAUSAL_LM",
+)
+
+model = AutoModelForCausalLM.from_pretrained(
+    "EleutherAI/gpt-neo-125m",
+    load_in_8bit=True,
+    device_map="auto",
+)
+
+trainer = SFTTrainer(
+    model,
+    train_dataset=dataset,
+    args=SFTConfig(),
+    peft_config=peft_config,
+)
+
+trainer.train()
+```
+
+## Using Flash Attention and Flash Attention 2
+
+You can benefit from Flash Attention 1 & 2 using SFTTrainer out of the box with minimal changes of code.
+First, to make sure you have all the latest features from transformers, install transformers from source
+
+```bash
+pip install -U git+https://github.com/huggingface/transformers.git
+```
+
+Note that Flash Attention only works on GPU now and under half-precision regime (when using adapters, base model loaded in half-precision)
+Note also both features are perfectly compatible with other tools such as quantization.
+
+### Using Flash-Attention 1
+
+For Flash Attention 1 you can use the `BetterTransformer` API and force-dispatch the API to use Flash Attention kernel. First, install the latest optimum package:
+
+```bash
+pip install -U optimum
+```
+
+Once you have loaded your model, wrap the `trainer.train()` call under the `with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):` context manager:
+
+```diff
+...
+
+ with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):
+    trainer.train()
+```
+
+Note that you cannot train your model using Flash Attention 1 on an arbitrary dataset as `torch.scaled_dot_product_attention` does not support training with padding tokens if you use Flash Attention kernels. Therefore you can only use that feature with `packing=True`. If your dataset contains padding tokens, consider switching to Flash Attention 2 integration.
+
+Below are some numbers you can get in terms of speedup and memory efficiency, using Flash Attention 1, on a single NVIDIA-T4 16GB.
+
+| use_flash_attn_1 | model_name        | max_seq_len | batch_size | time per training step |
+| ---------------- | ----------------- | ----------- | ---------- | ---------------------- |
+| x                | facebook/opt-350m | 2048        | 8          | ~59.1s                 |
+|                  | facebook/opt-350m | 2048        | 8          | **OOM**                |
+| x                | facebook/opt-350m | 2048        | 4          | ~30.3s                 |
+|                  | facebook/opt-350m | 2048        | 4          | ~148.9s                |
+
+### Using Flash Attention-2
+
+To use Flash Attention 2, first install the latest `flash-attn` package:
+
+```bash
+pip install -U flash-attn
+```
+
+And add `attn_implementation="flash_attention_2"` when calling `from_pretrained`:
+
+```python
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    load_in_4bit=True,
+    attn_implementation="flash_attention_2"
+)
+```
+
+If you don't use quantization, make sure your model is loaded in half-precision and dispatch your model on a supported GPU device.
+After loading your model, you can either train it as it is, or attach adapters and train adapters on it in case your model is quantized.
+
+In contrast to Flash Attention 1, the integration makes it possible to train your model on an arbitrary dataset that also includes padding tokens.
+
+
+### Using model creation utility
+
+We included a utility function to create your model.
+
+[[autodoc]] ModelConfig
+
+```python
+from trl import ModelConfig, SFTTrainer, get_kbit_device_map, get_peft_config, get_quantization_config
+model_config = ModelConfig(
+    model_name_or_path="facebook/opt-350m"
+    attn_implementation=None, # or "flash_attention_2"
+)
+torch_dtype = (
+    model_config.torch_dtype
+    if model_config.torch_dtype in ["auto", None]
+    else getattr(torch, model_config.torch_dtype)
+)
+quantization_config = get_quantization_config(model_config)
+model_kwargs = dict(
+    revision=model_config.model_revision,
+    trust_remote_code=model_config.trust_remote_code,
+    attn_implementation=model_config.attn_implementation,
+    torch_dtype=torch_dtype,
+    use_cache=False if training_args.gradient_checkpointing else True,
+    device_map=get_kbit_device_map() if quantization_config is not None else None,
+    quantization_config=quantization_config,
+)
+model = AutoModelForCausalLM.from_pretrained(model_config.model_name_or_path, **model_kwargs)
+trainer = SFTTrainer(
+    ...,
+    model=model_config.model_name_or_path,
+    peft_config=get_peft_config(model_config),
+)
+```
+
+### Enhance the model's performances using NEFTune
+
+NEFTune is a technique to boost the performance of chat models and was introduced by the paper ["NEFTune: Noisy Embeddings Improve Instruction Finetuning"](https://huggingface.co/papers/2310.05914) from Jain et al. it consists of adding noise to the embedding vectors during training. According to the abstract of the paper:
+
+>  Standard finetuning of LLaMA-2-7B using Alpaca achieves 29.79% on AlpacaEval, which rises to 64.69% using noisy embeddings. NEFTune also improves over strong baselines on modern instruction datasets. Models trained with Evol-Instruct see a 10% improvement, with ShareGPT an 8% improvement, and with OpenPlatypus an 8% improvement. Even powerful models further refined with RLHF such as LLaMA-2-Chat benefit from additional training with NEFTune.
+
+<div style="text-align: center">
+<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/neft-screenshot.png">
+</div>
+
+To use it in `SFTTrainer` simply pass `neftune_noise_alpha` when creating your `SFTConfig` instance. Note that to avoid any surprising behaviour, NEFTune is disabled after training to retrieve back the original behaviour of the embedding layer.
+
+```python
+from datasets import load_dataset
+from trl import SFTConfig, SFTTrainer
+
+dataset = load_dataset("stanfordnlp/imdb", split="train")
+
+sft_config = SFTConfig(
+    neftune_noise_alpha=5,
+)
+trainer = SFTTrainer(
+    "facebook/opt-350m",
+    train_dataset=dataset,
+    args=sft_config,
+)
+trainer.train()
+```
+
+We have tested NEFTune by training `mistralai/Mistral-7B-v0.1` on the [OpenAssistant dataset](https://huggingface.co/datasets/timdettmers/openassistant-guanaco) and validated that using NEFTune led to a performance boost of ~25% on MT Bench.
+
+<div style="text-align: center">
+<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/trl-neftune-mistral-7b.png">
+</div>
+
+Note however, that the amount of performance gain is _dataset dependent_ and in particular, applying NEFTune on synthetic datasets like [UltraChat](https://huggingface.co/datasets/stingning/ultrachat) typically produces smaller gains.
+
+### Accelerate fine-tuning 2x using `unsloth`
+
+You can further accelerate QLoRA / LoRA (2x faster, 60% less memory) using the [`unsloth`](https://github.com/unslothai/unsloth) library that is fully compatible with `SFTTrainer`. Currently `unsloth` supports only Llama (Yi, TinyLlama, Qwen, Deepseek etc) and Mistral architectures. Some benchmarks on 1x A100 listed below:
+
+| 1 A100 40GB     | Dataset   | 🤗   | 🤗 + Flash Attention 2 | 🦥 Unsloth | 🦥 VRAM saved |
+| --------------- | --------- | --- | --------------------- | --------- | ------------ |
+| Code Llama 34b  | Slim Orca | 1x  | 1.01x                 | **1.94x** | -22.7%       |
+| Llama-2 7b      | Slim Orca | 1x  | 0.96x                 | **1.87x** | -39.3%       |
+| Mistral 7b      | Slim Orca | 1x  | 1.17x                 | **1.88x** | -65.9%       |
+| Tiny Llama 1.1b | Alpaca    | 1x  | 1.55x                 | **2.74x** | -57.8%       |
+
+First install `unsloth` according to the [official documentation](https://github.com/unslothai/unsloth). Once installed, you can incorporate unsloth into your workflow in a very simple manner; instead of loading `AutoModelForCausalLM`, you just need to load a `FastLanguageModel` as follows:
+
+```python
+import torch
+from trl import SFTConfig, SFTTrainer
+from unsloth import FastLanguageModel
+
+max_seq_length = 2048 # Supports automatic RoPE Scaling, so choose any number
+
+# Load model
+model, tokenizer = FastLanguageModel.from_pretrained(
+    model_name="unsloth/mistral-7b",
+    max_seq_length=max_seq_length,
+    dtype=None,  # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
+    load_in_4bit=True,  # Use 4bit quantization to reduce memory usage. Can be False
+    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
+)
+
+# Do model patching and add fast LoRA weights
+model = FastLanguageModel.get_peft_model(
+    model,
+    r=16,
+    target_modules=[
+        "q_proj",
+        "k_proj",
+        "v_proj",
+        "o_proj",
+        "gate_proj",
+        "up_proj",
+        "down_proj",
+    ],
+    lora_alpha=16,
+    lora_dropout=0,  # Dropout = 0 is currently optimized
+    bias="none",  # Bias = "none" is currently optimized
+    use_gradient_checkpointing=True,
+    random_state=3407,
+)
+
+args = SFTConfig(
+    output_dir="./output",
+    max_seq_length=max_seq_length,
+    dataset_text_field="text",
+)
+
+trainer = SFTTrainer(
+    model=model,
+    args=args,
+    train_dataset=dataset,
+)
+trainer.train()
+```
+
+The saved model is fully compatible with Hugging Face's transformers library. Learn more about unsloth in their [official repository](https://github.com/unslothai/unsloth).
+
+## Liger-Kernel: Increase 20% throughput and reduces 60% memory for multi-GPU training
+
+[Liger Kernel](https://github.com/linkedin/Liger-Kernel) is a collection of Triton kernels designed specifically for LLM training. It can effectively increase multi-GPU training throughput by 20% and reduces memory usage by 60%. That way, we can **4x** our context length, as described in the benchmark below. They have implemented Hugging Face Compatible `RMSNorm`, `RoPE`, `SwiGLU`, `CrossEntropy`, `FusedLinearCrossEntropy`, and more to come. The kernel works out of the box with [Flash Attention](https://github.com/Dao-AILab/flash-attention), [PyTorch FSDP](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html), and [Microsoft DeepSpeed](https://github.com/microsoft/DeepSpeed).
+
+With great memory reduction, you can potentially turn off cpu_offloading or gradient checkpointing to further boost the performance. 
+
+| Speed Up                 | Memory Reduction        |
+|--------------------------|-------------------------|
+| ![Speed up](https://raw.githubusercontent.com/linkedin/Liger-Kernel/main/docs/images/e2e-tps.png) | ![Memory](https://raw.githubusercontent.com/linkedin/Liger-Kernel/main/docs/images/e2e-memory.png) |
+
+
+1. To use Liger-Kernel in `SFTTrainer`, first install by 
+
+```bash
+pip install liger-kernel
+```
+
+2. Once installed, set `use_liger` in [SFTConfig](https://github.com/huggingface/trl/blob/850ddcf598984013007d384c6b3e311def2a616e/trl/trainer/sft_config.py#L69). No other changes are needed!
+
+```python
+config = SFTConfig(
+  use_liger=True
+)
+```
+
+To learn more about Liger-Kernel, visit their [official repository](https://github.com/linkedin/Liger-Kernel/).
+
+## Best practices
+
+Pay attention to the following best practices when training a model with that trainer:
+
+- [`SFTTrainer`] always pads by default the sequences to the `max_seq_length` argument of the [`SFTTrainer`]. If none is passed, the trainer will retrieve that value from the tokenizer. Some tokenizers do not provide a default value, so there is a check to retrieve the minimum between 2048 and that value. Make sure to check it before training.
+- For training adapters in 8bit, you might need to tweak the arguments of the `prepare_model_for_kbit_training` method from PEFT, hence we advise users to use `prepare_in_int8_kwargs` field, or create the `PeftModel` outside the [`SFTTrainer`] and pass it.
+- For a more memory-efficient training using adapters, you can load the base model in 8bit, for that simply add `load_in_8bit` argument when creating the [`SFTTrainer`], or create a base model in 8bit outside the trainer and pass it.
+- If you create a model outside the trainer, make sure to not pass to the trainer any additional keyword arguments that are relative to `from_pretrained()` method.
+
+## Multi-GPU Training
+
+Trainer (and thus SFTTrainer) supports multi-GPU training. If you run your script with `python script.py` it will default to using DP as the strategy, which may be [slower than expected](https://github.com/huggingface/trl/issues/1303). To use DDP (which is generally recommended, see [here](https://huggingface.co/docs/transformers/en/perf_train_gpu_many?select-gpu=Accelerate#data-parallelism) for more info) you must launch the script with `python -m torch.distributed.launch script.py` or `accelerate launch script.py`. For DDP to work you must also check the following:
+- If you're using gradient_checkpointing, add the following to the TrainingArguments: `gradient_checkpointing_kwargs={'use_reentrant':False}` (more info [here](https://github.com/huggingface/transformers/issues/26969)
+- Ensure that the model is placed on the correct device:
+```python
+from accelerate import PartialState
+device_string = PartialState().process_index
+model = AutoModelForCausalLM.from_pretrained(
+     ...
+    device_map={'':device_string}
+)
+```
+
+## GPTQ Conversion
+
+You may experience some issues with GPTQ Quantization after completing training. Lowering `gradient_accumulation_steps` to `4` will resolve most issues during the quantization process to GPTQ format.
+
+## Extending `SFTTrainer` for Vision Language Models
+
+`SFTTrainer` does not inherently support vision-language data. However, we provide a guide on how to tweak the trainer to support vision-language data. Specifically, you need to use a custom data collator that is compatible with vision-language data. This guide outlines the steps to make these adjustments. For a concrete example, refer to the script [`examples/scripts/vsft_llava.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/vsft_llava.py) which demonstrates how to fine-tune the LLaVA 1.5 model on the [HuggingFaceH4/llava-instruct-mix-vsft](https://huggingface.co/datasets/HuggingFaceH4/llava-instruct-mix-vsft) dataset.
+
+### Preparing the Data
+
+The data format is flexible, provided it is compatible with the custom collator that we will define later. A common approach is to use conversational data. Given that the data includes both text and images, the format needs to be adjusted accordingly. Below is an example of a conversational data format involving both text and images:
+
+```python
+images = ["obama.png"]
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "text", "text": "Who is this?"},
+            {"type": "image"}
+        ]
+    },
+    {
+        "role": "assistant",
+        "content": [
+            {"type": "text", "text": "Barack Obama"}
+        ]
+    },
+    {
+        "role": "user",
+        "content": [
+            {"type": "text", "text": "What is he famous for?"}
+        ]
+    },
+    {
+        "role": "assistant",
+        "content": [
+            {"type": "text", "text": "He is the 44th President of the United States."}
+        ]
+    }
+]
+```
+
+To illustrate how this data format will be processed using the LLaVA model, you can use the following code:
+
+```python
+from transformers import AutoProcessor
+
+processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
+print(processor.apply_chat_template(messages, tokenize=False))
+```
+
+The output will be formatted as follows:
+
+```txt
+Who is this? ASSISTANT: Barack Obama USER: What is he famous for? ASSISTANT: He is the 44th President of the United States. 
+```
+
+<iframe src="https://huggingface.co/datasets/HuggingFaceH4/llava-instruct-mix-vsft/embed/viewer/default/train" frameborder="0" width="100%" height="560px"></iframe>
+
+
+### A custom collator for processing multi-modal data
+
+Unlike the default behavior of `SFTTrainer`, processing multi-modal data is done on the fly during the data collation process. To do this, you need to define a custom collator that processes both the text and images. This collator must take a list of examples as input (see the previous section for an example of the data format) and return a batch of processed data. Below is an example of such a collator:
+
+```python
+def collate_fn(examples):
+    # Get the texts and images, and apply the chat template
+    texts = [processor.apply_chat_template(example["messages"], tokenize=False) for example in examples]
+    images = [example["images"][0] for example in examples]
+
+    # Tokenize the texts and process the images
+    batch = processor(texts, images, return_tensors="pt", padding=True)
+
+    # The labels are the input_ids, and we mask the padding tokens in the loss computation
+    labels = batch["input_ids"].clone()
+    labels[labels == processor.tokenizer.pad_token_id] = -100
+    batch["labels"] = labels
+
+    return batch
+```
+
+We can verify that the collator works as expected by running the following code:
+
+```python
+from datasets import load_dataset
+
+dataset = load_dataset("HuggingFaceH4/llava-instruct-mix-vsft", split="train")
+examples = [dataset[0], dataset[1]]  # Just two examples for the sake of the example
+collated_data = collate_fn(examples)
+print(collated_data.keys())  # dict_keys(['input_ids', 'attention_mask', 'pixel_values', 'labels'])
+```
+
+### Training the vision-language model
+
+Now that we have prepared the data and defined the collator, we can proceed with training the model. To ensure that the data is not processed as text-only, we need to set a couple of arguments in the `SFTConfig`, specifically `dataset_text_field` and `remove_unused_columns`. We also need to set `skip_prepare_dataset` to `True` to avoid the default processing of the dataset. Below is an example of how to set up the `SFTTrainer`.
+
+```python
+args.dataset_text_field = ""  # needs a dummy field
+args.remove_unused_columns = False
+args.dataset_kwargs = {"skip_prepare_dataset": True}
+
+trainer = SFTTrainer(
+    model=model,
+    args=args,
+    data_collator=collate_fn,
+    train_dataset=train_dataset,
+    tokenizer=processor.tokenizer,
+)
+```
+
+A full example of training LLaVa 1.5 on the [HuggingFaceH4/llava-instruct-mix-vsft](https://huggingface.co/datasets/HuggingFaceH4/llava-instruct-mix-vsft) dataset can be found in the script [`examples/scripts/vsft_llava.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/vsft_llava.py).
+
+- [Experiment tracking](https://wandb.ai/huggingface/trl/runs/2b2c5l7s)
+- [Trained model](https://huggingface.co/HuggingFaceH4/sft-llava-1.5-7b-hf)
+
+## SFTTrainer
+
+[[autodoc]] SFTTrainer
+
+## SFTConfig
+
+[[autodoc]] SFTConfig
+
+## Datasets
+
+In the SFTTrainer we smartly support `datasets.IterableDataset` in addition to other style datasets. This is useful if you are using large corpora that you do not want to save all to disk. The data will be tokenized and processed on the fly, even when packing is enabled.
+
+Additionally, in the SFTTrainer, we support pre-tokenized datasets if they are `datasets.Dataset` or `datasets.IterableDataset`. In other words, if such a dataset has a column of `input_ids`, no further processing (tokenization or packing) will be done, and the dataset will be used as-is. This can be useful if you have pretokenized your dataset outside of this script and want to re-use it directly.
+
+### ConstantLengthDataset
+
+[[autodoc]] trainer.ConstantLengthDataset
--- a/docs/source/summarization_reward_tuning.mdx
+++ b/docs/source/summarization_reward_tuning.mdx
@ -1,30 +0,0 @@
-# Summarization Example
-
-The script in this example show how to train a reward model for summarization, following the OpenAI Learning to Summarize from Human Feedback [paper](https://arxiv.org/abs/2009.01325). We've validated that the script can be used to train a small GPT2 to get slightly over 60% validation accuracy, which is aligned with results from the paper. The model is [here](https://huggingface.co/Tristan/gpt2_reward_summarization).
-
-Here's an overview of the relevant files in the [trl repository](https://github.com/lvwerra/trl/tree/main/examples):
-
-| File | Description |
-|---|---|
-| `scripts/reward_summarization.py` | For tuning the reward model. |
-| `scripts/ds3_reward_summarization_example_config.json` | Can be used with the reward model script to scale it up to arbitrarily big models that don't fit on a single GPU. |
-
-
-## Installation
-
-```bash
-pip install trl
-pip install evaluate
-# optional: deepspeed
-pip install deepspeed
-```
-
-```bash
-# If you want your reward model to follow the Learning to Summarize from Human Feedback paper closely, then tune a GPT model on summarization and then instantiate the reward model
-# with it. In other words, pass in the name of your summarization-finetuned gpt on the hub, instead of the name of the pretrained gpt2 like we do in the following examples of how
-# to run this script.
-# Example of running this script with the small size gpt2 on a 40GB A100 (A100's support bf16). Here, the global batch size will be 64:
-python -m torch.distributed.launch --nproc_per_node=1 reward_summarization.py --bf16
-# Example of running this script with the xl size gpt2 on 16 40GB A100's. Here the global batch size will still be 64:
-python -m torch.distributed.launch --nproc_per_node=16 reward_summarization.py --per_device_train_batch_size=1 --per_device_eval_batch_size=1 --gradient_accumulation_steps=4 --gpt_model_name=gpt2-xl --bf16 --deepspeed=ds3_reward_summarization_example_config.json
-```
--- a/docs/source/text_environments.md
+++ b/docs/source/text_environments.md
@ -0,0 +1,197 @@
+# Text Environments
+
+Text environments provide a learning ground for language agents. It allows a language model to use tools to accomplish a task such as using a Python interpreter to answer math questions or using a search index for trivia questions. Having access to tools allows language models to solve tasks that would be very hard for the models itself but can be trivial for the appropriate tools. A good example is arithmetics of large numbers that become a simple copy-paste task once you have access to a calculator.
+
+<div style="text-align: center">
+<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/textenv.png">
+</div>
+
+Let's dive into how text environments work and start with tools!
+
+## Tools
+
+One of the core building blocks of text environments are tools that the model can use to solve tasks. In general tools can be any Python function that takes a string as input and returns string. The `TextEnvironment` offers two options for tools: either go with predefined tools from `transformers.Tool` or define your own function or class with `__call__` method. Let's have a look at both!
+
+### `transformers.Tool`
+
+Text environments fully support tools of the class `transformers.Tool`. The advantage of building tools in that framework is that they can easily be shared 
+
+```Python
+from transformers import load_tool
+
+# simple calculator tool that runs +-/* operations
+calc_tool = load_tool("ybelkada/simple-calculator")
+
+# python interpreter that executes program and returns outputs
+py_tool = load_tool("lvwerra/python-interpreter")
+
+# wikipedia search index that returns best search match
+wiki_tool = load_tool("vwxyzjn/pyserini-wikipedia-kilt-doc")
+```
+
+These tools are either loaded from the hub or from a local folder. Using the tool is as simple as calling them with a text query:
+
+```Python
+calc_tool("1/2")
+>>> "0.5"
+```
+
+Note that both input and return values are strings to enable easy usage with a language model.
+
+### Custom Tools
+
+The following is an example of a tool that adds two integers:
+
+```Python
+def add(text):
+    int_1, int_2 = text.split("+")
+    result = int(int_1) + int(int_2)
+    return str(result)
+
+print(add("1+1"))
+>>> "2"
+```
+
+We looked at basic examples such as a calculator but the principle holds for more complex tools as well such as a web search tool where you input the query and get the search results in return. Now let's look at how the model can use the tools with the call syntax.
+
+### Call syntax
+
+In order to have a unified way for the model to call a tool we created a simple syntax that looks as follows:
+
+```python
+"<request><TOOL_NAME>QUERY<call>TOOL_RESPONSE<response>"
+```
+
+There are a few special tokens involved so let's decompose it: First the model can signal that it wants to use a tool by emitting the `<request>` token. After that we want to know the name of the tool to call which is done by enclosing the tool name with `<>` brackets. Once we know which tool to call the tool query follows which is in free text form. The `<call>` tokens signifies the end of the query and stops the model generation. At this point the model output is parsed and the query sent to the tool. The environment appends the tool response to the string followed by the `<response>` token to show the end the tool output.
+
+Let's look at the concrete example of the calculator and assume its name is `Calculator` (more on how the name of a tool is inferred later):
+
+```python
+"<request><Calculator>1/2<call>0.5<response>"
+```
+
+Finally, the episode is ended and generation stops when the model generates `<submit>` which marks the interaction as completed.
+
+Now let's have a look how we can create a new text environment!
+
+## Create a `TextEnvironment`
+
+
+```python
+prompt = """\
+What is 13-3?
+<request><SimpleCalculatorTool>13-3<call>10.0<response>
+Result=10<submit>
+"""
+
+def reward_fn(result, answer):
+    """Simplified reward function returning 1 if result matches answer and 0 otherwise."""
+    result_parsed = result.split("=")[1].split("<")[0]
+    return int(result_parsed==answer)
+
+text_env = TextEnvironemnt(
+    model=model, 
+    tokenizer=tokenizer,
+    tools= {"SimpleCalculatorTool": load_tool("ybelkada/simple-calculator")},
+    reward_fn=exact_match_reward,
+    prompt=prompt, 
+    max_turns=1
+    max_tool_response=100
+    generation_kwargs={"do_sample": "true"}
+)
+```
+
+Let's decompose the settings:
+
+| Argument           | Description     |
+|:-------------------|:----------------|
+| `model`            | Language model to interact with the environment and generate requests. |
+| `tokenizer`        | Tokenizer of language model handling tokenization of strings. |
+| `tools`            | `list` of `dict` of tools. If former the name of the tool is inferred from class name and otherwise it's the keys of the dictionary.|
+| `reward_fn`        | A function that takes a string as input and returns. Can have extra arguments that are passed to `.run()` such as ground truth.|
+| `prompt`           | Prompt to prepend to every task. Usually a few examples to demonstrate to the model how to use the tools in a few-shot fashion. |
+| `max_turns`        | Maximum number of interactions between model and tools before episode ends.|
+| `max_tool_response`| The tool response is truncated to this number to avoid running out of model context.|
+| `max_length`       |  The maximum number of tokens to allow in an episode. |
+| `generation_kwargs`| Generation settings used by the language model. |
+
+You can customize the environment to your needs and add custom tools and settings. Let's see how you can use the environment to have the model interact with the available tools!
+
+
+## Run an Episode
+
+To run a set of queries through the text environment one can simply use the `run` method.
+
+```python
+queries = ["What is 1/2?"]
+answers = ["0.5"]
+
+queries, responses, masks, rewards, histories = text_env.run(queries, answers=answers)
+```
+
+This will execute the model/tool feedback loop for each query until either no tool is called anymore, the maximum number of turns is reached or to maximum number of tokens in an episode is exceeded. The extra `kwargs` (e.g. `answers=answers` above) passed to `run` will be passed on to the reward function.
+
+There are five objects that are returned by `run`: 
+
+- `queries`: a list of the tokenized queries
+- `responses`: all tokens that have been generated withing the environment including model and tool tokens
+- `masks`: mask that indicates which tokens have been generated by the model and which tokens are generated by the tool
+- `rewards`: a list of reward for each query/response
+- `histories`: list of `TextHistory` objects, which are useful objects containing all the above and also the text equivalents
+
+The masks are crucial for training as we don't want to optimize tokens that the model has not generated which are tokens produced by the tools.
+
+Next, we'll train a PPO step with the generated responses!
+
+
+### Train
+Training on episodes from the `TextEnvironment` is straight forward and simply requires forwarding all the returned variables except the `TextHistory` objects to the `step` method:
+
+```python
+train_stats = ppo_trainer.step(queries, responses, rewards, masks)
+```
+
+## `TextHistory`
+
+The `TextHistory` object stores the interactions between the model and the text environment. It stores tokens and text generated in each turn and their source in each turn (model or system) as well as rewards. Let's go through the class attributes and methods.
+
+### Attributes
+
+The following table summarises the available attributes of the `TextEnvironment` class:
+
+| Attribute           | Description     |
+|:-------------------|:----------------|
+| `text`             | The full string of the text generated in the text environment with both model and system generated text. |
+| `text_spans`       | A list of tuples with the spans for each model or system generated text segment. |
+| `system_spans`     | A list of boolean values indicating if the segment is model or system generated. |
+| `tokens`           | All tokens generated in text environment with both model and system generated tokens. |
+| `token_spans`      | Similar to `text_spans` the `token_spans` indicate the boundaries of model andsystem generated tokens. |
+| `token_masks`      | The token masks can be used to ignore system generated tokens by masking them. |
+| `completed`        | Indicates if the interaction with the environment has completed. |
+| `truncated`        | Indicates if the interaction with the environment has completed because max length was reached. |
+
+With these attributes you can reconstruct every interaction of the model with the `TextEnvironment`. The `TextHistory` also lets you visualize the text history. Let's have a look!
+
+### Visualization
+
+When the model interacts inside the `TextEnvironment` it can be useful to visualize and separate which parts of the text outputs were generated by the model and which parts come from the system and tools. For that purpose there are the two methods [`TextHistory.show_text`] and [`TextHistory.show_tokens`]. They print the text and tokens respectively and highlight the various segments using the [`rich` libray](https://github.com/Textualize/rich) (make sure to install it before using these methods).
+
+You can see that the prompt is highlighted in gray, whereas system segments such as query and tool responses are highlighted in green. All segments generated by the model are highlighted in blue and in addition to the pure text output the reward is displayed as additional text in plum. Here an example of `show_text`:
+
+<div style="text-align: center">
+<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/textenv_show_text.png" width=600>
+</div>
+
+Sometimes there can be tricky tokenization related issues that are hidden when showing the decoded text. Thus `TextHistory` also offers an option to display the same highlighting on the tokens directly with `show_tokens`:
+
+<div style="text-align: center">
+<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/textenv_show_tokens.png" width=800>
+</div>
+
+Note that you can turn on the colour legend by passing `show_legend=True`.
+
+## API Documentation
+
+[[autodoc]] TextEnvironment
+
+[[autodoc]] TextHistory
--- a/docs/source/trainer.mdx
+++ b/docs/source/trainer.mdx
@ -1,16 +0,0 @@
-# Trainer
-
-At TRL we support PPO (Proximal Policy Optimisation) with an implementation that largely follows  the structure introduced in the paper "Fine-Tuning Language Models from Human Preferences" by D. Ziegler et al. [[paper](https://arxiv.org/pdf/1909.08593.pdf), [code](https://github.com/openai/lm-human-preferences)].
-The Trainer and model classes are largely inspired from `transformers.Trainer` and `transformers.AutoModel` classes and adapted for RL.
-
-## PPOConfig
-
-[[autodoc]] PPOConfig
-
-## PPOTrainer
-
-[[autodoc]] PPOTrainer
-
-## set_seed
-
-[[autodoc]] set_seed
--- a/docs/source/use_model.md
+++ b/docs/source/use_model.md
@ -0,0 +1,58 @@
+# Use model after training
+
+Once you have trained a model using either the SFTTrainer, PPOTrainer, or DPOTrainer, you will have a fine-tuned model that can be used for text generation. In this section, we'll walk through the process of loading the fine-tuned model and generating text. If you need to run an inference server with the trained model, you can explore libraries such as [`text-generation-inference`](https://github.com/huggingface/text-generation-inference).
+
+## Load and Generate
+
+If you have fine-tuned a model fully, meaning without the use of PEFT you can simply load it like any other language model in transformers. E.g. the value head that was trained during the PPO training is no longer needed and if you load the model with the original transformer class it will be ignored:
+
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+
+model_name_or_path = "kashif/stack-llama-2" #path/to/your/model/or/name/on/hub
+device = "cpu" # or "cuda" if you have a GPU
+
+model = AutoModelForCausalLM.from_pretrained(model_name_or_path).to(device)
+tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
+
+inputs = tokenizer.encode("This movie was really", return_tensors="pt").to(device)
+outputs = model.generate(inputs)
+print(tokenizer.decode(outputs[0]))
+```
+
+Alternatively you can also use the pipeline:
+
+```python
+from transformers import pipeline
+
+model_name_or_path = "kashif/stack-llama-2" #path/to/your/model/or/name/on/hub
+pipe = pipeline("text-generation", model=model_name_or_path)
+print(pipe("This movie was really")[0]["generated_text"])
+```
+
+## Use Adapters PEFT
+
+```python
+from peft import PeftConfig, PeftModel
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+base_model_name = "kashif/stack-llama-2" #path/to/your/model/or/name/on/hub"
+adapter_model_name = "path/to/my/adapter"
+
+model = AutoModelForCausalLM.from_pretrained(base_model_name)
+model = PeftModel.from_pretrained(model, adapter_model_name)
+
+tokenizer = AutoTokenizer.from_pretrained(base_model_name)
+```
+
+You can also merge the adapters into the base model so you can use the model like a normal transformers model, however the checkpoint will be significantly bigger:
+
+```python
+model = AutoModelForCausalLM.from_pretrained(base_model_name)
+model = PeftModel.from_pretrained(model, adapter_model_name)
+
+model = model.merge_and_unload()
+model.save_pretrained("merged_adapters")
+```
+
+Once you have the model loaded and either merged the adapters or keep them separately on top you can run generation as with a normal model outlined above.
--- a/docs/source/using_llama_models.mdx
+++ b/docs/source/using_llama_models.mdx
@ -0,0 +1,160 @@
+# Using LLaMA models with TRL
+
+We've begun rolling out examples to use Meta's LLaMA models in `trl` (see [Meta's LLaMA release](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/) for the original LLaMA model).
+
+## Efficient training strategies
+
+Even training the smallest LLaMA model requires an enormous amount of memory. Some quick math: in bf16, every parameter uses 2 bytes (in fp32 4 bytes) in addition to 8 bytes used, e.g., in the Adam optimizer (see the [performance docs](https://huggingface.co/docs/transformers/perf_train_gpu_one#optimizer) in Transformers for more info). So a 7B parameter model would use `(2+8)*7B=70GB` just to fit in memory and would likely need more when you compute intermediate values such as attention scores. So you couldn’t train the model even on a single 80GB A100 like that. You can use some tricks, like more efficient optimizers of half-precision training, to squeeze a bit more into memory, but you’ll run out sooner or later.
+
+Another option is to use Parameter-Efficient Fine-Tuning (PEFT) techniques, such as the [`peft`](https://github.com/huggingface/peft) library, which can perform low-rank adaptation (LoRA) on a model loaded in 8-bit.
+For more on `peft` + `trl`, see the [docs](https://huggingface.co/docs/trl/sentiment_tuning_peft).
+
+Loading the model in 8bit reduces the memory footprint drastically since you only need one byte per parameter for the weights (e.g. 7B LlaMa is 7GB in memory).
+Instead of training the original weights directly, LoRA adds small adapter layers on top of some specific layers (usually the attention layers); thus, the number of trainable parameters is drastically reduced.
+
+In this scenario, a rule of thumb is to allocate ~1.2-1.4GB per billion parameters (depending on the batch size and sequence length) to fit the entire fine-tuning setup.
+This enables fine-tuning larger models (up to 50-60B scale models on a NVIDIA A100 80GB) at low cost.
+
+Now we can fit very large models into a single GPU, but the training might still be very slow.
+The simplest strategy in this scenario is data parallelism: we replicate the same training setup into separate GPUs and pass different batches to each GPU.
+With this, you can parallelize the forward/backward passes of the model and scale with the number of GPUs.
+
+![chapter10_ddp.png](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/blog/stackllama/chapter10_ddp.png)
+
+We use either the `transformers.Trainer` or `accelerate`, which both support data parallelism without any code changes, by simply passing arguments when calling the scripts with `torchrun` or `accelerate launch`. The following runs a training script with 8 GPUs on a single machine with `accelerate` and `torchrun`, respectively.
+
+```bash
+accelerate launch --multi_gpu --num_machines 1  --num_processes 8 my_accelerate_script.py
+torchrun --nnodes 1  --nproc_per_node 8 my_torch_script.py
+```
+
+## Supervised fine-tuning
+
+Before we start training reward models and tuning our model with RL, it helps if the model is already good in the domain we are interested in.
+In our case, we want it to answer questions, while for other use cases, we might want it to follow instructions, in which case instruction tuning is a great idea.
+The easiest way to achieve this is by continuing to train the language model with the language modeling objective on texts from the domain or task.
+The [StackExchange dataset](https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences) is enormous (over 10 million instructions), so we can easily train the language model on a subset of it.
+
+There is nothing special about fine-tuning the model before doing RLHF - it’s just the causal language modeling objective from pretraining that we apply here.
+To use the data efficiently, we use a technique called packing: instead of having one text per sample in the batch and then padding to either the longest text or the maximal context of the model, we concatenate a lot of texts with a EOS token in between and cut chunks of the context size to fill the batch without any padding.
+
+![chapter10_preprocessing-clm.png](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/blog/stackllama/chapter10_preprocessing-clm.png)
+
+With this approach the training is much more efficient as each token that is passed through the model is also trained in contrast to padding tokens which are usually masked from the loss.
+If you don't have much data and are more concerned about occasionally cutting off some tokens that are overflowing the context you can also use a classical data loader.
+
+The packing is handled by the `ConstantLengthDataset` and we can then use the `Trainer` after loading the model with `peft`. First, we load the model in int8, prepare it for training, and then add the LoRA adapters.
+
+```python
+# load model in 8bit
+model = AutoModelForCausalLM.from_pretrained(
+        args.model_path,
+        load_in_8bit=True,
+        device_map={"": Accelerator().local_process_index}
+    )
+model = prepare_model_for_kbit_training(model)
+
+# add LoRA to model
+lora_config = LoraConfig(
+    r=16,
+    lora_alpha=32,
+    lora_dropout=0.05,
+    bias="none",
+    task_type="CAUSAL_LM",
+)
+
+model = get_peft_model(model, config)
+```
+
+We train the model for a few thousand steps with the causal language modeling objective and save the model.
+Since we will tune the model again with different objectives, we merge the adapter weights with the original model weights.
+
+**Disclaimer:** due to LLaMA's license, we release only the adapter weights for this and the model checkpoints in the following sections.
+You can apply for access to the base model's weights by filling out Meta AI's [form](https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform) and then converting them to the 🤗 Transformers format by running this [script](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py).
+Note that you'll also need to install 🤗 Transformers from source until the `v4.28` is released.
+
+Now that we have fine-tuned the model for the task, we are ready to train a reward model.
+
+## Reward modeling and human preferences
+
+In principle, we could fine-tune the model using RLHF directly with the human annotations.
+However, this would require us to send some samples to humans for rating after each optimization iteration.
+This is expensive and slow due to the number of training samples needed for convergence and the inherent latency of human reading and annotator speed.
+
+A trick that works well instead of direct feedback is training a reward model on human annotations collected before the RL loop.
+The goal of the reward model is to imitate how a human would rate a text. There are several possible strategies to build a reward model: the most straightforward way would be to predict the annotation (e.g. a rating score or a binary value for “good”/”bad”).
+In practice, what works better is to predict the ranking of two examples, where the reward model is presented with two candidates `(y_k, y_j)` for a given prompt `x` and has to predict which one would be rated higher by a human annotator.
+
+With the StackExchange dataset, we can infer which of the two answers was preferred by the users based on the score.
+With that information and the loss defined above, we can then modify the `transformers.Trainer` by adding a custom loss function.
+
+```python
+class RewardTrainer(Trainer):
+    def compute_loss(self, model, inputs, return_outputs=False):
+        rewards_j = model(input_ids=inputs["input_ids_j"],  attention_mask=inputs["attention_mask_j"])[0]
+        rewards_k = model(input_ids=inputs["input_ids_k"], attention_mask=inputs["attention_mask_k"])[0]
+        loss = -nn.functional.logsigmoid(rewards_j - rewards_k).mean()
+        if return_outputs:
+            return loss, {"rewards_j": rewards_j, "rewards_k": rewards_k}
+        return loss
+```
+
+We utilize a subset of a 100,000 pair of candidates and evaluate on a held-out set of 50,000. With a modest training batch size of 4, we train the Llama model using the LoRA `peft` adapter for a single epoch using the Adam optimizer with BF16 precision. Our LoRA configuration is:
+
+```python
+peft_config = LoraConfig(
+    task_type=TaskType.SEQ_CLS,
+    inference_mode=False,
+    r=8,
+    lora_alpha=32,
+    lora_dropout=0.1,
+)
+```
+As detailed in the next section, the resulting adapter can be merged into the frozen model and saved for further downstream use.
+
+## Reinforcement Learning from Human Feedback
+
+With the fine-tuned language model and the reward model at hand, we are now ready to run the RL loop. It follows roughly three steps:
+
+1. Generate responses from prompts,
+2. Rate the responses with the reward model,
+3. Run a reinforcement learning policy-optimization step with the ratings.
+
+The Query and Response prompts are templated as follows before being tokenized and passed to the model:
+
+```bash
+Question: <Query>
+
+Answer: <Response>
+```
+
+The same template was used for SFT, RM and RLHF stages.
+Once more, we utilize `peft` for memory-efficient training, which offers an extra advantage in the RLHF context.
+Here, the reference model and policy share the same base, the SFT model, which we load in 8-bit and freeze during training.
+We exclusively optimize the policy's LoRA weights using PPO while sharing the base model's weights.
+
+```python
+for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)):
+    question_tensors = batch["input_ids"]
+
+	# sample from the policy and to generate responses
+    response_tensors = ppo_trainer.generate(
+        question_tensors,
+        return_prompt=False,
+        length_sampler=output_length_sampler,
+        **generation_kwargs,
+    )
+    batch["response"] = tokenizer.batch_decode(response_tensors, skip_special_tokens=True)
+
+    # Compute sentiment score
+    texts = [q + r for q, r in zip(batch["query"], batch["response"])]
+    pipe_outputs = sentiment_pipe(texts, **sent_kwargs)
+    rewards = [torch.tensor(output[0]["score"] - script_args.reward_baseline) for output in pipe_outputs]
+
+    # Run PPO step
+    stats = ppo_trainer.step(question_tensors, response_tensors, rewards)
+	# Log stats to Wandb
+    ppo_trainer.log_stats(stats, batch, rewards)
+```
+
+For the rest of the details and evaluation, please refer to our [blog post on StackLLaMA](https://huggingface.co/blog/stackllama).
--- a/docs/source/xpo_trainer.mdx
+++ b/docs/source/xpo_trainer.mdx
@ -0,0 +1,137 @@
+# XPO Trainer
+
+## Overview
+
+Exploratory Preference Optimization (XPO) was proposed in the paper [Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF](https://huggingface.co/papers/2405.21046) by Tengyang Xie, Dylan J. Foster, Akshay Krishnamurthy, [Corby Rosset](https://huggingface.co/corbyrosset), [Ahmed Awadallah](https://huggingface.co/AhmedAwadallah), and Alexander Rakhlin. It is a simple online preference tuning method based on the DPO loss together with a reward model (RM). XPO augments the DPO objective with an exploration bonus allowing the method to explore outside the support of the intitial model and human feedback data.
+
+The abstract from the paper is the following:
+
+> Reinforcement learning from human feedback (RLHF) has emerged as a central tool for language model alignment. We consider online exploration in RLHF, which exploits interactive access to human or AI feedback by deliberately encouraging the model to produce diverse, maximally informative responses. By allowing RLHF to confidently stray from the pre-trained model, online exploration offers the possibility of novel, potentially super-human capabilities, but its full potential as a paradigm for language model training has yet to be realized, owing to computational and statistical bottlenecks in directly adapting existing reinforcement learning techniques. We propose a new algorithm for online exploration in RLHF, Exploratory Preference Optimization (XPO), which is simple and practical -- a one-line change to (online) Direct Preference Optimization (DPO; Rafailov et al., 2023) -- yet enjoys the strongest known provable guarantees and promising empirical performance. XPO augments the DPO objective with a novel and principled exploration bonus, empowering the algorithm to explore outside the support of the initial model and human feedback data. In theory, we show that XPO is provably sample-efficient and converges to a near-optimal language model policy under natural exploration conditions, irrespective of whether the initial model has good coverage. Our analysis, which builds on the observation that DPO implicitly performs a form of Q*-approximation (or, Bellman error minimization), combines previously disparate techniques from language modeling and theoretical reinforcement learning in a serendipitous fashion through the perspective of KL-regularized Markov decision processes. Empirically, we find that XPO is more sample-efficient than non-exploratory DPO variants in a preliminary evaluation.
+
+This post-training method was contributed by [Kashif Rasul](https://huggingface.co/kashif),  [Quentin Gallouédec](https://huggingface.co/qgallouedec) and [Lewis Tunstall](https://huggingface.co/lewtun).
+
+## Quick start
+
+This example demonstrates how to train a model using the XPO method. We use the [Qwen 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) as the base model and the [Qwen 0.5B reward model](https://huggingface.co/trl-lib/Qwen2-0.5B-Reward) as the reward model. We use the prompts from the [UltraFeedback dataset](https://huggingface.co/datasets/openbmb/UltraFeedback). You can view the prompts in the dataset here:
+
+<iframe
+  src="https://huggingface.co/datasets/trl-lib/ultrafeedback-prompt/embed/viewer/default/train?row=0"
+  frameborder="0"
+  width="100%"
+  height="560px"
+></iframe>
+
+Below is the script to train the model:
+
+```python
+# train_xpo.py
+from datasets import load_dataset
+from trl import XPOConfig, XPOTrainer
+from transformers import AutoModelForCausalLM, AutoModelForSequenceClassification, AutoTokenizer
+
+model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
+tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
+reward_model = AutoModelForSequenceClassification.from_pretrained("trl-lib/Qwen2-0.5B-Reward", num_labels=1)
+train_dataset = load_dataset("trl-lib/ultrafeedback-prompt", split="train")
+
+args = XPOConfig(output_dir="nash-md-qwen2", logging_steps=10)
+trainer = XPOTrainer(
+    model=model,
+    reward_model=reward_model,
+    args=args,
+    tokenizer=tokenizer,
+    train_dataset=train_dataset,
+)
+trainer.train()
+```
+
+Execute the script using the following command:
+
+```bash
+accelerate launch train_xpo.py
+```
+
+## Expected dataset format
+
+XPO requires a [prompt-only dataset](dataset_format#preference). The [`XPOTrainer`] supports both [conversational](dataset_format#conversational-dataset-format) and [standard](dataset_format#standard-dataset-format) dataset format. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
+
+## Usage tips
+
+### ⚠️ Use the same chat template
+
+Make sure that the SFT model and reward model use the _same_ chat template. Otherwise, you may find the model completions are scored incorrectly during training.
+
+### Encourage EOS token generation
+
+We can want the model to generate completion within a given length. During the learning, the model will generate completion up to the maximum completion length specified in the `max_new_tokens` argument of [`XPOConfig`]. I you want to penalize for not generating an EOS token before the maximum completion length, you can use the `missing_eos_penalty` argument of [`XPOConfig`]:
+
+```python
+args = XPOConfig(..., max_new_tokens=128, missing_eos_penalty=1.0)
+```
+
+### Logging Completions
+
+To better understand your model’s behavior during training, you can log sample completions periodically using the [`LogCompletionsCallback`].
+
+```python
+trainer = XPOTrainer(..., eval_dataset=eval_dataset)
+completions_callback = LogCompletionsCallback(trainer, num_prompts=8)
+trainer.add_callback(completions_callback)
+```
+
+This callback logs the model's generated completions directly to Weights & Biases.
+
+![Logged Completions](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/wandb_completions.png)
+
+## Example script
+
+We provide an example script to train a model using the XPO method. The script is available in [`examples/scripts/nash_md.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/nash_md.py)
+
+To test the XPO script with the [Pythia 14M model](https://huggingface.co/EleutherAI/pythia-14m) on the TL;DR summarization task, run the following command:
+
+```bash
+python examples/scripts/xpo.py \
+    --model_name_or_path EleutherAI/pythia-14m  \
+    --reward_model_path EleutherAI/pythia-14m \
+    --dataset_name trl-lib/tldr \
+    --learning_rate 5.0e-7 \
+    --output_dir pythia-14m-tldr-xpo \
+    --per_device_train_batch_size 4 \
+    --gradient_accumulation_steps 32 \
+    --num_train_epochs 3 \
+    --max_new_tokens 64 \
+    --warmup_ratio 0.1 \
+    --missing_eos_penalty 1.0 \
+    --push_to_hub
+```
+
+## Logged metrics
+
+The logged metrics are as follows:
+
+* `loss/xpo`: The mean xpo part of the full loss.
+* `loss/dpo`: The mean dpo part of the full loss.
+* `objective/kl`: The mean KL divergence between the model and reference data.
+* `objective/entropy`: The mean entropy of the model and reference data.
+* `objective/model_scores`: The mean scores (according to the reward model) of the model completions.
+* `objective/ref_scores`: The mean scores (according to the reward model) of the reference completions.
+* `objective/scores_margin`: The mean score margin (according to the external reward model) between the chosen and rejected completions.
+* `rewards/chosen`: The mean reward (according to XPO's DPO implicit reward model) of the chosen completions.
+* `rewards/rejected`: The mean reward (according to XPO's DPO implicit reward model) of the rejected completions.
+* `rewards/accuracies`: The accuracies of the XPO's implicit reward model.
+* `rewards/margins`: The mean reward margin (according to online DPO's implicit reward model) between the chosen and rejected completions.
+* `logps/chosen`: The mean log probabilities of the chosen completions.
+* `logps/rejected`: The mean log probabilities of the rejected completions.
+* `val/model_contain_eos_token`: The amount of times the model's output contains the eos token.
+* `val/ref_contain_eos_token`: The amount of times the reference's output contains the eos token.
+* `alpha`: The weight of the XPO loss term. Typically fixed, but can be made dynamic by passing a list to [`XPOConfig`].
+* `beta`: The parameter that controls the weight of the loss term representing the deviation from the reference model. Typically fixed, but can be made dynamic by passing a list to [`XPOConfig`].
+
+
+## XPOTrainer
+
+[[autodoc]] XPOTrainer
+
+## XPOConfig
+
+[[autodoc]] XPOConfig
--- a/examples/README.md
+++ b/examples/README.md
@ -1,66 +1,3 @@
-# Sentiment Examples
+# Examples

-The notebooks and scripts in this examples show how to fine-tune a model with a sentiment classifier (such as `lvwerra/distilbert-imdb`).
-
-Here's an overview of the notebooks and scripts:
-
-| File | Description |
-|---|---|
-| `notebooks/gpt2-sentiment.ipynb`  | Fine-tune GPT2 to generate positive movie reviews. |
-| `notebooks/gpt2-sentiment-control.ipynb`  | Fine-tune GPT2 to generate movie reviews with controlled sentiment. |
-| `scripts/gpt2-sentiment.py` | Same as the notebook, but easier to use to use in mutli-GPU setup. |
-| `scripts/t5-sentiment.py` | Same as GPT2 script, but for a Seq2Seq model (T5). |
-
-
-## Installation
-
-```bash
-pip install trl
-#optional: wandb
-pip install wandb
-```
-
-Note: if you don't want to log with `wandb` remove `log_with="wandb"` in the scripts/notebooks. You can also replace it with your favourite experiment tracker that's [supported by `accelerate`](https://huggingface.co/docs/accelerate/usage_guides/tracking).
-
-
-## Launch scripts
-
-The `trl` library is powered by `accelerate`. As such it is best to configure and launch trainings with the following commands:
-
-```bash
-accelerate config # will prompt you to define the training configuration
-accelerate launch scripts/gpt2-sentiment.py # launches training
-```
-
-# Summarization Example
-  
-The script in this example show how to train a reward model for summarization, following the OpenAI Learning to Summarize from Human Feedback [paper](https://arxiv.org/abs/2009.01325). We've validated that the script can be used to train a small GPT2 to get slightly over 60% validation accuracy, which is aligned with results from the paper. The model is [here](https://huggingface.co/Tristan/gpt2_reward_summarization).
-
-Here's an overview of the files:
-
-| File | Description |
-|---|---|
-| `scripts/reward_summarization.py` | For tuning the reward model. |
-| `scripts/ds3_reward_summarization_example_config.json` | Can be used with the reward model script to scale it up to arbitrarily big models that don't fit on a single GPU. |
-
-
-## Installation
-
-```bash
-pip install trl
-pip install evaluate
-# optional: deepspeed
-pip install deepspeed
-```
-
-```bash
-# If you want your reward model to follow the Learning to Summarize from Human Feedback paper closely, then tune a GPT model on summarization and then instantiate the reward model
-# with it. In other words, pass in the name of your summarization-finetuned gpt on the hub, instead of the name of the pretrained gpt2 like we do in the following examples of how
-# to run this script.
-
-# Example of running this script with the small size gpt2 on a 40GB A100 (A100's support bf16). Here, the global batch size will be 64:
-python -m torch.distributed.launch --nproc_per_node=1 reward_summarization.py --bf16
-
-# Example of running this script with the xl size gpt2 on 16 40GB A100's. Here the global batch size will still be 64:
-python -m torch.distributed.launch --nproc_per_node=16 reward_summarization.py --per_device_train_batch_size=1 --per_device_eval_batch_size=1 --gradient_accumulation_steps=4 --gpt_model_name=gpt2-xl --bf16 --deepspeed=ds3_reward_summarization_example_config.json
-```
+Please check out https://huggingface.co/docs/trl/example_overview for documentation on our examples.
--- a/examples/accelerate_configs/deepspeed_zero1.yaml
+++ b/examples/accelerate_configs/deepspeed_zero1.yaml
@ -0,0 +1,20 @@
+compute_environment: LOCAL_MACHINE
+debug: false
+deepspeed_config:
+  deepspeed_multinode_launcher: standard
+  gradient_accumulation_steps: 1
+  zero3_init_flag: false
+  zero_stage: 1
+distributed_type: DEEPSPEED
+downcast_bf16: 'no'
+machine_rank: 0
+main_training_function: main
+mixed_precision: 'bf16'
+num_machines: 1
+num_processes: 8
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
--- a/examples/accelerate_configs/deepspeed_zero2.yaml
+++ b/examples/accelerate_configs/deepspeed_zero2.yaml
@ -0,0 +1,21 @@
+compute_environment: LOCAL_MACHINE
+debug: false
+deepspeed_config:
+  deepspeed_multinode_launcher: standard
+  offload_optimizer_device: none
+  offload_param_device: none
+  zero3_init_flag: false
+  zero_stage: 2
+distributed_type: DEEPSPEED
+downcast_bf16: 'no'
+machine_rank: 0
+main_training_function: main
+mixed_precision: 'bf16'
+num_machines: 1
+num_processes: 8
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
--- a/examples/accelerate_configs/deepspeed_zero3.yaml
+++ b/examples/accelerate_configs/deepspeed_zero3.yaml
@ -0,0 +1,22 @@
+compute_environment: LOCAL_MACHINE
+debug: false
+deepspeed_config:
+  deepspeed_multinode_launcher: standard
+  offload_optimizer_device: none
+  offload_param_device: none
+  zero3_init_flag: true
+  zero3_save_16bit_model: true
+  zero_stage: 3
+distributed_type: DEEPSPEED
+downcast_bf16: 'no'
+machine_rank: 0
+main_training_function: main
+mixed_precision: bf16
+num_machines: 1
+num_processes: 8
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
--- a/examples/accelerate_configs/fsdp_qlora.yaml
+++ b/examples/accelerate_configs/fsdp_qlora.yaml
@ -0,0 +1,25 @@
+compute_environment: LOCAL_MACHINE                                                                                                                                           
+debug: false                                                                                                                                                                 
+distributed_type: FSDP
+downcast_bf16: 'no'
+fsdp_config:
+  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
+  fsdp_backward_prefetch: BACKWARD_PRE
+  fsdp_cpu_ram_efficient_loading: true
+  fsdp_forward_prefetch: false
+  fsdp_offload_params: true
+  fsdp_sharding_strategy: FULL_SHARD
+  fsdp_state_dict_type: SHARDED_STATE_DICT
+  fsdp_sync_module_states: true
+  fsdp_use_orig_params: false
+machine_rank: 0
+main_training_function: main
+mixed_precision: 'bf16'
+num_machines: 1
+num_processes: 8
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
--- a/examples/accelerate_configs/multi_gpu.yaml
+++ b/examples/accelerate_configs/multi_gpu.yaml
@ -0,0 +1,16 @@
+compute_environment: LOCAL_MACHINE
+debug: false
+distributed_type: MULTI_GPU
+downcast_bf16: 'no'
+gpu_ids: all
+machine_rank: 0
+main_training_function: main
+mixed_precision: 'bf16'
+num_machines: 1
+num_processes: 8
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
--- a/examples/accelerate_configs/single_gpu.yaml
+++ b/examples/accelerate_configs/single_gpu.yaml
@ -0,0 +1,16 @@
+compute_environment: LOCAL_MACHINE
+debug: false
+distributed_type: "NO"
+downcast_bf16: 'no'
+gpu_ids: all
+machine_rank: 0
+main_training_function: main
+mixed_precision: 'bf16'
+num_machines: 1
+num_processes: 8
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
--- a/examples/cli_configs/example_config.yaml
+++ b/examples/cli_configs/example_config.yaml
@ -0,0 +1,20 @@
+# This is an example configuration file of TRL CLI, you can use it for 
+# SFT like that: `trl sft --config config.yaml --output_dir test-sft`
+# The YAML file supports environment variables by adding an `env` field
+# as below
+
+# env:
+#   CUDA_VISIBLE_DEVICES: 0
+
+model_name_or_path:
+  trl-internal-testing/tiny-random-LlamaForCausalLM
+dataset_name:
+  stanfordnlp/imdb
+dataset_text_field:
+  text
+report_to:
+  none
+learning_rate:
+  0.0001
+lr_scheduler_type:
+  cosine
--- a/examples/datasets/hh-rlhf-helpful-base.py
+++ b/examples/datasets/hh-rlhf-helpful-base.py
@ -0,0 +1,96 @@
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import re
+from dataclasses import dataclass
+from typing import Dict, List, Optional
+
+from datasets import load_dataset
+from transformers import HfArgumentParser
+
+
+@dataclass
+class ScriptArguments:
+    r"""
+    Arguments for the script.
+
+    Args:
+        push_to_hub (`bool`, *optional*, defaults to `False`):
+            Whether to push the dataset to the Hugging Face Hub.
+        repo_id (`str`, *optional*, defaults to `"trl-lib/hh-rlhf-helpful-base"`):
+            Hugging Face repository ID to push the dataset to.
+        dataset_num_proc (`Optional[int]`, *optional*, defaults to `None`):
+            Number of workers to use for dataset processing.
+    """
+
+    push_to_hub: bool = False
+    repo_id: str = "trl-lib/hh-rlhf-helpful-base"
+    dataset_num_proc: Optional[int] = None
+
+
+def common_start(str1: str, str2: str) -> str:
+    # Zip the two strings and iterate over them together
+    common_chars = []
+    for c1, c2 in zip(str1, str2):
+        if c1 == c2:
+            common_chars.append(c1)
+        else:
+            break
+    # Join the common characters and return as a string
+    return "".join(common_chars)
+
+
+def extract_dialogue(example: str) -> List[Dict[str, str]]:
+    # Extract the prompt, which corresponds to the common start of the chosen and rejected dialogues
+    prompt_text = common_start(example["chosen"], example["rejected"])
+
+    # The chosen and rejected may share a common start, so we need to remove the common part
+    if not prompt_text.endswith("\n\nAssistant: "):
+        prompt_text = prompt_text[: prompt_text.rfind("\n\nAssistant: ")] + "\n\nAssistant: "
+
+    # Extract the chosen and rejected lines
+    chosen_line = example["chosen"][len(prompt_text) :]
+    rejected_line = example["rejected"][len(prompt_text) :]
+
+    # Remove the generation prompt ("\n\nAssistant: ") from the prompt
+    prompt_text = prompt_text[: -len("\n\nAssistant: ")]
+
+    # Split the string at every occurrence of "Human: " or "Assistant: "
+    prompt_lines = re.split(r"(\n\nAssistant: |\n\nHuman: )", prompt_text)
+
+    # Remove the first element as it's empty
+    prompt_lines = prompt_lines[1:]
+
+    prompt = []
+    for idx in range(0, len(prompt_lines), 2):
+        role = "user" if prompt_lines[idx] == "\n\nHuman: " else "assistant"
+        content = prompt_lines[idx + 1]
+        prompt.append({"role": role, "content": content})
+
+    # Remove the prompt from the chosen and rejected dialogues
+    chosen = [{"role": "assitant", "content": chosen_line}]
+    rejected = [{"role": "assistant", "content": rejected_line}]
+
+    return {"prompt": prompt, "chosen": chosen, "rejected": rejected}
+
+
+if __name__ == "__main__":
+    parser = HfArgumentParser(ScriptArguments)
+    args = parser.parse_args_into_dataclasses()[0]
+
+    dataset = load_dataset("Anthropic/hh-rlhf", data_dir="helpful-base")
+    dataset = dataset.map(extract_dialogue, num_proc=args.dataset_num_proc)
+
+    if args.push_to_hub:
+        dataset.push_to_hub(args.repo_id)
--- a/examples/datasets/lm-human-preferences-descriptiveness.py
+++ b/examples/datasets/lm-human-preferences-descriptiveness.py
@ -0,0 +1,81 @@
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dataclasses import dataclass
+from typing import Optional
+
+from datasets import load_dataset
+from transformers import AutoTokenizer, HfArgumentParser
+
+
+@dataclass
+class ScriptArguments:
+    r"""
+    Arguments for the script.
+
+    Args:
+        push_to_hub (`bool`, *optional*, defaults to `False`):
+            Whether to push the dataset to the Hugging Face Hub.
+        repo_id (`str`, *optional*, defaults to `"trl-lib/lm-human-preferences-descriptiveness"`):
+            Hugging Face repository ID to push the dataset to.
+        dataset_num_proc (`Optional[int]`, *optional*, defaults to `None`):
+            Number of workers to use for dataset processing.
+    """
+
+    push_to_hub: bool = False
+    repo_id: str = "trl-lib/lm-human-preferences-descriptiveness"
+    dataset_num_proc: Optional[int] = None
+
+
+# Edge cases handling: remove the cases where all samples are the same
+def samples_not_all_same(example):
+    return not all(example["sample0"] == example[f"sample{j}"] for j in range(1, 4))
+
+
+def to_prompt_completion(example, tokenizer):
+    prompt = tokenizer.decode(example["query"]).strip()
+    best_idx = example["best"]
+    chosen = tokenizer.decode(example[f"sample{best_idx}"])
+    for rejected_idx in range(4):  # take the first rejected sample that is different from the chosen one
+        rejected = tokenizer.decode(example[f"sample{rejected_idx}"])
+        if chosen != rejected:
+            break
+    assert chosen != rejected
+    return {"prompt": prompt, "chosen": chosen, "rejected": rejected}
+
+
+if __name__ == "__main__":
+    parser = HfArgumentParser(ScriptArguments)
+    args = parser.parse_args_into_dataclasses()[0]
+
+    dataset = load_dataset(
+        "json",
+        data_files="https://openaipublic.blob.core.windows.net/lm-human-preferences/labels/descriptiveness/offline_5k.json",
+        split="train",
+    )
+
+    dataset = dataset.filter(samples_not_all_same, num_proc=args.dataset_num_proc)
+
+    dataset = dataset.map(
+        to_prompt_completion,
+        num_proc=args.dataset_num_proc,
+        remove_columns=["query", "sample0", "sample1", "sample2", "sample3", "best"],
+        fn_kwargs={"tokenizer": AutoTokenizer.from_pretrained("gpt2")},
+    )
+
+    # train_size taken from https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/launch.py#L79)
+    dataset = dataset.train_test_split(train_size=4992)
+
+    if args.push_to_hub:
+        dataset.push_to_hub(args.repo_id)
--- a/examples/datasets/lm-human-preferences-sentiment.py
+++ b/examples/datasets/lm-human-preferences-sentiment.py
@ -0,0 +1,74 @@
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dataclasses import dataclass
+from typing import Optional
+
+from datasets import load_dataset
+from transformers import AutoTokenizer, HfArgumentParser
+
+
+@dataclass
+class ScriptArguments:
+    r"""
+    Arguments for the script.
+
+    Args:
+        push_to_hub (`bool`, *optional*, defaults to `False`):
+            Whether to push the dataset to the Hugging Face Hub.
+        repo_id (`str`, *optional*, defaults to `"trl-lib/lm-human-preferences-sentiment"`):
+            Hugging Face repository ID to push the dataset to.
+        dataset_num_proc (`Optional[int]`, *optional*, defaults to `None`):
+            Number of workers to use for dataset processing.
+    """
+
+    push_to_hub: bool = False
+    repo_id: str = "trl-lib/lm-human-preferences-sentiment"
+    dataset_num_proc: Optional[int] = None
+
+
+def to_prompt_completion(example, tokenizer):
+    prompt = tokenizer.decode(example["query"]).strip()
+    best_idx = example["best"]
+    chosen = tokenizer.decode(example[f"sample{best_idx}"])
+    for rejected_idx in range(4):  # take the first rejected sample that is different from the chosen one
+        rejected = tokenizer.decode(example[f"sample{rejected_idx}"])
+        if chosen != rejected:
+            break
+    assert chosen != rejected
+    return {"prompt": prompt, "chosen": chosen, "rejected": rejected}
+
+
+if __name__ == "__main__":
+    parser = HfArgumentParser(ScriptArguments)
+    args = parser.parse_args_into_dataclasses()[0]
+
+    dataset = load_dataset(
+        "json",
+        data_files="https://openaipublic.blob.core.windows.net/lm-human-preferences/labels/sentiment/offline_5k.json",
+        split="train",
+    )
+
+    dataset = dataset.map(
+        to_prompt_completion,
+        num_proc=args.dataset_num_proc,
+        remove_columns=["query", "sample0", "sample1", "sample2", "sample3", "best"],
+        fn_kwargs={"tokenizer": AutoTokenizer.from_pretrained("gpt2")},
+    )
+
+    # train_size taken from https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/launch.py#L70)
+    dataset = dataset.train_test_split(train_size=4992)
+
+    if args.push_to_hub:
+        dataset.push_to_hub(args.repo_id)
--- a/examples/datasets/tldr.py
+++ b/examples/datasets/tldr.py
@ -0,0 +1,67 @@
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dataclasses import dataclass
+from typing import Optional
+
+from datasets import load_dataset
+from transformers import HfArgumentParser
+
+
+@dataclass
+class ScriptArguments:
+    r"""
+    Arguments for the script.
+
+    Args:
+        push_to_hub (`bool`, *optional*, defaults to `False`):
+            Whether to push the dataset to the Hugging Face Hub.
+        repo_id (`str`, *optional*, defaults to `"trl-lib/tldr"`):
+            Hugging Face repository ID to push the dataset to.
+        dataset_num_proc (`Optional[int]`, *optional*, defaults to `None`):
+            Number of workers to use for dataset processing.
+    """
+
+    push_to_hub: bool = False
+    repo_id: str = "trl-lib/tldr"
+    dataset_num_proc: Optional[int] = None
+
+
+def to_prompt_completion(example):
+    tldr_format_str = "SUBREDDIT: r/{subreddit}\n\nTITLE: {title}\n\nPOST: {post}\n\nTL;DR:"
+    prompt = tldr_format_str.format(subreddit=example["subreddit"], title=example["title"], post=example["post"])
+    completion = " " + example["summary"]  # Add a space to separate the prompt from the completion
+    return {"prompt": prompt, "completion": completion}
+
+
+if __name__ == "__main__":
+    parser = HfArgumentParser(ScriptArguments)
+    args = parser.parse_args_into_dataclasses()[0]
+
+    # Filtered reddit TL;DR dataset from https://github.com/openai/summarize-from-feedback?tab=readme-ov-file#reddit-tldr-dataset
+    data_files = {
+        "train": "https://openaipublic.blob.core.windows.net/summarize-from-feedback/datasets/tldr_3_filtered/train.jsonl",
+        "validation": "https://openaipublic.blob.core.windows.net/summarize-from-feedback/datasets/tldr_3_filtered/valid.jsonl",
+        "test": "https://openaipublic.blob.core.windows.net/summarize-from-feedback/datasets/tldr_3_filtered/test.jsonl",
+    }
+    dataset = load_dataset("json", data_files=data_files)
+
+    dataset = dataset.map(
+        to_prompt_completion,
+        num_proc=args.dataset_num_proc,
+        remove_columns=["id", "subreddit", "title", "post", "summary"],
+    )
+
+    if args.push_to_hub:
+        dataset.push_to_hub(args.repo_id)
--- a/examples/datasets/tldr_preference.py
+++ b/examples/datasets/tldr_preference.py
@ -0,0 +1,72 @@
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dataclasses import dataclass
+from typing import Optional
+
+from datasets import load_dataset
+from transformers import HfArgumentParser
+
+
+@dataclass
+class ScriptArguments:
+    r"""
+    Arguments for the script.
+
+    Args:
+        push_to_hub (`bool`, *optional*, defaults to `False`):
+            Whether to push the dataset to the Hugging Face Hub.
+        repo_id (`str`, *optional*, defaults to `"trl-lib/tldr-preference"`):
+            Hugging Face repository ID to push the dataset to.
+        dataset_num_proc (`Optional[int]`, *optional*, defaults to `None`):
+            Number of workers to use for dataset processing.
+    """
+
+    push_to_hub: bool = False
+    repo_id: str = "trl-lib/tldr-preference"
+    dataset_num_proc: Optional[int] = None
+
+
+def to_preference(example):
+    info = example["info"]
+    if example["batch"] in ["batch0_cnndm", "cnndm0", "cnndm2"]:  # CNN Daily Mail batches
+        article = info["article"].replace("\n\n", "\n")
+        prompt = f"TITLE: {info['title']}\n\n{article}\n\nTL;DR:"
+    elif example["batch"] in [f"batch{i}" for i in range(3, 23)] + ["edit_b2_eval_test"]:  # Reddit batches
+        post = info["post"].replace("\n\n", "\n")
+        prompt = f"SUBREDDIT: r/{info['subreddit']}\n\nTITLE: {info['title']}\n\nPOST: {post}\n\nTL;DR:"
+    else:
+        raise ValueError(f"Unknown batch: {example['batch']}")
+
+    chosen_idx = example["choice"]
+    rejected_idx = 1 - chosen_idx
+    chosen = example["summaries"][chosen_idx]["text"]
+    rejected = example["summaries"][rejected_idx]["text"]
+    return {"prompt": prompt, "chosen": chosen, "rejected": rejected}
+
+
+if __name__ == "__main__":
+    parser = HfArgumentParser(ScriptArguments)
+    args = parser.parse_args_into_dataclasses()[0]
+
+    dataset = load_dataset("openai/summarize_from_feedback", "comparisons")
+
+    dataset = dataset.map(
+        to_preference,
+        num_proc=args.dataset_num_proc,
+        remove_columns=["info", "summaries", "choice", "worker", "batch", "split", "extra"],
+    )
+
+    if args.push_to_hub:
+        dataset.push_to_hub(args.repo_id)
--- a/examples/datasets/tokenize_ds.py
+++ b/examples/datasets/tokenize_ds.py
@ -0,0 +1,54 @@
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dataclasses import dataclass, field
+from typing import Optional
+
+from datasets import load_dataset
+from transformers import AutoTokenizer, HfArgumentParser
+
+from trl.trainer.utils import SIMPLE_CHAT_TEMPLATE
+
+
+"""
+python -i examples/datasets/tokenize_ds.py --model HuggingFaceH4/zephyr-7b-beta
+python -i examples/datasets/tokenize_ds.py --model gpt2
+"""
+
+
+@dataclass
+class ScriptArguments:
+    dataset_name: str = field(
+        default="trl-internal-testing/hh-rlhf-helpful-base-trl-style", metadata={"help": "The dataset to load"}
+    )
+    model: str = field(default="gpt2", metadata={"help": "The model to use for tokenization"})
+    dataset_num_proc: Optional[int] = field(
+        default=None, metadata={"help": "The number of workers to use to tokenize the data"}
+    )
+
+
+if __name__ == "__main__":
+    args = HfArgumentParser(ScriptArguments).parse_args_into_dataclasses()[0]
+    dataset = load_dataset(args.dataset_name)
+    tokenizer = AutoTokenizer.from_pretrained(args.model)
+    if tokenizer.chat_template is None:
+        tokenizer.chat_template = SIMPLE_CHAT_TEMPLATE
+
+    def process(row):
+        row["chosen"] = tokenizer.apply_chat_template(row["chosen"], tokenize=False)
+        row["rejected"] = tokenizer.apply_chat_template(row["rejected"], tokenize=False)
+        return row
+
+    dataset = dataset.map(process, num_proc=args.dataset_num_proc)
+    print(dataset["train"][0]["chosen"])
--- a/examples/datasets/ultrafeedback-prompt.py
+++ b/examples/datasets/ultrafeedback-prompt.py
@ -0,0 +1,68 @@
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dataclasses import dataclass
+from typing import Optional
+
+from datasets import load_dataset
+from transformers import HfArgumentParser
+
+
+@dataclass
+class ScriptArguments:
+    r"""
+    Arguments for the script.
+
+    Args:
+        push_to_hub (`bool`, *optional*, defaults to `False`):
+            Whether to push the dataset to the Hugging Face Hub.
+        repo_id (`str`, *optional*, defaults to `"trl-lib/ultrafeedback-prompt"`):
+            Hugging Face repository ID to push the dataset to.
+        dataset_num_proc (`Optional[int]`, *optional*, defaults to `None`):
+            Number of workers to use for dataset processing.
+    """
+
+    push_to_hub: bool = False
+    repo_id: str = "trl-lib/ultrafeedback-prompt"
+    dataset_num_proc: Optional[int] = None
+
+
+def to_unpaired_preference(example):
+    prompt = [{"role": "user", "content": example["instruction"]}]
+    return {"prompt": prompt}
+
+
+def drop_long_prompt(example):
+    if len(example["prompt"][0]["content"]) > 512:
+        return False
+    else:
+        return True
+
+
+if __name__ == "__main__":
+    parser = HfArgumentParser(ScriptArguments)
+    args = parser.parse_args_into_dataclasses()[0]
+
+    dataset = load_dataset("openbmb/UltraFeedback", split="train")
+
+    dataset = dataset.map(
+        to_unpaired_preference,
+        remove_columns=["source", "instruction", "models", "completions", "correct_answers", "incorrect_answers"],
+        num_proc=args.dataset_num_proc,
+    )
+    dataset = dataset.filter(drop_long_prompt)
+    dataset = dataset.train_test_split(test_size=0.05, seed=42)
+
+    if args.push_to_hub:
+        dataset.push_to_hub(args.repo_id)
--- a/examples/datasets/ultrafeedback.py
+++ b/examples/datasets/ultrafeedback.py
@ -0,0 +1,100 @@
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dataclasses import dataclass
+from typing import Optional
+
+from datasets import load_dataset
+from transformers import HfArgumentParser
+
+
+@dataclass
+class ScriptArguments:
+    r"""
+    Arguments for the script.
+
+    Args:
+        model_name (`str`, *optional*, defaults to `"gpt-3.5-turbo"`):
+            Language model to target. Possible values are:
+
+                - `"alpaca-7b"`
+                - `"bard"`
+                - `"falcon-40b-instruct"`
+                - `"gpt-3.5-turbo"` (default)
+                - `"gpt-4"`
+                - `"llama-2-13b-chat"`
+                - `"llama-2-70b-chat"`
+                - `"llama-2-7b-chat"`
+                - `"mpt-30b-chat"`
+                - `"pythia-12b"`
+                - `"starchat"`
+                - `"ultralm-13b"`
+                - `"ultralm-65b"`
+                - `"vicuna-33b"`
+                - `"wizardlm-13b"`
+                - `"wizardlm-70b"`
+                - `"wizardlm-7b"`
+
+        aspect (`str`, *optional*, defaults to `"helpfulness"`):
+            Aspect to target. Possible values are:
+
+                - `"helpfulness"` (default)
+                - `"honesty"`
+                - `"instruction-following"`
+                - `"truthfulness"`
+
+        push_to_hub (`bool`, *optional*, defaults to `False`):
+            Whether to push the dataset to the Hugging Face Hub.
+        repo_id (`str`, *optional*, defaults to `"trl-lib/ultrafeedback-gpt-3.5-turbo-helpfulness"`):
+            Hugging Face repository ID to push the dataset to.
+        dataset_num_proc (`Optional[int]`, *optional*, defaults to `None`):
+            Number of workers to use for dataset processing.
+    """
+
+    model_name: str = "gpt-3.5-turbo"
+    aspect: str = "helpfulness"
+    push_to_hub: bool = False
+    repo_id: str = "trl-lib/ultrafeedback-gpt-3.5-turbo-helpfulness"
+    dataset_num_proc: Optional[int] = None
+
+
+def to_unpaired_preference(example, model_name, aspect):
+    prompt = [{"role": "user", "content": example["instruction"]}]
+    model_index = example["models"].index(model_name)
+    response_content = example["completions"][model_index]["response"]
+    completion = [{"role": "assistant", "content": response_content}]
+    score = int(example["completions"][model_index]["annotations"][aspect]["Rating"])
+    label = score >= 5
+    return {"prompt": prompt, "completion": completion, "label": label}
+
+
+if __name__ == "__main__":
+    parser = HfArgumentParser(ScriptArguments)
+    args = parser.parse_args_into_dataclasses()[0]
+
+    dataset = load_dataset("openbmb/UltraFeedback", split="train")
+
+    dataset = dataset.filter(
+        lambda example: args.model_name in example["models"], batched=False, num_proc=args.dataset_num_proc
+    )
+    dataset = dataset.map(
+        to_unpaired_preference,
+        remove_columns=["source", "instruction", "models", "completions", "correct_answers", "incorrect_answers"],
+        fn_kwargs={"model_name": args.model_name, "aspect": args.aspect},
+        num_proc=args.dataset_num_proc,
+    )
+    dataset = dataset.train_test_split(test_size=0.05, seed=42)
+
+    if args.push_to_hub:
+        dataset.push_to_hub(args.repo_id)
--- a/examples/datasets/zen.py
+++ b/examples/datasets/zen.py
@ -0,0 +1,583 @@
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dataclasses import dataclass
+
+from datasets import Dataset
+from transformers import HfArgumentParser
+
+
+@dataclass
+class ScriptArguments:
+    r"""
+    Arguments for the script.
+
+    Args:
+        test_size (`float`, *optional*, defaults to `0.1`):
+            Fraction of the dataset to include in the test split.
+        push_to_hub (`bool`, *optional*, defaults to `False`):
+            Whether to push the dataset to the Hugging Face Hub.
+        repo_id (`str`, *optional*, defaults to `"trl-lib/zen"`):
+            Hugging Face repository ID to push the dataset to.
+    """
+
+    test_size: float = 0.1
+    push_to_hub: bool = False
+    repo_id: str = "trl-lib/zen"
+
+
+def main(test_size, push_to_hub, repo_id):
+    # fmt: off
+    standard_language_modeling_dataset = Dataset.from_dict({
+        "text": [
+            "Beautiful is better than ugly.",
+            "Explicit is better than implicit.",
+            "Simple is better than complex.",
+            "Complex is better than complicated.",
+            "Flat is better than nested.",
+            "Sparse is better than dense.",
+            "Readability counts.",
+            "Special cases aren't special enough to break the rules.",
+            "Although practicality beats purity.",
+            "Errors should never pass silently.",
+            "Unless explicitly silenced.",
+            "In the face of ambiguity, refuse the temptation to guess.",
+            "There should be one-- and preferably only one --obvious way to do it.",
+            "Although that way may not be obvious at first unless you're Dutch.",
+            "Now is better than never.",
+            "Although never is often better than *right* now.",
+            "If the implementation is hard to explain, it's a bad idea.",
+            "If the implementation is easy to explain, it may be a good idea.",
+            "Namespaces are one honking great idea -- let's do more of those!",
+        ],
+    })
+    standard_language_modeling_dataset = standard_language_modeling_dataset.train_test_split(test_size=test_size)
+    if push_to_hub:
+        standard_language_modeling_dataset.push_to_hub(repo_id, config_name="standard_language_modeling")
+
+    standard_prompt_only_dataset = Dataset.from_dict({
+        "prompt": [
+            "Beautiful is better than",
+            "Explicit is",
+            "Simple is better",
+            "Complex",
+            "Flat is better than",
+            "Sparse is better",
+            "Readability",
+            "Special cases aren't special",
+            "Although practicality beats",
+            "Errors should never",
+            "Unless explicitly",
+            "In the face of ambiguity, refuse",
+            "There should be one-- and preferably",
+            "Although that way may not be obvious at first unless you're",
+            "Now is",
+            "Although never is often",
+            "If the implementation is hard to explain,",
+            "If the implementation is easy",
+            "Namespaces are one honking great",
+        ],
+    })
+    standard_prompt_only_dataset = standard_prompt_only_dataset.train_test_split(test_size=test_size)
+    if push_to_hub:
+        standard_prompt_only_dataset.push_to_hub(repo_id, config_name="standard_prompt_only")
+
+    standard_prompt_completion_dataset = Dataset.from_dict({
+        "prompt": [
+            "Beautiful is better than",
+            "Explicit is",
+            "Simple is better",
+            "Complex",
+            "Flat is better than",
+            "Sparse is better",
+            "Readability",
+            "Special cases aren't special",
+            "Although practicality beats",
+            "Errors should never",
+            "Unless explicitly",
+            "In the face of ambiguity, refuse",
+            "There should be one-- and preferably",
+            "Although that way may not be obvious at first unless you're",
+            "Now is",
+            "Although never is often",
+            "If the implementation is hard to explain,",
+            "If the implementation is easy",
+            "Namespaces are one honking great",
+        ],
+        "completion": [
+            " ugly.",
+            " better than implicit.",
+            " than complex.",
+            " is better than complicated.",
+            " nested.",
+            " than dense.",
+            " counts.",
+            " enough to break the rules.",
+            " purity.",
+            " pass silently.",
+            " silenced.",
+            " the temptation to guess.",
+            " only one --obvious way to do it.",
+            " Dutch.",
+            " better than never.",
+            " better than *right* now.",
+            " it's a bad idea.",
+            " to explain, it may be a good idea.",
+            " idea -- let's do more of those!",
+        ],
+    })
+    standard_prompt_completion_dataset = standard_prompt_completion_dataset.train_test_split(test_size=test_size)
+    if push_to_hub:
+        standard_prompt_completion_dataset.push_to_hub(repo_id, config_name="standard_prompt_completion")
+
+    standard_preference_dataset = Dataset.from_dict({
+        "prompt": [
+            "Beautiful is better than",
+            "Explicit is",
+            "Simple is better",
+            "Complex",
+            "Flat is better than",
+            "Sparse is better",
+            "Readability",
+            "Special cases aren't special",
+            "Although practicality beats",
+            "Errors should never",
+            "Unless explicitly",
+            "In the face of ambiguity, refuse",
+            "There should be one-- and preferably",
+            "Although that way may not be obvious at first unless you're",
+            "Now is",
+            "Although never is often",
+            "If the implementation is hard to explain,",
+            "If the implementation is easy",
+            "Namespaces are one honking great",
+        ],
+        "chosen": [
+            " ugly.",
+            " better than implicit.",
+            " than complex.",
+            " is better than complicated.",
+            " nested.",
+            " than dense.",
+            " counts.",
+            " enough to break the rules.",
+            " purity.",
+            " pass silently.",
+            " silenced.",
+            " the temptation to guess.",
+            " only one --obvious way to do it.",
+            " Dutch.",
+            " better than never.",
+            " better than *right* now.",
+            " it's a bad idea.",
+            " to explain, it may be a good idea.",
+            " idea -- let's do more of those!",
+        ],
+        "rejected": [
+            " the moon.",
+            " worse than nothing.",
+            " than a long vacation.",
+            " is always the answer.",
+            " chocolate.",
+            " without any context.",
+            " is optional.",
+            " enough to become unicorns.",
+            " reality.",
+            " pass their driving test.",
+            " forgotten.",
+            " the opportunity to laugh.",
+            " two or more confusing methods.",
+            " a time traveler.",
+            " never better.",
+            " not even a possibility.",
+            " it's clearly the best choice.",
+            " it's probably magic.",
+            " watermelon -- let's plant some!",
+        ],
+    })
+    standard_preference_dataset = standard_preference_dataset.train_test_split(test_size=test_size)
+    if push_to_hub:
+        standard_preference_dataset.push_to_hub(repo_id, config_name="standard_preference")
+
+    standard_implicit_prompt_preference_dataset = Dataset.from_dict({
+        "chosen": [
+            "Beautiful is better than ugly.",
+            "Explicit is better than implicit.",
+            "Simple is better than complex.",
+            "Complex is better than complicated.",
+            "Flat is better than nested.",
+            "Sparse is better than dense.",
+            "Readability counts.",
+            "Special cases aren't special enough to break the rules.",
+            "Although practicality beats purity.",
+            "Errors should never pass silently.",
+            "Unless explicitly silenced.",
+            "In the face of ambiguity, refuse the temptation to guess.",
+            "There should be one-- and preferably only one --obvious way to do it.",
+            "Although that way may not be obvious at first unless you're Dutch.",
+            "Now is better than never.",
+            "Although never is often better than *right* now.",
+            "If the implementation is hard to explain, it's a bad idea.",
+            "If the implementation is easy to explain, it may be a good idea.",
+            "Namespaces are one honking great idea -- let's do more of those!",
+        ],
+        "rejected": [
+            "Beautiful is better than the moon.",
+            "Explicit is worse than nothing.",
+            "Simple is better than a long vacation.",
+            "Complex is always the answer.",
+            "Flat is better than chocolate.",
+            "Sparse is better without any context.",
+            "Readability is optional.",
+            "Special cases aren't special enough to become unicorns.",
+            "Although practicality beats reality.",
+            "Errors should never pass their driving test.",
+            "Unless explicitly forgotten.",
+            "In the face of ambiguity, refuse the opportunity to laugh.",
+            "There should be one-- and preferably two or more confusing methods.",
+            "Although that way may not be obvious at first unless you're a time traveler.",
+            "Now is never better.",
+            "Although never is often not even a possibility.",
+            "If the implementation is hard to explain, it's clearly the best choice.",
+            "If the implementation is easy it's probably magic.",
+            "Namespaces are one honking great watermelon -- let's plant some!",
+        ],
+    })
+    standard_implicit_prompt_preference_dataset = standard_implicit_prompt_preference_dataset.train_test_split(test_size=test_size)
+    if push_to_hub:
+        standard_implicit_prompt_preference_dataset.push_to_hub(repo_id, config_name="standard_implicit_prompt_preference")
+
+    standard_unpaired_preference_dataset = Dataset.from_dict({
+        "prompt": [
+            "Beautiful is better than",
+            "Explicit is",
+            "Simple is better",
+            "Complex",
+            "Flat is better than",
+            "Sparse is better",
+            "Readability",
+            "Special cases aren't special",
+            "Although practicality beats",
+            "Errors should never",
+            "Unless explicitly",
+            "In the face of ambiguity, refuse",
+            "There should be one-- and preferably",
+            "Although that way may not be obvious at first unless you're",
+            "Now is",
+            "Although never is often",
+            "If the implementation is hard to explain,",
+            "If the implementation is easy",
+            "Namespaces are one honking great",
+        ],
+        "completion": [
+            " ugly.",
+            " worse than nothing.",
+            " than a long vacation.",
+            " is better than complicated.",
+            " nested.",
+            " without any context.",
+            " counts.",
+            " enough to become unicorns.",
+            " purity.",
+            " pass silently.",
+            " forgotten.",
+            " the temptation to guess.",
+            " only one --obvious way to do it.",
+            " a time traveler.",
+            " better than never.",
+            " not even a possibility.",
+            " it's a bad idea.",
+            " it's probably magic.",
+            " watermelon -- let's plant some!",
+        ],
+        "label": [True, False, False, True, True, False, True, False, True, True, False, True, True, False, True, False, True, False, False],
+    })
+    standard_unpaired_preference_dataset = standard_unpaired_preference_dataset.train_test_split(test_size=test_size)
+    if push_to_hub:
+        standard_unpaired_preference_dataset.push_to_hub(repo_id, config_name="standard_unpaired_preference")
+
+    conversational_language_modeling_dataset = Dataset.from_dict({
+        "messages": [
+            [{"role": "user", "content": "What is better than ugly?"}, {"role": "assistant", "content": "Beautiful."},],
+            [{"role": "user", "content": "What is better than implicit?"}, {"role": "assistant", "content": "Explicit."}],
+            [{"role": "user", "content": "What is better than complex?"}, {"role": "assistant", "content": "Simple."}],
+            [{"role": "user", "content": "What is better than complicated?"}, {"role": "assistant", "content": "Complex."}],
+            [{"role": "user", "content": "What is better than nested?"}, {"role": "assistant", "content": "Flat."}],
+            [{"role": "user", "content": "What is better than dense?"}, {"role": "assistant", "content": "Sparse."}],
+            [{"role": "user", "content": "What counts?"}, {"role": "assistant", "content": "Readability."}],
+            [{"role": "user", "content": "Are special cases enough to break the rules?"}, {"role": "assistant", "content": "No, special cases aren't special enough to break the rules."}],
+            [{"role": "user", "content": "What beats purity?"}, {"role": "assistant", "content": "Practicality."}],
+            [{"role": "user", "content": "What should never pass silently?"}, {"role": "assistant", "content": "Errors."}],
+            [{"role": "user", "content": "When can errors pass silently?"}, {"role": "assistant", "content": "When explicitly silenced."}],
+            [{"role": "user", "content": "What should you do in the face of ambiguity?"}, {"role": "assistant", "content": "Refuse the temptation to guess."}],
+            [{"role": "user", "content": "How many ways should there be to do it?"}, {"role": "assistant", "content": "One, and preferably only one."}],
+            [{"role": "user", "content": "For whom may the way not be obvious at first?"}, {"role": "assistant", "content": "Dutch."}],
+            [{"role": "user", "content": "What is better than never?"}, {"role": "assistant", "content": "Now is better than never."}],
+            [{"role": "user", "content": "Is never better than *right* now?"}, {"role": "assistant", "content": "Yes, often."}],
+            [{"role": "user", "content": "What does it mean if the implementation is hard to explain?"}, {"role": "assistant", "content": "It means it's a bad idea."}],
+            [{"role": "user", "content": "What does it mean if the implementation is easy to explain?"}, {"role": "assistant", "content": "It means it may be a good idea."}],
+            [{"role": "user", "content": "Any great ideas?"}, {"role": "assistant", "content": "Namespaces are one honking great idea."}],
+        ],
+    })
+    conversational_language_modeling_dataset = conversational_language_modeling_dataset.train_test_split(test_size=test_size)
+    if push_to_hub:
+        conversational_language_modeling_dataset.push_to_hub(repo_id, config_name="conversational_language_modeling")
+
+    conversational_prompt_only_dataset = Dataset.from_dict({
+        "prompt": [
+            [{"role": "user", "content": "What is better than ugly?"}],
+            [{"role": "user", "content": "What is better than implicit?"}],
+            [{"role": "user", "content": "What is better than complex?"}],
+            [{"role": "user", "content": "What is better than complicated?"}],
+            [{"role": "user", "content": "What is better than nested?"}],
+            [{"role": "user", "content": "What is better than dense?"}],
+            [{"role": "user", "content": "What counts?"}],
+            [{"role": "user", "content": "Are special cases enough to break the rules?"}],
+            [{"role": "user", "content": "What beats purity?"}],
+            [{"role": "user", "content": "What should never pass silently?"}],
+            [{"role": "user", "content": "When can errors pass silently?"}],
+            [{"role": "user", "content": "What should you do in the face of ambiguity?"}],
+            [{"role": "user", "content": "How many ways should there be to do it?"}],
+            [{"role": "user", "content": "For whom may the way not be obvious at first?"}],
+            [{"role": "user", "content": "What is better than never?"}],
+            [{"role": "user", "content": "Is never better than *right* now?"}],
+            [{"role": "user", "content": "What does it mean if the implementation is hard to explain?"}],
+            [{"role": "user", "content": "What does it mean if the implementation is easy to explain?"}],
+            [{"role": "user", "content": "Any great ideas?"}],
+        ],
+    })
+    conversational_prompt_only_dataset = conversational_prompt_only_dataset.train_test_split(test_size=test_size)
+    if push_to_hub:
+        conversational_prompt_only_dataset.push_to_hub(repo_id, config_name="conversational_prompt_only")
+
+    conversational_prompt_completion_dataset = Dataset.from_dict({
+        "prompt": [
+            [{"role": "user", "content": "What is better than ugly?"}],
+            [{"role": "user", "content": "What is better than implicit?"}],
+            [{"role": "user", "content": "What is better than complex?"}],
+            [{"role": "user", "content": "What is better than complicated?"}],
+            [{"role": "user", "content": "What is better than nested?"}],
+            [{"role": "user", "content": "What is better than dense?"}],
+            [{"role": "user", "content": "What counts?"}],
+            [{"role": "user", "content": "Are special cases enough to break the rules?"}],
+            [{"role": "user", "content": "What beats purity?"}],
+            [{"role": "user", "content": "What should never pass silently?"}],
+            [{"role": "user", "content": "When can errors pass silently?"}],
+            [{"role": "user", "content": "What should you do in the face of ambiguity?"}],
+            [{"role": "user", "content": "How many ways should there be to do it?"}],
+            [{"role": "user", "content": "For whom may the way not be obvious at first?"}],
+            [{"role": "user", "content": "What is better than never?"}],
+            [{"role": "user", "content": "Is never better than *right* now?"}],
+            [{"role": "user", "content": "What does it mean if the implementation is hard to explain?"}],
+            [{"role": "user", "content": "What does it mean if the implementation is easy to explain?"}],
+            [{"role": "user", "content": "Any great ideas?"}],
+        ],
+        "completion": [
+            [{"role": "assistant", "content": "Beautiful."}],
+            [{"role": "assistant", "content": "Explicit."}],
+            [{"role": "assistant", "content": "Simple."}],
+            [{"role": "assistant", "content": "Complex."}],
+            [{"role": "assistant", "content": "Flat."}],
+            [{"role": "assistant", "content": "Sparse."}],
+            [{"role": "assistant", "content": "Readability."}],
+            [{"role": "assistant", "content": "No, special cases aren't special enough to break the rules."}],
+            [{"role": "assistant", "content": "Practicality."}],
+            [{"role": "assistant", "content": "Errors."}],
+            [{"role": "assistant", "content": "When explicitly silenced."}],
+            [{"role": "assistant", "content": "Refuse the temptation to guess."}],
+            [{"role": "assistant", "content": "One, and preferably only one."}],
+            [{"role": "assistant", "content": "Dutch."}],
+            [{"role": "assistant", "content": "Now is better than never."}],
+            [{"role": "assistant", "content": "Yes, often."}],
+            [{"role": "assistant", "content": "It means it's a bad idea."}],
+            [{"role": "assistant", "content": "It means it may be a good idea."}],
+            [{"role": "assistant", "content": "Namespaces are one honking great idea."}],
+        ],
+    })
+    conversational_prompt_completion_dataset = conversational_prompt_completion_dataset.train_test_split(test_size=test_size)
+    if push_to_hub:
+        conversational_prompt_completion_dataset.push_to_hub(repo_id, config_name="conversational_prompt_completion")
+
+    conversational_preference_dataset = Dataset.from_dict({
+        "prompt": [
+            [{"role": "user", "content": "What is better than ugly?"}],
+            [{"role": "user", "content": "What is better than implicit?"}],
+            [{"role": "user", "content": "What is better than complex?"}],
+            [{"role": "user", "content": "What is better than complicated?"}],
+            [{"role": "user", "content": "What is better than nested?"}],
+            [{"role": "user", "content": "What is better than dense?"}],
+            [{"role": "user", "content": "What counts?"}],
+            [{"role": "user", "content": "Are special cases enough to break the rules?"}],
+            [{"role": "user", "content": "What beats purity?"}],
+            [{"role": "user", "content": "What should never pass silently?"}],
+            [{"role": "user", "content": "When can errors pass silently?"}],
+            [{"role": "user", "content": "What should you do in the face of ambiguity?"}],
+            [{"role": "user", "content": "How many ways should there be to do it?"}],
+            [{"role": "user", "content": "For whom may the way not be obvious at first?"}],
+            [{"role": "user", "content": "What is better than never?"}],
+            [{"role": "user", "content": "Is never better than *right* now?"}],
+            [{"role": "user", "content": "What does it mean if the implementation is hard to explain?"}],
+            [{"role": "user", "content": "What does it mean if the implementation is easy to explain?"}],
+            [{"role": "user", "content": "Any great ideas?"}],
+        ],
+        "chosen": [
+            [{"role": "assistant", "content": "Beautiful."}],
+            [{"role": "assistant", "content": "Explicit."}],
+            [{"role": "assistant", "content": "Simple."}],
+            [{"role": "assistant", "content": "Complex."}],
+            [{"role": "assistant", "content": "Flat."}],
+            [{"role": "assistant", "content": "Sparse."}],
+            [{"role": "assistant", "content": "Readability."}],
+            [{"role": "assistant", "content": "No, special cases aren't special enough to break the rules."}],
+            [{"role": "assistant", "content": "Practicality."}],
+            [{"role": "assistant", "content": "Errors."}],
+            [{"role": "assistant", "content": "When explicitly silenced."}],
+            [{"role": "assistant", "content": "Refuse the temptation to guess."}],
+            [{"role": "assistant", "content": "One, and preferably only one."}],
+            [{"role": "assistant", "content": "Dutch."}],
+            [{"role": "assistant", "content": "Now is better than never."}],
+            [{"role": "assistant", "content": "Yes, often."}],
+            [{"role": "assistant", "content": "It means it's a bad idea."}],
+            [{"role": "assistant", "content": "It means it may be a good idea."}],
+            [{"role": "assistant", "content": "Namespaces are one honking great idea."}],
+        ],
+        "rejected": [
+            [{"role": "assistant", "content": "Acceptable."}],
+            [{"role": "assistant", "content": "Explained."}],
+            [{"role": "assistant", "content": "Very complex."}],
+            [{"role": "assistant", "content": "Very complicated."}],
+            [{"role": "assistant", "content": "Circular."}],
+            [{"role": "assistant", "content": "Heavy."}],
+            [{"role": "assistant", "content": "Looking complicated."}],
+            [{"role": "assistant", "content": "Yes, special cases are special enough to break the rules."}],
+            [{"role": "assistant", "content": "Nothing."}],
+            [{"role": "assistant", "content": "Warnings."}],
+            [{"role": "assistant", "content": "Never."}],
+            [{"role": "assistant", "content": "Give up."}],
+            [{"role": "assistant", "content": "As many as possible."}],
+            [{"role": "assistant", "content": "French."}],
+            [{"role": "assistant", "content": "Some day."}],
+            [{"role": "assistant", "content": "No, never."}],
+            [{"role": "assistant", "content": "It means it's a good idea."}],
+            [{"role": "assistant", "content": "It means it's a bad idea."}],
+            [{"role": "assistant", "content": "Recursion."}],
+        ],
+    })
+    conversational_preference_dataset = conversational_preference_dataset.train_test_split(test_size=test_size)
+    if push_to_hub:
+        conversational_preference_dataset.push_to_hub(repo_id, config_name="conversational_preference")
+
+    conversational_implicit_prompt_preference_dataset = Dataset.from_dict({
+        "chosen": [
+            [{"role": "user", "content": "What is better than ugly?"}, {"role": "assistant", "content": "Beautiful."}],
+            [{"role": "user", "content": "What is better than implicit?"}, {"role": "assistant", "content": "Explicit."}],
+            [{"role": "user", "content": "What is better than complex?"}, {"role": "assistant", "content": "Simple."}],
+            [{"role": "user", "content": "What is better than complicated?"}, {"role": "assistant", "content": "Complex."}],
+            [{"role": "user", "content": "What is better than nested?"}, {"role": "assistant", "content": "Flat."}],
+            [{"role": "user", "content": "What is better than dense?"}, {"role": "assistant", "content": "Sparse."}],
+            [{"role": "user", "content": "What counts?"}, {"role": "assistant", "content": "Readability."}],
+            [{"role": "user", "content": "Are special cases enough to break the rules?"}, {"role": "assistant", "content": "No, special cases aren't special enough to break the rules."}],
+            [{"role": "user", "content": "What beats purity?"}, {"role": "assistant", "content": "Practicality."}],
+            [{"role": "user", "content": "What should never pass silently?"}, {"role": "assistant", "content": "Errors."}],
+            [{"role": "user", "content": "When can errors pass silently?"}, {"role": "assistant", "content": "When explicitly silenced."}],
+            [{"role": "user", "content": "What should you do in the face of ambiguity?"}, {"role": "assistant", "content": "Refuse the temptation to guess."}],
+            [{"role": "user", "content": "How many ways should there be to do it?"}, {"role": "assistant", "content": "One, and preferably only one."}],
+            [{"role": "user", "content": "For whom may the way not be obvious at first?"}, {"role": "assistant", "content": "Dutch."}],
+            [{"role": "user", "content": "What is better than never?"}, {"role": "assistant", "content": "Now is better than never."}],
+            [{"role": "user", "content": "Is never better than *right* now?"}, {"role": "assistant", "content": "Yes, often."}],
+            [{"role": "user", "content": "What does it mean if the implementation is hard to explain?"}, {"role": "assistant", "content": "It means it's a bad idea."}],
+            [{"role": "user", "content": "What does it mean if the implementation is easy to explain?"}, {"role": "assistant", "content": "It means it may be a good idea."}],
+            [{"role": "user", "content": "Any great ideas?"}, {"role": "assistant", "content": "Namespaces are one honking great idea."}],
+        ],
+        "rejected": [
+            [{"role": "user", "content": "What is better than ugly?"}, {"role": "assistant", "content": "Acceptable."}],
+            [{"role": "user", "content": "What is better than implicit?"}, {"role": "assistant", "content": "Explained."}],
+            [{"role": "user", "content": "What is better than complex?"}, {"role": "assistant", "content": "Very complex."}],
+            [{"role": "user", "content": "What is better than complicated?"}, {"role": "assistant", "content": "Very complicated."}],
+            [{"role": "user", "content": "What is better than nested?"}, {"role": "assistant", "content": "Circular."}],
+            [{"role": "user", "content": "What is better than dense?"}, {"role": "assistant", "content": "Heavy."}],
+            [{"role": "user", "content": "What counts?"}, {"role": "assistant", "content": "Looking complicated."}],
+            [{"role": "user", "content": "Are special cases enough to break the rules?"}, {"role": "assistant", "content": "Yes, special cases are special enough to break the rules."}],
+            [{"role": "user", "content": "What beats purity?"}, {"role": "assistant", "content": "Nothing."}],
+            [{"role": "user", "content": "What should never pass silently?"}, {"role": "assistant", "content": "Warnings."}],
+            [{"role": "user", "content": "When can errors pass silently?"}, {"role": "assistant", "content": "Never."}],
+            [{"role": "user", "content": "What should you do in the face of ambiguity?"}, {"role": "assistant", "content": "Give up."}],
+            [{"role": "user", "content": "How many ways should there be to do it?"}, {"role": "assistant", "content": "As many as possible."}],
+            [{"role": "user", "content": "For whom may the way not be obvious at first?"}, {"role": "assistant", "content": "French."}],
+            [{"role": "user", "content": "What is better than never?"}, {"role": "assistant", "content": "Some day."}],
+            [{"role": "user", "content": "Is never better than *right* now?"}, {"role": "assistant", "content": "No, never."}],
+            [{"role": "user", "content": "What does it mean if the implementation is hard to explain?"}, {"role": "assistant", "content": "It means it's a good idea."}],
+            [{"role": "user", "content": "What does it mean if the implementation is easy to explain?"}, {"role": "assistant", "content": "It means it's a bad idea."}],
+            [{"role": "user", "content": "Any great ideas?"}, {"role": "assistant", "content": "Recursion."}],
+        ],
+    })
+    conversational_implicit_prompt_preference_dataset = conversational_implicit_prompt_preference_dataset.train_test_split(test_size=test_size)
+    if push_to_hub:
+        conversational_implicit_prompt_preference_dataset.push_to_hub(repo_id, config_name="conversational_implicit_prompt_preference")
+
+    conversational_unpaired_preference_dataset = Dataset.from_dict({
+        "prompt": [
+            [{"role": "user", "content": "What is better than ugly?"}],
+            [{"role": "user", "content": "What is better than implicit?"}],
+            [{"role": "user", "content": "What is better than complex?"}],
+            [{"role": "user", "content": "What is better than complicated?"}],
+            [{"role": "user", "content": "What is better than nested?"}],
+            [{"role": "user", "content": "What is better than dense?"}],
+            [{"role": "user", "content": "What counts?"}],
+            [{"role": "user", "content": "Are special cases enough to break the rules?"}],
+            [{"role": "user", "content": "What beats purity?"}],
+            [{"role": "user", "content": "What should never pass silently?"}],
+            [{"role": "user", "content": "When can errors pass silently?"}],
+            [{"role": "user", "content": "What should you do in the face of ambiguity?"}],
+            [{"role": "user", "content": "How many ways should there be to do it?"}],
+            [{"role": "user", "content": "For whom may the way not be obvious at first?"}],
+            [{"role": "user", "content": "What is better than never?"}],
+            [{"role": "user", "content": "Is never better than *right* now?"}],
+            [{"role": "user", "content": "What does it mean if the implementation is hard to explain?"}],
+            [{"role": "user", "content": "What does it mean if the implementation is easy to explain?"}],
+            [{"role": "user", "content": "Any great ideas?"}],
+        ],
+        "completion": [
+            [{'role': 'assistant', 'content': 'Beautiful.'}],
+            [{'role': 'assistant', 'content': 'Explicit.'}],
+            [{'role': 'assistant', 'content': 'Simple.'}],
+            [{'role': 'assistant', 'content': 'Very complicated.'}],
+            [{'role': 'assistant', 'content': 'Flat.'}],
+            [{'role': 'assistant', 'content': 'Sparse.'}],
+            [{'role': 'assistant', 'content': 'Readability.'}],
+            [{'role': 'assistant', 'content': 'Yes, special cases are special enough to break the rules.'}],
+            [{'role': 'assistant', 'content': 'Practicality.'}],
+            [{'role': 'assistant', 'content': 'Warnings.'}],
+            [{'role': 'assistant', 'content': 'When explicitly silenced.'}],
+            [{'role': 'assistant', 'content': 'Give up.'}],
+            [{'role': 'assistant', 'content': 'One, and preferably only one.'}],
+            [{'role': 'assistant', 'content': 'French.'}],
+            [{'role': 'assistant', 'content': 'Some day.'}],
+            [{'role': 'assistant', 'content': 'Yes, often.'}],
+            [{'role': 'assistant', 'content': "It means it's a bad idea."}],
+            [{'role': 'assistant', 'content': 'It means it may be a good idea.'}],
+            [{'role': 'assistant', 'content': 'Namespaces are one honking great idea.'}],
+        ],
+        "label": [True, True, True, False, True, True, True, False, True, False, True, False, True, False, False, True, True, True, True],
+    })
+    conversational_unpaired_preference_dataset = conversational_unpaired_preference_dataset.train_test_split(test_size=test_size)
+    if push_to_hub:
+        conversational_unpaired_preference_dataset.push_to_hub(repo_id, config_name="conversational_unpaired_preference")
+    # fmt: on
+
+
+if __name__ == "__main__":
+    parser = HfArgumentParser(ScriptArguments)
+    args = parser.parse_args_into_dataclasses()[0]
+    main(args.test_size, args.push_to_hub, args.repo_id)
--- a/examples/hello_world.py
+++ b/examples/hello_world.py
@ -0,0 +1,54 @@
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# 0. imports
+import torch
+from transformers import GPT2Tokenizer
+
+from trl import AutoModelForCausalLMWithValueHead, PPOConfig, PPOTrainer
+
+
+# 1. load a pretrained model
+model = AutoModelForCausalLMWithValueHead.from_pretrained("gpt2")
+ref_model = AutoModelForCausalLMWithValueHead.from_pretrained("gpt2")
+tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
+tokenizer.pad_token = tokenizer.eos_token
+
+# 2. initialize trainer
+ppo_config = {"mini_batch_size": 1, "batch_size": 1}
+config = PPOConfig(**ppo_config)
+ppo_trainer = PPOTrainer(config, model, ref_model, tokenizer)
+
+# 3. encode a query
+query_txt = "This morning I went to the "
+query_tensor = tokenizer.encode(query_txt, return_tensors="pt").to(model.pretrained_model.device)
+
+# 4. generate model response
+generation_kwargs = {
+    "min_length": -1,
+    "top_k": 0.0,
+    "top_p": 1.0,
+    "do_sample": True,
+    "pad_token_id": tokenizer.eos_token_id,
+    "max_new_tokens": 20,
+}
+response_tensor = ppo_trainer.generate(list(query_tensor), return_prompt=False, **generation_kwargs)
+response_txt = tokenizer.decode(response_tensor[0])
+
+# 5. define a reward for response
+# (this could be any reward such as human feedback or output from another model)
+reward = [torch.tensor(1.0, device=model.pretrained_model.device)]
+
+# 6. train model with ppo
+train_stats = ppo_trainer.step([query_tensor[0]], [response_tensor[0]], reward)
--- a/examples/notebooks/README.md
+++ b/examples/notebooks/README.md
@ -0,0 +1,7 @@
+# Notebooks
+
+This directory contains a collection of Jupyter notebooks that demonstrate how to use the TRL library in different applications.
+
+- [`best_of_n.ipynb`](https://github.com/huggingface/trl/tree/main/examples/notebooks/best_of_n.ipynb): This notebook demonstrates how to use the "Best of N" sampling strategy using TRL when fine-tuning your model with PPO.
+- [`gpt2-sentiment.ipynb`](https://github.com/huggingface/trl/tree/main/examples/notebooks/gpt2-sentiment.ipynb): This notebook demonstrates how to reproduce the GPT2 imdb sentiment tuning example on a jupyter notebook.
+- [`gpt2-control.ipynb`](https://github.com/huggingface/trl/tree/main/examples/notebooks/gpt2-sentiment-control.ipynb): This notebook demonstrates how to reproduce the GPT2 sentiment control example on a jupyter notebook.
--- a/Show More
+++ b/Show More