Method comparison: LoRA that targets MLP modules (#2845 )

The "LoRA Without Regret" blog post (https://thinkingmachines.ai/blog/lora/) mentions that targeting the MLP part of the transformer is more effective than targeting the attention modules. This experiment tests this by targeting: ["gate_proj", "up_proj", "down_proj"] instead of the default layers (["q_proj", "v_proj"]). I chose a rank to match the parameter count we would get when targeting the attention modules with rank 32, which is rank 10. Testing on my machine, there is indeed a nice improvement in the test score: | metric | target attention | target MLP | |----------------------|------------------|------------| | test accuracy | 48.2% | 51.3% | | # trainable params | 9175040 | 9461760 | | peak memory reserved | 20.74 GB | 23.02 GB | There is, however, also a marked increase in memory usage, despite matching parameter count. Since the operations are different, this may not be a surprise, but let's wait for the final verdict once this experiment runs on our AWS instance. Note: I also tested higher and lower ranks when targeting the MLP. The effect on memory usage was negligible, but it did improve the score: | metric | rank 8 | rank 10 | rank 12 | rank 32 | |--------------------|---------|---------|----------|----------| | test accuracy | 50.3% | 51.3% | 52.2% | 54.8% | | # trainable params | 7569408 | 9461760 | 11354112 | 30277632 | In the end, I chose only to add the rank 10 experiment to match the number of trainable parameters.
ENH Add RWKV default target modules (#2810 )
2025-10-20 15:33:48 +08:00 · 2025-10-16 17:37:02 +02:00 · 2025-10-16 16:30:51 +02:00 · 2025-10-16 14:59:09 +02:00 · 2025-10-15 16:29:32 +02:00 · 2025-10-15 12:07:51 +02:00
617 changed files with 180203 additions and 18299 deletions
--- a/.github/ISSUE_TEMPLATE/bug-report.yml
+++ b/.github/ISSUE_TEMPLATE/bug-report.yml
@ -15,38 +15,22 @@ body:
    attributes:
      label: Who can help?
      description: |
-        Your issue will be replied to more quickly if you can figure out the right person to tag with @
+        Your issue will be replied to more quickly if you can figure out the right person to tag with @.
        If you know how to use git blame, that is the easiest way, otherwise, here is a rough guide of **who to tag**.
-        
+
        All issues are read by one of the core maintainers, so if you don't know who to tag, just leave this blank and
        a core maintainer will ping the right person.
-        
+
        Please tag fewer than 3 people.
-        
-        Library: @pacman100 @younesbelkada @sayakpaul
-        
-        Documentation: @stevhliu and @MKhalusova
+
+        Library: @benjaminbossan @githubnemo
+
+        diffusers integration: @benjaminbossan @sayakpaul
+
+        Documentation: @stevhliu

      placeholder: "@Username ..."

-  - type: checkboxes
-    id: information-scripts-examples
-    attributes:
-      label: Information
-      description: 'The problem arises when using:'
-      options:
-        - label: "The official example scripts"
-        - label: "My own modified scripts"
-
-  - type: checkboxes
-    id: information-tasks
-    attributes:
-      label: Tasks
-      description: "The tasks I am working on are:"
-      options:
-        - label: "An officially supported task in the `examples` folder"
-        - label: "My own task or dataset (give details below)"
-
  - type: textarea
    id: reproduction
    validations:
@ -55,12 +39,11 @@ body:
      label: Reproduction
      description: |
        Please provide a code sample that reproduces the problem you ran into. It can be a Colab link or just a code snippet.
-        Please provide the simplest reproducer as possible so that we can quickly fix the issue. 
+        Please provide the simplest reproducer as possible so that we can quickly fix the issue. When you paste
+        the error message, please include the full traceback.

      placeholder: |
-        Reproducer: 
-        
-          
+        Reproducer:

  - type: textarea
    id: expected-behavior
--- a/.github/ISSUE_TEMPLATE/feature-request.yml
+++ b/.github/ISSUE_TEMPLATE/feature-request.yml
@ -11,15 +11,6 @@ body:
      description: |
        A clear and concise description of the feature proposal. Please provide a link to the paper and code in case they exist.

-  - type: textarea
-    id: motivation
-    validations:
-      required: true
-    attributes:
-      label: Motivation
-      description: |
-        Please outline the motivation for the proposal. Is your feature request related to a problem? 
-
  - type: textarea
    id: contribution
    validations:
@ -27,4 +18,4 @@ body:
    attributes:
      label: Your contribution
      description: |
-        Is there any way that you could help, e.g. by submitting a PR? 
+        Is there any way that you could help, e.g. by submitting a PR?
--- a/.github/workflows/build_docker_images.yml
+++ b/.github/workflows/build_docker_images.yml
@ -10,65 +10,141 @@ concurrency:
  group: docker-image-builds
  cancel-in-progress: false

+permissions: {}
+
+env:
+  CI_SLACK_CHANNEL: ${{ secrets.CI_DOCKER_CHANNEL }}
+
 jobs:
  latest-cpu:
    name: "Latest Peft CPU [dev]"
-    runs-on: ubuntu-latest
+    runs-on:
+      group: aws-general-8-plus
    steps:
-      - name: Cleanup disk
-        run: |
-          sudo ls -l /usr/local/lib/
-          sudo ls -l /usr/share/
-          sudo du -sh /usr/local/lib/
-          sudo du -sh /usr/share/
-          sudo rm -rf /usr/local/lib/android
-          sudo rm -rf /usr/share/dotnet
-          sudo du -sh /usr/local/lib/
-          sudo du -sh /usr/share/
      - name: Set up Docker Buildx
-        uses: docker/setup-buildx-action@v1
+        uses: docker/setup-buildx-action@b5ca514318bd6ebac0fb2aedd5d36ec1b5c232a2  # v3.10.0
      - name: Check out code
-        uses: actions/checkout@v3
+        uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8  # v5.0.0
+        with:
+          persist-credentials: false
      - name: Login to DockerHub
-        uses: docker/login-action@v2
+        uses: docker/login-action@74a5d142397b4f367a81961eba4e8cd7edddf772  # v3.4.0
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_PASSWORD }}

      - name: Build and Push CPU
-        uses: docker/build-push-action@v4
+        uses: docker/build-push-action@14487ce63c7a62a4a324b0bfb37086795e31c6c1  # v6.16.0
        with:
          context: ./docker/peft-cpu
          push: true
          tags: huggingface/peft-cpu

+      - name: Post to Slack
+        if: always()
+        uses: huggingface/hf-workflows/.github/actions/post-slack@3f88d63d3761558a32e8e46fc2a8536e04bb2aea  # main from Feb 2025-02-24
+        with:
+          slack_channel: ${{ env.CI_SLACK_CHANNEL }}
+          title: 🤗 Results of the PEFT-CPU docker build
+          status: ${{ job.status }}
+          slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
+
  latest-cuda:
    name: "Latest Peft GPU [dev]"
-    runs-on: ubuntu-latest
+    runs-on:
+      group: aws-general-8-plus
    steps:
-      - name: Cleanup disk
-        run: |
-          sudo ls -l /usr/local/lib/
-          sudo ls -l /usr/share/
-          sudo du -sh /usr/local/lib/
-          sudo du -sh /usr/share/
-          sudo rm -rf /usr/local/lib/android
-          sudo rm -rf /usr/share/dotnet
-          sudo du -sh /usr/local/lib/
-          sudo du -sh /usr/share/
      - name: Set up Docker Buildx
-        uses: docker/setup-buildx-action@v1
+        uses: docker/setup-buildx-action@b5ca514318bd6ebac0fb2aedd5d36ec1b5c232a2  # v3.10.0
      - name: Check out code
-        uses: actions/checkout@v3
+        uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8  # v5.0.0
+        with:
+          persist-credentials: false
      - name: Login to DockerHub
-        uses: docker/login-action@v1
+        uses: docker/login-action@74a5d142397b4f367a81961eba4e8cd7edddf772  # v3.4.0
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_PASSWORD }}

      - name: Build and Push GPU
-        uses: docker/build-push-action@v4
+        uses: docker/build-push-action@14487ce63c7a62a4a324b0bfb37086795e31c6c1  # v6.16.0
        with:
          context: ./docker/peft-gpu
          push: true
          tags: huggingface/peft-gpu
+
+      - name: Post to Slack
+        if: always()
+        uses: huggingface/hf-workflows/.github/actions/post-slack@3f88d63d3761558a32e8e46fc2a8536e04bb2aea  # main from Feb 2025-02-24
+        with:
+          slack_channel: ${{ env.CI_SLACK_CHANNEL }}
+          title: 🤗 Results of the PEFT-GPU docker build
+          status: ${{ job.status }}
+          slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
+
+  latest-cuda-bnb-source:
+    name: "Latest Peft GPU + bnb source [dev]"
+    runs-on:
+      group: aws-general-8-plus
+    steps:
+      - name: Set up Docker Buildx
+        uses: docker/setup-buildx-action@b5ca514318bd6ebac0fb2aedd5d36ec1b5c232a2  # v3.10.0
+      - name: Check out code
+        uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8  # v5.0.0
+        with:
+          persist-credentials: false
+      - name: Login to DockerHub
+        uses: docker/login-action@74a5d142397b4f367a81961eba4e8cd7edddf772  # v3.4.0
+        with:
+          username: ${{ secrets.DOCKERHUB_USERNAME }}
+          password: ${{ secrets.DOCKERHUB_PASSWORD }}
+
+      - name: Build and Push GPU
+        uses: docker/build-push-action@14487ce63c7a62a4a324b0bfb37086795e31c6c1  # v6.16.0
+        with:
+          context: ./docker/peft-gpu-bnb-source
+          push: true
+          tags: huggingface/peft-gpu-bnb-source
+
+      - name: Post to Slack
+        if: always()
+        uses: huggingface/hf-workflows/.github/actions/post-slack@3f88d63d3761558a32e8e46fc2a8536e04bb2aea  # main from Feb 2025-02-24
+        with:
+          slack_channel: ${{ env.CI_SLACK_CHANNEL }}
+          title: 🤗 Results of the PEFT-GPU (bnb source / HF latest) docker build
+          status: ${{ job.status }}
+          slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
+
+  latest-cuda-bnb-source-latest:
+    name: "Latest Peft GPU + bnb source [accelerate / peft / transformers latest]"
+    runs-on:
+      group: aws-general-8-plus
+    steps:
+      - name: Set up Docker Buildx
+        uses: docker/setup-buildx-action@b5ca514318bd6ebac0fb2aedd5d36ec1b5c232a2  # v3.10.0
+      - name: Check out code
+        uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8  # v5.0.0
+        with:
+          persist-credentials: false
+      - name: Login to DockerHub
+        uses: docker/login-action@74a5d142397b4f367a81961eba4e8cd7edddf772  # v3.4.0
+        with:
+          username: ${{ secrets.DOCKERHUB_USERNAME }}
+          password: ${{ secrets.DOCKERHUB_PASSWORD }}
+
+      - name: Build and Push GPU
+        uses: docker/build-push-action@14487ce63c7a62a4a324b0bfb37086795e31c6c1  # v6.16.0
+        with:
+          context: ./docker/peft-gpu-bnb-latest
+          push: true
+          tags: huggingface/peft-gpu-bnb-latest
+
+      - name: Post to Slack
+        if: always()
+        uses: huggingface/hf-workflows/.github/actions/post-slack@3f88d63d3761558a32e8e46fc2a8536e04bb2aea  # main from Feb 2025-02-24
+        with:
+          slack_channel: ${{ env.CI_SLACK_CHANNEL }}
+          title: 🤗 Results of the PEFT-GPU (bnb source / HF source) docker build
+          status: ${{ job.status }}
+          slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
+
--- a/.github/workflows/build_documentation.yml
+++ b/.github/workflows/build_documentation.yml
@ -7,13 +7,16 @@ on:
      - doc-builder*
      - v*-release

+permissions: {}
+
 jobs:
   build:
-    uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@main
+    uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@ba4b74d11c46d884a4cf6497687c090f55f027d9  # main from 2025-09-05
    with:
      commit_sha: ${{ github.sha }}
      package: peft
      notebook_folder: peft_docs
+      custom_container: huggingface/transformers-doc-builder
    secrets:
      token: ${{ secrets.HUGGINGFACE_PUSH }}
-      hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
+      hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
--- a/.github/workflows/build_pr_documentation.yml
+++ b/.github/workflows/build_pr_documentation.yml
@ -7,10 +7,13 @@ concurrency:
  group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
  cancel-in-progress: true

+permissions: {}
+
 jobs:
  build:
-    uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@main
+    uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@ba4b74d11c46d884a4cf6497687c090f55f027d9  # main from 2025-09-05
    with:
      commit_sha: ${{ github.event.pull_request.head.sha }}
      pr_number: ${{ github.event.number }}
      package: peft
+      custom_container: huggingface/transformers-doc-builder
--- a/.github/workflows/delete_doc_comment.yml
+++ b/.github/workflows/delete_doc_comment.yml
@ -1,14 +0,0 @@
-name: Delete doc comment
-
-on:
-  workflow_run:
-    workflows: ["Delete doc comment trigger"]
-    types:
-      - completed
-
-
-jobs:
-  delete:
-    uses: huggingface/doc-builder/.github/workflows/delete_doc_comment.yml@main
-    secrets:
-      comment_bot_token: ${{ secrets.COMMENT_BOT_TOKEN }}
--- a/.github/workflows/delete_doc_comment_trigger.yml
+++ b/.github/workflows/delete_doc_comment_trigger.yml
@ -1,12 +0,0 @@
-name: Delete doc comment trigger
-
-on:
-  pull_request:
-    types: [ closed ]
-
-
-jobs:
-  delete:
-    uses: huggingface/doc-builder/.github/workflows/delete_doc_comment_trigger.yml@main
-    with:
-      pr_number: ${{ github.event.number }}
--- a/.github/workflows/deploy_method_comparison_app.yml
+++ b/.github/workflows/deploy_method_comparison_app.yml
@ -0,0 +1,41 @@
+name: Deploy "method_comparison" Gradio to Spaces
+
+on:
+  push:
+    branches: [ main ]
+    paths:
+      - "method_comparison/**"
+  workflow_dispatch:
+
+permissions: {}
+
+jobs:
+  deploy:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout code
+        uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8  # v5.0.0
+        with:
+          fetch-depth: 0  # full history needed for subtree
+          persist-credentials: false
+
+      - name: Authenticate via ~/.netrc
+        env:
+          HF_TOKEN: ${{ secrets.PEFT_INTERNAL_REPO_READ_WRITE }}
+        run: |
+          # netrc needs BOTH login and password entries
+          printf "machine huggingface.co\nlogin hf\npassword ${HF_TOKEN}\n" >> ~/.netrc
+          chmod 600 ~/.netrc
+
+      - name: Deploy method_comparison app to HF Spaces
+        run: |
+          cd method_comparison
+          git init
+          # Spaces expect requirements.txt
+          mv requirements-app.txt requirements.txt
+          git config user.name "github-actions[bot]"
+          git config user.email "github-actions[bot]@users.noreply.github.com"
+          git remote add gradio-app https://huggingface.co/spaces/peft-internal-testing/PEFT-method-comparison
+          git add .
+          git commit -m "🚀 Deploy method comparison app from GH action"
+          git push -f gradio-app HEAD:main
--- a/.github/workflows/integrations_tests.yml
+++ b/.github/workflows/integrations_tests.yml
@ -7,6 +7,8 @@ on:
        description: 'Branch to test on'
        required: true

+permissions: {}
+
 jobs:
  run_transformers_integration_tests:
    strategy:
@ -15,20 +17,21 @@ jobs:
        transformers-version: ['main', 'latest']
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v4
+      - uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8  # v5.0.0
        with:
          ref: ${{ github.event.inputs.branch }}
          repository: ${{ github.event.pull_request.head.repo.full_name }}
+          persist-credentials: false
      - name: Set up Python
-        uses: actions/setup-python@v4
+        uses: actions/setup-python@e797f83bcb11b83ae66e0230d6156d7c80228e7c  # v6.0.0
        with:
          python-version: "3.10"
          cache: "pip"
          cache-dependency-path: "setup.py"
      - name: print environment variables
        run: |
-          echo "env.CI_BRANCH = ${{ env.CI_BRANCH }}"
-          echo "env.CI_SHA = ${{ env.CI_SHA }}"
+          echo "env.CI_BRANCH = ${CI_BRANCH}"
+          echo "env.CI_SHA = ${CI_SHA}"
      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
@ -51,25 +54,26 @@ jobs:
        diffusers-version: ['main']
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v4
+      - uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8  # v5.0.0
        with:
          ref: ${{ github.event.inputs.branch }}
          repository: ${{ github.event.pull_request.head.repo.full_name }}
+          persist-credentials: false
      - name: Set up Python
-        uses: actions/setup-python@v4
+        uses: actions/setup-python@e797f83bcb11b83ae66e0230d6156d7c80228e7c  # v6.0.0
        with:
          python-version: "3.10"
          cache: "pip"
          cache-dependency-path: "setup.py"
      - name: print environment variables
        run: |
-          echo "env.CI_BRANCH = ${{ env.CI_BRANCH }}"
-          echo "env.CI_SHA = ${{ env.CI_SHA }}"
+          echo "env.CI_BRANCH = ${CI_BRANCH}"
+          echo "env.CI_SHA = ${CI_SHA}"
      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          python -m pip install .[test]
-          
+
          if [ "${{ matrix.diffusers-version }}" == "main" ]; then
              pip install -U git+https://github.com/huggingface/diffusers.git
          else
--- a/.github/workflows/nightly-bnb.yml
+++ b/.github/workflows/nightly-bnb.yml
@ -0,0 +1,249 @@
+name: BNB from source self-hosted runner with slow tests (scheduled)
+
+on:
+  workflow_dispatch:
+  schedule:
+    - cron: "0 2 * * *"
+
+env:
+  RUN_SLOW: "yes"
+  IS_GITHUB_CI: "1"
+  # To be able to run tests on CUDA 12.2
+  NVIDIA_DISABLE_REQUIRE: "1"
+  SLACK_API_TOKEN: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
+
+permissions: {}
+
+jobs:
+  run_all_tests_single_gpu:
+    timeout-minutes: 60
+    strategy:
+      fail-fast: false
+      matrix:
+          docker-image-name: ["huggingface/peft-gpu-bnb-source:latest", "huggingface/peft-gpu-bnb-latest:latest"]
+    runs-on:
+      group: aws-g6-4xlarge-plus
+    env:
+      CUDA_VISIBLE_DEVICES: "0"
+      TEST_TYPE: "single_gpu_${{ matrix.docker-image-name }}"
+    container:
+      image: ${{ matrix.docker-image-name }}
+      options: --gpus all --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/
+    defaults:
+      run:
+        shell: bash
+    steps:
+      - uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8  # v5.0.0
+        with:
+          persist-credentials: false
+      - name: Pip install
+        run: |
+          source activate peft
+          pip install -e . --no-deps
+          pip install pytest-reportlog pytest-cov parameterized datasets scipy einops
+          pip install "pytest>=7.2.0,<8.0.0" # see: https://github.com/huggingface/transformers/blob/ce4fff0be7f6464d713f7ac3e0bbaafbc6959ae5/setup.py#L148C6-L148C26
+          mkdir transformers-clone && git clone https://github.com/huggingface/transformers.git transformers-clone # rename to transformers clone to avoid modules conflict
+          if [ "${{ matrix.docker-image-name }}" == "huggingface/peft-gpu-bnb-latest:latest" ]; then
+            cd transformers-clone
+            transformers_version=$(pip show transformers | grep '^Version:' | cut -d ' ' -f2 | sed 's/\.dev0//')
+            echo "Checking out tag for Transformers version: v$transformers_version"
+            git fetch --tags
+            git checkout tags/v$transformers_version
+            cd ..
+          fi
+
+      - name: Test bnb import
+        id: import
+        if: always()
+        run: |
+          source activate peft
+          python3 -m bitsandbytes
+          python3 -c "import bitsandbytes as bnb"
+
+      - name: Post to Slack
+        if: always()
+        uses: huggingface/hf-workflows/.github/actions/post-slack@3f88d63d3761558a32e8e46fc2a8536e04bb2aea  # main from Feb 2025-02-24
+        with:
+          slack_channel: ${{ secrets.BNB_SLACK_CHANNEL_ID }}
+          title: 🤗 Results of bitsandbytes import
+          status: ${{ steps.import.outcome }}
+          slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
+
+      - name: Run examples on single GPU
+        id: examples_tests
+        if: always()
+        run: |
+          source activate peft
+          make tests_examples_single_gpu_bnb
+
+      - name: Post to Slack
+        if: always()
+        uses: huggingface/hf-workflows/.github/actions/post-slack@3f88d63d3761558a32e8e46fc2a8536e04bb2aea  # main from Feb 2025-02-24
+        with:
+          slack_channel: ${{ secrets.BNB_SLACK_CHANNEL_ID }}
+          title: 🤗 Results of bitsandbytes examples tests - single GPU
+          status: ${{ steps.examples_tests.outcome }}
+          slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
+
+      - name: Run core tests on single GPU
+        id: core_tests
+        if: always()
+        run: |
+          source activate peft
+          make tests_core_single_gpu_bnb
+
+      - name: Post to Slack
+        if: always()
+        uses: huggingface/hf-workflows/.github/actions/post-slack@3f88d63d3761558a32e8e46fc2a8536e04bb2aea  # main from Feb 2025-02-24
+        with:
+          slack_channel: ${{ secrets.BNB_SLACK_CHANNEL_ID }}
+          title: 🤗 Results of bitsandbytes core tests - single GPU
+          status: ${{ steps.core_tests.outcome }}
+          slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
+
+      # TODO: this is a test to see if BNB multi-backend single-GPU tests succeed w/o regression tests
+      # - name: Run BNB regression tests on single GPU
+      #   id: regression_tests
+      #   if: always()
+      #   run: |
+      #     source activate peft
+      #     make tests_gpu_bnb_regression
+
+      # - name: Post to Slack
+      #   if: always()
+      #   uses: huggingface/hf-workflows/.github/actions/post-slack@3f88d63d3761558a32e8e46fc2a8536e04bb2aea  # main from Feb 2025-02-24
+      #   with:
+      #     slack_channel: ${{ secrets.BNB_SLACK_CHANNEL_ID }}
+      #     title: 🤗 Results of bitsandbytes regression tests - single GPU
+      #     status: ${{ steps.regression_tests.outcome }}
+      #     slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
+
+      - name: Run transformers tests on single GPU
+        id: transformers_tests
+        if: always()
+        run: |
+          source activate peft
+          make transformers_tests
+
+      - name: Post to Slack
+        if: always()
+        uses: huggingface/hf-workflows/.github/actions/post-slack@3f88d63d3761558a32e8e46fc2a8536e04bb2aea  # main from Feb 2025-02-24
+        with:
+          slack_channel: ${{ secrets.BNB_SLACK_CHANNEL_ID }}
+          title: 🤗 Results of bitsandbytes transformers tests - single GPU
+          status: ${{ steps.transformers_tests.outcome }}
+          slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
+
+      - name: Generate Report
+        if: always()
+        run: |
+          pip install slack_sdk tabulate
+          python scripts/log_reports.py --slack_channel_name bnb-daily-ci-collab >> $GITHUB_STEP_SUMMARY
+
+  run_all_tests_multi_gpu:
+    timeout-minutes: 60
+    strategy:
+      fail-fast: false
+      matrix:
+        docker-image-name: ["huggingface/peft-gpu-bnb-source:latest", "huggingface/peft-gpu-bnb-latest:latest"]
+    runs-on:
+      group: aws-g6-12xlarge-plus
+    env:
+      CUDA_VISIBLE_DEVICES: "0,1"
+      TEST_TYPE: "multi_gpu_${{ matrix.docker-image-name }}"
+    container:
+      image: ${{ matrix.docker-image-name }}
+      options: --gpus all --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/
+    defaults:
+      run:
+        shell: bash
+    steps:
+      - uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8  # v5.0.0
+        with:
+          persist-credentials: false
+      - name: Pip install
+        run: |
+          source activate peft
+          pip install -e . --no-deps
+          pip install pytest-reportlog pytest-cov parameterized datasets scipy einops
+          pip install "pytest>=7.2.0,<8.0.0" # see: https://github.com/huggingface/transformers/blob/ce4fff0be7f6464d713f7ac3e0bbaafbc6959ae5/setup.py#L148C6-L148C26
+          mkdir transformers-clone && git clone https://github.com/huggingface/transformers.git transformers-clone
+          if [ "${{ matrix.docker-image-name }}" == "huggingface/peft-gpu-bnb-latest:latest" ]; then
+            cd transformers-clone
+            transformers_version=$(pip show transformers | grep '^Version:' | cut -d ' ' -f2 | sed 's/\.dev0//')
+            echo "Checking out tag for Transformers version: v$transformers_version"
+            git fetch --tags
+            git checkout tags/v$transformers_version
+            cd ..
+          fi
+
+      - name: Test bnb import
+        id: import
+        if: always()
+        run: |
+          source activate peft
+          python3 -m bitsandbytes
+          python3 -c "import bitsandbytes as bnb"
+
+      - name: Post to Slack
+        if: always()
+        uses: huggingface/hf-workflows/.github/actions/post-slack@3f88d63d3761558a32e8e46fc2a8536e04bb2aea  # main from Feb 2025-02-24
+        with:
+          slack_channel: ${{ secrets.BNB_SLACK_CHANNEL_ID }}
+          title: 🤗 Results of bitsandbytes import
+          status: ${{ steps.import.outcome }}
+          slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
+
+      - name: Run examples on multi GPU
+        id: examples_tests
+        if: always()
+        run: |
+          source activate peft
+          make tests_examples_multi_gpu_bnb
+
+      - name: Post to Slack
+        if: always()
+        uses: huggingface/hf-workflows/.github/actions/post-slack@3f88d63d3761558a32e8e46fc2a8536e04bb2aea  # main from Feb 2025-02-24
+        with:
+          slack_channel: ${{ secrets.BNB_SLACK_CHANNEL_ID }}
+          title: 🤗 Results of bitsandbytes examples tests - multi GPU
+          status: ${{ steps.examples_tests.outcome }}
+          slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
+
+      - name: Run core tests on multi GPU
+        id: core_tests
+        if: always()
+        run: |
+          source activate peft
+          make tests_core_multi_gpu_bnb
+
+      - name: Post to Slack
+        if: always()
+        uses: huggingface/hf-workflows/.github/actions/post-slack@3f88d63d3761558a32e8e46fc2a8536e04bb2aea  # main from Feb 2025-02-24
+        with:
+          slack_channel: ${{ secrets.BNB_SLACK_CHANNEL_ID }}
+          title: 🤗 Results of bitsandbytes core tests - multi GPU
+          status: ${{ steps.core_tests.outcome }}
+          slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
+
+      - name: Run transformers tests on multi GPU
+        id: transformers_tests
+        if: always()
+        run: |
+          source activate peft
+          make transformers_tests
+
+      - name: Post to Slack
+        if: always()
+        uses: huggingface/hf-workflows/.github/actions/post-slack@3f88d63d3761558a32e8e46fc2a8536e04bb2aea  # main from Feb 2025-02-24
+        with:
+          slack_channel: ${{ secrets.BNB_SLACK_CHANNEL_ID }}
+          title: 🤗 Results of bitsandbytes transformers tests - multi GPU
+          status: ${{ steps.transformers_tests.outcome }}
+          slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
+
+      - name: Generate Report
+        if: always()
+        run: |
+          pip install slack_sdk tabulate
+          python scripts/log_reports.py --slack_channel_name bnb-daily-ci-collab >> $GITHUB_STEP_SUMMARY
--- a/.github/workflows/nightly.yml
+++ b/.github/workflows/nightly.yml
@ -10,12 +10,16 @@ env:
  IS_GITHUB_CI: "1"
  # To be able to run tests on CUDA 12.2
  NVIDIA_DISABLE_REQUIRE: "1"
-  SLACK_API_TOKEN: ${{ secrets.SLACK_API_TOKEN }}
+  SLACK_API_TOKEN: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}

+permissions: {}

 jobs:
  run_all_tests_single_gpu:
-    runs-on: [self-hosted, docker-gpu, multi-gpu]
+    strategy:
+      fail-fast: false
+    runs-on:
+      group: aws-g6-4xlarge-plus
    env:
      CUDA_VISIBLE_DEVICES: "0"
      TEST_TYPE: "single_gpu"
@ -24,17 +28,17 @@ jobs:
      options: --gpus all --shm-size "16gb" -e NVIDIA_DISABLE_REQUIRE=true
    defaults:
      run:
-        working-directory: peft/
        shell: bash
    steps:
-      - name: Update clone & pip install
+      - uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8  # v5.0.0
+        with:
+          persist-credentials: false
+      - name: Pip install
        run: |
          source activate peft
-          git config --global --add safe.directory '*'
-          git fetch && git checkout ${{ github.sha }} 
          pip install -e . --no-deps
          pip install pytest-reportlog
-      
+
      - name: Run common tests on single GPU
        run: |
          source activate peft
@ -44,12 +48,17 @@ jobs:
        run: |
          source activate peft
          make tests_examples_single_gpu
-      
+
      - name: Run core tests on single GPU
        run: |
          source activate peft
          make tests_core_single_gpu
-          
+
+      - name: Run regression tests on single GPU
+        run: |
+          source activate peft
+          make tests_regression
+
      - name: Generate Report
        if: always()
        run: |
@ -57,7 +66,10 @@ jobs:
          python scripts/log_reports.py >> $GITHUB_STEP_SUMMARY

  run_all_tests_multi_gpu:
-    runs-on: [self-hosted, docker-gpu, multi-gpu]
+    strategy:
+      fail-fast: false
+    runs-on:
+      group: aws-g6-12xlarge-plus
    env:
      CUDA_VISIBLE_DEVICES: "0,1"
      TEST_TYPE: "multi_gpu"
@ -66,36 +78,36 @@ jobs:
      options: --gpus all --shm-size "16gb" -e NVIDIA_DISABLE_REQUIRE=true
    defaults:
      run:
-        working-directory: peft/
        shell: bash
    steps:
-      - name: Update clone
+      - uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8  # v5.0.0
+        with:
+          persist-credentials: false
+      - name: Pip install
        run: |
          source activate peft
-          git config --global --add safe.directory '*'
-          git fetch && git checkout ${{ github.sha }}
          pip install -e . --no-deps
          pip install pytest-reportlog

      - name: Run core GPU tests on multi-gpu
        run: |
          source activate peft
-          
+
      - name: Run common tests on multi GPU
        run: |
          source activate peft
          make tests_common_gpu
-        
+
      - name: Run examples on multi GPU
        run: |
          source activate peft
          make tests_examples_multi_gpu
-      
+
      - name: Run core tests on multi GPU
        run: |
          source activate peft
          make tests_core_multi_gpu
-          
+
      - name: Generate Report
        if: always()
        run: |
--- a/.github/workflows/stale.yml
+++ b/.github/workflows/stale.yml
@ -4,24 +4,31 @@ on:
  schedule:
    - cron: "0 15 * * *"

+permissions: {}
+
 jobs:
  close_stale_issues:
    name: Close Stale Issues
    if: github.repository == 'huggingface/peft'
    runs-on: ubuntu-latest
+    permissions:
+      issues: write
+      pull-requests: write
    env:
      GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
    steps:
-    - uses: actions/checkout@v3
+    - uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8  # v5.0.0
+      with:
+        persist-credentials: false

    - name: Setup Python
-      uses: actions/setup-python@v4
+      uses: actions/setup-python@e797f83bcb11b83ae66e0230d6156d7c80228e7c  # v6.0.0
      with:
-        python-version: 3.8
+        python-version: 3.11

    - name: Install requirements
      run: |
        pip install PyGithub
    - name: Close stale issues
      run: |
-        python scripts/stale.py
+        python scripts/stale.py
--- a/.github/workflows/test-docker-build.yml
+++ b/.github/workflows/test-docker-build.yml
@ -0,0 +1,66 @@
+name: Test Docker images (on PR)
+
+on:
+  pull_request:
+    paths:
+      # Run only when DockerFile files are modified
+      - "docker/*/Dockerfile"
+
+permissions: {}
+
+jobs:
+  get_changed_files:
+    name: "Build all modified docker images"
+    runs-on: ubuntu-latest
+    outputs:
+      matrix: ${{ steps.set-matrix.outputs.matrix }}
+    steps:
+      - name: Check out code
+        uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8  # v5.0.0
+        with:
+          persist-credentials: false
+      - name: Get changed files
+        id: changed-files
+        uses: tj-actions/changed-files@1c8e6069583811afb28f97afeaf8e7da80c6be5c #v42
+        with:
+          files: docker/*/Dockerfile
+          json: "true"
+      - name: Run step if only the files listed above change
+        if: steps.changed-files.outputs.any_changed == 'true'
+        id: set-matrix
+        env:
+          ALL_CHANGED_FILES: ${{ steps.changed-files.outputs.all_changed_files }}
+        run: |
+          echo "matrix=${ALL_CHANGED_FILES}" >> $GITHUB_OUTPUT
+  build_modified_files:
+    needs: get_changed_files
+    name: Build Docker images on modified files
+    runs-on: ubuntu-latest
+    if: ${{ needs.get_changed_files.outputs.matrix != '[]' }}
+    strategy:
+      fail-fast: false
+      matrix:
+        docker-file: ${{ fromJson(needs.get_changed_files.outputs.matrix) }}
+    steps:
+      - name: Cleanup disk
+        run: |
+          sudo ls -l /usr/local/lib/
+          sudo ls -l /usr/share/
+          sudo du -sh /usr/local/lib/
+          sudo du -sh /usr/share/
+          sudo rm -rf /usr/local/lib/android
+          sudo rm -rf /usr/share/dotnet
+          sudo du -sh /usr/local/lib/
+          sudo du -sh /usr/share/
+      - name: Set up Docker Buildx
+        uses: docker/setup-buildx-action@b5ca514318bd6ebac0fb2aedd5d36ec1b5c232a2  # v3.10.0
+      - name: Check out code
+        uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8  # v5.0.0
+        with:
+          persist-credentials: false
+      - name: Build Docker image
+        uses: docker/build-push-action@14487ce63c7a62a4a324b0bfb37086795e31c6c1  # v6.16.0
+        with:
+          file: ${{ matrix.docker-file }}
+          context: .
+          push: False
--- a/.github/workflows/tests-main.yml
+++ b/.github/workflows/tests-main.yml
@ -0,0 +1,43 @@
+name: tests on transformers main
+
+on:
+  push:
+    branches: [main]
+    paths-ignore:
+        - 'docs/**'
+
+permissions: {}
+
+jobs:
+  tests:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8  # v5.0.0
+        with:
+          persist-credentials: false
+      - name: Set up Python 3.11
+        uses: actions/setup-python@e797f83bcb11b83ae66e0230d6156d7c80228e7c  # v6.0.0
+        with:
+          python-version: 3.11
+          cache: "pip"
+          cache-dependency-path: "setup.py"
+      - name: Install dependencies
+        run: |
+          python -m pip install --upgrade pip
+          # cpu version of pytorch
+          pip install -U git+https://github.com/huggingface/transformers.git
+          pip install -e .[test]
+      - name: Test with pytest
+        env:
+          TRANSFORMERS_IS_CI: 1
+          HF_TOKEN: ${{ secrets.HF_TOKEN }}
+        run: |
+          make test
+      - name: Post to Slack
+        if: always()
+        uses: huggingface/hf-workflows/.github/actions/post-slack@3f88d63d3761558a32e8e46fc2a8536e04bb2aea  # main from Feb 2025-02-24
+        with:
+          slack_channel: ${{ secrets.SLACK_CHANNEL_ID }}
+          title: 🤗 Results of transformers main tests
+          status: ${{ job.status }}
+          slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@ -3,17 +3,28 @@ name: tests
 on:
  push:
    branches: [main]
+    paths-ignore:
+      - 'docs/**'
  pull_request:
+    paths-ignore:
+      - 'docs/**'
+
+env:
+  HF_HOME: .cache/huggingface
+
+permissions: {}

 jobs:
  check_code_quality:
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v3
-      - name: Set up Python
-        uses: actions/setup-python@v4
+      - uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8  # v5.0.0
        with:
-          python-version: "3.8"
+          persist-credentials: false
+      - name: Set up Python
+        uses: actions/setup-python@e797f83bcb11b83ae66e0230d6156d7c80228e7c  # v6.0.0
+        with:
+          python-version: "3.11"
          cache: "pip"
          cache-dependency-path: "setup.py"
      - name: Install dependencies
@ -27,14 +38,39 @@ jobs:
  tests:
    needs: check_code_quality
    strategy:
+      fail-fast: false
      matrix:
-        python-version: ["3.8", "3.9", "3.10"]
-        os: ["ubuntu-latest", "macos-latest", "windows-latest"]
+        python-version: ["3.10", "3.11", "3.12", "3.13"]
+        os: ["ubuntu-latest", "macos-13", "windows-latest"]
+        exclude:
+          - os: macos-13
+            python-version: "3.13"
    runs-on: ${{ matrix.os }}
    steps:
-      - uses: actions/checkout@v3
+      - uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8  # v5.0.0
+        with:
+          persist-credentials: false
+      - name: Model cache
+        uses: actions/cache/restore@0400d5f644dc74513175e3cd8d07132dd4860809  # v4.2.4
+        with:
+          # Avoid caching HF_HOME/modules and Python cache files to prevent interoperability
+          # issues and potential cache poisioning. We also avoid lock files to prevent runs
+          # avoiding re-download because they see a lock file.
+          path: |
+            ${{ env.HF_HOME }}/hub/**
+            !${{ env.HF_HOME }}/**/*.pyc
+          key: model-cache-${{ github.run_id }}
+          restore-keys: model-cache-
+          enableCrossOsArchive: true
+      - name: Dump cache content
+        # TODO: remove this step after 2025-02-15
+        if: matrix.os != 'windows-latest'
+        run: |
+          SHASUM=sha256sum
+          [ -f "$(which shasum)" ] && SHASUM=shasum
+          find "${{ env.HF_HOME }}/hub" -type f -exec "$SHASUM" {} \; > cache_content_initial || true
      - name: Set up Python ${{ matrix.python-version }}
-        uses: actions/setup-python@v4
+        uses: actions/setup-python@e797f83bcb11b83ae66e0230d6156d7c80228e7c  # v6.0.0
        with:
          python-version: ${{ matrix.python-version }}
          cache: "pip"
@ -42,8 +78,56 @@ jobs:
      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
+          pip install setuptools
          # cpu version of pytorch
          pip install -e .[test]
      - name: Test with pytest
+        # MacOS tests are currently too flaky and will fail almost each time. Thus, continue (green checkmark) even if
+        # they fail, but add a notice so that the failure is not completely silent
+        continue-on-error: ${{ matrix.os == 'macos-13' }}
+        shell: bash
+        env:
+          HF_TOKEN: ${{ secrets.HF_TOKEN }}
+          TRANSFORMERS_IS_CI: 1
        run: |
+          set +e
          make test
+          status=$?
+          # Post a notice only if this is macOS AND tests failed
+          if [ "$status" -ne 0 ] && [ "${{ matrix.os }}" = "macos-13" ]; then
+            {
+              echo "## ⚠️ macOS tests failed"
+              echo ""
+              echo "- OS: ${{ matrix.os }}"
+              echo "- Python: ${{ matrix.python-version }}"
+              echo ""
+              echo "Check the logs from this step for details."
+            } >> "$GITHUB_STEP_SUMMARY"
+          fi
+          # Return the real status. On macOS this won't fail the job because of continue-on-error.
+          exit $status
+      - name: Dump cache content and diff
+        # This is just debug info so that we can monitor if the model cache diverges substantially
+        # over time and what the diverging model is.
+        # TODO: remove after 2025-02-15
+        if: matrix.os != 'windows-latest'
+        run: |
+          SHASUM=sha256sum
+          [ -f "$(which shasum)" ] && SHASUM=shasum
+          find "${{ env.HF_HOME }}/hub" -type f -exec "$SHASUM" {} \; > cache_content_after || true
+          diff -udp cache_content_initial cache_content_after || true
+      - name: Delete old model cache entries
+        run: |
+          # make sure that cache cleaning doesn't break the pipeline
+          python scripts/ci_clean_cache.py -d || true
+      - name: Update model cache
+        uses: actions/cache/save@0400d5f644dc74513175e3cd8d07132dd4860809  # v4.2.4
+        # Only let one runner (preferably the one that covers most tests) update the model cache
+        # after *every* run. This way we make sure that our cache is never outdated and we don't
+        # have to keep track of hashes.
+        if: always() && matrix.os == 'ubuntu-latest' && matrix.python-version == '3.10'
+        with:
+          path: |
+            ${{ env.HF_HOME }}/hub/**
+            !${{ env.HF_HOME }}/**/*.pyc
+          key: model-cache-${{ github.run_id }}
--- a/.github/workflows/torch_compile_tests.yml
+++ b/.github/workflows/torch_compile_tests.yml
@ -1,7 +1,5 @@
 name: torch compile tests

-# see peft/tests/__init__.py
-
 on:
  workflow_dispatch:
    inputs:
@ -13,31 +11,46 @@ on:
        required: false
        default: false

+env:
+  RUN_SLOW: "yes"
+  IS_GITHUB_CI: "1"
+  # To be able to run tests on CUDA 12.2
+  NVIDIA_DISABLE_REQUIRE: "1"
+
+permissions: {}
+
 jobs:
  run_tests_with_compile:
-    runs-on: ubuntu-latest
+    runs-on:
+      group: aws-g6-4xlarge-plus
    env:
      PEFT_DEBUG_WITH_TORCH_COMPILE: 1
+      CUDA_VISIBLE_DEVICES: "0"
+      TEST_TYPE: "single_gpu_huggingface/peft-gpu-bnb-latest:latest"
+      USE_PYTORCH_NIGHTLY: "${{ github.event.inputs.pytorch_nightly }}"
+    container:
+      image: "huggingface/peft-gpu-bnb-latest:latest"
+      options: --gpus all --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/
+    defaults:
+      run:
+        shell: bash
    steps:
-      - uses: actions/checkout@v4
+      - uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8  # v5.0.0
        with:
          ref: ${{ github.event.inputs.branch }}
          repository: ${{ github.event.pull_request.head.repo.full_name }}
-      - name: Set up Python
-        uses: actions/setup-python@v4
-        with:
-          python-version: "3.10"
-          cache: "pip"
-          cache-dependency-path: "setup.py"
-      - name: Install dependencies
+          persist-credentials: false
+      - name: Pip install
        run: |
-          python -m pip install --upgrade pip
-          python -m pip install .[test]
-          if [ "${{ github.event.inputs.pytorch_nightly }}" = "true" ]; then
+          source activate peft
+          pip install -e . --no-deps
+          pip install pytest-cov pytest-reportlog parameterized datasets scipy einops
+          pip install "pytest>=7.2.0,<8.0.0" # see: https://github.com/huggingface/transformers/blob/ce4fff0be7f6464d713f7ac3e0bbaafbc6959ae5/setup.py#L148C6-L148C26
+          if [ "${USE_PYTORCH_NIGHTLY}" = "true" ]; then
            python -m pip install --upgrade --pre torch --index-url https://download.pytorch.org/whl/nightly/cpu
          fi
      - name: Test compile with pytest
        run: |
+          source activate peft
          echo "PEFT_DEBUG_WITH_TORCH_COMPILE=$PEFT_DEBUG_WITH_TORCH_COMPILE"
-          git status
-          make test
+          make tests_torch_compile
--- a/.github/workflows/trufflehog.yml
+++ b/.github/workflows/trufflehog.yml
@ -0,0 +1,18 @@
+on:
+  push:
+
+name: Secret Leaks
+
+permissions: {}
+
+jobs:
+  trufflehog:
+    runs-on: ubuntu-latest
+    steps:
+    - name: Checkout code
+      uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8  # v5.0.0
+      with:
+        fetch-depth: 0
+        persist-credentials: false
+    - name: Secret Scanning
+      uses: trufflesecurity/trufflehog@0f58ae7c5036094a1e3e750d18772af92821b503  # v3.90.5
--- a/.github/workflows/upload_pr_documentation.yml
+++ b/.github/workflows/upload_pr_documentation.yml
@ -6,11 +6,13 @@ on:
    types:
      - completed

+permissions: {}
+
 jobs:
  build:
-    uses: huggingface/doc-builder/.github/workflows/upload_pr_documentation.yml@main
+    uses: huggingface/doc-builder/.github/workflows/upload_pr_documentation.yml@ba4b74d11c46d884a4cf6497687c090f55f027d9  # main from 2025-09-05
    with:
      package_name: peft
    secrets:
      hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
-      comment_bot_token: ${{ secrets.COMMENT_BOT_TOKEN }}
+      comment_bot_token: ${{ secrets.COMMENT_BOT_TOKEN }}
--- a/.github/workflows/zizmor.yaml
+++ b/.github/workflows/zizmor.yaml
@ -0,0 +1,28 @@
+name: CI security linting
+
+on:
+  push:
+    branches: ["main"]
+  pull_request:
+    branches: ["*"]
+    paths:
+      - '.github/**'
+
+permissions: {}
+
+jobs:
+  zizmor:
+    name: zizmor latest via Cargo
+    runs-on: ubuntu-latest
+    permissions:
+      contents: read
+      security-events: write
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8  # v5.0.0
+        with:
+          persist-credentials: false
+      - name: Install zizmor
+        run: cargo install --locked zizmor
+      - name: Run zizmor
+        run: zizmor .github/workflows
--- a/.github/zizmor.yml
+++ b/.github/zizmor.yml
@ -0,0 +1,24 @@
+rules:
+  dangerous-triggers:
+    ignore:
+      # this workflow is only triggered after maintainer approval
+      - upload_pr_documentation.yml:3:1
+  cache-poisoning:
+    ignore:
+      # the docker buildx binary is cached and zizmor warns about a cache poisoning attack.
+      # OTOH this cache would make us more resilient against an intrusion on docker-buildx' side.
+      # There is no obvious benefit so we leave it as it is.
+      - build_docker_images.yml:37:9
+      - build_docker_images.yml:70:9
+      - build_docker_images.yml:103:9
+      - build_docker_images.yml:136:9
+      - build_docker_images.yml:169:9
+  unpinned-images:
+    ignore:
+      # We want to test these images with the latest version and we're not using them
+      # to deploy anything so we deem it safe to use those, even if they are unpinned.
+      - nightly-bnb.yml:30:7
+      - nightly-bnb.yml:155:7
+      - nightly.yml:27:7
+      - nightly.yml:77:7
+      - torch_compile_tests.yml:32:7
--- a/.gitignore
+++ b/.gitignore
@ -138,4 +138,8 @@ dmypy.json
 .DS_Store

 # More test things
-wandb
+wandb
+
+# method_comparison logs
+method_comparison/MetaMathQA/cancelled_results/
+method_comparison/MetaMathQA/temporary_results/
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@ -0,0 +1,13 @@
+repos:
+  - repo: https://github.com/astral-sh/ruff-pre-commit
+    rev: v0.12.8
+    hooks:
+      - id: ruff
+        args:
+          - --fix
+      - id: ruff-format
+  - repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v4.6.0
+    hooks:
+      - id: check-merge-conflict
+      - id: check-yaml
--- a/42
+++ b/42
@ -1,19 +1,19 @@
 .PHONY: quality style test docs

-check_dirs := src tests examples docs
+check_dirs := src tests examples docs scripts docker

 # Check that source code meets quality standards

 # this target runs checks on all files
 quality:
-	black --check $(check_dirs)
-	ruff $(check_dirs)
+	ruff check $(check_dirs)
+	ruff format --check $(check_dirs)
 	doc-builder style src/peft tests docs/source --max_len 119 --check_only

 # Format source code automatically and check is there are any problems left that need manual fixing
 style:
-	black $(check_dirs)
-	ruff $(check_dirs) --fix
+	ruff check --fix $(check_dirs)
+	ruff format $(check_dirs)
 	doc-builder style src/peft tests docs/source --max_len 119

 test:
@ -31,6 +31,36 @@ tests_core_multi_gpu:
 tests_core_single_gpu:
 	python -m pytest -m single_gpu_tests tests/test_common_gpu.py $(if $(IS_GITHUB_CI),--report-log "core_single_gpu.log",)

+# exclude gemma tests, as generation fails with torch.compile, these failures
+# trigger side effects that make other tests fail with 'RuntimeError: Offset
+# increment outside graph capture encountered unexpectedly.' 
+# TODO re-enable gemma once/if it is fixed
 tests_common_gpu:
-	python -m pytest tests/test_decoder_models.py $(if $(IS_GITHUB_CI),--report-log "common_decoder.log",)
+	python -m pytest tests/test_decoder_models.py -k "not gemma" $(if $(IS_GITHUB_CI),--report-log "common_decoder.log",)
 	python -m pytest tests/test_encoder_decoder_models.py $(if $(IS_GITHUB_CI),--report-log "common_encoder_decoder.log",)
+	python -m pytest tests/test_gptqmodel.py $(if $(IS_GITHUB_CI),--report-log "gptqmodel_gpu.log",)
+
+tests_examples_multi_gpu_bnb:
+	python -m pytest -m "multi_gpu_tests and bitsandbytes" tests/test_gpu_examples.py $(if $(IS_GITHUB_CI),--report-log "multi_gpu_examples.log",)
+
+tests_examples_single_gpu_bnb:
+	python -m pytest -m "single_gpu_tests and bitsandbytes" tests/test_gpu_examples.py $(if $(IS_GITHUB_CI),--report-log "single_gpu_examples.log",)
+
+tests_core_multi_gpu_bnb:
+	python -m pytest -m "multi_gpu_tests and bitsandbytes" tests/test_common_gpu.py $(if $(IS_GITHUB_CI),--report-log "core_multi_gpu.log",)
+
+tests_core_single_gpu_bnb:
+	python -m pytest -m "single_gpu_tests and bitsandbytes" tests/test_common_gpu.py $(if $(IS_GITHUB_CI),--report-log "core_single_gpu.log",)
+
+tests_gpu_bnb_regression:
+	python -m pytest tests/bnb/test_bnb_regression.py $(if $(IS_GITHUB_CI),--report-log "bnb_regression_gpu.log",)
+
+# For testing transformers tests for bnb runners
+transformers_tests:
+	RUN_SLOW=1 python -m pytest transformers-clone/tests/quantization/bnb $(if $(IS_GITHUB_CI),--report-log "transformers_tests.log",)
+
+tests_regression:
+	python -m pytest -s --regression tests/regression/ $(if $(IS_GITHUB_CI),--report-log "regression_tests.log",)
+
+tests_torch_compile:
+	python -m pytest tests/test_torch_compile.py $(if $(IS_GITHUB_CI),--report-log "compile_tests.log",)
--- a/README.md
+++ b/README.md
@ -19,48 +19,72 @@ limitations under the License.
    <p>State-of-the-art Parameter-Efficient Fine-Tuning (PEFT) methods</p>
 </h3>

-Parameter-Efficient Fine-Tuning (PEFT) methods enable efficient adaptation of pre-trained language models (PLMs) to various downstream applications without fine-tuning all the model's parameters. Fine-tuning large-scale PLMs is often prohibitively costly. In this regard, PEFT methods only fine-tune a small number of (extra) model parameters, thereby greatly decreasing the computational and storage costs. Recent State-of-the-Art PEFT techniques achieve performance comparable to that of full fine-tuning. 
+Fine-tuning large pretrained models is often prohibitively costly due to their scale. Parameter-Efficient Fine-Tuning (PEFT) methods enable efficient adaptation of large pretrained models to various downstream applications by only fine-tuning a small number of (extra) model parameters instead of all the model's parameters. This significantly decreases the computational and storage costs. Recent state-of-the-art PEFT techniques achieve performance comparable to fully fine-tuned models.

-Seamlessly integrated with 🤗 Accelerate for large scale models leveraging DeepSpeed and Big Model Inference. 
+PEFT is integrated with Transformers for easy model training and inference, Diffusers for conveniently managing different adapters, and Accelerate for distributed training and inference for really big models.

-Supported methods:
+> [!TIP]
+> Visit the [PEFT](https://huggingface.co/PEFT) organization to read about the PEFT methods implemented in the library and to see notebooks demonstrating how to apply these methods to a variety of downstream tasks. Click the "Watch repos" button on the organization page to be notified of newly implemented methods and notebooks!

-1. LoRA: [LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS](https://arxiv.org/abs/2106.09685)
-2. Prefix Tuning: [Prefix-Tuning: Optimizing Continuous Prompts for Generation](https://aclanthology.org/2021.acl-long.353/), [P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks](https://arxiv.org/pdf/2110.07602.pdf)
-3. P-Tuning: [GPT Understands, Too](https://arxiv.org/abs/2103.10385)
-4. Prompt Tuning: [The Power of Scale for Parameter-Efficient Prompt Tuning](https://arxiv.org/abs/2104.08691)
-5. AdaLoRA: [Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning](https://arxiv.org/abs/2303.10512)  
-6. $(IA)^3$: [Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning](https://arxiv.org/abs/2205.05638)
-7. MultiTask Prompt Tuning: [Multitask Prompt Tuning Enables Parameter-Efficient Transfer Learning](https://arxiv.org/abs/2303.02861)
-8. LoHa: [FedPara: Low-Rank Hadamard Product for Communication-Efficient Federated Learning](https://arxiv.org/abs/2108.06098)
-9. LoKr: [KronA: Parameter Efficient Tuning with Kronecker Adapter](https://arxiv.org/abs/2212.10650) based on [Navigating Text-To-Image Customization:From LyCORIS Fine-Tuning to Model Evaluation](https://arxiv.org/abs/2309.14859) implementation
+Check the PEFT Adapters API Reference section for a list of supported PEFT methods, and read the [Adapters](https://huggingface.co/docs/peft/en/conceptual_guides/adapter), [Soft prompts](https://huggingface.co/docs/peft/en/conceptual_guides/prompting), and [IA3](https://huggingface.co/docs/peft/en/conceptual_guides/ia3) conceptual guides to learn more about how these methods work.

-## Getting started
+## Quickstart

-```python
-from transformers import AutoModelForSeq2SeqLM
-from peft import get_peft_config, get_peft_model, LoraConfig, TaskType
-model_name_or_path = "bigscience/mt0-large"
-tokenizer_name_or_path = "bigscience/mt0-large"
+Install PEFT from pip:

-peft_config = LoraConfig(
-    task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1
-)
-
-model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path)
-model = get_peft_model(model, peft_config)
-model.print_trainable_parameters()
-# output: trainable params: 2359296 || all params: 1231940608 || trainable%: 0.19151053100118282
+```bash
+pip install peft
 ```

-## Use Cases
+Prepare a model for training with a PEFT method such as LoRA by wrapping the base model and PEFT configuration with `get_peft_model`. For the bigscience/mt0-large model, you're only training 0.19% of the parameters!

-### Get comparable performance to full finetuning by adapting LLMs to downstream tasks using consumer hardware
+```python
+from transformers import AutoModelForCausalLM
+from peft import LoraConfig, TaskType, get_peft_model

-GPU memory required for adapting LLMs on the few-shot dataset [`ought/raft/twitter_complaints`](https://huggingface.co/datasets/ought/raft/viewer/twitter_complaints). Here, settings considered
-are full finetuning, PEFT-LoRA using plain PyTorch and PEFT-LoRA using DeepSpeed with CPU Offloading. 
+device = torch.accelerator.current_accelerator().type if hasattr(torch, "accelerator") else "cuda"
+model_id = "Qwen/Qwen2.5-3B-Instruct"
+model = AutoModelForCausalLM.from_pretrained(model_id, device_map=device)
+peft_config = LoraConfig(
+    r=16,
+    lora_alpha=32,
+    task_type=TaskType.CAUSAL_LM,
+    # target_modules=["q_proj", "v_proj", ...]  # optionally indicate target modules
+)
+model = get_peft_model(model, peft_config)
+model.print_trainable_parameters()
+# prints: trainable params: 3,686,400 || all params: 3,089,625,088 || trainable%: 0.1193

-Hardware: Single A100 80GB GPU with CPU RAM above 64GB
+# now perform training on your dataset, e.g. using transformers Trainer, then save the model
+model.save_pretrained("qwen2.5-3b-lora")
+```
+
+To load a PEFT model for inference:
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from peft import PeftModel
+
+device = torch.accelerator.current_accelerator().type if hasattr(torch, "accelerator") else "cuda"
+model_id = "Qwen/Qwen2.5-3B-Instruct"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(model_id, device_map=device)
+model = PeftModel.from_pretrained(model, "qwen2.5-3b-lora")
+
+inputs = tokenizer("Preheat the oven to 350 degrees and place the cookie dough", return_tensors="pt")
+outputs = model.generate(**inputs.to(device), max_new_tokens=50)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+
+# prints something like: Preheat the oven to 350 degrees and place the cookie dough in a baking dish [...]
+```
+
+## Why you should use PEFT
+
+There are many benefits of using PEFT but the main one is the huge savings in compute and storage, making PEFT applicable to many different use cases.
+
+### High performance on consumer hardware
+
+Consider the memory requirements for training the following models on the [ought/raft/twitter_complaints](https://huggingface.co/datasets/ought/raft/viewer/twitter_complaints) dataset with an A100 80GB GPU with more than 64GB of CPU RAM.

 |   Model         | Full Finetuning | PEFT-LoRA PyTorch  | PEFT-LoRA DeepSpeed with CPU Offloading |
 | --------- | ---- | ---- | ---- |
@ -68,9 +92,7 @@ Hardware: Single A100 80GB GPU with CPU RAM above 64GB
 | bigscience/mt0-xxl (12B params) | OOM GPU | 56GB GPU / 3GB CPU | 22GB GPU / 52GB CPU |
 | bigscience/bloomz-7b1 (7B params) | OOM GPU | 32GB GPU / 3.8GB CPU | 18.1GB GPU / 35GB CPU |

-Performance of PEFT-LoRA tuned [`bigscience/T0_3B`](https://huggingface.co/bigscience/T0_3B) on [`ought/raft/twitter_complaints`](https://huggingface.co/datasets/ought/raft/viewer/twitter_complaints) leaderboard. 
-A point to note is that we didn't try to squeeze performance by playing around with input instruction templates, LoRA hyperparams and other training related hyperparams. Also, we didn't use the larger 13B [mt0-xxl](https://huggingface.co/bigscience/mt0-xxl) model.
-So, we are already seeing comparable performance to SoTA with parameter efficient tuning. Also, the final additional checkpoint size is just `19MB` in comparison to `11GB` size of the backbone [`bigscience/T0_3B`](https://huggingface.co/bigscience/T0_3B) model, but one still has to load the original full size model.
+With LoRA you can fully finetune a 12B parameter model that would've otherwise run out of memory on the 80GB GPU, and comfortably fit and train a 3B parameter model. When you look at the 3B parameter model's performance, it is comparable to a fully finetuned model at a fraction of the GPU memory.

 |   Submission Name        | Accuracy |
 | --------- | ---- |
@ -78,335 +100,88 @@ So, we are already seeing comparable performance to SoTA with parameter efficien
 | Flan-T5 | 0.892 |
 | lora-t0-3b | 0.863 |

-**Therefore, we can see that performance comparable to SoTA is achievable by PEFT methods with consumer hardware such as 16GB and 24GB GPUs.**
+> [!TIP]
+> The bigscience/T0_3B model performance isn't optimized in the table above. You can squeeze even more performance out of it by playing around with the input instruction templates, LoRA hyperparameters, and other training related hyperparameters. The final checkpoint size of this model is just 19MB compared to 11GB of the full bigscience/T0_3B model. Learn more about the advantages of finetuning with PEFT in this [blog post](https://www.philschmid.de/fine-tune-flan-t5-peft).

-An insightful blogpost explaining the advantages of using PEFT for fine-tuning FlanT5-XXL: [https://www.philschmid.de/fine-tune-flan-t5-peft](https://www.philschmid.de/fine-tune-flan-t5-peft)
+### Quantization

-### Parameter Efficient Tuning of Diffusion Models
+Quantization is another method for reducing the memory requirements of a model by representing the data in a lower precision. It can be combined with PEFT methods to make it even easier to train and load LLMs for inference.

-GPU memory required by different settings during training is given below. The final checkpoint size is `8.8 MB`.
+* Learn how to finetune [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) with QLoRA and the [TRL](https://huggingface.co/docs/trl/index) library on a 16GB GPU in the [Finetune LLMs on your own consumer hardware using tools from PyTorch and Hugging Face ecosystem](https://pytorch.org/blog/finetune-llms/) blog post.
+* Learn how to finetune a [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) model for multilingual automatic speech recognition with LoRA and 8-bit quantization in this [notebook](https://colab.research.google.com/drive/1DOkD_5OUjFa0r5Ik3SgywJLJtEo2qLxO?usp=sharing) (see this [notebook](https://colab.research.google.com/drive/1vhF8yueFqha3Y3CpTHN6q9EVcII9EYzs?usp=sharing) instead for an example of streaming a dataset).

-Hardware: Single A100 80GB GPU with CPU RAM above 64GB
+### Save compute and storage
+
+PEFT can help you save storage by avoiding full finetuning of models on each of downstream task or dataset. In many cases, you're only finetuning a very small fraction of a model's parameters and each checkpoint is only a few MBs in size (instead of GBs). These smaller PEFT adapters demonstrate performance comparable to a fully finetuned model. If you have many datasets, you can save a lot of storage with a PEFT model and not have to worry about catastrophic forgetting or overfitting the backbone or base model.
+
+## PEFT integrations
+
+PEFT is widely supported across the Hugging Face ecosystem because of the massive efficiency it brings to training and inference.
+
+### Diffusers
+
+The iterative diffusion process consumes a lot of memory which can make it difficult to train. PEFT can help reduce the memory requirements and reduce the storage size of the final model checkpoint. For example, consider the memory required for training a Stable Diffusion model with LoRA on an A100 80GB GPU with more than 64GB of CPU RAM. The final model checkpoint size is only 8.8MB!

 |   Model         | Full Finetuning | PEFT-LoRA  | PEFT-LoRA with Gradient Checkpointing  |
 | --------- | ---- | ---- | ---- |
 | CompVis/stable-diffusion-v1-4 | 27.5GB GPU / 3.97GB CPU | 15.5GB GPU / 3.84GB CPU | 8.12GB GPU / 3.77GB CPU | 

+> [!TIP]
+> Take a look at the [examples/lora_dreambooth/train_dreambooth.py](examples/lora_dreambooth/train_dreambooth.py) training script to try training your own Stable Diffusion model with LoRA, and play around with the [smangrul/peft-lora-sd-dreambooth](https://huggingface.co/spaces/smangrul/peft-lora-sd-dreambooth) Space which is running on a T4 instance. Learn more about the PEFT integration in Diffusers in this [tutorial](https://huggingface.co/docs/peft/main/en/tutorial/peft_integrations#diffusers).

-**Training**
-An example of using LoRA for parameter efficient dreambooth training is given in [`examples/lora_dreambooth/train_dreambooth.py`](examples/lora_dreambooth/train_dreambooth.py)
+### Transformers

-```bash
-export MODEL_NAME= "CompVis/stable-diffusion-v1-4" #"stabilityai/stable-diffusion-2-1"
-export INSTANCE_DIR="path-to-instance-images"
-export CLASS_DIR="path-to-class-images"
-export OUTPUT_DIR="path-to-save-model"
-
-accelerate launch train_dreambooth.py \
-  --pretrained_model_name_or_path=$MODEL_NAME  \
-  --instance_data_dir=$INSTANCE_DIR \
-  --class_data_dir=$CLASS_DIR \
-  --output_dir=$OUTPUT_DIR \
-  --train_text_encoder \
-  --with_prior_preservation --prior_loss_weight=1.0 \
-  --instance_prompt="a photo of sks dog" \
-  --class_prompt="a photo of dog" \
-  --resolution=512 \
-  --train_batch_size=1 \
-  --lr_scheduler="constant" \
-  --lr_warmup_steps=0 \
-  --num_class_images=200 \
-  --use_lora \
-  --lora_r 16 \
-  --lora_alpha 27 \
-  --lora_text_encoder_r 16 \
-  --lora_text_encoder_alpha 17 \
-  --learning_rate=1e-4 \
-  --gradient_accumulation_steps=1 \
-  --gradient_checkpointing \
-  --max_train_steps=800
-```
-
-Try out the 🤗 Gradio Space which should run seamlessly on a T4 instance:
-[smangrul/peft-lora-sd-dreambooth](https://huggingface.co/spaces/smangrul/peft-lora-sd-dreambooth).
-
-![peft lora dreambooth gradio space](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/peft/peft_lora_dreambooth_gradio_space.png)
-
-**NEW** ✨ Multi Adapter support and combining multiple LoRA adapters in a weighted combination 
-![peft lora dreambooth weighted adapter](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/peft/weighted_adapter_dreambooth_lora.png)
-
-**NEW** ✨ Dreambooth training for Stable Diffusion using LoHa and LoKr adapters [`examples/stable_diffusion/train_dreambooth.py`](examples/stable_diffusion/train_dreambooth.py)
-
-### Parameter Efficient Tuning of LLMs for RLHF components such as Ranker and Policy
- Here is an example in [trl](https://github.com/lvwerra/trl) library using PEFT+INT8 for tuning policy model: [gpt2-sentiment_peft.py](https://github.com/lvwerra/trl/blob/main/examples/sentiment/scripts/gpt2-sentiment_peft.py) and corresponding [Blog](https://huggingface.co/blog/trl-peft)
- Example using PEFT for Instruction finetuning, reward model and policy : [stack_llama](https://github.com/lvwerra/trl/tree/main/examples/research_projects/stack_llama/scripts) and corresponding [Blog](https://huggingface.co/blog/stackllama) 
-
-### INT8 training of large models in Colab using PEFT LoRA and bits_and_bytes
-
- Here is now a demo on how to fine tune [OPT-6.7b](https://huggingface.co/facebook/opt-6.7b) (14GB in fp16) in a Google Colab: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1jCkpikz0J2o20FBQmYmAGdiKmJGOMo-o?usp=sharing)
-
- Here is now a demo on how to fine tune [whisper-large](https://huggingface.co/openai/whisper-large-v2) (1.5B params) (14GB in fp16) in a Google Colab: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1DOkD_5OUjFa0r5Ik3SgywJLJtEo2qLxO?usp=sharing) and [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1vhF8yueFqha3Y3CpTHN6q9EVcII9EYzs?usp=sharing)
-
-### Save compute and storage even for medium and small models
-
-Save storage by avoiding full finetuning of models on each of the downstream tasks/datasets,
-With PEFT methods, users only need to store tiny checkpoints in the order of `MBs` all the while retaining 
-performance comparable to full finetuning.
-
-An example of using LoRA for the task of adapting `LayoutLMForTokenClassification` on `FUNSD` dataset is given in `~examples/token_classification/PEFT_LoRA_LayoutLMForTokenClassification_on_FUNSD.py`. We can observe that with only `0.62 %` of parameters being trainable, we achieve performance (F1 0.777) comparable to full finetuning (F1 0.786) (without any hyperparam tuning runs for extracting more performance), and the checkpoint of this is only `2.8MB`. Now, if there are `N` such datasets, just have these PEFT models one for each dataset and save a lot of storage without having to worry about the problem of catastrophic forgetting or overfitting of backbone/base model.
-
-Another example is fine-tuning [`roberta-large`](https://huggingface.co/roberta-large) on [`MRPC` GLUE](https://huggingface.co/datasets/glue/viewer/mrpc) dataset using different PEFT methods. The notebooks are given in `~examples/sequence_classification`. 
-
-
-## PEFT + 🤗 Accelerate
-
-PEFT models work with 🤗 Accelerate out of the box. Use 🤗 Accelerate for Distributed training on various hardware such as GPUs, Apple Silicon devices, etc during training.
-Use 🤗 Accelerate for inferencing on consumer hardware with small resources.
-
-### Example of PEFT model training using 🤗 Accelerate's DeepSpeed integration
-
-DeepSpeed version required `v0.8.0`. An example is provided in `~examples/conditional_generation/peft_lora_seq2seq_accelerate_ds_zero3_offload.py`. 
-  a. First, run `accelerate config --config_file ds_zero3_cpu.yaml` and answer the questionnaire. 
-  Below are the contents of the config file.
-  ```yaml
-  compute_environment: LOCAL_MACHINE
-  deepspeed_config:
-    gradient_accumulation_steps: 1
-    gradient_clipping: 1.0
-    offload_optimizer_device: cpu
-    offload_param_device: cpu
-    zero3_init_flag: true
-    zero3_save_16bit_model: true
-    zero_stage: 3
-  distributed_type: DEEPSPEED
-  downcast_bf16: 'no'
-  dynamo_backend: 'NO'
-  fsdp_config: {}
-  machine_rank: 0
-  main_training_function: main
-  megatron_lm_config: {}
-  mixed_precision: 'no'
-  num_machines: 1
-  num_processes: 1
-  rdzv_backend: static
-  same_network: true
-  use_cpu: false
-  ```
-  b. run the below command to launch the example script
-  ```bash
-  accelerate launch --config_file ds_zero3_cpu.yaml examples/peft_lora_seq2seq_accelerate_ds_zero3_offload.py
-  ```
-
-  c. output logs:
-  ```bash
-  GPU Memory before entering the train : 1916
-  GPU Memory consumed at the end of the train (end-begin): 66
-  GPU Peak Memory consumed during the train (max-begin): 7488
-  GPU Total Peak Memory consumed during the train (max): 9404
-  CPU Memory before entering the train : 19411
-  CPU Memory consumed at the end of the train (end-begin): 0
-  CPU Peak Memory consumed during the train (max-begin): 0
-  CPU Total Peak Memory consumed during the train (max): 19411
-  epoch=4: train_ppl=tensor(1.0705, device='cuda:0') train_epoch_loss=tensor(0.0681, device='cuda:0')
-  100%|████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:27<00:00,  3.92s/it]
-  GPU Memory before entering the eval : 1982
-  GPU Memory consumed at the end of the eval (end-begin): -66
-  GPU Peak Memory consumed during the eval (max-begin): 672
-  GPU Total Peak Memory consumed during the eval (max): 2654
-  CPU Memory before entering the eval : 19411
-  CPU Memory consumed at the end of the eval (end-begin): 0
-  CPU Peak Memory consumed during the eval (max-begin): 0
-  CPU Total Peak Memory consumed during the eval (max): 19411
-  accuracy=100.0
-  eval_preds[:10]=['no complaint', 'no complaint', 'complaint', 'complaint', 'no complaint', 'no complaint', 'no complaint', 'complaint', 'complaint', 'no complaint']
-  dataset['train'][label_column][:10]=['no complaint', 'no complaint', 'complaint', 'complaint', 'no complaint', 'no complaint', 'no complaint', 'complaint', 'complaint', 'no complaint']
-  ```
-
-### Example of PEFT model inference using 🤗 Accelerate's Big Model Inferencing capabilities
-An example is provided in `~examples/causal_language_modeling/peft_lora_clm_accelerate_big_model_inference.ipynb`. 
-
-
-## Models support matrix
-
-### Causal Language Modeling
-| Model        | LoRA | Prefix Tuning  | P-Tuning | Prompt Tuning  | IA3 |
-|--------------| ---- | ---- | ---- | ----  | ----  |
-| GPT-2        | ✅  | ✅  | ✅  | ✅  | ✅  |
-| Bloom        | ✅  | ✅  | ✅  | ✅  | ✅  |
-| OPT          | ✅  | ✅  | ✅  | ✅  | ✅  |
-| GPT-Neo      | ✅  | ✅  | ✅  | ✅  | ✅  |
-| GPT-J        | ✅  | ✅  | ✅  | ✅  | ✅  |
-| GPT-NeoX-20B | ✅  | ✅  | ✅  | ✅  | ✅  |
-| LLaMA        | ✅  | ✅  | ✅  | ✅  | ✅  |
-| ChatGLM      | ✅  | ✅  | ✅  | ✅  | ✅  |
-
-### Conditional Generation
-|   Model         | LoRA | Prefix Tuning  | P-Tuning | Prompt Tuning  | IA3 |
-| --------- | ---- | ---- | ---- | ---- | ---- |
-| T5        | ✅   | ✅   | ✅   | ✅   | ✅   |
-| BART      | ✅   | ✅   | ✅   | ✅   | ✅   |
-
-### Sequence Classification
-|   Model         | LoRA | Prefix Tuning  | P-Tuning | Prompt Tuning  | IA3 |
-| --------- | ---- | ---- | ---- | ----  | ----  |
-| BERT           | ✅  | ✅  | ✅  | ✅  |  ✅  |  
-| RoBERTa        | ✅  | ✅  | ✅  | ✅  | ✅  |
-| GPT-2          | ✅  | ✅  | ✅  | ✅  |   |
-| Bloom          | ✅  | ✅  | ✅  | ✅  |   |
-| OPT            | ✅  | ✅  | ✅  | ✅  |   |
-| GPT-Neo        | ✅  | ✅  | ✅  | ✅  |   |
-| GPT-J          | ✅  | ✅  | ✅  | ✅  |   |
-| Deberta        | ✅  |     | ✅  | ✅  |   | 
-| Deberta-v2     | ✅  |     | ✅  | ✅  |   |
-
-### Token Classification
-|   Model         | LoRA | Prefix Tuning  | P-Tuning | Prompt Tuning  | IA3 |
-| --------- | ---- | ---- | ---- | ----  | ----  |
-| BERT           | ✅  | ✅  |   |   |   |  
-| RoBERTa        | ✅  | ✅  |   |   |   |
-| GPT-2          | ✅  | ✅  |   |   |   |
-| Bloom          | ✅  | ✅  |   |   |   |
-| OPT            | ✅  | ✅  |   |   |   |
-| GPT-Neo        | ✅  | ✅  |   |   |   |
-| GPT-J          | ✅  | ✅  |   |   |   |
-| Deberta        | ✅  |     |   |   |   |
-| Deberta-v2     | ✅  |     |   |   |   |
-
-### Text-to-Image Generation
-
-|   Model         | LoRA | LoHa | LoKr | Prefix Tuning  | P-Tuning | Prompt Tuning  | IA3 |
-| --------- | ---- | ---- | ---- | ---- | ---- | ----  | ----  |
-| Stable Diffusion           | ✅  | ✅  | ✅  |  |   |   |
-
-
-### Image Classification
-
-|   Model         | LoRA | Prefix Tuning  | P-Tuning | Prompt Tuning  | IA3 |
-| --------- | ---- | ---- | ---- | ----  | ----  |
-| ViT           | ✅  |   |   |   |    | 
-| Swin           | ✅  |   |   |   |   |  
-
-### Image to text (Multi-modal models)
-
-|   Model         | LoRA | Prefix Tuning  | P-Tuning | Prompt Tuning  | IA3
-| --------- | ---- | ---- | ---- | ----  | ----  |
-| Blip-2           | ✅  |   |   |   |   |
-
-___Note that we have tested LoRA for [ViT](https://huggingface.co/docs/transformers/model_doc/vit) and [Swin](https://huggingface.co/docs/transformers/model_doc/swin) for fine-tuning on image classification. However, it should be possible to use LoRA for any compatible model [provided](https://huggingface.co/models?pipeline_tag=image-classification&sort=downloads&search=vit) by 🤗 Transformers. Check out the respective
-examples to learn more. If you run into problems, please open an issue.___
-
-The same principle applies to our [segmentation models](https://huggingface.co/models?pipeline_tag=image-segmentation&sort=downloads) as well. 
-
-### Semantic Segmentation
-
-|   Model         | LoRA | Prefix Tuning  | P-Tuning | Prompt Tuning  | IA3 |
-| --------- | ---- | ---- | ---- | ----  | ----  |
-| SegFormer           | ✅  |   |   |   |   | 
-
-
-## Caveats:
-
-1. Below is an example of using PyTorch FSDP for training. However, it doesn't lead to 
-any GPU memory savings. Please refer issue [[FSDP] FSDP with CPU offload consumes 1.65X more GPU memory when training models with most of the params frozen](https://github.com/pytorch/pytorch/issues/91165). 
-
-  ```python
-  from peft.utils.other import fsdp_auto_wrap_policy
-
-  ...
-
-  if os.environ.get("ACCELERATE_USE_FSDP", None) is not None:
-      accelerator.state.fsdp_plugin.auto_wrap_policy = fsdp_auto_wrap_policy(model)
-
-  model = accelerator.prepare(model)
-  ```
-
-  Example of parameter efficient tuning with [`mt0-xxl`](https://huggingface.co/bigscience/mt0-xxl) base model using 🤗 Accelerate is provided in `~examples/conditional_generation/peft_lora_seq2seq_accelerate_fsdp.py`. 
-  a. First, run `accelerate config --config_file fsdp_config.yaml` and answer the questionnaire. 
-  Below are the contents of the config file.
-  ```yaml
-  command_file: null
-  commands: null
-  compute_environment: LOCAL_MACHINE
-  deepspeed_config: {}
-  distributed_type: FSDP
-  downcast_bf16: 'no'
-  dynamo_backend: 'NO'
-  fsdp_config:
-    fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
-    fsdp_backward_prefetch_policy: BACKWARD_PRE
-    fsdp_offload_params: true
-    fsdp_sharding_strategy: 1
-    fsdp_state_dict_type: FULL_STATE_DICT
-    fsdp_transformer_layer_cls_to_wrap: T5Block
-  gpu_ids: null
-  machine_rank: 0
-  main_process_ip: null
-  main_process_port: null
-  main_training_function: main
-  megatron_lm_config: {}
-  mixed_precision: 'no'
-  num_machines: 1
-  num_processes: 2
-  rdzv_backend: static
-  same_network: true
-  tpu_name: null
-  tpu_zone: null
-  use_cpu: false
-  ```
-  b. run the below command to launch the example script
-  ```bash
-  accelerate launch --config_file fsdp_config.yaml examples/peft_lora_seq2seq_accelerate_fsdp.py
-  ```
-
-2. When using ZeRO3 with zero3_init_flag=True, if you find the gpu memory increase with training steps. we might need to update deepspeed after [deepspeed commit 42858a9891422abc](https://github.com/microsoft/DeepSpeed/commit/42858a9891422abcecaa12c1bd432d28d33eb0d4) . The related issue is [[BUG] Peft Training with Zero.Init() and Zero3 will increase GPU memory every forward step ](https://github.com/microsoft/DeepSpeed/issues/3002)
-
-## 🤗 PEFT as a utility library
-
-Inject trainable adapters on any `torch` model using `inject_adapter_in_model` method. Note the method will make no further change to the model.
+PEFT is directly integrated with [Transformers](https://huggingface.co/docs/transformers/main/en/peft). After loading a model, call `add_adapter` to add a new PEFT adapter to the model:

 ```python
-import torch 
-from peft import inject_adapter_in_model, LoraConfig
-
-class DummyModel(torch.nn.Module):
-    def __init__(self):
-        super().__init__()
-        self.embedding = torch.nn.Embedding(10, 10)
-        self.linear = torch.nn.Linear(10, 10)
-        self.lm_head = torch.nn.Linear(10, 10)
-    
-    def forward(self, input_ids):
-        x = self.embedding(input_ids)
-        x = self.linear(x)
-        x = self.lm_head(x)
-        return x
-
-lora_config = LoraConfig(
-    lora_alpha=16,
-    lora_dropout=0.1,
-    r=64,
-    bias="none",
-    target_modules=["linear"],
-)
-
-model = DummyModel()
-model = inject_adapter_in_model(lora_config, model)
-
-dummy_inputs = torch.LongTensor([[0, 1, 2, 3, 4, 5, 6, 7]])
-dummy_outputs = model(dummy_inputs)
+from peft import LoraConfig
+model = ...  # transformers model
+peft_config = LoraConfig(...)
+model.add_adapter(lora_config, adapter_name="lora_1")
 ```

-## Contributing
+To load a trained PEFT adapter, call `load_adapter`:

-If you would like to contribute to PEFT, please check out our [contributing guide](https://huggingface.co/docs/peft/developer_guides/contributing).
+```python
+model = ...  # transformers model
+model.load_adapter(<path-to-adapter>, adapter_name="lora_1")
+```
+
+And to switch between different adapters, call `set_adapter`:
+
+```python
+model.set_adapter("lora_2")
+```
+
+The Transformers integration doesn't include all the functionalities offered in PEFT, such as methods for merging the adapter into the base model.
+
+### Accelerate
+
+[Accelerate](https://huggingface.co/docs/accelerate/index) is a library for distributed training and inference on various training setups and hardware (GPUs, TPUs, Apple Silicon, etc.). PEFT models work with Accelerate out of the box, making it really convenient to train really large models or use them for inference on consumer hardware with limited resources.
+
+### TRL
+
+PEFT can also be applied to training LLMs with RLHF components such as the ranker and policy. Get started by reading:
+
+* [Fine-tune a Mistral-7b model with Direct Preference Optimization](https://towardsdatascience.com/fine-tune-a-mistral-7b-model-with-direct-preference-optimization-708042745aac) with PEFT and the [TRL](https://huggingface.co/docs/trl/index) library to learn more about the Direct Preference Optimization (DPO) method and how to apply it to a LLM.
+* [Fine-tuning 20B LLMs with RLHF on a 24GB consumer GPU](https://huggingface.co/blog/trl-peft) with PEFT and the [TRL](https://huggingface.co/docs/trl/index) library, and then try out the [gpt2-sentiment_peft.ipynb](https://github.com/huggingface/trl/blob/main/examples/notebooks/gpt2-sentiment.ipynb) notebook to optimize GPT2 to generate positive movie reviews.
+* [StackLLaMA: A hands-on guide to train LLaMA with RLHF](https://huggingface.co/blog/stackllama) with PEFT, and then try out the [stack_llama/scripts](https://github.com/huggingface/trl/tree/main/examples/research_projects/stack_llama/scripts) for supervised finetuning, reward modeling, and RL finetuning.
+
+## Model support
+
+Use this [Space](https://stevhliu-peft-methods.hf.space) or check out the [docs](https://huggingface.co/docs/peft/main/en/index) to find which models officially support a PEFT method out of the box. Even if you don't see a model listed below, you can manually configure the model config to enable PEFT for a model. Read the [New transformers architecture](https://huggingface.co/docs/peft/main/en/developer_guides/custom_models#new-transformers-architectures) guide to learn how.
+
+## Contribute
+
+If you would like to contribute to PEFT, please check out our [contribution guide](https://huggingface.co/docs/peft/developer_guides/contributing).

 ## Citing 🤗 PEFT

-If you use 🤗 PEFT in your publication, please cite it by using the following BibTeX entry.
+To use 🤗 PEFT in your publication, please cite it by using the following BibTeX entry.

 ```bibtex
@Misc{peft,
-  title =        {PEFT: State-of-the-art Parameter-Efficient Fine-Tuning methods},
+  title =        {{PEFT}: State-of-the-art Parameter-Efficient Fine-Tuning methods},
  author =       {Sourab Mangrulkar and Sylvain Gugger and Lysandre Debut and Younes Belkada and Sayak Paul and Benjamin Bossan},
  howpublished = {\url{https://github.com/huggingface/peft}},
  year =         {2022}
--- a/docker/README.md
+++ b/docker/README.md
@ -0,0 +1,8 @@
+# PEFT Docker images
+
+Here we store all PEFT Docker images used in our testing infrastructure. We use python 3.11 for now on all our images.
+
+- `peft-cpu`: PEFT compiled on CPU with all other HF libraries installed on main branch
+- `peft-gpu`: PEFT complied for NVIDIA GPUs with all other HF libraries installed on main branch
+- `peft-gpu-bnb-source`: PEFT complied for NVIDIA GPUs with `bitsandbytes` and all other HF libraries installed from main branch
+- `peft-gpu-bnb-latest`: PEFT complied for NVIDIA GPUs with `bitsandbytes` complied from main and all other HF libraries installed from latest PyPi
--- a/docker/peft-cpu/Dockerfile
+++ b/docker/peft-cpu/Dockerfile
@ -4,13 +4,14 @@
 # Use base conda image to reduce time
 FROM continuumio/miniconda3:latest AS compile-image
 # Specify py version
-ENV PYTHON_VERSION=3.8
+ENV PYTHON_VERSION=3.11
 # Install apt libs - copied from https://github.com/huggingface/accelerate/blob/main/docker/accelerate-gpu/Dockerfile
 RUN apt-get update && \
    apt-get install -y curl git wget software-properties-common git-lfs && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists*

+
 # Install audio-related libraries 
 RUN apt-get update && \
    apt install -y ffmpeg
@ -48,4 +49,4 @@ RUN apt-get update && \
 RUN echo "source activate peft" >> ~/.profile

 # Activate the virtualenv
-CMD ["/bin/bash"]
+CMD ["/bin/bash"]
--- a/docker/peft-gpu-bnb-latest/Dockerfile
+++ b/docker/peft-gpu-bnb-latest/Dockerfile
@ -0,0 +1,68 @@
+# Builds GPU docker image of PyTorch
+# Uses multi-staged approach to reduce size
+# Stage 1
+# Use base conda image to reduce time
+FROM continuumio/miniconda3:latest AS compile-image
+# Specify py version
+ENV PYTHON_VERSION=3.11
+# Install apt libs - copied from https://github.com/huggingface/accelerate/blob/main/docker/accelerate-gpu/Dockerfile
+RUN apt-get update && \
+    apt-get install -y curl git wget software-properties-common git-lfs && \
+    apt-get clean && \
+    rm -rf /var/lib/apt/lists*
+
+# Install audio-related libraries 
+RUN apt-get update && \
+    apt install -y ffmpeg
+
+RUN apt install -y libsndfile1-dev
+RUN git lfs install
+
+# Create our conda env - copied from https://github.com/huggingface/accelerate/blob/main/docker/accelerate-gpu/Dockerfile
+RUN conda create --name peft python=${PYTHON_VERSION} ipython jupyter pip
+RUN python3 -m pip install --no-cache-dir --upgrade pip
+
+# Below is copied from https://github.com/huggingface/accelerate/blob/main/docker/accelerate-gpu/Dockerfile
+# We don't install pytorch here yet since CUDA isn't available
+# instead we use the direct torch wheel
+ENV PATH /opt/conda/envs/peft/bin:$PATH
+# Activate our bash shell
+RUN chsh -s /bin/bash
+SHELL ["/bin/bash", "-c"]
+
+# Stage 2
+FROM nvidia/cuda:12.6.3-devel-ubuntu22.04 AS build-image
+COPY --from=compile-image /opt/conda /opt/conda
+ENV PATH /opt/conda/bin:$PATH
+
+RUN chsh -s /bin/bash
+SHELL ["/bin/bash", "-c"]
+
+# Install apt libs
+RUN apt-get update && \
+    apt-get install -y curl git wget cmake && \
+    apt-get clean && \
+    rm -rf /var/lib/apt/lists*
+
+# Activate the conda env and install transformers + accelerate from latest pypi
+# Also clone BNB and build it from source.
+RUN source activate peft && \
+    python3 -m pip install -U --no-cache-dir \
+    librosa \
+    "soundfile>=0.12.1" \
+    scipy \
+    transformers \
+    accelerate \
+    peft \
+    optimum \
+    auto-gptq && \
+    git clone https://github.com/bitsandbytes-foundation/bitsandbytes && cd bitsandbytes && \
+    cmake -B . -DCOMPUTE_BACKEND=cuda -S . && \
+    cmake --build . && \
+    pip install -e . && \ 
+    pip freeze | grep bitsandbytes
+
+RUN echo "source activate peft" >> ~/.profile
+
+# Activate the virtualenv
+CMD ["/bin/bash"]
--- a/docker/peft-gpu-bnb-source/Dockerfile
+++ b/docker/peft-gpu-bnb-source/Dockerfile
@ -0,0 +1,68 @@
+# Builds GPU docker image of PyTorch
+# Uses multi-staged approach to reduce size
+# Stage 1
+# Use base conda image to reduce time
+FROM continuumio/miniconda3:latest AS compile-image
+# Specify py version
+ENV PYTHON_VERSION=3.11
+# Install apt libs - copied from https://github.com/huggingface/accelerate/blob/main/docker/accelerate-gpu/Dockerfile
+RUN apt-get update && \
+    apt-get install -y curl git wget software-properties-common git-lfs && \
+    apt-get clean && \
+    rm -rf /var/lib/apt/lists*
+
+# Install audio-related libraries 
+RUN apt-get update && \
+    apt install -y ffmpeg
+
+RUN apt install -y libsndfile1-dev
+RUN git lfs install
+
+# Create our conda env - copied from https://github.com/huggingface/accelerate/blob/main/docker/accelerate-gpu/Dockerfile
+RUN conda create --name peft python=${PYTHON_VERSION} ipython jupyter pip
+RUN python3 -m pip install --no-cache-dir --upgrade pip
+
+# Below is copied from https://github.com/huggingface/accelerate/blob/main/docker/accelerate-gpu/Dockerfile
+# We don't install pytorch here yet since CUDA isn't available
+# instead we use the direct torch wheel
+ENV PATH /opt/conda/envs/peft/bin:$PATH
+# Activate our bash shell
+RUN chsh -s /bin/bash
+SHELL ["/bin/bash", "-c"]
+
+# Stage 2
+FROM nvidia/cuda:12.6.3-devel-ubuntu22.04 AS build-image
+COPY --from=compile-image /opt/conda /opt/conda
+ENV PATH /opt/conda/bin:$PATH
+
+RUN chsh -s /bin/bash
+SHELL ["/bin/bash", "-c"]
+
+# Install apt libs
+RUN apt-get update && \
+    apt-get install -y curl git wget cmake && \
+    apt-get clean && \
+    rm -rf /var/lib/apt/lists*
+
+# Activate the conda env and install transformers + accelerate from source
+# Also clone BNB and build it from source.
+RUN source activate peft && \
+    python3 -m pip install -U --no-cache-dir \
+    librosa \
+    "soundfile>=0.12.1" \
+    scipy \
+    git+https://github.com/huggingface/transformers \
+    git+https://github.com/huggingface/accelerate \
+    peft[test]@git+https://github.com/huggingface/peft \
+    optimum \
+    auto-gptq && \
+    git clone https://github.com/bitsandbytes-foundation/bitsandbytes && cd bitsandbytes && \
+    cmake -B . -DCOMPUTE_BACKEND=cuda -S . && \
+    cmake --build . && \
+    pip install -e . && \ 
+    pip freeze | grep bitsandbytes
+
+RUN echo "source activate peft" >> ~/.profile
+
+# Activate the virtualenv
+CMD ["/bin/bash"]
--- a/docker/peft-gpu/Dockerfile
+++ b/docker/peft-gpu/Dockerfile
@ -4,23 +4,18 @@
 # Use base conda image to reduce time
 FROM continuumio/miniconda3:latest AS compile-image
 # Specify py version
-ENV PYTHON_VERSION=3.8
+ENV PYTHON_VERSION=3.11
 # Install apt libs - copied from https://github.com/huggingface/accelerate/blob/main/docker/accelerate-gpu/Dockerfile
+# Install audio-related libraries
 RUN apt-get update && \
-    apt-get install -y curl git wget software-properties-common git-lfs && \
+    apt-get install -y curl git wget software-properties-common git-lfs ffmpeg libsndfile1-dev && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists*

-# Install audio-related libraries 
-RUN apt-get update && \
-    apt install -y ffmpeg
-
-RUN apt install -y libsndfile1-dev
 RUN git lfs install

 # Create our conda env - copied from https://github.com/huggingface/accelerate/blob/main/docker/accelerate-gpu/Dockerfile
 RUN conda create --name peft python=${PYTHON_VERSION} ipython jupyter pip
-RUN python3 -m pip install --no-cache-dir --upgrade pip

 # Below is copied from https://github.com/huggingface/accelerate/blob/main/docker/accelerate-gpu/Dockerfile
 # We don't install pytorch here yet since CUDA isn't available
@ -29,32 +24,46 @@ ENV PATH /opt/conda/envs/peft/bin:$PATH
 # Activate our bash shell
 RUN chsh -s /bin/bash
 SHELL ["/bin/bash", "-c"]
-# Activate the conda env and install transformers + accelerate from source
-RUN source activate peft && \
-    python3 -m pip install --no-cache-dir \
-    librosa \
-    "soundfile>=0.12.1" \
-    scipy \
-    git+https://github.com/huggingface/transformers \
-    git+https://github.com/huggingface/accelerate \
-    peft[test]@git+https://github.com/huggingface/peft

 # Stage 2
-FROM nvidia/cuda:12.2.2-devel-ubuntu22.04 AS build-image
+FROM nvidia/cuda:12.4.1-devel-ubuntu22.04 AS build-image
 COPY --from=compile-image /opt/conda /opt/conda
 ENV PATH /opt/conda/bin:$PATH

-RUN chsh -s /bin/bash
-SHELL ["/bin/bash", "-c"]
-RUN source activate peft && \ 
-    python3 -m pip install --no-cache-dir bitsandbytes optimum auto-gptq
-
 # Install apt libs
 RUN apt-get update && \
    apt-get install -y curl git wget && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists*

+RUN chsh -s /bin/bash
+SHELL ["/bin/bash", "-c"]
+RUN source activate peft && \ 
+    python3 -m pip install --no-cache-dir bitsandbytes optimum auto-gptq && \
+    # Add autoawq for quantization testing
+    python3 -m pip install --no-cache-dir https://github.com/casper-hansen/AutoAWQ/releases/download/v0.2.7.post2/autoawq-0.2.7.post2-py3-none-any.whl && \
+    python3 -m pip install --no-cache-dir https://github.com/casper-hansen/AutoAWQ_kernels/releases/download/v0.0.9/autoawq_kernels-0.0.9-cp311-cp311-linux_x86_64.whl && \
+    # Add eetq for quantization testing
+    python3 -m pip install git+https://github.com/NetEase-FuXi/EETQ.git
+
+# Activate the conda env and install transformers + accelerate from source
+RUN source activate peft && \
+    python3 -m pip install -U --no-cache-dir \
+    librosa \
+    "soundfile>=0.12.1" \
+    scipy \
+    torchao \
+    git+https://github.com/huggingface/transformers \
+    git+https://github.com/huggingface/accelerate \
+    peft[test]@git+https://github.com/huggingface/peft \
+    # Add aqlm for quantization testing
+    aqlm[gpu]>=1.0.2 \
+    # Add HQQ for quantization testing
+    hqq
+
+RUN source activate peft && \
+    pip freeze | grep transformers
+
 RUN echo "source activate peft" >> ~/.profile

 # Activate the virtualenv
--- a/docs/README.md
+++ b/docs/README.md
@ -33,7 +33,7 @@ pip install git+https://github.com/huggingface/doc-builder
 **NOTE**

 You only need to generate the documentation to inspect it locally (if you're planning changes and want to
-check how they look before committing for instance). You don't have to commit the built documentation.
+check how they look before committing for instance). You don't have to commit to the built documentation.

 ---

@ -46,7 +46,7 @@ typing the following command:
 doc-builder build peft docs/source/ --build_dir ~/tmp/test-build
 ```

-You can adapt the `--build_dir` to set any temporary folder that you prefer. This command will create it and generate
+You can adapt the `--build_dir` to set any temporary folder you prefer. This command will create it and generate
 the MDX files that will be rendered as the documentation on the main website. You can inspect them in your favorite
 Markdown editor.

@ -124,7 +124,7 @@ Adding a new tutorial or section is done in two steps:
 - Link that file in `./source/_toctree.yml` on the correct toc-tree.

 Make sure to put your new file under the proper section. It's unlikely to go in the first section (*Get Started*), so
-depending on the intended targets (beginners, more advanced users, or researchers) it should go in sections two, three, or
+depending on the intended targets (beginners, more advanced users, or researchers) it should go into sections two, three, or
 four.

 ### Writing source documentation
@ -188,7 +188,7 @@ then its documentation should look like this:
 ```

 Note that we always omit the "defaults to \`None\`" when None is the default for any argument. Also note that even
-if the first line describing your argument type and its default gets long, you can't break it on several lines. You can
+if the first line describing your argument type and its default gets long, you can't break it into several lines. You can
 however write as many lines as you want in the indented description (see the example above with `input_ids`).

 #### Writing a multi-line code block
@ -234,13 +234,13 @@ We have an automatic script running with the `make style` comment that will make
 - the docstrings fully take advantage of the line width
 - all code examples are formatted using black, like the code of the Transformers library

-This script may have some weird failures if you made a syntax mistake or if you uncover a bug. Therefore, it's
+This script may have some weird failures if you make a syntax mistake or if you uncover a bug. Therefore, it's
 recommended to commit your changes before running `make style`, so you can revert the changes done by that script
 easily.

 ## Writing documentation examples

-The syntax for Example docstrings can look as follows:
+The syntax, for example, docstrings can look as follows:

 ```
    Example:
@ -264,4 +264,4 @@ is to be used in inference and also include the expected (ideally sensible)
 output.
 Often, readers will try out the example before even going through the function 
 or class definitions. Therefore, it is of utmost importance that the example 
-works as expected.
+works as expected.
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@ -7,59 +7,145 @@
  - local: install
    title: Installation

- title: Task guides
+- title: Tutorial
  sections:
-  - local: task_guides/image_classification_lora
-    title: Image classification using LoRA
-  - local: task_guides/seq2seq-prefix-tuning
-    title: Prefix tuning for conditional generation
-  - local: task_guides/clm-prompt-tuning
-    title: Prompt tuning for causal language modeling
-  - local: task_guides/semantic_segmentation_lora
-    title: Semantic segmentation using LoRA
-  - local: task_guides/ptuning-seq-classification
-    title: P-tuning for sequence classification
-  - local: task_guides/dreambooth_lora
-    title: Dreambooth fine-tuning with LoRA
-  - local: task_guides/token-classification-lora
-    title: LoRA for token classification
-  - local: task_guides/int8-asr
-    title: int8 training for automatic speech recognition
-  - local: task_guides/semantic-similarity-lora
-    title: Semantic similarity with LoRA
+  - local: tutorial/peft_model_config
+    title: Configurations and models
+  - local: tutorial/peft_integrations
+    title: Integrations
+
+- title: PEFT method guides
+  sections:
+  - local: task_guides/prompt_based_methods
+    title: Prompt-based methods
+  - local: task_guides/lora_based_methods
+    title: LoRA methods
+  - local: task_guides/ia3
+    title: IA3

 - title: Developer guides
  sections:
+  - local: developer_guides/model_merging
+    title: Model merging
+  - local: developer_guides/quantization
+    title: Quantization
+  - local: developer_guides/lora
+    title: LoRA
  - local: developer_guides/custom_models
-    title: Working with custom models
+    title: Custom models
  - local: developer_guides/low_level_api
-    title: PEFT low level API
+    title: Adapter injection
+  - local: developer_guides/mixed_models
+    title: Mixed adapter types
+  - local: developer_guides/torch_compile
+    title: torch.compile
  - local: developer_guides/contributing
-    title: Contributing to PEFT
+    title: Contribute to PEFT
  - local: developer_guides/troubleshooting
    title: Troubleshooting
+  - local: developer_guides/checkpoint
+    title: PEFT checkpoint format

 - title: 🤗 Accelerate integrations
  sections:
-  - local: accelerate/deepspeed-zero3-offload
+  - local: accelerate/deepspeed
    title: DeepSpeed
  - local: accelerate/fsdp
    title: Fully Sharded Data Parallel

 - title: Conceptual guides
  sections:
-  - local: conceptual_guides/lora
-    title: LoRA
+  - local: conceptual_guides/adapter
+    title: Adapters
  - local: conceptual_guides/prompting
-    title: Prompting
+    title: Soft prompts
  - local: conceptual_guides/ia3
    title: IA3
+  - local: conceptual_guides/oft
+    title: OFT/BOFT

- title: Reference
-  sections:
-  - local: package_reference/peft_model
-    title: PEFT model
-  - local: package_reference/config
-    title: Configuration
-  - local: package_reference/tuners
-    title: Tuners
+- sections:
+  - sections:
+    - local: package_reference/auto_class
+      title: AutoPeftModel
+    - local: package_reference/peft_model
+      title: PEFT model
+    - local: package_reference/peft_types
+      title: PEFT types
+    - local: package_reference/config
+      title: Configuration
+    - local: package_reference/tuners
+      title: Tuner
+    title: Main classes
+  - sections:
+    - local: package_reference/adalora
+      title: AdaLoRA
+    - local: package_reference/ia3
+      title: IA3
+    - local: package_reference/llama_adapter
+      title: Llama-Adapter
+    - local: package_reference/loha
+      title: LoHa
+    - local: package_reference/lokr
+      title: LoKr
+    - local: package_reference/lora
+      title: LoRA
+    - local: package_reference/xlora
+      title: X-LoRA
+    - local: package_reference/adapter_utils
+      title: LyCORIS
+    - local: package_reference/multitask_prompt_tuning
+      title: Multitask Prompt Tuning
+    - local: package_reference/oft
+      title: OFT
+    - local: package_reference/boft
+      title: BOFT
+    - local: package_reference/poly
+      title: Polytropon
+    - local: package_reference/p_tuning
+      title: P-tuning
+    - local: package_reference/prefix_tuning
+      title: Prefix tuning
+    - local: package_reference/prompt_tuning
+      title: Prompt tuning
+    - local: package_reference/layernorm_tuning
+      title: Layernorm tuning
+    - local: package_reference/vera
+      title: VeRA
+    - local: package_reference/fourierft
+      title: FourierFT
+    - local: package_reference/vblora
+      title: VB-LoRA
+    - local: package_reference/hra
+      title: HRA
+    - local: package_reference/cpt
+      title: CPT
+    - local: package_reference/bone
+      title: Bone
+    - local: package_reference/trainable_tokens
+      title: Trainable Tokens
+    - local: package_reference/randlora
+      title: RandLora
+    - local: package_reference/shira
+      title: SHiRA
+    - local: package_reference/c3a
+      title: C3A
+    - local: package_reference/miss
+      title: MiSS
+    - local: package_reference/road
+      title: RoAd
+    - local: package_reference/waveft
+      title: WaveFT
+
+    title: Adapters
+  - sections:
+    - local: package_reference/merge_utils
+      title: Model merge
+    - local: package_reference/helpers
+      title: Helpers
+    - local: package_reference/hotswap
+      title: Hotswapping adapters
+    - local: package_reference/functional
+      title: Functions for PEFT integration
+    title: Utilities
+  title: API reference
--- a/docs/source/accelerate/deepspeed-zero3-offload.mdx
+++ b/docs/source/accelerate/deepspeed-zero3-offload.mdx
@ -1,163 +0,0 @@
-# DeepSpeed
-
-[DeepSpeed](https://www.deepspeed.ai/) is a library designed for speed and scale for distributed training of large models with billions of parameters. At its core is the Zero Redundancy Optimizer (ZeRO) that shards optimizer states (ZeRO-1), gradients (ZeRO-2), and parameters (ZeRO-3) across data parallel processes. This drastically reduces memory usage, allowing you to scale your training to billion parameter models. To unlock even more memory efficiency, ZeRO-Offload reduces GPU compute and memory by leveraging CPU resources during optimization.
-
-Both of these features are supported in 🤗 Accelerate, and you can use them with 🤗 PEFT. This guide will help you learn how to use our DeepSpeed [training script](https://github.com/huggingface/peft/blob/main/examples/conditional_generation/peft_lora_seq2seq_accelerate_ds_zero3_offload.py). You'll configure the script to train a large model for conditional generation with ZeRO-3 and ZeRO-Offload.
-
-<Tip>
-
-💡 To help you get started, check out our example training scripts for [causal language modeling](https://github.com/huggingface/peft/blob/main/examples/causal_language_modeling/peft_lora_clm_accelerate_ds_zero3_offload.py) and [conditional generation](https://github.com/huggingface/peft/blob/main/examples/conditional_generation/peft_lora_seq2seq_accelerate_ds_zero3_offload.py). You can adapt these scripts for your own applications or even use them out of the box if your task is similar to the one in the scripts.
-
-</Tip>
-
-## Configuration
-
-Start by running the following command to [create a DeepSpeed configuration file](https://huggingface.co/docs/accelerate/quicktour#launching-your-distributed-script) with 🤗 Accelerate. The `--config_file` flag allows you to save the configuration file to a specific location, otherwise it is saved as a `default_config.yaml` file in the 🤗 Accelerate cache.
-
-The configuration file is used to set the default options when you launch the training script.
-
-```bash
-accelerate config --config_file ds_zero3_cpu.yaml
-```
-
-You'll be asked a few questions about your setup, and configure the following arguments. In this example, you'll use ZeRO-3 and ZeRO-Offload so make sure you pick those options.
-
-```bash
-`zero_stage`: [0] Disabled, [1] optimizer state partitioning, [2] optimizer+gradient state partitioning and [3] optimizer+gradient+parameter partitioning
-`gradient_accumulation_steps`: Number of training steps to accumulate gradients before averaging and applying them.
-`gradient_clipping`: Enable gradient clipping with value.
-`offload_optimizer_device`: [none] Disable optimizer offloading, [cpu] offload optimizer to CPU, [nvme] offload optimizer to NVMe SSD. Only applicable with ZeRO >= Stage-2.
-`offload_param_device`: [none] Disable parameter offloading, [cpu] offload parameters to CPU, [nvme] offload parameters to NVMe SSD. Only applicable with ZeRO Stage-3.
-`zero3_init_flag`: Decides whether to enable `deepspeed.zero.Init` for constructing massive models. Only applicable with ZeRO Stage-3.
-`zero3_save_16bit_model`: Decides whether to save 16-bit model weights when using ZeRO Stage-3.
-`mixed_precision`: `no` for FP32 training, `fp16` for FP16 mixed-precision training and `bf16` for BF16 mixed-precision training. 
-```
-
-An example [configuration file](https://github.com/huggingface/peft/blob/main/examples/conditional_generation/accelerate_ds_zero3_cpu_offload_config.yaml) might look like the following. The most important thing to notice is that `zero_stage` is set to `3`, and `offload_optimizer_device` and `offload_param_device` are set to the `cpu`.
-
-```yml
-compute_environment: LOCAL_MACHINE
-deepspeed_config:
-  gradient_accumulation_steps: 1
-  gradient_clipping: 1.0
-  offload_optimizer_device: cpu
-  offload_param_device: cpu
-  zero3_init_flag: true
-  zero3_save_16bit_model: true
-  zero_stage: 3
-distributed_type: DEEPSPEED
-downcast_bf16: 'no'
-dynamo_backend: 'NO'
-fsdp_config: {}
-machine_rank: 0
-main_training_function: main
-megatron_lm_config: {}
-mixed_precision: 'no'
-num_machines: 1
-num_processes: 1
-rdzv_backend: static
-same_network: true
-use_cpu: false
-```
-
-## The important parts
-
-Let's dive a little deeper into the script so you can see what's going on, and understand how it works.
-
-Within the [`main`](https://github.com/huggingface/peft/blob/2822398fbe896f25d4dac5e468624dc5fd65a51b/examples/conditional_generation/peft_lora_seq2seq_accelerate_ds_zero3_offload.py#L103) function, the script creates an [`~accelerate.Accelerator`] class to initialize all the necessary requirements for distributed training.
-
-<Tip>
-
-💡 Feel free to change the model and dataset inside the `main` function. If your dataset format is different from the one in the script, you may also need to write your own preprocessing function. 
-
-</Tip>
-
-The script also creates a configuration for the 🤗 PEFT method you're using, which in this case, is LoRA. The [`LoraConfig`] specifies the task type and important parameters such as the dimension of the low-rank matrices, the matrices scaling factor, and the dropout probability of the LoRA layers. If you want to use a different 🤗 PEFT method, make sure you replace `LoraConfig` with the appropriate [class](../package_reference/tuners).
-
-```diff
- def main():
-+    accelerator = Accelerator()
-     model_name_or_path = "facebook/bart-large"
-     dataset_name = "twitter_complaints"
-+    peft_config = LoraConfig(
-         task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1
-     )
-```
-
-Throughout the script, you'll see the [`~accelerate.Accelerator.main_process_first`] and [`~accelerate.Accelerator.wait_for_everyone`] functions which help control and synchronize when processes are executed.
-
-The [`get_peft_model`] function takes a base model and the [`peft_config`] you prepared earlier to create a [`PeftModel`]:
-
-```diff
-  model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path)
-+ model = get_peft_model(model, peft_config)
-```
-
-Pass all the relevant training objects to 🤗 Accelerate's [`~accelerate.Accelerator.prepare`] which makes sure everything is ready for training:
-
-```py
-model, train_dataloader, eval_dataloader, test_dataloader, optimizer, lr_scheduler = accelerator.prepare(
-    model, train_dataloader, eval_dataloader, test_dataloader, optimizer, lr_scheduler
-)
-```
-
-The next bit of code checks whether the DeepSpeed plugin is used in the `Accelerator`, and if the plugin exists, then the `Accelerator` uses ZeRO-3 as specified in the configuration file:
-
-```py
-is_ds_zero_3 = False
-if getattr(accelerator.state, "deepspeed_plugin", None):
-    is_ds_zero_3 = accelerator.state.deepspeed_plugin.zero_stage == 3
-```
-
-Inside the training loop, the usual `loss.backward()` is replaced by 🤗 Accelerate's [`~accelerate.Accelerator.backward`] which uses the correct `backward()` method based on your configuration:
-
-```diff
-  for epoch in range(num_epochs):
-      with TorchTracemalloc() as tracemalloc:
-          model.train()
-          total_loss = 0
-          for step, batch in enumerate(tqdm(train_dataloader)):
-              outputs = model(**batch)
-              loss = outputs.loss
-              total_loss += loss.detach().float()
-+             accelerator.backward(loss)
-              optimizer.step()
-              lr_scheduler.step()
-              optimizer.zero_grad()
-```
-
-That is all! The rest of the script handles the training loop, evaluation, and even pushes it to the Hub for you.
-
-## Train
-
-Run the following command to launch the training script. Earlier, you saved the configuration file to `ds_zero3_cpu.yaml`, so you'll need to pass the path to the launcher with the `--config_file` argument like this:
-
-```bash
-accelerate launch --config_file ds_zero3_cpu.yaml examples/peft_lora_seq2seq_accelerate_ds_zero3_offload.py
-```
-
-You'll see some output logs that track memory usage during training, and once it's completed, the script returns the accuracy and compares the predictions to the labels:
-
-```bash
-GPU Memory before entering the train : 1916
-GPU Memory consumed at the end of the train (end-begin): 66
-GPU Peak Memory consumed during the train (max-begin): 7488
-GPU Total Peak Memory consumed during the train (max): 9404
-CPU Memory before entering the train : 19411
-CPU Memory consumed at the end of the train (end-begin): 0
-CPU Peak Memory consumed during the train (max-begin): 0
-CPU Total Peak Memory consumed during the train (max): 19411
-epoch=4: train_ppl=tensor(1.0705, device='cuda:0') train_epoch_loss=tensor(0.0681, device='cuda:0')
-100%|████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:27<00:00,  3.92s/it]
-GPU Memory before entering the eval : 1982
-GPU Memory consumed at the end of the eval (end-begin): -66
-GPU Peak Memory consumed during the eval (max-begin): 672
-GPU Total Peak Memory consumed during the eval (max): 2654
-CPU Memory before entering the eval : 19411
-CPU Memory consumed at the end of the eval (end-begin): 0
-CPU Peak Memory consumed during the eval (max-begin): 0
-CPU Total Peak Memory consumed during the eval (max): 19411
-accuracy=100.0
-eval_preds[:10]=['no complaint', 'no complaint', 'complaint', 'complaint', 'no complaint', 'no complaint', 'no complaint', 'complaint', 'complaint', 'no complaint']
-dataset['train'][label_column][:10]=['no complaint', 'no complaint', 'complaint', 'complaint', 'no complaint', 'no complaint', 'no complaint', 'complaint', 'complaint', 'no complaint']
-```
--- a/docs/source/accelerate/deepspeed.md
+++ b/docs/source/accelerate/deepspeed.md
@ -0,0 +1,449 @@
+<!--⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+-->
+
+# DeepSpeed
+
+[DeepSpeed](https://www.deepspeed.ai/) is a library designed for speed and scale for distributed training of large models with billions of parameters. At its core is the Zero Redundancy Optimizer (ZeRO) that shards optimizer states (ZeRO-1), gradients (ZeRO-2), and parameters (ZeRO-3) across data parallel processes. This drastically reduces memory usage, allowing you to scale your training to billion parameter models. To unlock even more memory efficiency, ZeRO-Offload reduces GPU compute and memory by leveraging CPU resources during optimization.
+
+Both of these features are supported in 🤗 Accelerate, and you can use them with 🤗 PEFT. 
+
+## Compatibility with `bitsandbytes` quantization + LoRA
+
+Below is a table that summarizes the compatibility between PEFT's LoRA, [`bitsandbytes`](https://github.com/TimDettmers/bitsandbytes) library and DeepSpeed Zero stages with respect to fine-tuning. DeepSpeed Zero-1 and 2 will have no effect at inference as stage 1 shards the optimizer states and stage 2 shards the optimizer states and gradients:
+
+| DeepSpeed stage   | Is compatible? |
+|---|---|
+| Zero-1 |  🟢 |
+| Zero-2   |  🟢 |
+| Zero-3  |  🟢 |
+
+For DeepSpeed Stage 3 + QLoRA, please refer to the section [Use PEFT QLoRA and DeepSpeed with ZeRO3 for finetuning large models on multiple GPUs](#use-peft-qlora-and-deepspeed-with-zero3-for-finetuning-large-models-on-multiple-gpus) below.
+
+For confirming these observations, we ran the SFT (Supervised Fine-tuning) [offical example scripts](https://github.com/huggingface/trl/tree/main/examples) of the [Transformers Reinforcement Learning (TRL) library](https://github.com/huggingface/trl) using QLoRA + PEFT and the accelerate configs available [here](https://github.com/huggingface/trl/tree/main/examples/accelerate_configs). We ran these experiments on a 2x NVIDIA T4 GPU.
+
+# Use PEFT and DeepSpeed with ZeRO3 for finetuning large models on multiple devices and multiple nodes
+
+This section of guide will help you learn how to use our DeepSpeed [training script](https://github.com/huggingface/peft/blob/main/examples/sft/train.py) for performing SFT. You'll configure the script to do SFT (supervised fine-tuning) of Llama-70B model with LoRA and ZeRO-3 on 8xH100 80GB GPUs on a single machine. You can configure it to scale to multiple machines by changing the accelerate config.
+
+## Configuration
+
+Start by running the following command to [create a DeepSpeed configuration file](https://huggingface.co/docs/accelerate/quicktour#launching-your-distributed-script) with 🤗 Accelerate. The `--config_file` flag allows you to save the configuration file to a specific location, otherwise it is saved as a `default_config.yaml` file in the 🤗 Accelerate cache.
+
+The configuration file is used to set the default options when you launch the training script.
+
+```bash
+accelerate config --config_file deepspeed_config.yaml
+```
+
+You'll be asked a few questions about your setup, and configure the following arguments. In this example, you'll use ZeRO-3 so make sure you pick those options.
+
+```bash
+`zero_stage`: [0] Disabled, [1] optimizer state partitioning, [2] optimizer+gradient state partitioning and [3] optimizer+gradient+parameter partitioning
+`gradient_accumulation_steps`: Number of training steps to accumulate gradients before averaging and applying them. Pass the same value as you would pass via cmd argument else you will encounter mismatch error.
+`gradient_clipping`: Enable gradient clipping with value. Don't set this as you will be passing it via cmd arguments.
+`offload_optimizer_device`: [none] Disable optimizer offloading, [cpu] offload optimizer to CPU, [nvme] offload optimizer to NVMe SSD. Only applicable with ZeRO >= Stage-2. Set this as `none` as don't want to enable offloading.
+`offload_param_device`: [none] Disable parameter offloading, [cpu] offload parameters to CPU, [nvme] offload parameters to NVMe SSD. Only applicable with ZeRO Stage-3. Set this as `none` as don't want to enable offloading.
+`zero3_init_flag`: Decides whether to enable `deepspeed.zero.Init` for constructing massive models. Only applicable with ZeRO Stage-3. Set this to `True`.
+`zero3_save_16bit_model`: Decides whether to save 16-bit model weights when using ZeRO Stage-3. Set this to `True`.
+`mixed_precision`: `no` for FP32 training, `fp16` for FP16 mixed-precision training and `bf16` for BF16 mixed-precision training. Set this to `True`.
+```
+
+Once this is done, the corresponding config should look like below and you can find it in config folder at [deepspeed_config.yaml](https://github.com/huggingface/peft/blob/main/examples/sft/configs/deepspeed_config.yaml):
+
+```yml
+compute_environment: LOCAL_MACHINE                                                                                                                                           
+debug: false
+deepspeed_config:
+  deepspeed_multinode_launcher: standard
+  gradient_accumulation_steps: 4
+  offload_optimizer_device: none
+  offload_param_device: none
+  zero3_init_flag: true
+  zero3_save_16bit_model: true
+  zero_stage: 3
+distributed_type: DEEPSPEED
+downcast_bf16: 'no'
+machine_rank: 0
+main_training_function: main
+mixed_precision: bf16
+num_machines: 1
+num_processes: 8
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
+```
+
+## Launch command
+
+The launch command is available at [run_peft_deepspeed.sh](https://github.com/huggingface/peft/blob/main/examples/sft/run_peft_deepspeed.sh) and it is also shown below:
+```bash
+accelerate launch --config_file "configs/deepspeed_config.yaml"  train.py \
+--seed 100 \
+--model_name_or_path "meta-llama/Llama-2-70b-hf" \
+--dataset_name "smangrul/ultrachat-10k-chatml" \
+--chat_template_format "chatml" \
+--add_special_tokens False \
+--append_concat_token False \
+--splits "train,test" \
+--max_seq_len 2048 \
+--num_train_epochs 1 \
+--logging_steps 5 \
+--log_level "info" \
+--logging_strategy "steps" \
+--eval_strategy "epoch" \
+--save_strategy "epoch" \
+--push_to_hub \
+--hub_private_repo True \
+--hub_strategy "every_save" \
+--bf16 True \
+--packing True \
+--learning_rate 1e-4 \
+--lr_scheduler_type "cosine" \
+--weight_decay 1e-4 \
+--warmup_ratio 0.0 \
+--max_grad_norm 1.0 \
+--output_dir "llama-sft-lora-deepspeed" \
+--per_device_train_batch_size 8 \
+--per_device_eval_batch_size 8 \
+--gradient_accumulation_steps 4 \
+--gradient_checkpointing True \
+--use_reentrant False \
+--dataset_text_field "content" \
+--use_flash_attn True \
+--use_peft_lora True \
+--lora_r 8 \
+--lora_alpha 16 \
+--lora_dropout 0.1 \
+--lora_target_modules "all-linear" \
+--use_4bit_quantization False
+```
+
+Notice that we are using LoRA with  rank=8, alpha=16 and targeting all linear layers. We are passing the deepspeed config file and finetuning 70B Llama model on a subset of the ultrachat dataset.
+
+## The important parts
+
+Let's dive a little deeper into the script so you can see what's going on, and understand how it works.
+
+The first thing to know is that the script uses DeepSpeed for distributed training as the DeepSpeed config has been passed. The [`~trl.SFTTrainer`] class handles all the heavy lifting of creating the PEFT model using the peft config that is passed. After that, when you call `trainer.train()`, [`~trl.SFTTrainer`] internally uses 🤗 Accelerate to prepare the model, optimizer and trainer using the DeepSpeed config to create DeepSpeed engine which is then trained. The main code snippet is below:
+
+```python
+# trainer
+trainer = SFTTrainer(
+    model=model,
+    processing_class=tokenizer,
+    args=training_args,
+    train_dataset=train_dataset,
+    eval_dataset=eval_dataset,
+    peft_config=peft_config,
+)
+trainer.accelerator.print(f"{trainer.model}")
+
+# train
+checkpoint = None
+if training_args.resume_from_checkpoint is not None:
+    checkpoint = training_args.resume_from_checkpoint
+trainer.train(resume_from_checkpoint=checkpoint)
+
+# saving final model
+trainer.save_model()
+```
+
+## Memory usage
+
+In the above example, the memory consumed per GPU is 64 GB (80%) as seen in the screenshot below:
+
+<div class="flex justify-center">
+    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/peft/peft_deepspeed_mem_usage.png"/>
+</div>
+<small>GPU memory usage for the training run</small>
+
+## More resources
+You can also refer this blog post [Falcon 180B Finetuning using 🤗 PEFT and DeepSpeed](https://medium.com/@sourabmangrulkar/falcon-180b-finetuning-using-peft-and-deepspeed-b92643091d99) on how to finetune 180B Falcon model on 16 A100 GPUs on 2 machines.
+
+
+# Use PEFT QLoRA and DeepSpeed with ZeRO3 for finetuning large models on multiple GPUs
+
+In this section, we will look at how to use QLoRA and DeepSpeed Stage-3 for finetuning 70B llama model on 2X40GB GPUs.
+For this, we first need `bitsandbytes>=0.43.3`, `accelerate>=1.0.1`, `transformers>4.44.2`, `trl>0.11.4` and `peft>0.13.0`. We need to set `zero3_init_flag` to true when using Accelerate config. Below is the config which can be found at [deepspeed_config_z3_qlora.yaml](https://github.com/huggingface/peft/blob/main/examples/sft/configs/deepspeed_config_z3_qlora.yaml):
+
+```yml
+compute_environment: LOCAL_MACHINE                                                                                                                                           
+debug: false
+deepspeed_config:
+  deepspeed_multinode_launcher: standard
+  offload_optimizer_device: none
+  offload_param_device: none
+  zero3_init_flag: true
+  zero3_save_16bit_model: true
+  zero_stage: 3
+distributed_type: DEEPSPEED
+downcast_bf16: 'no'
+machine_rank: 0
+main_training_function: main
+mixed_precision: bf16
+num_machines: 1
+num_processes: 2
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
+```
+
+Launch command is given below which is available at [run_peft_qlora_deepspeed_stage3.sh](https://github.com/huggingface/peft/blob/main/examples/sft/run_peft_qlora_deepspeed_stage3.sh):
+```
+accelerate launch --config_file "configs/deepspeed_config_z3_qlora.yaml"  train.py \
+--seed 100 \
+--model_name_or_path "meta-llama/Llama-2-70b-hf" \
+--dataset_name "smangrul/ultrachat-10k-chatml" \
+--chat_template_format "chatml" \
+--add_special_tokens False \
+--append_concat_token False \
+--splits "train,test" \
+--max_seq_len 2048 \
+--num_train_epochs 1 \
+--logging_steps 5 \
+--log_level "info" \
+--logging_strategy "steps" \
+--eval_strategy "epoch" \
+--save_strategy "epoch" \
+--push_to_hub \
+--hub_private_repo True \
+--hub_strategy "every_save" \
+--bf16 True \
+--packing True \
+--learning_rate 1e-4 \
+--lr_scheduler_type "cosine" \
+--weight_decay 1e-4 \
+--warmup_ratio 0.0 \
+--max_grad_norm 1.0 \
+--output_dir "llama-sft-qlora-dsz3" \
+--per_device_train_batch_size 2 \
+--per_device_eval_batch_size 2 \
+--gradient_accumulation_steps 2 \
+--gradient_checkpointing True \
+--use_reentrant True \
+--dataset_text_field "content" \
+--use_flash_attn True \
+--use_peft_lora True \
+--lora_r 8 \
+--lora_alpha 16 \
+--lora_dropout 0.1 \
+--lora_target_modules "all-linear" \
+--use_4bit_quantization True \
+--use_nested_quant True \
+--bnb_4bit_compute_dtype "bfloat16" \
+--bnb_4bit_quant_storage_dtype "bfloat16"
+```
+
+Notice the new argument being passed `bnb_4bit_quant_storage_dtype` which denotes the data type for packing the 4-bit parameters. For example, when it is set to `bfloat16`, **32/4 = 8** 4-bit params are packed together post quantization.
+
+In terms of training code, the important code changes are: 
+
+```diff
+...
+
+bnb_config = BitsAndBytesConfig(
+    load_in_4bit=args.use_4bit_quantization,
+    bnb_4bit_quant_type=args.bnb_4bit_quant_type,
+    bnb_4bit_compute_dtype=compute_dtype,
+    bnb_4bit_use_double_quant=args.use_nested_quant,
+   bnb_4bit_quant_storage=quant_storage_dtype,
+)
+
+...
+
+model = AutoModelForCausalLM.from_pretrained(
+    args.model_name_or_path,
+    quantization_config=bnb_config,
+    trust_remote_code=True,
+    attn_implementation="flash_attention_2" if args.use_flash_attn else "eager",
+   dtype=quant_storage_dtype or torch.float32,
+)
+```
+
+Notice that `dtype` for `AutoModelForCausalLM` is same as the `bnb_4bit_quant_storage` data type. That's it. Everything else is handled by Trainer and TRL.
+
+## Memory usage
+
+In the above example, the memory consumed per GPU is **36.6 GB**. Therefore, what took 8X80GB GPUs with DeepSpeed Stage 3+LoRA and a couple of 80GB GPUs with DDP+QLoRA now requires 2X40GB GPUs. This makes finetuning of large models more accessible.
+
+# Use PEFT and DeepSpeed with ZeRO3 and CPU Offloading for finetuning large models on a single GPU
+This section of guide will help you learn how to use our DeepSpeed [training script](https://github.com/huggingface/peft/blob/main/examples/conditional_generation/peft_lora_seq2seq_accelerate_ds_zero3_offload.py). You'll configure the script to train a large model for conditional generation with ZeRO-3 and CPU Offload.
+
+> [!TIP]
+> 💡 To help you get started, check out our example training scripts for [causal language modeling](https://github.com/huggingface/peft/blob/main/examples/causal_language_modeling/peft_lora_clm_accelerate_ds_zero3_offload.py) and [conditional generation](https://github.com/huggingface/peft/blob/main/examples/conditional_generation/peft_lora_seq2seq_accelerate_ds_zero3_offload.py). You can adapt these scripts for your own applications or even use them out of the box if your task is similar to the one in the scripts.
+
+## Configuration
+
+Start by running the following command to [create a DeepSpeed configuration file](https://huggingface.co/docs/accelerate/quicktour#launching-your-distributed-script) with 🤗 Accelerate. The `--config_file` flag allows you to save the configuration file to a specific location, otherwise it is saved as a `default_config.yaml` file in the 🤗 Accelerate cache.
+
+The configuration file is used to set the default options when you launch the training script.
+
+```bash
+accelerate config --config_file ds_zero3_cpu.yaml
+```
+
+You'll be asked a few questions about your setup, and configure the following arguments. In this example, you'll use ZeRO-3 along with CPU-Offload so make sure you pick those options.
+
+```bash
+`zero_stage`: [0] Disabled, [1] optimizer state partitioning, [2] optimizer+gradient state partitioning and [3] optimizer+gradient+parameter partitioning
+`gradient_accumulation_steps`: Number of training steps to accumulate gradients before averaging and applying them.
+`gradient_clipping`: Enable gradient clipping with value.
+`offload_optimizer_device`: [none] Disable optimizer offloading, [cpu] offload optimizer to CPU, [nvme] offload optimizer to NVMe SSD. Only applicable with ZeRO >= Stage-2.
+`offload_param_device`: [none] Disable parameter offloading, [cpu] offload parameters to CPU, [nvme] offload parameters to NVMe SSD. Only applicable with ZeRO Stage-3.
+`zero3_init_flag`: Decides whether to enable `deepspeed.zero.Init` for constructing massive models. Only applicable with ZeRO Stage-3.
+`zero3_save_16bit_model`: Decides whether to save 16-bit model weights when using ZeRO Stage-3.
+`mixed_precision`: `no` for FP32 training, `fp16` for FP16 mixed-precision training and `bf16` for BF16 mixed-precision training. 
+```
+
+An example [configuration file](https://github.com/huggingface/peft/blob/main/examples/conditional_generation/accelerate_ds_zero3_cpu_offload_config.yaml) might look like the following. The most important thing to notice is that `zero_stage` is set to `3`, and `offload_optimizer_device` and `offload_param_device` are set to the `cpu`.
+
+```yml
+compute_environment: LOCAL_MACHINE
+deepspeed_config:
+  gradient_accumulation_steps: 1
+  gradient_clipping: 1.0
+  offload_optimizer_device: cpu
+  offload_param_device: cpu
+  zero3_init_flag: true
+  zero3_save_16bit_model: true
+  zero_stage: 3
+distributed_type: DEEPSPEED
+downcast_bf16: 'no'
+dynamo_backend: 'NO'
+fsdp_config: {}
+machine_rank: 0
+main_training_function: main
+megatron_lm_config: {}
+mixed_precision: 'no'
+num_machines: 1
+num_processes: 1
+rdzv_backend: static
+same_network: true
+use_cpu: false
+```
+
+## The important parts
+
+Let's dive a little deeper into the script so you can see what's going on, and understand how it works.
+
+Within the [`main`](https://github.com/huggingface/peft/blob/2822398fbe896f25d4dac5e468624dc5fd65a51b/examples/conditional_generation/peft_lora_seq2seq_accelerate_ds_zero3_offload.py#L103) function, the script creates an [`~accelerate.Accelerator`] class to initialize all the necessary requirements for distributed training.
+
+> [!TIP]
+> 💡 Feel free to change the model and dataset inside the `main` function. If your dataset format is different from the one in the script, you may also need to write your own preprocessing function.
+
+The script also creates a configuration for the 🤗 PEFT method you're using, which in this case, is LoRA. The [`LoraConfig`] specifies the task type and important parameters such as the dimension of the low-rank matrices, the matrices scaling factor, and the dropout probability of the LoRA layers. If you want to use a different 🤗 PEFT method, make sure you replace `LoraConfig` with the appropriate [class](../package_reference/tuners).
+
+```diff
+ def main():
+    accelerator = Accelerator()
+     model_name_or_path = "facebook/bart-large"
+     dataset_name = "twitter_complaints"
+    peft_config = LoraConfig(
+         task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1
+     )
+```
+
+Throughout the script, you'll see the [`~accelerate.Accelerator.main_process_first`] and [`~accelerate.Accelerator.wait_for_everyone`] functions which help control and synchronize when processes are executed.
+
+The [`get_peft_model`] function takes a base model and the [`peft_config`] you prepared earlier to create a [`PeftModel`]:
+
+```diff
+  model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path)
+ model = get_peft_model(model, peft_config)
+```
+
+Pass all the relevant training objects to 🤗 Accelerate's [`~accelerate.Accelerator.prepare`] which makes sure everything is ready for training:
+
+```py
+model, train_dataloader, eval_dataloader, test_dataloader, optimizer, lr_scheduler = accelerator.prepare(
+    model, train_dataloader, eval_dataloader, test_dataloader, optimizer, lr_scheduler
+)
+```
+
+The next bit of code checks whether the DeepSpeed plugin is used in the `Accelerator`, and if the plugin exists, then we check if we are using ZeRO-3. This conditional flag is used when calling `generate` function call during inference for syncing GPUs when the model parameters are sharded:
+
+```py
+is_ds_zero_3 = False
+if getattr(accelerator.state, "deepspeed_plugin", None):
+    is_ds_zero_3 = accelerator.state.deepspeed_plugin.zero_stage == 3
+```
+
+Inside the training loop, the usual `loss.backward()` is replaced by 🤗 Accelerate's [`~accelerate.Accelerator.backward`] which uses the correct `backward()` method based on your configuration:
+
+```diff
+  for epoch in range(num_epochs):
+      with TorchTracemalloc() as tracemalloc:
+          model.train()
+          total_loss = 0
+          for step, batch in enumerate(tqdm(train_dataloader)):
+              outputs = model(**batch)
+              loss = outputs.loss
+              total_loss += loss.detach().float()
+             accelerator.backward(loss)
+              optimizer.step()
+              lr_scheduler.step()
+              optimizer.zero_grad()
+```
+
+That is all! The rest of the script handles the training loop, evaluation, and even pushes it to the Hub for you.
+
+## Train
+
+Run the following command to launch the training script. Earlier, you saved the configuration file to `ds_zero3_cpu.yaml`, so you'll need to pass the path to the launcher with the `--config_file` argument like this:
+
+```bash
+accelerate launch --config_file ds_zero3_cpu.yaml examples/peft_lora_seq2seq_accelerate_ds_zero3_offload.py
+```
+
+You'll see some output logs that track memory usage during training, and once it's completed, the script returns the accuracy and compares the predictions to the labels:
+
+```bash
+GPU Memory before entering the train : 1916
+GPU Memory consumed at the end of the train (end-begin): 66
+GPU Peak Memory consumed during the train (max-begin): 7488
+GPU Total Peak Memory consumed during the train (max): 9404
+CPU Memory before entering the train : 19411
+CPU Memory consumed at the end of the train (end-begin): 0
+CPU Peak Memory consumed during the train (max-begin): 0
+CPU Total Peak Memory consumed during the train (max): 19411
+epoch=4: train_ppl=tensor(1.0705, device='cuda:0') train_epoch_loss=tensor(0.0681, device='cuda:0')
+100%|████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:27<00:00,  3.92s/it]
+GPU Memory before entering the eval : 1982
+GPU Memory consumed at the end of the eval (end-begin): -66
+GPU Peak Memory consumed during the eval (max-begin): 672
+GPU Total Peak Memory consumed during the eval (max): 2654
+CPU Memory before entering the eval : 19411
+CPU Memory consumed at the end of the eval (end-begin): 0
+CPU Peak Memory consumed during the eval (max-begin): 0
+CPU Total Peak Memory consumed during the eval (max): 19411
+accuracy=100.0
+eval_preds[:10]=['no complaint', 'no complaint', 'complaint', 'complaint', 'no complaint', 'no complaint', 'no complaint', 'complaint', 'complaint', 'no complaint']
+dataset['train'][label_column][:10]=['no complaint', 'no complaint', 'complaint', 'complaint', 'no complaint', 'no complaint', 'no complaint', 'complaint', 'complaint', 'no complaint']
+```
+
+# Caveats
+1. Merging when using PEFT and DeepSpeed is currently unsupported and will raise error.
+2. When using CPU offloading, the major gains from using PEFT to shrink the optimizer states and gradients to that of the adapter weights would be realized on CPU RAM and there won't be savings with respect to GPU memory.
+3. DeepSpeed Stage 3 and qlora when used with CPU offloading leads to more GPU memory usage when compared to disabling CPU offloading. 
+
+> [!TIP]
+> 💡 When you have code that requires merging (and unmerging) of weights, try to manually collect the parameters with DeepSpeed Zero-3 beforehand:
+>
+> ```python
+> import deepspeed
+>
+> is_ds_zero_3 = ... # check if Zero-3
+>
+> with deepspeed.zero.GatheredParameters(list(model.parameters()), enabled= is_ds_zero_3):
+>     model.merge_adapter()
+>     # do whatever is needed, then unmerge in the same context if unmerging is required
+>     ...
+>     model.unmerge_adapter()
+> ```
--- a/docs/source/accelerate/fsdp.md
+++ b/docs/source/accelerate/fsdp.md
@ -0,0 +1,285 @@
+<!--⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+-->
+
+# Fully Sharded Data Parallel
+
+[Fully sharded data parallel](https://pytorch.org/docs/stable/fsdp.html) (FSDP) is developed for distributed training of large pretrained models up to 1T parameters. FSDP achieves this by sharding the model parameters, gradients, and optimizer states across data parallel processes and it can also offload sharded model parameters to a CPU. The memory efficiency afforded by FSDP allows you to scale training to larger batch or model sizes.
+
+Both of these features are supported in 🤗 Accelerate, and you can use them with 🤗 PEFT. 
+
+# Use PEFT and FSDP
+This section of guide will help you learn how to use our DeepSpeed [training script](https://github.com/huggingface/peft/blob/main/examples/sft/train.py) for performing SFT. You'll configure the script to do SFT (supervised fine-tuning) of Llama-70B model with LoRA and FSDP on 8xH100 80GB GPUs on a single machine. You can configure it to scale to multiple machines by changing the accelerate config.
+
+## Configuration
+
+Start by running the following command to [create a FSDP configuration file](https://huggingface.co/docs/accelerate/quicktour#launching-your-distributed-script) with 🤗 Accelerate. The `--config_file` flag allows you to save the configuration file to a specific location, otherwise it is saved as a `default_config.yaml` file in the 🤗 Accelerate cache.
+
+The configuration file is used to set the default options when you launch the training script.
+
+```bash
+accelerate config --config_file fsdp_config.yaml
+```
+
+You'll be asked a few questions about your setup, and configure the following arguments. In this example, you'll answer the questionnaire as shown in the image below.
+<div class="flex justify-center">
+    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/peft/fsdp-peft-config.png"/>
+</div>
+<small>Creating Accelerate's config to use FSDP</small>
+
+Once this is done, the corresponding config should look like below and you can find it in config folder at [fsdp_config.yaml](https://github.com/huggingface/peft/blob/main/examples/sft/configs/fsdp_config.yaml):
+
+```yml
+compute_environment: LOCAL_MACHINE
+debug: false
+distributed_type: FSDP
+downcast_bf16: 'no'
+fsdp_config:
+  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
+  fsdp_backward_prefetch: BACKWARD_PRE
+  fsdp_cpu_ram_efficient_loading: true
+  fsdp_forward_prefetch: false
+  fsdp_offload_params: false
+  fsdp_sharding_strategy: FULL_SHARD
+  fsdp_state_dict_type: SHARDED_STATE_DICT
+  fsdp_sync_module_states: true
+  fsdp_use_orig_params: false
+machine_rank: 0
+main_training_function: main
+mixed_precision: bf16
+num_machines: 1
+num_processes: 8
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
+```
+
+## Launch command
+
+The launch command is available at [run_peft_fsdp.sh](https://github.com/huggingface/peft/blob/main/examples/sft/run_peft_fsdp.sh) and it is also shown below:
+```bash
+accelerate launch --config_file "configs/fsdp_config.yaml"  train.py \
+--seed 100 \
+--model_name_or_path "meta-llama/Llama-2-70b-hf" \
+--dataset_name "smangrul/ultrachat-10k-chatml" \
+--chat_template_format "chatml" \
+--add_special_tokens False \
+--append_concat_token False \
+--splits "train,test" \
+--max_seq_len 2048 \
+--num_train_epochs 1 \
+--logging_steps 5 \
+--log_level "info" \
+--logging_strategy "steps" \
+--eval_strategy "epoch" \
+--save_strategy "epoch" \
+--push_to_hub \
+--hub_private_repo True \
+--hub_strategy "every_save" \
+--bf16 True \
+--packing True \
+--learning_rate 1e-4 \
+--lr_scheduler_type "cosine" \
+--weight_decay 1e-4 \
+--warmup_ratio 0.0 \
+--max_grad_norm 1.0 \
+--output_dir "llama-sft-lora-fsdp" \
+--per_device_train_batch_size 8 \
+--per_device_eval_batch_size 8 \
+--gradient_accumulation_steps 4 \
+--gradient_checkpointing True \
+--use_reentrant False \
+--dataset_text_field "content" \
+--use_flash_attn True \
+--use_peft_lora True \
+--lora_r 8 \
+--lora_alpha 16 \
+--lora_dropout 0.1 \
+--lora_target_modules "all-linear" \
+--use_4bit_quantization False
+```
+
+Notice that we are using LoRA with  rank=8, alpha=16 and targeting all linear layers. We are passing the FSDP config file and finetuning the 70B Llama model on a subset of the [ultrachat dataset](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k).
+
+## The important parts
+
+Let's dive a little deeper into the script so you can see what's going on, and understand how it works.
+
+The first thing to know is that the script uses FSDP for distributed training as the FSDP config has been passed. The [`~trl.SFTTrainer`] class handles all the heavy lifting of creating PEFT model using the peft config that is passed. After that when you call `trainer.train()`, Trainer internally uses 🤗 Accelerate to prepare model, optimizer and trainer using the FSDP config to create FSDP wrapped model which is then trained. The main code snippet is below:
+
+```python
+# trainer
+trainer = SFTTrainer(
+    model=model,
+    processing_class=tokenizer,
+    args=training_args,
+    train_dataset=train_dataset,
+    eval_dataset=eval_dataset,
+    peft_config=peft_config,
+)
+trainer.accelerator.print(f"{trainer.model}")
+if model_args.use_peft_lora:
+    # handle PEFT+FSDP case
+    trainer.model.print_trainable_parameters()
+    if getattr(trainer.accelerator.state, "fsdp_plugin", None):
+        from peft.utils.other import fsdp_auto_wrap_policy
+
+        fsdp_plugin = trainer.accelerator.state.fsdp_plugin
+        fsdp_plugin.auto_wrap_policy = fsdp_auto_wrap_policy(trainer.model)
+
+# train
+checkpoint = None
+if training_args.resume_from_checkpoint is not None:
+    checkpoint = training_args.resume_from_checkpoint
+trainer.train(resume_from_checkpoint=checkpoint)
+
+# saving final model
+if trainer.is_fsdp_enabled:
+    trainer.accelerator.state.fsdp_plugin.set_state_dict_type("FULL_STATE_DICT")
+trainer.save_model()
+```
+
+
+Here, one main thing to note currently when using FSDP with PEFT is that `use_orig_params` needs to be `False` to realize GPU memory savings. Due to `use_orig_params=False`, the auto wrap policy for FSDP needs to change so that trainable and non-trainable parameters are wrapped separately. This is done by the code snippt below which uses the util function `fsdp_auto_wrap_policy` from PEFT:
+
+```
+if getattr(trainer.accelerator.state, "fsdp_plugin", None):
+    from peft.utils.other import fsdp_auto_wrap_policy
+
+    fsdp_plugin = trainer.accelerator.state.fsdp_plugin
+    fsdp_plugin.auto_wrap_policy = fsdp_auto_wrap_policy(trainer.model)
+```
+
+## Memory usage
+
+In the above example, the memory consumed per GPU is  72-80 GB (90-98%) as seen in the screenshot below. The slight increase in GPU memory at the end is when saving the model using `FULL_STATE_DICT` state dict type instead of the `SHARDED_STATE_DICT` so that the model has adapter weights that can be loaded normally with `from_pretrained` method during inference:
+
+<div class="flex justify-center">
+    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/peft/peft_fsdp_mem_usage.png"/>
+</div>
+<small>GPU memory usage for the training run</small>
+
+# Use PEFT QLoRA and FSDP for finetuning large models on multiple GPUs
+
+In this section, we will look at how to use QLoRA and FSDP for finetuning 70B llama model on 2X24GB GPUs. [Answer.AI](https://www.answer.ai/) in collaboration with bitsandbytes and Hugging Face 🤗 open sourced code enabling the usage of FSDP+QLoRA and explained the whole process in their insightful blogpost [You can now train a 70b language model at home](https://www.answer.ai/posts/2024-03-06-fsdp-qlora.html). This is now integrated in Hugging Face ecosystem. 
+
+For this, we first need `bitsandbytes>=0.43.3`, `accelerate>=1.0.1`, `transformers>4.44.2`, `trl>0.11.4` and `peft>0.13.0`. We need to set `fsdp_cpu_ram_efficient_loading=true`, `fsdp_use_orig_params=false` and `fsdp_offload_params=true`(cpu offloading) when using Accelerate config. When not using accelerate launcher, you can alternately set the environment variable `export FSDP_CPU_RAM_EFFICIENT_LOADING=true`.  Here, we will be using accelerate config and below is the config which can be found at [fsdp_config_qlora.yaml](https://github.com/huggingface/peft/blob/main/examples/sft/configs/fsdp_config_qlora.yaml):
+
+```yml
+compute_environment: LOCAL_MACHINE                                                                                                                                           
+debug: false                                                                                                                                                                 
+distributed_type: FSDP
+downcast_bf16: 'no'
+fsdp_config:
+  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
+  fsdp_backward_prefetch: BACKWARD_PRE
+  fsdp_cpu_ram_efficient_loading: true
+  fsdp_forward_prefetch: false
+  fsdp_offload_params: true
+  fsdp_sharding_strategy: FULL_SHARD
+  fsdp_state_dict_type: SHARDED_STATE_DICT
+  fsdp_sync_module_states: true
+  fsdp_use_orig_params: false
+machine_rank: 0
+main_training_function: main
+mixed_precision: 'no'
+num_machines: 1
+num_processes: 2
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
+```
+
+Launch command is given below which is available at [run_peft_qlora_fsdp.sh](https://github.com/huggingface/peft/blob/main/examples/sft/run_peft_qlora_fsdp.sh):
+```
+accelerate launch --config_file "configs/fsdp_config_qlora.yaml"  train.py \
+--seed 100 \
+--model_name_or_path "meta-llama/Llama-2-70b-hf" \
+--dataset_name "smangrul/ultrachat-10k-chatml" \
+--chat_template_format "chatml" \
+--add_special_tokens False \
+--append_concat_token False \
+--splits "train,test" \
+--max_seq_len 2048 \
+--num_train_epochs 1 \
+--logging_steps 5 \
+--log_level "info" \
+--logging_strategy "steps" \
+--eval_strategy "epoch" \
+--save_strategy "epoch" \
+--push_to_hub \
+--hub_private_repo True \
+--hub_strategy "every_save" \
+--bf16 True \
+--packing True \
+--learning_rate 1e-4 \
+--lr_scheduler_type "cosine" \
+--weight_decay 1e-4 \
+--warmup_ratio 0.0 \
+--max_grad_norm 1.0 \
+--output_dir "llama-sft-qlora-fsdp" \
+--per_device_train_batch_size 2 \
+--per_device_eval_batch_size 2 \
+--gradient_accumulation_steps 2 \
+--gradient_checkpointing True \
+--use_reentrant True \
+--dataset_text_field "content" \
+--use_flash_attn True \
+--use_peft_lora True \
+--lora_r 8 \
+--lora_alpha 16 \
+--lora_dropout 0.1 \
+--lora_target_modules "all-linear" \
+--use_4bit_quantization True \
+--use_nested_quant True \
+--bnb_4bit_compute_dtype "bfloat16" \
+--bnb_4bit_quant_storage_dtype "bfloat16"
+```
+
+Notice the new argument being passed, `bnb_4bit_quant_storage_dtype`, which denotes the data type for packing the 4-bit parameters. For example, when it is set to `bfloat16`, **16/4 = 4** 4-bit params are packed together post quantization. When using mixed precision training with `bfloat16`, `bnb_4bit_quant_storage_dtype` can be either `bfloat16` for pure `bfloat16` finetuning, or `float32` for automatic mixed precision (this consumes more GPU memory). When using mixed precision training with `float16`, `bnb_4bit_quant_storage_dtype` should be set to `float32` for stable automatic mixed precision training.
+
+In terms of training code, the important code changes are: 
+
+```diff
+...
+
+bnb_config = BitsAndBytesConfig(
+    load_in_4bit=args.use_4bit_quantization,
+    bnb_4bit_quant_type=args.bnb_4bit_quant_type,
+    bnb_4bit_compute_dtype=compute_dtype,
+    bnb_4bit_use_double_quant=args.use_nested_quant,
+   bnb_4bit_quant_storage=quant_storage_dtype,
+)
+
+...
+
+model = AutoModelForCausalLM.from_pretrained(
+    args.model_name_or_path,
+    quantization_config=bnb_config,
+    trust_remote_code=True,
+    attn_implementation="flash_attention_2" if args.use_flash_attn else "eager",
+   dtype=quant_storage_dtype or torch.float32,
+)
+```
+
+Notice that `dtype` for `AutoModelForCausalLM` is same as the `bnb_4bit_quant_storage` data type. That's it. Everything else is handled by Trainer and TRL.
+
+## Memory usage
+
+In the above example, the memory consumed per GPU is **19.6 GB** while CPU RAM usage is around **107 GB**. When disabling CPU offloading, the GPU memory usage is  **35.6 GB/ GPU**. Therefore, what took 16X80GB GPUs for full finetuning, 8X80GB GPUs with FSDP+LoRA, and a couple of 80GB GPUs with DDP+QLoRA, now requires 2X24GB GPUs. This makes finetuning of large models more accessible.
+
+## More resources
+You can also refer the [llama-recipes](https://github.com/facebookresearch/llama-recipes/?tab=readme-ov-file#fine-tuning) repo and [Getting started with Llama](https://llama.meta.com/get-started/#fine-tuning) guide on how to finetune using FSDP and PEFT.
+
+## Caveats
+1. Merging when using PEFT and FSDP is currently unsupported and will raise error.
+2. Passing `modules_to_save` config parameter to is untested at present.
+3. GPU Memory saving when using CPU Offloading is untested at present.
+4. When using FSDP+QLoRA, `paged_adamw_8bit` currently results in an error when saving a checkpoint.
+5. DoRA training with FSDP should work (albeit at lower speed than LoRA). If combined with bitsandbytes (QDoRA), 4-bit quantization should also work, but 8-bit quantization has known issues and is not recommended.
--- a/docs/source/accelerate/fsdp.mdx
+++ b/docs/source/accelerate/fsdp.mdx
@ -1,124 +0,0 @@
-# Fully Sharded Data Parallel
-
-[Fully sharded data parallel](https://pytorch.org/docs/stable/fsdp.html) (FSDP) is developed for distributed training of large pretrained models up to 1T parameters. FSDP achieves this by sharding the model parameters, gradients, and optimizer states across data parallel processes and it can also offload sharded model parameters to a CPU. The memory efficiency afforded by FSDP allows you to scale training to larger batch or model sizes.
-
-<Tip warning={true}>
-
-Currently, FSDP does not confer any reduction in GPU memory usage and FSDP with CPU offload actually consumes 1.65x more GPU memory during training. You can track this PyTorch [issue](https://github.com/pytorch/pytorch/issues/91165) for any updates.
-
-</Tip>
-
-FSDP is supported in 🤗 Accelerate, and you can use it with 🤗 PEFT. This guide will help you learn how to use our FSDP [training script](https://github.com/huggingface/peft/blob/main/examples/conditional_generation/peft_lora_seq2seq_accelerate_fsdp.py). You'll configure the script to train a large model for conditional generation.
-
-## Configuration
-
-Begin by running the following command to [create a FSDP configuration file](https://huggingface.co/docs/accelerate/main/en/usage_guides/fsdp) with 🤗 Accelerate. Use the `--config_file` flag to save the configuration file to a specific location, otherwise it is saved as a `default_config.yaml` file in the 🤗 Accelerate cache.
-
-The configuration file is used to set the default options when you launch the training script.
-
-```bash
-accelerate config --config_file fsdp_config.yaml
-```
-
-You'll be asked a few questions about your setup, and configure the following arguments. For this example, make sure you fully shard the model parameters, gradients, optimizer states, leverage the CPU for offloading, and wrap model layers based on the Transformer layer class name.
-
-```bash
-`Sharding Strategy`: [1] FULL_SHARD (shards optimizer states, gradients and parameters), [2] SHARD_GRAD_OP (shards optimizer states and gradients), [3] NO_SHARD
-`Offload Params`: Decides Whether to offload parameters and gradients to CPU
-`Auto Wrap Policy`: [1] TRANSFORMER_BASED_WRAP, [2] SIZE_BASED_WRAP, [3] NO_WRAP 
-`Transformer Layer Class to Wrap`: When using `TRANSFORMER_BASED_WRAP`, user specifies comma-separated string of transformer layer class names (case-sensitive) to wrap ,e.g, 
-`BertLayer`, `GPTJBlock`, `T5Block`, `BertLayer,BertEmbeddings,BertSelfOutput`...
-`Min Num Params`: minimum number of parameters when using `SIZE_BASED_WRAP`
-`Backward Prefetch`: [1] BACKWARD_PRE, [2] BACKWARD_POST, [3] NO_PREFETCH
-`State Dict Type`: [1] FULL_STATE_DICT, [2] LOCAL_STATE_DICT, [3] SHARDED_STATE_DICT  
-```
-
-For example, your FSDP configuration file may look like the following:
-
-```yaml
-command_file: null
-commands: null
-compute_environment: LOCAL_MACHINE
-deepspeed_config: {}
-distributed_type: FSDP
-downcast_bf16: 'no'
-dynamo_backend: 'NO'
-fsdp_config:
-  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
-  fsdp_backward_prefetch_policy: BACKWARD_PRE
-  fsdp_offload_params: true
-  fsdp_sharding_strategy: 1
-  fsdp_state_dict_type: FULL_STATE_DICT
-  fsdp_transformer_layer_cls_to_wrap: T5Block
-gpu_ids: null
-machine_rank: 0
-main_process_ip: null
-main_process_port: null
-main_training_function: main
-megatron_lm_config: {}
-mixed_precision: 'no'
-num_machines: 1
-num_processes: 2
-rdzv_backend: static
-same_network: true
-tpu_name: null
-tpu_zone: null
-use_cpu: false
-```
-
-## The important parts
-
-Let's dig a bit deeper into the training script to understand how it works.
-
-The [`main()`](https://github.com/huggingface/peft/blob/2822398fbe896f25d4dac5e468624dc5fd65a51b/examples/conditional_generation/peft_lora_seq2seq_accelerate_fsdp.py#L14) function begins with initializing an [`~accelerate.Accelerator`] class which handles everything for distributed training, such as automatically detecting your training environment.
-
-<Tip>
-
-💡 Feel free to change the model and dataset inside the `main` function. If your dataset format is different from the one in the script, you may also need to write your own preprocessing function. 
-
-</Tip>
-
-The script also creates a configuration corresponding to the 🤗 PEFT method you're using. For LoRA, you'll use [`LoraConfig`] to specify the task type, and several other important parameters such as the dimension of the low-rank matrices, the matrices scaling factor, and the dropout probability of the LoRA layers. If you want to use a different 🤗 PEFT method, replace `LoraConfig` with the appropriate [class](../package_reference/tuners).
-
-Next, the script wraps the base model and `peft_config` with the [`get_peft_model`] function to create a [`PeftModel`]. 
-
-```diff
- def main():
-+    accelerator = Accelerator()
-     model_name_or_path = "t5-base"
-     base_path = "temp/data/FinancialPhraseBank-v1.0"
-+    peft_config = LoraConfig(
-         task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1
-     )
-    model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path)
-+   model = get_peft_model(model, peft_config)
-```
-
-Throughout the script, you'll see the [`~accelerate.Accelerator.main_process_first`] and [`~accelerate.Accelerator.wait_for_everyone`] functions which help control and synchronize when processes are executed.
-
-After your dataset is prepared, and all the necessary training components are loaded, the script checks if you're using the `fsdp_plugin`. PyTorch offers two ways for wrapping model layers in FSDP, automatically or manually. The simplest method is to allow FSDP to automatically recursively wrap model layers without changing any other code. You can choose to wrap the model layers based on the layer name or on the size (number of parameters). In the FSDP configuration file, it uses the `TRANSFORMER_BASED_WRAP` option to wrap the [`T5Block`] layer.
-
-```py
-if getattr(accelerator.state, "fsdp_plugin", None) is not None:
-    accelerator.state.fsdp_plugin.auto_wrap_policy = fsdp_auto_wrap_policy(model)
-```
-
-Next, use 🤗 Accelerate's [`~accelerate.Accelerator.prepare`] function to prepare the model, datasets, optimizer, and scheduler for training.
-
-```py
-model, train_dataloader, eval_dataloader, optimizer, lr_scheduler = accelerator.prepare(
-    model, train_dataloader, eval_dataloader, optimizer, lr_scheduler
-)
-```
-
-From here, the remainder of the script handles the training loop, evaluation, and sharing your model to the Hub.
-
-## Train
-
-Run the following command to launch the training script. Earlier, you saved the configuration file to `fsdp_config.yaml`, so you'll need to pass the path to the launcher with the `--config_file` argument like this:
-
-```bash
-accelerate launch --config_file fsdp_config.yaml examples/peft_lora_seq2seq_accelerate_fsdp.py
-```
-
-Once training is complete, the script returns the accuracy and compares the predictions to the labels.
--- a/docs/source/conceptual_guides/adapter.md
+++ b/docs/source/conceptual_guides/adapter.md
@ -0,0 +1,136 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# Adapters
+
+Adapter-based methods add extra trainable parameters after the attention and fully-connected layers of a frozen pretrained model to reduce memory-usage and speed up training. The method varies depending on the adapter, it could simply be an extra added layer or it could be expressing the weight updates ∆W as a low-rank decomposition of the weight matrix. Either way, the adapters are typically small but demonstrate comparable performance to a fully finetuned model and enable training larger models with fewer resources.
+
+This guide will give you a brief overview of the adapter methods supported by PEFT (if you're interested in learning more details about a specific method, take a look at the linked paper).
+
+## Low-Rank Adaptation (LoRA)
+
+> [!TIP]
+> LoRA is one of the most popular PEFT methods and a good starting point if you're just getting started with PEFT. It was originally developed for large language models but it is a tremendously popular training method for diffusion models because of its efficiency and effectiveness.
+
+As mentioned briefly earlier, [LoRA](https://hf.co/papers/2106.09685) is a technique that accelerates finetuning large models while consuming less memory.
+
+LoRA represents the weight updates ∆W with two smaller matrices (called *update matrices*) through low-rank decomposition. These new matrices can be trained to adapt to the new data while keeping the overall number of parameters low. The original weight matrix remains frozen and doesn't receive any further updates. To produce the final results, the original and extra adapted weights are combined. You could also merge the adapter weights with the base model to eliminate inference latency.
+
+<div class="flex justify-center">
+    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/peft/lora_animated.gif"/>
+</div>
+
+This approach has a number of advantages:
+
+* LoRA makes finetuning more efficient by drastically reducing the number of trainable parameters.
+* The original pretrained weights are kept frozen, which means you can have multiple lightweight and portable LoRA models for various downstream tasks built on top of them.
+* LoRA is orthogonal to other parameter-efficient methods and can be combined with many of them.
+* Performance of models finetuned using LoRA is comparable to the performance of fully finetuned models.
+
+In principle, LoRA can be applied to any subset of weight matrices in a neural network to reduce the number of trainable parameters. However, for simplicity and further parameter efficiency, LoRA is typically only applied to the attention blocks in Transformer models. The resulting number of trainable parameters in a LoRA model depends on the size of the update matrices, which is determined mainly by the rank `r` and the shape of the original weight matrix.
+
+<div class="flex justify-center">
+    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/peft/lora.png"/>
+</div>
+<small><a href="https://hf.co/papers/2103.10385">Navigating Text-To-Image Customization: From LyCORIS Fine-Tuning to Model Evaluation</a></small>
+
+## Mixture of LoRA Experts (X-LoRA)
+
+[X-LoRA](https://huggingface.co/papers/2402.07148) is a mixture of experts method for LoRA which works by using dense or sparse gating to dynamically activate LoRA experts. The LoRA experts as well as the base model are frozen during training, resulting in a low parameter count as only the gating layers must be trained. In particular, the gating layers output scalings which (depending on config) are granular on the layer and token level. Additionally, during inference, X-LoRA dynamically activates LoRA adapters to recall knowledge and effectively mix them:
+
+The below graphic demonstrates how the scalings change for different prompts for each token. This highlights the activation of different adapters as the generation progresses and the sequence creates new context.
+
+![Token-by-token scalings](https://github.com/EricLBuehler/xlora/raw/master/res/token_by_token_scalings.gif)
+
+For each step, X-LoRA requires the base model to be run twice: first, to get hidden states without any LoRA adapters, and secondly, the hidden states are used to calculate scalings which are applied to the LoRA adapters and the model is run a second time. The output of the second run is the result of the model step.
+
+Ultimately, X-LoRA allows the model to reflect upon its knowledge because of the dual forward pass scheme, and dynamically reconfigure the architecture.
+
+## Low-Rank Hadamard Product (LoHa)
+
+Low-rank decomposition can impact performance because the weight updates are limited to the low-rank space, which can constrain a model's expressiveness. However, you don't necessarily want to use a larger rank because it increases the number of trainable parameters. To address this, [LoHa](https://huggingface.co/papers/2108.06098) (a method originally developed for computer vision) was applied to diffusion models where the ability to generate diverse images is an important consideration. LoHa should also work with general model types, but the embedding layers aren't currently implemented in PEFT.
+
+LoHa uses the [Hadamard product](https://en.wikipedia.org/wiki/Hadamard_product_(matrices)) (element-wise product) instead of the matrix product. ∆W is represented by four smaller matrices instead of two - like in LoRA - and each pair of these low-rank matrices are combined with the Hadamard product. As a result, ∆W can have the same number of trainable parameters but a higher rank and expressivity.
+
+## Low-Rank Kronecker Product (LoKr)
+
+[LoKr](https://hf.co/papers/2309.14859) is very similar to LoRA and LoHa, and it is also mainly applied to diffusion models, though you could also use it with other model types. LoKr replaces the matrix product with the [Kronecker product](https://en.wikipedia.org/wiki/Kronecker_product) instead. The Kronecker product decomposition creates a block matrix which preserves the rank of the original weight matrix. Another benefit of the Kronecker product is that it can be vectorized by stacking the matrix columns. This can speed up the process because you're avoiding fully reconstructing ∆W.
+
+## Orthogonal Finetuning (OFT)
+
+<div class="flex justify-center">
+    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/peft/oft.png"/>
+</div>
+<small><a href="https://hf.co/papers/2306.07280">Controlling Text-to-Image Diffusion by Orthogonal Finetuning</a></small>
+
+[OFT](https://hf.co/papers/2306.07280) is a method that primarily focuses on preserving a pretrained model's generative performance in the finetuned model. It tries to maintain the same cosine similarity (hyperspherical energy) between all pairwise neurons in a layer because this better captures the semantic information among neurons. This means OFT is more capable at preserving the subject and it is better for controllable generation (similar to [ControlNet](https://huggingface.co/docs/diffusers/using-diffusers/controlnet)).
+
+OFT preserves the hyperspherical energy by learning an orthogonal transformation for neurons to keep the cosine similarity between them unchanged. In practice, this means taking the matrix product of an orthogonal matrix with the pretrained weight matrix. However, to be parameter-efficient, the orthogonal matrix is represented as a block-diagonal matrix with rank `r` blocks. Whereas LoRA reduces the number of trainable parameters with low-rank structures, OFT reduces the number of trainable parameters with a sparse block-diagonal matrix structure.
+
+## Orthogonal Butterfly (BOFT)
+
+[BOFT](https://hf.co/papers/2311.06243) is an improved orthogonal finetuning method that focuses on preserving a pretrained model's generative capabilities while being significantly more parameter-efficient than standard OFT. Like OFT, BOFT maintains the same cosine similarity (hyperspherical energy) between all pairwise neurons in a layer by applying an orthogonal transformation to the pretrained weight matrix, ensuring the semantic relationships among neurons are preserved.
+
+Instead of using a block-diagonal orthogonal matrix, BOFT factorizes the orthogonal transformation into a product of **sparse butterfly matrices** (originally introduced in the [Cooley–Tukey FFT](https://en.wikipedia.org/wiki/Cooley%E2%80%93Tukey_FFT_algorithm)). Unlike OFT's block-diagonal rotations, which only mix inputs within each block, the butterfly structure guarantees that every input can influence every output, producing a **dense connectivity** with just `O(d log d)` parameters. This factorization preserves expressivity while drastically reducing the parameter count compared to OFT (at the expense of computation time).
+
+In practice, BOFT multiplies each pretrained weight matrix by a sequence of butterfly-structured orthogonal factors, enabling efficient and expressive neuron rotations. This makes BOFT well-suited for controllable generation and tasks where maintaining the pretrained model's subject representation is critical, while also scaling to larger models with lower memory and compute overhead.
+
+## Adaptive Low-Rank Adaptation (AdaLoRA)
+
+[AdaLoRA](https://hf.co/papers/2303.10512) manages the parameter budget introduced from LoRA by allocating more parameters - in other words, a higher rank `r` - for important weight matrices that are better adapted for a task and pruning less important ones. The rank is controlled by a method similar to singular value decomposition (SVD). The ∆W is parameterized with two orthogonal matrices and a diagonal matrix which contains singular values. This parametrization method avoids iteratively applying SVD which is computationally expensive. Based on this method, the rank of ∆W is adjusted according to an importance score. ∆W is divided into triplets and each triplet is scored according to its contribution to model performance. Triplets with low importance scores are pruned and triplets with high importance scores are kept for finetuning.
+
+Training with AdaLoRA has three phases: the init phase, the budgeting phase and the final phase. In the initial phase, no budgeting is applied, therefore the ranks are not touched. During the budgeting phase the process described above is applied and the rank is redistributed according to a budget, aiming to give more important adapters more rank and less important layers less. When reaching the final phase, budgeting has ended, the ranks are redistributed but we may continue training for a while with the redistributed ranks to further improve performance.
+
+## Llama-Adapter
+
+[Llama-Adapter](https://hf.co/papers/2303.16199) is a method for adapting Llama into an instruction-following model. To help adapt the model for instruction-following, the adapter is trained with a 52K instruction-output dataset.
+
+A set of learnable adaption prompts are prefixed to the input instruction tokens. These are inserted into the upper layers of the model because it is better to learn with the higher-level semantics of the pretrained model. The instruction-output tokens prefixed to the input guide the adaption prompt to generate a contextual response.
+
+<div class="flex justify-center">
+    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/peft/llama-adapter.png"/>
+</div>
+<small><a href="https://hf.co/papers/2303.16199">LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention</a></small>
+
+To avoid adding noise to the tokens, the adapter uses zero-initialized attention. On top of this, the adapter adds a learnable gating factor (initialized with zeros) to progressively add information to the model during training. This prevents overwhelming the model's pretrained knowledge with the newly learned instructions.
+
+## Householder Reflection Adaptation (HRA)
+
+[HRA](https://huggingface.co/papers/2405.17484) provides a new perspective connecting LoRA to OFT, which means it can harness the advantages of both strategies, reduce parameters and computation costs while penalizing the loss of pre-training knowledge. 
+
+<div class="flex justify-center">
+    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/peft/hra.png"/>
+</div>
+<small><a href="https://huggingface.co/papers/2405.17484">Bridging The Gap between Low-rank and Orthogonal Adaptation via Householder Reflection Adaptation</a></small>
+
+HRA constructs a chain of `r` trainable Householder reflections (HRs). Because the Householder reflection matrix is an orthogonal matrix and the product of orthogonal matrices is also an orthogonal matrix, HRA satisfies the theoretical guarantee of Orthogonal Finetuning (OFT). Meanwhile, HRA can also be viewed as a low-rank fine-tuning adapter by rewriting formula. 
+
+The higher `r`, the more trainable parameters, resulting in a larger model capacity and better performance. Besides, due to the chain structure, the orthogonality of HR planes impacts the capacity and regularity of HRA. To achieve a trade-off between the model capacity and regularity, an orthogonality regularizer of the HR planes is added to the loss function. The weight \\(\lambda\\) can control the strength of the regularizer. 
+
+## Bone
+[MiSS](https://huggingface.co/papers/2409.15371) New version of paper(MiSS: Balancing LoRA Performance and Efficiency with Simple Shard Sharing)
+If you already have a Bone checkpoint, you can use `/scripts/convert-bone-to-miss.py` to convert it into a MiSS checkpoint and proceed with training using MiSS.
+
+## MiSS
+[MiSS](https://huggingface.co/papers/2409.15371) MiSS (Matrix Shard Sharing) is a novel Parameter-Efficient Fine-Tuning (PEFT) method designed to address the trade-off between adaptability and efficiency in Large Language Models. The core approach of MiSS involves a simple shard-sharing mechanism. It achieves low-rank adaptation by decomposing a weight matrix into multiple fragments and then utilizing a shared, trainable "common fragment." The final low-rank update matrix is constructed by replicating these shared, partitioned shards. (MiSS is a novel PEFT method that adopts a low-rank structure, requires only a single trainable matrix, and introduces a new update mechanism distinct from LoRA, achieving an excellent balance between performance and efficiency.)
+
+<small><a href="https://huggingface.co/papers/2409.15371">MiSS: Balancing LoRA Performance and Efficiency with Simple Shard Sharing</a></small>
+
+Intuitively, the shape of a single trainable matrix in MiSS is consistent with `lora_B`, so the `r` parameter in MiSS is less than the `r` in LoRA by (`in_feature * r`).
+
+Note: Bat's r (b) is special and requires that weight W satisfies the conditions `in_features % r == 0` and `out_features % r == 0`. Additionally, when `in_features == out_features` and MiSS-r equals LoRA-r, MiSS's number of trainable parameters is only half that of LoRA.
+
+Although the nonlinear updates of Bat bring some performance improvements, they also increase computational overhead. Its main purpose is to provide researchers with a direction for improvement. Therefore, we recommend fine-tuning the comprehensive MiSS model instead.
--- a/docs/source/conceptual_guides/ia3.mdx
+++ b/docs/source/conceptual_guides/ia3.mdx
@ -8,11 +8,15 @@ http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
 -->

 # IA3 

-This conceptual guide gives a brief overview of [IA3](https://arxiv.org/abs/2205.05638), a parameter-efficient fine tuning technique that is 
+This conceptual guide gives a brief overview of [IA3](https://huggingface.co/papers/2205.05638), a parameter-efficient fine tuning technique that is 
 intended to improve over [LoRA](./lora).

 To make fine-tuning more efficient, IA3 (Infused Adapter by Inhibiting and Amplifying Inner Activations) 
@ -28,10 +32,13 @@ Being similar to LoRA, IA3 carries many of the same advantages:
 * Performance of models fine-tuned using IA3 is comparable to the performance of fully fine-tuned models.
 * IA3 does not add any inference latency because adapter weights can be merged with the base model.

-In principle, IA3 can be applied to any subset of weight matrices in a neural network to reduce the number of trainable 
-parameters. Following the authors' implementation, IA3 weights are added to the key, value and feedforward layers 
-of a Transformer model. Given the target layers for injecting IA3 parameters, the number of trainable parameters 
-can be determined based on the size of the weight matrices. 
+In principle, IA3 can be applied to any subset of weight matrices in a neural network to reduce the number of trainable
+parameters. Following the authors' implementation, IA3 weights are added to the key, value and feedforward layers
+of a Transformer model. To be specific, for transformer models, IA3 weights are added to the outputs of key and value layers, and to the input of the second feedforward layer
+in each transformer block.
+
+Given the target layers for injecting IA3 parameters, the number of trainable parameters
+can be determined based on the size of the weight matrices.


 ## Common IA3 parameters in PEFT
@ -43,10 +50,19 @@ As with other methods supported by PEFT, to fine-tune a model using IA3, you nee
 3. Wrap the base model with `get_peft_model()` to get a trainable `PeftModel`.
 4. Train the `PeftModel` as you normally would train the base model.

-`IA3Config` allows you to control how IA3 is applied to the base model through the following parameters: 
+`IA3Config` allows you to control how IA3 is applied to the base model through the following parameters:

 - `target_modules`: The modules (for example, attention blocks) to apply the IA3 vectors.
- `feedforward_modules`: The list of modules to be treated as feedforward layers in `target_modules`. While learned vectors are multiplied with 
-the output activation for attention blocks, the vectors are multiplied with the input for classic feedforward layers.
+- `feedforward_modules`: The list of modules to be treated as feedforward layers in `target_modules`. While learned vectors are multiplied with
+the output activation for attention blocks, the vectors are multiplied with the input for classic feedforward layers. Note that `feedforward_modules` must be a subset of `target_modules`.
 - `modules_to_save`: List of modules apart from IA3 layers to be set as trainable and saved in the final checkpoint. These typically include model's custom head that is randomly initialized for the fine-tuning task.

+## Example Usage
+
+For the task of sequence classification, one can initialize the IA3 config for a Llama model as follows:
+
+```py
+peft_config = IA3Config(
+    task_type=TaskType.SEQ_CLS, target_modules=["k_proj", "v_proj", "down_proj"], feedforward_modules=["down_proj"]
+)
+```
--- a/docs/source/conceptual_guides/lora.mdx
+++ b/docs/source/conceptual_guides/lora.mdx
@ -1,91 +0,0 @@
-<!--Copyright 2023 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-
-# LoRA 
-
-This conceptual guide gives a brief overview of [LoRA](https://arxiv.org/abs/2106.09685), a technique that accelerates 
-the fine-tuning of large models while consuming less memory. 
-
-To make fine-tuning more efficient, LoRA's approach is to represent the weight updates with two smaller 
-matrices (called **update matrices**) through low-rank decomposition. These new matrices can be trained to adapt to the 
-new data while keeping the overall number of changes low. The original weight matrix remains frozen and doesn't receive 
-any further adjustments. To produce the final results, both the original and the adapted weights are combined.
-
-This approach has a number of advantages: 
-
-* LoRA makes fine-tuning more efficient by drastically reducing the number of trainable parameters.
-* The original pre-trained weights are kept frozen, which means you can have multiple lightweight and portable LoRA models for various downstream tasks built on top of them.
-* LoRA is orthogonal to many other parameter-efficient methods and can be combined with many of them.
-* Performance of models fine-tuned using LoRA is comparable to the performance of fully fine-tuned models.
-* LoRA does not add any inference latency because adapter weights can be merged with the base model.
-
-In principle, LoRA can be applied to any subset of weight matrices in a neural network to reduce the number of trainable 
-parameters. However, for simplicity and further parameter efficiency, in Transformer models LoRA is typically applied to 
-attention blocks only. The resulting number of trainable parameters in a LoRA model depends on the size of the low-rank 
-update matrices, which is determined mainly by the rank `r` and the shape of the original weight matrix.
-
-## Merge LoRA weights into the base model
-
-While LoRA is significantly smaller and faster to train, you may encounter latency issues during inference due to separately loading the base model and the LoRA model. To eliminate latency, use the [`~LoraModel.merge_and_unload`] function to merge the adapter weights with the base model which allows you to effectively use the newly merged model as a standalone model.
-
-<div class="flex justify-center">
-    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/peft/lora_diagram.png"/>
-</div>
-
-This works because during training, the smaller weight matrices (*A* and *B* in the diagram above) are separate. But once training is complete, the weights can actually be merged into a new weight matrix that is identical.
-
-## Utils for LoRA
-
-Use [`~LoraModel.merge_adapter`] to merge the LoRa layers into the base model while retaining the PeftModel.
-This will help in later unmerging, deleting, loading different adapters and so on.
-
-Use [`~LoraModel.unmerge_adapter`] to unmerge the LoRa layers from the base model while retaining the PeftModel.
-This will help in later merging, deleting, loading different adapters and so on.
-
-Use [`~LoraModel.unload`] to get back the base model without the merging of the active lora modules. 
-This will help when you want to get back the pretrained base model in some applications when you want to reset the model to its original state.
-For example, in Stable Diffusion WebUi, when the user wants to infer with base model post trying out LoRAs.
-
-Use [`~LoraModel.delete_adapter`] to delete an existing adapter.
-
-Use [`~LoraModel.add_weighted_adapter`] to combine multiple LoRAs into a new adapter based on the user provided weighing scheme.
-
-## Common LoRA parameters in PEFT
-
-As with other methods supported by PEFT, to fine-tune a model using LoRA, you need to:
-
-1. Instantiate a base model.
-2. Create a configuration (`LoraConfig`) where you define LoRA-specific parameters.
-3. Wrap the base model with `get_peft_model()` to get a trainable `PeftModel`.
-4. Train the `PeftModel` as you normally would train the base model.
-
-`LoraConfig` allows you to control how LoRA is applied to the base model through the following parameters: 
-
- `r`: the rank of the update matrices, expressed in `int`. Lower rank results in smaller update matrices with fewer trainable parameters.
- `target_modules`: The modules (for example, attention blocks) to apply the LoRA update matrices.
- `alpha`: LoRA scaling factor.
- `bias`: Specifies if the `bias` parameters should be trained. Can be `'none'`, `'all'` or `'lora_only'`.
- `modules_to_save`: List of modules apart from LoRA layers to be set as trainable and saved in the final checkpoint. These typically include model's custom head that is randomly initialized for the fine-tuning task.
- `layers_to_transform`: List of layers to be transformed by LoRA. If not specified, all layers in `target_modules` are transformed.
- `layers_pattern`: Pattern to match layer names in `target_modules`, if `layers_to_transform` is specified. By default `PeftModel` will look at common layer pattern (`layers`, `h`, `blocks`, etc.), use it for exotic and custom models.
- `rank_pattern`: The mapping from layer names or regexp expression to ranks which are different from the default rank specified by `r`.
- `alpha_pattern`: The mapping from layer names or regexp expression to alphas which are different from the default alpha specified by `lora_alpha`.
-
-## LoRA examples
-
-For an example of LoRA method application to various downstream tasks, please refer to the following guides:
-
-* [Image classification using LoRA](../task_guides/image_classification_lora)
-* [Semantic segmentation](../task_guides/semantic_segmentation_lora)
-
-While the original paper focuses on language models, the technique can be applied to any dense layers in deep learning 
-models. As such, you can leverage this technique with diffusion models. See [Dreambooth fine-tuning with LoRA](../task_guides/task_guides/dreambooth_lora) task guide for an example.
--- a/docs/source/conceptual_guides/oft.md
+++ b/docs/source/conceptual_guides/oft.md
@ -0,0 +1,165 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# Orthogonal Finetuning (OFT and BOFT) 
+
+This conceptual guide gives a brief overview of [OFT](https://huggingface.co/papers/2306.07280), [OFTv2](https://www.arxiv.org/abs/2506.19847) and [BOFT](https://huggingface.co/papers/2311.06243), a parameter-efficient fine-tuning technique that utilizes orthogonal matrix to multiplicatively transform the pretrained weight matrices.
+
+To achieve efficient fine-tuning, OFT represents the weight updates with an orthogonal transformation. The orthogonal transformation is parameterized by an orthogonal matrix multiplied to the pretrained weight matrix. These new matrices can be trained to adapt to the new data while keeping the overall number of changes low. The original weight matrix remains frozen and doesn't receive any further adjustments. To produce the final results, both the original and the adapted weights are multiplied togethor.
+
+Orthogonal Butterfly (BOFT) generalizes OFT with Butterfly factorization and further improves its parameter efficiency and finetuning flexibility. In short, OFT can be viewed as a special case of BOFT. Different from LoRA that uses additive low-rank weight updates, BOFT uses multiplicative orthogonal weight updates. The comparison is shown below.
+
+<div class="flex justify-center">
+    <img src="https://raw.githubusercontent.com/wy1iu/butterfly-oft/main/assets/BOFT_comparison.png"/>
+</div>
+
+
+BOFT has some advantages compared to LoRA: 
+
+* BOFT proposes a simple yet generic way to finetune pretrained models to downstream tasks, yielding a better preservation of pretraining knowledge and a better parameter efficiency.
+* Through the orthogonality, BOFT introduces a structural constraint, i.e., keeping the [hyperspherical energy](https://huggingface.co/papers/1805.09298) unchanged during finetuning. This can effectively reduce the forgetting of pretraining knowledge.
+* BOFT uses the butterfly factorization to efficiently parameterize the orthogonal matrix, which yields a compact yet expressive learning space (i.e., hypothesis class).
+* The sparse matrix decomposition in BOFT brings in additional inductive biases that are beneficial to generalization.
+
+In principle, BOFT can be applied to any subset of weight matrices in a neural network to reduce the number of trainable parameters. Given the target layers for injecting BOFT parameters, the number of trainable parameters can be determined based on the size of the weight matrices.
+
+## Merge OFT/BOFT weights into the base model
+
+Similar to LoRA, the weights learned by OFT/BOFT can be integrated into the pretrained weight matrices using the merge_and_unload() function. This function merges the adapter weights with the base model which allows you to effectively use the newly merged model as a standalone model.
+
+<div class="flex justify-center">
+    <img src="https://raw.githubusercontent.com/wy1iu/butterfly-oft/main/assets/boft_merge.png"/>
+</div>
+
+This works because during training, the orthogonal weight matrix (R in the diagram above) and the pretrained weight matrices are separate. But once training is complete, these weights can actually be merged (multiplied) into a new weight matrix that is equivalent.
+
+## Utils for OFT / BOFT
+
+### Common OFT / BOFT parameters in PEFT
+
+As with other methods supported by PEFT, to fine-tune a model using OFT or BOFT, you need to:
+
+1. Instantiate a base model.
+2. Create a configuration (`OFTConfig` or `BOFTConfig`) where you define OFT/BOFT-specific parameters.
+3. Wrap the base model with `get_peft_model()` to get a trainable `PeftModel`.
+4. Train the `PeftModel` as you normally would train the base model.
+
+
+### OFT-specific parameters
+
+`OFTConfig` allows you to control how OFT is applied to the base model through the following parameters:
+
+- `r`: OFT rank, number of OFT blocks per injected layer. **Bigger** `r` results in more sparse update matrices with **fewer** trainable paramters. **Note**: You can only specify either `r` or `oft_block_size`, but not both simultaneously, because `r` × `oft_block_size` = layer dimension. For simplicity, we let the user speficy either `r` or `oft_block_size` and infer the other one. Default set to `r = 0`, the user is advised to set the `oft_block_size` instead for better clarity.
+- `oft_block_size`: OFT block size across different layers. **Bigger** `oft_block_size` results in more dense update matrices with **more** trainable parameters. **Note**: Please choose `oft_block_size` to be divisible by layer's input dimension (`in_features`), e.g., 4, 8, 16. You can only specify either `r` or `oft_block_size`, but not both simultaneously, because `r` × `oft_block_size` = layer dimension. For simplicity, we let the user speficy either `r` or `oft_block_size` and infer the other one. Default set to `oft_block_size = 32`. 
+- `use_cayley_neumann`: Specifies whether to use the Cayley-Neumann parameterization (efficient but approximate) or the vanilla Cayley parameterization (exact but computationally expensive because of matrix inverse). We recommend to set it to `True` for better efficiency, but performance may be slightly worse because of the approximation error. Please test both settings (`True` and `False`) depending on your needs. Default is `False`.
+- `module_dropout`: The multiplicative dropout probability, by setting OFT blocks to identity during training, similar to the dropout layer in LoRA.
+- `bias`: specify if the `bias` parameters should be trained. Can be `"none"`, `"all"` or `"oft_only"`.
+- `target_modules`: The modules (for example, attention blocks) to inject the OFT matrices.
+- `modules_to_save`: List of modules apart from OFT matrices to be set as trainable and saved in the final checkpoint. These typically include model's custom head that is randomly initialized for the fine-tuning task.
+
+### BOFT-specific parameters
+
+`BOFTConfig` allows you to control how BOFT is applied to the base model through the following parameters:
+
+- `boft_block_size`: the BOFT matrix block size across different layers, expressed in `int`. **Bigger** `boft_block_size` results in more dense update matrices with **more** trainable parameters. **Note**, please choose `boft_block_size` to be divisible by most layer's input dimension (`in_features`), e.g., 4, 8, 16. Also, please only 
+specify either `boft_block_size` or `boft_block_num`, but not both simultaneously or leaving both to 0, because `boft_block_size` x `boft_block_num` must equal the layer's input dimension.
+- `boft_block_num`: the number of BOFT matrix blocks across different layers, expressed in `int`. **Bigger** `boft_block_num` result in sparser update matrices with **fewer** trainable parameters. **Note**, please choose `boft_block_num` to be divisible by most layer's input dimension (`in_features`), e.g., 4, 8, 16. Also, please only 
+specify either `boft_block_size` or `boft_block_num`, but not both simultaneously or leaving both to 0, because `boft_block_size` x `boft_block_num` must equal the layer's input dimension.
+- `boft_n_butterfly_factor`: the number of butterfly factors. **Note**, for `boft_n_butterfly_factor=1`, BOFT is the same as vanilla OFT, for `boft_n_butterfly_factor=2`, the effective block size of OFT becomes twice as big and the number of blocks become half.
+- `bias`: specify if the `bias` parameters should be trained. Can be `"none"`, `"all"` or `"boft_only"`.
+- `boft_dropout`: specify the probability of multiplicative dropout.
+- `target_modules`: The modules (for example, attention blocks) to inject the OFT/BOFT matrices.
+- `modules_to_save`: List of modules apart from OFT/BOFT matrices to be set as trainable and saved in the final checkpoint. These typically include model's custom head that is randomly initialized for the fine-tuning task.
+
+
+
+## OFT Example Usage
+
+For using OFT for quantized finetuning with [TRL](https://github.com/huggingface/trl) for `SFT`, `PPO`, or `DPO` fine-tuning, follow the following outline:
+
+```py
+from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
+from trl import SFTTrainer
+from peft import OFTConfig
+
+if use_quantization:
+    bnb_config = BitsAndBytesConfig(
+        load_in_4bit=True,
+        bnb_4bit_quant_type="nf4",
+        bnb_4bit_compute_dtype=torch.bfloat16,
+        bnb_4bit_use_double_quant=True,
+        bnb_4bit_quant_storage=torch.bfloat16,
+    )
+
+model = AutoModelForCausalLM.from_pretrained(
+    "model_name", 
+    quantization_config=bnb_config
+)
+tokenizer = AutoTokenizer.from_pretrained("model_name")
+
+# Configure OFT
+peft_config = OFTConfig(
+    oft_block_size=32,
+    use_cayley_neumann=True,
+    target_modules="all-linear",
+    bias="none",
+    task_type="CAUSAL_LM"
+)
+
+trainer = SFTTrainer(
+    model=model,
+    train_dataset=ds['train'],
+    peft_config=peft_config,
+    processing_class=tokenizer,
+    args=training_arguments,
+    data_collator=collator,
+)
+
+trainer.train()
+```
+
+
+## BOFT Example Usage
+
+For an example of the BOFT method application to various downstream tasks, please refer to the following guides:
+
+Take a look at the following step-by-step guides on how to finetune a model with BOFT:
+- [Dreambooth finetuning with BOFT](https://github.com/huggingface/peft/blob/main/examples/boft_dreambooth/boft_dreambooth.md)
+- [Controllable generation finetuning with BOFT (ControlNet)](https://github.com/huggingface/peft/blob/main/examples/boft_controlnet/boft_controlnet.md)
+
+For the task of image classification, one can initialize the BOFT config for a DinoV2 model as follows:
+
+```py
+import transformers
+from transformers import AutoModelForSeq2SeqLM, BOFTConfig
+from peft import BOFTConfig, get_peft_model
+
+config = BOFTConfig(
+    boft_block_size=4,
+    boft_n_butterfly_factor=2,
+    target_modules=["query", "value", "key", "output.dense", "mlp.fc1", "mlp.fc2"],
+    boft_dropout=0.1,
+    bias="boft_only",
+    modules_to_save=["classifier"],
+)
+
+model = transformers.Dinov2ForImageClassification.from_pretrained(
+    "facebook/dinov2-large",
+    num_labels=100,
+)
+
+boft_model = get_peft_model(model, config)
+```
--- a/docs/source/conceptual_guides/prompting.mdx
+++ b/docs/source/conceptual_guides/prompting.mdx
@ -1,4 +1,8 @@
-# Prompting
+<!--⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+-->
+
+# Soft prompts

 Training large pretrained language models is very time-consuming and compute-intensive. As they continue to grow in size, there is increasing interest in more efficient training methods such as *prompting*. Prompting primes a frozen pretrained model for a specific downstream task by including a text prompt that describes the task or even demonstrates an example of the task. With prompting, you can avoid fully training a separate model for each downstream task, and use the same frozen pretrained model instead. This is a lot easier because you can use the same model for several different tasks, and it is significantly more efficient to train and store a smaller set of prompt parameters than to train all the model's parameters.

@ -7,16 +11,16 @@ There are two categories of prompting methods:
 - hard prompts are manually handcrafted text prompts with discrete input tokens; the downside is that it requires a lot of effort to create a good prompt
 - soft prompts are learnable tensors concatenated with the input embeddings that can be optimized to a dataset; the downside is that they aren't human readable because you aren't matching these "virtual tokens" to the embeddings of a real word

-This conceptual guide provides a brief overview of the soft prompt methods included in 🤗 PEFT: prompt tuning, prefix tuning, and P-tuning.
+This conceptual guide provides a brief overview of the soft prompt methods included in 🤗 PEFT: prompt tuning, prefix tuning, P-tuning, and multitask prompt tuning.

 ## Prompt tuning

 <div class="flex justify-center">
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/peft/prompt-tuning.png"/>
 </div>
-<small>Only train and store a significantly smaller set of task-specific prompt parameters <a href="https://arxiv.org/abs/2104.08691">(image source)</a>.</small>
+<small>Only train and store a significantly smaller set of task-specific prompt parameters <a href="https://hf.co/papers/2104.08691">(image source)</a>.</small>

-Prompt tuning was developed for text classification tasks on T5 models, and all downstream tasks are cast as a text generation task. For example, sequence classification usually assigns a single class label to a sequence of text. By casting it as a text generation task, the tokens that make up the class label are *generated*. Prompts are added to the input as a series of tokens. Typically, the model parameters are fixed which means the prompt tokens are also fixed by the model parameters.
+[Prompt tuning](https://hf.co/papers/2104.08691) was developed for text classification tasks on T5 models, and all downstream tasks are cast as a text generation task. For example, sequence classification usually assigns a single class label to a sequence of text. By casting it as a text generation task, the tokens that make up the class label are *generated*. Prompts are added to the input as a series of tokens. Typically, the model parameters are fixed which means the prompt tokens are also fixed by the model parameters.

 The key idea behind prompt tuning is that prompt tokens have their own parameters that are updated independently. This means you can keep the pretrained model's parameters frozen, and only update the gradients of the prompt token embeddings. The results are comparable to the traditional method of training the entire model, and prompt tuning performance scales as model size increases.

@ -27,9 +31,9 @@ Take a look at [Prompt tuning for causal language modeling](../task_guides/clm-p
 <div class="flex justify-center">
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/peft/prefix-tuning.png"/>
 </div>
-<small>Optimize the prefix parameters for each task <a href="https://arxiv.org/abs/2101.00190">(image source)</a>.</small>
+<small>Optimize the prefix parameters for each task <a href="https://hf.co/papers/2101.00190">(image source)</a>.</small>

-Prefix tuning was designed for natural language generation (NLG) tasks on GPT models. It is very similar to prompt tuning; prefix tuning also prepends a sequence of task-specific vectors to the input that can be trained and updated while keeping the rest of the pretrained model's parameters frozen. 
+[Prefix tuning](https://hf.co/papers/2101.00190) was designed for natural language generation (NLG) tasks on GPT models. It is very similar to prompt tuning; prefix tuning also prepends a sequence of task-specific vectors to the input that can be trained and updated while keeping the rest of the pretrained model's parameters frozen. 

 The main difference is that the prefix parameters are inserted in **all** of the model layers, whereas prompt tuning only adds the prompt parameters to the model input embeddings. The prefix parameters are also optimized by a separate feed-forward network (FFN) instead of training directly on the soft prompts because it causes instability and hurts performance. The FFN is discarded after updating the soft prompts.

@ -42,9 +46,9 @@ Take a look at [Prefix tuning for conditional generation](../task_guides/seq2seq
 <div class="flex justify-center">
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/peft/p-tuning.png"/>
 </div>
-<small>Prompt tokens can be inserted anywhere in the input sequence, and they are optimized by a prompt encoder <a href="https://arxiv.org/abs/2103.10385">(image source)</a>.</small>
+<small>Prompt tokens can be inserted anywhere in the input sequence, and they are optimized by a prompt encoder <a href="https://hf.co/papers/2103.10385">(image source)</a>.</small>

-P-tuning is designed for natural language understanding (NLU) tasks and all language models. 
+[P-tuning](https://hf.co/papers/2103.10385) is designed for natural language understanding (NLU) tasks and all language models. 
 It is another variation of a soft prompt method; P-tuning also adds a trainable embedding tensor that can be optimized to find better prompts, and it uses a prompt encoder (a bidirectional long-short term memory network or LSTM) to optimize the prompt parameters. Unlike prefix tuning though:

 - the prompt tokens can be inserted anywhere in the input sequence, and it isn't restricted to only the beginning
@ -53,4 +57,37 @@ It is another variation of a soft prompt method; P-tuning also adds a trainable

 The results suggest that P-tuning is more efficient than manually crafting prompts, and it enables GPT-like models to compete with BERT-like models on NLU tasks.

-Take a look at [P-tuning for sequence classification](../task_guides/ptuning-seq-classification) for a step-by-step guide on how to train a model with P-tuning.
+Take a look at [P-tuning for sequence classification](../task_guides/ptuning-seq-classification) for a step-by-step guide on how to train a model with P-tuning.
+
+## Multitask prompt tuning
+
+<div class="flex justify-center">
+    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/peft/mpt.png"/>
+</div>
+<small><a href="https://hf.co/papers/2303.02861">Multitask prompt tuning enables parameter-efficient transfer learning</a>.</small>
+
+[Multitask prompt tuning (MPT)](https://hf.co/papers/2303.02861) learns a single prompt from data for multiple task types that can be shared for different target tasks. Other existing approaches learn a separate soft prompt for each task that need to be retrieved or aggregated for adaptation to target tasks. MPT consists of two stages:
+
+1. source training - for each task, its soft prompt is decomposed into task-specific vectors. The task-specific vectors are multiplied together to form another matrix W, and the Hadamard product is used between W and a shared prompt matrix P to generate a task-specific prompt matrix. The task-specific prompts are distilled into a single prompt matrix that is shared across all tasks. This prompt is trained with multitask training.
+2. target adaptation - to adapt the single prompt for a target task, a target prompt is initialized and expressed as the Hadamard product of the shared prompt matrix and the task-specific low-rank prompt matrix.
+
+<div class="flex justify-center">
+    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/peft/mpt-decomposition.png"/>
+</div>
+<small><a href="https://hf.co/papers/2103.10385">Prompt decomposition</a>.</small>
+
+
+## Context-Aware Prompt Tuning (CPT)
+
+<div class="flex justify-center">
+    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/peft/cpt.png"/>
+</div>
+<small>CPT optimizing only specific token embeddings while keeping the rest of the model frozen <a href="https://huggingface.co/papers/2410.17222">(image source)</a>.</small>
+
+[Context-Aware Prompt Tuning (CPT)](https://huggingface.co/papers/2410.17222) is designed to enhance few-shot classification by refining only context embeddings. 
+This approach combines ideas from In-Context Learning (ICL), Prompt Tuning (PT), and adversarial optimization, focusing on making model adaptation both parameter-efficient and effective.
+In CPT, only specific context token embeddings are optimized, while the rest of the model remains frozen. 
+To prevent overfitting and maintain stability, CPT uses controlled perturbations to limit the allowed changes to context embeddings within a defined range. 
+Additionally, to address the phenomenon of recency bias—where examples near the end of the context tend to be prioritized over earlier ones—CPT applies a decay loss factor.
+
+Take a look at [Example](https://github.com/huggingface/peft/blob/main/examples/cpt_finetuning/README.md) for a step-by-step guide on how to train a model with CPT.
--- a/docs/source/developer_guides/checkpoint.md
+++ b/docs/source/developer_guides/checkpoint.md
@ -0,0 +1,244 @@
+<!--Copyright 2024 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# PEFT checkpoint format
+
+This document describes how PEFT's checkpoint files are structured and how to convert between the PEFT format and other formats.
+
+## PEFT files
+
+PEFT (parameter-efficient fine-tuning) methods only update a small subset of a model's parameters rather than all of them. This is nice because checkpoint files can generally be much smaller than the original model files and are easier to store and share. However, this also means that to load a PEFT model, you need to have the original model available as well.
+
+When you call [`~PeftModel.save_pretrained`] on a PEFT model, the PEFT model saves three files, described below:
+
+1. `adapter_model.safetensors` or `adapter_model.bin`
+
+By default, the model is saved in the `safetensors` format, a secure alternative to the `bin` format, which is known to be susceptible to [security vulnerabilities](https://huggingface.co/docs/hub/security-pickle) because it uses the pickle utility under the hood. Both formats store the same `state_dict` though, and are interchangeable.
+
+The `state_dict` only contains the parameters of the adapter module, not the base model. To illustrate the difference in size, a normal BERT model requires ~420MB of disk space, whereas an IA³ adapter on top of this BERT model only requires ~260KB.
+
+2. `adapter_config.json`
+
+The `adapter_config.json` file contains the configuration of the adapter module, which is necessary to load the model. Below is an example of an `adapter_config.json` for an IA³ adapter with standard settings applied to a BERT model:
+
+```json
+{
+  "auto_mapping": {
+    "base_model_class": "BertModel",
+    "parent_library": "transformers.models.bert.modeling_bert"
+  },
+  "base_model_name_or_path": "bert-base-uncased",
+  "fan_in_fan_out": false,
+  "feedforward_modules": [
+    "output.dense"
+  ],
+  "inference_mode": true,
+  "init_ia3_weights": true,
+  "modules_to_save": null,
+  "peft_type": "IA3",
+  "revision": null,
+  "target_modules": [
+    "key",
+    "value",
+    "output.dense"
+  ],
+  "task_type": null
+}
+```
+
+The configuration file contains:
+
+- the adapter module type stored, `"peft_type": "IA3"`
+- information about the base model like `"base_model_name_or_path": "bert-base-uncased"`
+- the revision of the model (if any), `"revision": null`
+
+If the base model is not a pretrained Transformers model, the latter two entries will be `null`. Other than that, the settings are all related to the specific IA³ adapter that was used to fine-tune the model.
+
+3. `README.md`
+
+The generated `README.md` is the model card of a PEFT model and contains a few pre-filled entries. The intent of this is to make it easier to share the model with others and to provide some basic information about the model. This file is not needed to load the model.
+
+## Convert to PEFT format
+
+When converting from another format to the PEFT format, we require both the `adapter_model.safetensors` (or `adapter_model.bin`) file and the `adapter_config.json` file.
+
+### adapter_model
+
+For the model weights, it is important to use the correct mapping from parameter name to value for PEFT to load the file. Getting this mapping right is an exercise in checking the implementation details, as there is no generally agreed upon format for PEFT adapters.
+
+Fortunately, figuring out this mapping is not overly complicated for common base cases. Let's look at a concrete example, the [`LoraLayer`](https://github.com/huggingface/peft/blob/main/src/peft/tuners/lora/layer.py):
+
+```python
+# showing only part of the code
+
+class LoraLayer(BaseTunerLayer):
+    # All names of layers that may contain (trainable) adapter weights
+    adapter_layer_names = ("lora_A", "lora_B", "lora_embedding_A", "lora_embedding_B")
+    # All names of other parameters that may contain adapter-related parameters
+    other_param_names = ("r", "lora_alpha", "scaling", "lora_dropout")
+
+    def __init__(self, base_layer: nn.Module, **kwargs) -> None:
+        self.base_layer = base_layer
+        self.r = {}
+        self.lora_alpha = {}
+        self.scaling = {}
+        self.lora_dropout = nn.ModuleDict({})
+        self.lora_A = nn.ModuleDict({})
+        self.lora_B = nn.ModuleDict({})
+        # For Embedding layer
+        self.lora_embedding_A = nn.ParameterDict({})
+        self.lora_embedding_B = nn.ParameterDict({})
+        # Mark the weight as unmerged
+        self._disable_adapters = False
+        self.merged_adapters = []
+        self.use_dora: dict[str, bool] = {}
+        self.lora_magnitude_vector: Optional[torch.nn.ParameterDict] = None  # for DoRA
+        self._caches: dict[str, Any] = {}
+        self.kwargs = kwargs
+```
+
+In the `__init__` code used by all `LoraLayer` classes in PEFT, there are a bunch of parameters used to initialize the model, but only a few are relevant for the checkpoint file: `lora_A`, `lora_B`, `lora_embedding_A`, and `lora_embedding_B`. These parameters are listed in the class attribute `adapter_layer_names` and contain the learnable parameters, so they must be included in the checkpoint file. All the other parameters, like the rank `r`, are derived from the `adapter_config.json` and must be included there (unless the default value is used).
+
+Let's check the `state_dict` of a PEFT LoRA model applied to BERT. When printing the first five keys using the default LoRA settings (the remaining keys are the same, just with different layer numbers), we get:
+
+- `base_model.model.encoder.layer.0.attention.self.query.lora_A.weight` 
+- `base_model.model.encoder.layer.0.attention.self.query.lora_B.weight` 
+- `base_model.model.encoder.layer.0.attention.self.value.lora_A.weight` 
+- `base_model.model.encoder.layer.0.attention.self.value.lora_B.weight` 
+- `base_model.model.encoder.layer.1.attention.self.query.lora_A.weight`
+- etc.
+
+Let's break this down:
+
+- By default, for BERT models, LoRA is applied to the `query` and `value` layers of the attention module. This is why you see `attention.self.query` and `attention.self.value` in the key names for each layer.
+- LoRA decomposes the weights into two low-rank matrices, `lora_A` and `lora_B`. This is where `lora_A` and `lora_B` come from in the key names.
+- These LoRA matrices are implemented as `nn.Linear` layers, so the parameters are stored in the `.weight` attribute (`lora_A.weight`, `lora_B.weight`).
+- By default, LoRA isn't applied to BERT's embedding layer, so there are _no entries_ for `lora_A_embedding` and `lora_B_embedding`.
+- The keys of the `state_dict` always start with `"base_model.model."`. The reason is that, in PEFT, we wrap the base model inside a tuner-specific model (`LoraModel` in this case), which itself is wrapped in a general PEFT model (`PeftModel`). For this reason, these two prefixes are added to the keys. When converting to the PEFT format, it is required to add these prefixes.
+
+> [!TIP]
+> This last point is not true for prefix tuning techniques like prompt tuning. There, the extra embeddings are directly stored in the `state_dict` without any prefixes added to the keys.
+
+When inspecting the parameter names in the loaded model, you might be surprised to find that they look a bit different, e.g. `base_model.model.encoder.layer.0.attention.self.query.lora_A.default.weight`. The difference is the *`.default`* part in the second to last segment. This part exists because PEFT generally allows the addition of multiple adapters at once (using an `nn.ModuleDict` or `nn.ParameterDict` to store them). For example, if you add another adapter called "other", the key for that adapter would be `base_model.model.encoder.layer.0.attention.self.query.lora_A.other.weight`.
+
+When you call [`~PeftModel.save_pretrained`], the adapter name is stripped from the keys. The reason is that the adapter name is not an important part of the model architecture; it is just an arbitrary name. When loading the adapter, you could choose a totally different name, and the model would still work the same way. This is why the adapter name is not stored in the checkpoint file.
+
+> [!TIP]
+> If you call `save_pretrained("some/path")` and the adapter name is not `"default"`, the adapter is stored in a sub-directory with the same name as the adapter. So if the name is "other", it would be stored inside of `some/path/other`.
+
+In some circumstances, deciding which values to add to the checkpoint file can become a bit more complicated. For example, in PEFT, DoRA is implemented as a special case of LoRA. If you want to convert a DoRA model to PEFT, you should create a LoRA checkpoint with extra entries for DoRA. You can see this in the `__init__` of the previous `LoraLayer` code:
+
+```python
+self.lora_magnitude_vector: Optional[torch.nn.ParameterDict] = None  # for DoRA
+```
+
+This indicates that there is an optional extra parameter per layer for DoRA.
+
+### adapter_config
+
+All the other information needed to load a PEFT model is contained in the `adapter_config.json` file. Let's check this file for a LoRA model applied to BERT:
+
+```json
+{
+  "alpha_pattern": {},
+  "auto_mapping": {
+    "base_model_class": "BertModel",
+    "parent_library": "transformers.models.bert.modeling_bert"
+  },
+  "base_model_name_or_path": "bert-base-uncased",
+  "bias": "none",
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 8,
+  "lora_dropout": 0.0,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "r": 8,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "query",
+    "value"
+  ],
+  "task_type": null,
+  "use_dora": false,
+  "use_rslora": false
+}
+```
+
+This contains a lot of entries, and at first glance, it could feel overwhelming to figure out all the right values to put in there. However, most of the entries are not necessary to load the model. This is either because they use the default values and don't need to be added or because they only affect the initialization of the LoRA weights, which is irrelevant when it comes to loading the model. If you find that you don't know what a specific parameter does, e.g., `"use_rslora",` don't add it, and you should be fine. Also note that as more options are added, this file will get more entries in the future, but it should be backward compatible.
+
+At the minimum, you should include the following entries:
+
+```json
+{
+  "target_modules": ["query", "value"],
+  "peft_type": "LORA"
+}
+```
+
+However, adding as many entries as possible, like the rank `r` or the `base_model_name_or_path` (if it's a Transformers model) is recommended. This information can help others understand the model better and share it more easily. To check which keys and values are expected, check out the [config.py](https://github.com/huggingface/peft/blob/main/src/peft/tuners/lora/config.py) file (as an example, this is the config file for LoRA) in the PEFT source code.
+
+## Model storage
+
+In some circumstances, you might want to store the whole PEFT model, including the base weights. This can be necessary if, for instance, the base model is not available to the users trying to load the PEFT model. You can merge the weights first or convert it into a Transformer model.
+
+### Merge the weights
+
+The most straightforward way to store the whole PEFT model is to merge the adapter weights into the base weights:
+
+```python
+merged_model = model.merge_and_unload()
+merged_model.save_pretrained(...)
+```
+
+There are some disadvantages to this approach, though:
+
+- Once [`~LoraModel.merge_and_unload`] is called, you get a basic model without any PEFT-specific functionality. This means you can't use any of the PEFT-specific methods anymore.
+- You cannot unmerge the weights, load multiple adapters at once, disable the adapter, etc.
+- Not all PEFT methods support merging weights.
+- Some PEFT methods may generally allow merging, but not with specific settings (e.g. when using certain quantization techniques).
+- The whole model will be much larger than the PEFT model, as it will contain all the base weights as well.
+
+But inference with a merged model should be a bit faster.
+
+### Convert to a Transformers model
+
+Another way to save the whole model, assuming the base model is a Transformers model, is to use this hacky approach to directly insert the PEFT weights into the base model and save it, which only works if you "trick" Transformers into believing the PEFT model is not a PEFT model. This only works with LoRA because other adapters are not implemented in Transformers.
+
+```python
+model = ...  # the PEFT model
+...
+# after you finish training the model, save it in a temporary location
+model.save_pretrained(<temp_location>)
+# now load this model directly into a transformers model, without the PEFT wrapper
+# the PEFT weights are directly injected into the base model
+model_loaded = AutoModel.from_pretrained(<temp_location>)
+# now make the loaded model believe that it is _not_ a PEFT model
+model_loaded._hf_peft_config_loaded = False
+# now when we save it, it will save the whole model
+model_loaded.save_pretrained(<final_location>)
+# or upload to Hugging Face Hub
+model_loaded.push_to_hub(<final_location>)
+```
+
--- a/docs/source/developer_guides/contributing.md
+++ b/docs/source/developer_guides/contributing.md
@ -0,0 +1,96 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# Contribute to PEFT
+
+We are happy to accept contributions to PEFT. If you plan to contribute, please read this to make the process as smooth as possible.
+
+## Installation
+
+For code contributions to PEFT, you should choose the ["source"](../install#source) installation method.
+
+If you are new to creating a pull request, follow the [Creating a pull request](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request) guide by GitHub.
+
+## Tests and code quality checks
+
+Regardless of the contribution type (unless it’s only about the docs), you should run tests and code quality checks before creating a PR to ensure your contribution doesn’t break anything and follows the project standards.
+
+We provide a Makefile to execute the necessary tests. Run the code below for the unit test:
+
+```sh
+make test
+```
+
+Run one of the following to either only check or check and fix code quality and style:
+
+```sh
+make quality  # just check
+make style  # check and fix
+```
+
+You can also set up [`pre-commit`](https://pre-commit.com/) to run these fixes
+automatically as Git commit hooks.
+
+```bash
+$ pip install pre-commit
+$ pre-commit install
+```
+
+Running all the tests can take a while, so during development it can be more efficient to only [run tests specific to your change](https://docs.pytest.org/en/6.2.x/usage.html#specifying-tests-selecting-tests), e.g. via:
+
+```sh
+pytest tests/<test-file-name> -k <name-of-test>
+```
+
+This should finish much quicker and allow for faster iteration.
+
+If your change is specific to a hardware setting (e.g., it requires CUDA), take a look at [tests/test_gpu_examples.py](https://github.com/huggingface/peft/blob/1c1c7fdaa6e6abaa53939b865dee1eded82ad032/tests/test_gpu_examples.py) and [tests/test_common_gpu.py](https://github.com/huggingface/peft/blob/1c1c7fdaa6e6abaa53939b865dee1eded82ad032/tests/test_common_gpu.py) to see if it makes sense to add tests there. If your change could have an effect on saving and loading models, please run the tests with the `--regression` flag to trigger regression tests.
+
+It can happen that while you’re working on your PR, the underlying code base changes due to other changes being merged. If that happens – especially when there is a merge conflict – please update your branch with the latest changes. This can be a merge or a rebase, and we'll squash and merge the PR once it’s ready. If possible, avoid force pushes to make reviews easier.
+
+## PR description
+
+When opening a PR, please provide a nice description of the change you're proposing. If it relates to other issues or PRs, please reference them. Providing a good description not only helps the reviewers review your code better and faster, it can also be used later (as a basis) for the commit message which helps with long term maintenance of the project.
+
+If your code makes some non-trivial changes, it may also be a good idea to add comments to the code to explain those changes. For example, if you had to iterate on your implementation multiple times because the most obvious way didn’t work, it’s a good indication that a code comment is needed.
+
+## Bugfixes
+
+Please give a description of the circumstances that led to the bug. If there is an existing issue, please link to it (e.g., “Resolves #12345”).
+
+Ideally when a bugfix is provided, it should be accompanied by a test for the bug. The test should fail with the current code and pass with the bugfix. Add a comment to the test that references the issue or PR. Without a test, it is more difficult to prevent regressions in the future.
+
+## Add a new fine-tuning method
+
+New parameter-efficient fine-tuning methods are developed all the time. If you would like to add a new and promising method to PEFT, please follow these steps.
+
+1. Before you start to implement the new method, please open a [GitHub issue](https://github.com/huggingface/peft/issues) with your proposal. This way, the maintainers can give you some early feedback.
+2. Please add a link to the source (usually a paper) of the method. The paper should be in a final state to avoid changing requirements during development (e.g. due to reviewer feedback).
+3. When implementing the method, it makes sense to look for existing implementations that already exist as a guide. Moreover, when you structure your code, please take inspiration from the other PEFT methods. For example, if your method is similar to LoRA, it makes sense to structure your code similarly or even reuse some functions or classes where it makes sense (some code duplication is okay, but don’t overdo it).
+4. Ideally, in addition to the implementation of the new method, there should also be
+   - [examples](https://github.com/huggingface/peft/tree/main/examples) (notebooks, scripts)
+   - [documentation](https://github.com/huggingface/peft/tree/main/docs/source)
+   - [extensive test suite](https://github.com/huggingface/peft/tree/main/tests) that proves the method correctly integrates with PEFT
+   - [experimental setup](https://github.com/huggingface/peft/tree/main/method_comparison#creating-new-experiments) to run benchmarks
+5. Once you have something that seems to be working, don’t hesitate to create a draft PR even if it’s not in a mergeable state yet. The maintainers are happy to give you feedback and guidance along the way.
+
+## Add other features
+
+It is best if you first open an issue on GitHub with a proposal to add the new feature. This way, you can discuss with the maintainers if it makes sense to add the feature before spending too much time on implementing it.
+
+New features should generally be accompanied by tests and documentation or examples. Without the latter, users will have a hard time discovering your cool new feature.
+
+Changes to the code should be implemented in a backward-compatible way. For example, existing code should continue to work the same way after the feature is merged.
--- a/docs/source/developer_guides/contributing.mdx
+++ b/docs/source/developer_guides/contributing.mdx
@ -1,89 +0,0 @@
-<!--Copyright 2023 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-
-# Contributing to PEFT
-
-We are happy to accept contributions to PEFT. If you plan to contribute, please read this document to make the process as smooth as possible.
-
-## Installation
-
-The installation instructions can be found [here](https://huggingface.co/docs/peft/install). If you want to provide code contributions to PEFT, you should choose the "source" installation method.
-
-If you are new to creating a pull request, follow [these instructions from GitHub](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request).
-
-## Running tests and code quality checks
-
-Regardless of the type of contribution (unless it’s only about the docs), you should run tests and code quality checks before creating a PR to ensure that your contribution doesn’t break anything and follows the standards of the project.
-
-We provide a Makefile to facilitate those steps. Run the code below for the unit test:
-
-```sh
-make test
-```
-
-Run one of the following to either check or check and fix code quality and style:
-
-```sh
-make quality  # just check
-make style  # check and fix
-```
-
-
-Running all the tests can take a couple of minutes. Therefore, during development, it can be useful to run only those tests specific to your change:
-
-```sh
-pytest tests/ -k <name-of-test>
-```
-
-This should finish much quicker and allow faster iteration. Before creating the PR, however, please still run the whole test suite, as some changes can inadvertently break tests that at first glance are unrelated.
-
-If your change is specific to a hardware setting (e.g. it requires CUDA), take a look at `tests/test_gpu_examples.py` and `tests/test_common_gpu.py` – maybe it makes sense to add a test there.
-
-It can happen that while you’re working on your PR, the underlying code base changes due to other changes being merged. If that happens – especially when there is a merge conflict – please update your branch to be on the latest changes. This can be a merge or a rebase, whatever you prefer. We will squash and merge the PR once it’s ready.
-
-## PR description
-
-When opening the PR, please provide a nice description of the change you provide. If it relates to other issues or PRs, please reference them. Providing a good description will not only help the reviewers review your code better and faster, it can also later be used (as a basis) for the commit message, which helps with long term maintenance of the project.
-
-If your code makes some non-trivial changes, it can also be a good idea to add comments to the code to explain those changes. For example, if you had to iterate on your implementation multiple times because the most obvious way didn’t work, it’s a good indication that a code comment is needed.
-
-## Providing a bugfix
-
-Please give a description of the circumstances that lead to the bug. If there is an existing issue, please link to it (e.g. “Resolves #12345”).
-
-Ideally, when a bugfix is provided, it should be accompanied by a test for this bug. The test should fail with the current code and pass with the bugfix. Add a comment to the test that references the issue or PR. Without such a test, it is difficult to prevent regressions in the future.
-
-## Adding a new fine-tuning method
-
-New parameter-efficient fine-tuning methods are developed all the time. If you would like to add a new, promising method to PEFT, please follow these steps.
-
-**Requirements**
-
-1. Please add a link to the source (usually a paper) of the method.
-2. Some evidence should be provided that there is general interest in using the method. We will not add new methods that are freshly published but without evidence that there is demand for it.
-3. Ideally, we want to not only add the implementation of the new method, but also examples (notebooks, scripts), documentation, and an extensive test suite that proves that the method works with a variety of tasks. However, this can be very daunting. Therefore, it is also acceptable to only provide the implementation and at least one working example. Documentation and tests can be added in follow up PRs.
-
-**Steps**
-
-Before you start to implement the new method, please open an issue on GitHub with your proposal. That way, the maintainers can give you some early feedback.
-
-When implementing the method, it makes sense to look for existing implementations that already exist as a guide. Moreover, when you structure your code, please take inspiration from the other PEFT methods. For example, if your method is similar to LoRA, it makes sense to structure your code similarly or even re-use some functions or classes where it makes sense (but don’t overdo it, some code duplication is okay).
-
-Once you have something that seems to be working, don’t hesitate to create a draft PR, even if it’s not in a mergeable state yet. The maintainers will be happy to give you feedback and guidance along the way.
-
-## Adding other features
-
-It is best if you first open an issue on GitHub with a proposal to add the new feature. That way, you can discuss with the maintainers if it makes sense to add the feature before spending too much time on implementing it.
-
-New features should generally be accompanied by tests and documentation or examples. Without the latter, users will have a hard time discovering your cool new feature.
-
-Changes to the code should be implemented in a backward-compatible way. For example, existing code should continue to work the same way after the feature is merged.
--- a/docs/source/developer_guides/custom_models.md
+++ b/docs/source/developer_guides/custom_models.md
@ -0,0 +1,304 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# Custom models
+
+Some fine-tuning techniques, such as prompt tuning, are specific to language models. That means in 🤗 PEFT, it is
+assumed a 🤗 Transformers model is being used. However, other fine-tuning techniques - like
+[LoRA](../conceptual_guides/lora) - are not restricted to specific model types.
+
+In this guide, we will see how LoRA can be applied to a multilayer perceptron, a computer vision model from the [timm](https://huggingface.co/docs/timm/index) library, or a new 🤗 Transformers architecture.
+
+## Multilayer perceptron
+
+Let's assume that we want to fine-tune a multilayer perceptron with LoRA. Here is the definition:
+
+```python
+from torch import nn
+
+
+class MLP(nn.Module):
+    def __init__(self, num_units_hidden=2000):
+        super().__init__()
+        self.seq = nn.Sequential(
+            nn.Linear(20, num_units_hidden),
+            nn.ReLU(),
+            nn.Linear(num_units_hidden, num_units_hidden),
+            nn.ReLU(),
+            nn.Linear(num_units_hidden, 2),
+            nn.LogSoftmax(dim=-1),
+        )
+
+    def forward(self, X):
+        return self.seq(X)
+```
+
+This is a straightforward multilayer perceptron with an input layer, a hidden layer, and an output layer.
+
+> [!TIP]
+> For this toy example, we choose an exceedingly large number of hidden units to highlight the efficiency gains
+> from PEFT, but those gains are in line with more realistic examples.
+
+There are a few linear layers in this model that could be tuned with LoRA. When working with common 🤗 Transformers
+models, PEFT will know which layers to apply LoRA to, but in this case, it is up to us as a user to choose the layers.
+To determine the names of the layers to tune:
+
+```python
+print([(n, type(m)) for n, m in MLP().named_modules()])
+```
+
+This should print:
+
+```
+[('', __main__.MLP),
+ ('seq', torch.nn.modules.container.Sequential),
+ ('seq.0', torch.nn.modules.linear.Linear),
+ ('seq.1', torch.nn.modules.activation.ReLU),
+ ('seq.2', torch.nn.modules.linear.Linear),
+ ('seq.3', torch.nn.modules.activation.ReLU),
+ ('seq.4', torch.nn.modules.linear.Linear),
+ ('seq.5', torch.nn.modules.activation.LogSoftmax)]
+```
+
+Let's say we want to apply LoRA to the input layer and to the hidden layer, those are `'seq.0'` and `'seq.2'`. Moreover,
+let's assume we want to update the output layer without LoRA, that would be `'seq.4'`. The corresponding config would
+be:
+
+```python
+from peft import LoraConfig
+
+config = LoraConfig(
+    target_modules=["seq.0", "seq.2"],
+    modules_to_save=["seq.4"],
+)
+```
+
+With that, we can create our PEFT model and check the fraction of parameters trained:
+
+```python
+from peft import get_peft_model
+
+model = MLP()
+peft_model = get_peft_model(model, config)
+peft_model.print_trainable_parameters()
+# prints trainable params: 56,164 || all params: 4,100,164 || trainable%: 1.369798866581922
+```
+
+Finally, we can use any training framework we like, or write our own fit loop, to train the `peft_model`.
+
+For a complete example, check out [this notebook](https://github.com/huggingface/peft/blob/main/examples/multilayer_perceptron/multilayer_perceptron_lora.ipynb).
+
+## timm models
+
+The [timm](https://huggingface.co/docs/timm/index) library contains a large number of pretrained computer vision models.
+Those can also be fine-tuned with PEFT. Let's check out how this works in practice.
+
+To start, ensure that timm is installed in the Python environment:
+
+```bash
+python -m pip install -U timm
+```
+
+Next we load a timm model for an image classification task:
+
+```python
+import timm
+
+num_classes = ...
+model_id = "timm/poolformer_m36.sail_in1k"
+model = timm.create_model(model_id, pretrained=True, num_classes=num_classes)
+```
+
+Again, we need to make a decision about what layers to apply LoRA to. Since LoRA supports 2D conv layers, and since
+those are a major building block of this model, we should apply LoRA to the 2D conv layers. To identify the names of
+those layers, let's look at all the layer names:
+
+```python
+print([(n, type(m)) for n, m in model.named_modules()])
+```
+
+This will print a very long list, we'll only show the first few:
+
+```
+[('', timm.models.metaformer.MetaFormer),
+ ('stem', timm.models.metaformer.Stem),
+ ('stem.conv', torch.nn.modules.conv.Conv2d),
+ ('stem.norm', torch.nn.modules.linear.Identity),
+ ('stages', torch.nn.modules.container.Sequential),
+ ('stages.0', timm.models.metaformer.MetaFormerStage),
+ ('stages.0.downsample', torch.nn.modules.linear.Identity),
+ ('stages.0.blocks', torch.nn.modules.container.Sequential),
+ ('stages.0.blocks.0', timm.models.metaformer.MetaFormerBlock),
+ ('stages.0.blocks.0.norm1', timm.layers.norm.GroupNorm1),
+ ('stages.0.blocks.0.token_mixer', timm.models.metaformer.Pooling),
+ ('stages.0.blocks.0.token_mixer.pool', torch.nn.modules.pooling.AvgPool2d),
+ ('stages.0.blocks.0.drop_path1', torch.nn.modules.linear.Identity),
+ ('stages.0.blocks.0.layer_scale1', timm.models.metaformer.Scale),
+ ('stages.0.blocks.0.res_scale1', torch.nn.modules.linear.Identity),
+ ('stages.0.blocks.0.norm2', timm.layers.norm.GroupNorm1),
+ ('stages.0.blocks.0.mlp', timm.layers.mlp.Mlp),
+ ('stages.0.blocks.0.mlp.fc1', torch.nn.modules.conv.Conv2d),
+ ('stages.0.blocks.0.mlp.act', torch.nn.modules.activation.GELU),
+ ('stages.0.blocks.0.mlp.drop1', torch.nn.modules.dropout.Dropout),
+ ('stages.0.blocks.0.mlp.norm', torch.nn.modules.linear.Identity),
+ ('stages.0.blocks.0.mlp.fc2', torch.nn.modules.conv.Conv2d),
+ ('stages.0.blocks.0.mlp.drop2', torch.nn.modules.dropout.Dropout),
+ ('stages.0.blocks.0.drop_path2', torch.nn.modules.linear.Identity),
+ ('stages.0.blocks.0.layer_scale2', timm.models.metaformer.Scale),
+ ('stages.0.blocks.0.res_scale2', torch.nn.modules.linear.Identity),
+ ('stages.0.blocks.1', timm.models.metaformer.MetaFormerBlock),
+ ('stages.0.blocks.1.norm1', timm.layers.norm.GroupNorm1),
+ ('stages.0.blocks.1.token_mixer', timm.models.metaformer.Pooling),
+ ('stages.0.blocks.1.token_mixer.pool', torch.nn.modules.pooling.AvgPool2d),
+ ...
+ ('head.global_pool.flatten', torch.nn.modules.linear.Identity),
+ ('head.norm', timm.layers.norm.LayerNorm2d),
+ ('head.flatten', torch.nn.modules.flatten.Flatten),
+ ('head.drop', torch.nn.modules.linear.Identity),
+ ('head.fc', torch.nn.modules.linear.Linear)]
+ ]
+```
+
+Upon closer inspection, we see that the 2D conv layers have names such as `"stages.0.blocks.0.mlp.fc1"` and
+`"stages.0.blocks.0.mlp.fc2"`. How can we match those layer names specifically? You can write a [regular
+expressions](https://docs.python.org/3/library/re.html) to match the layer names. For our case, the regex
+`r".*\.mlp\.fc\d"` should do the job.
+
+Furthermore, as in the first example, we should ensure that the output layer, in this case the classification head, is
+also updated. Looking at the end of the list printed above, we can see that it's named `'head.fc'`. With that in mind,
+here is our LoRA config:
+
+```python
+config = LoraConfig(target_modules=r".*\.mlp\.fc\d", modules_to_save=["head.fc"])
+```
+
+Then we only need to create the PEFT model by passing our base model and the config to `get_peft_model`:
+
+```python
+peft_model = get_peft_model(model, config)
+peft_model.print_trainable_parameters()
+# prints trainable params: 1,064,454 || all params: 56,467,974 || trainable%: 1.88505789139876
+```
+
+This shows us that we only need to train less than 2% of all parameters, which is a huge efficiency gain.
+
+For a complete example, check out [this notebook](https://github.com/huggingface/peft/blob/main/examples/image_classification/image_classification_timm_peft_lora.ipynb).
+
+## New transformers architectures
+
+When new popular transformers architectures are released, we do our best to quickly add them to PEFT. If you come across a transformers model that is not supported out of the box, don't worry, it will most likely still work if the config is set correctly. Specifically, you have to identify the layers that should be adapted and set them correctly when initializing the corresponding config class, e.g. `LoraConfig`. Here are some tips to help with this.
+
+As a first step, it is a good idea to check the existing models for inspiration. You can find them inside of [constants.py](https://github.com/huggingface/peft/blob/main/src/peft/utils/constants.py) in the PEFT repository. Often, you'll find a similar architecture that uses the same names. For example, if the new model architecture is a variation of the "mistral" model and you want to apply LoRA, you can see that the entry for "mistral" in `TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING` contains `["q_proj", "v_proj"]`. This tells you that for "mistral" models, the `target_modules` for LoRA should be `["q_proj", "v_proj"]`:
+
+```python
+from peft import LoraConfig, get_peft_model
+
+my_mistral_model = ...
+config = LoraConfig(
+    target_modules=["q_proj", "v_proj"],
+    ...,  # other LoRA arguments
+)
+peft_model = get_peft_model(my_mistral_model, config)
+```
+
+If that doesn't help, check the existing modules in your model architecture with the `named_modules` method and try to identify the attention layers, especially the key, query, and value layers. Those will often have names such as `c_attn`, `query`, `q_proj`, etc. The key layer is not always adapted, and ideally, you should check whether including it results in better performance.
+
+Additionally, linear layers are common targets to be adapted (e.g. in [QLoRA paper](https://huggingface.co/papers/2305.14314), authors suggest to adapt them as well). Their names will often contain the strings `fc` or `dense`.
+
+If you want to add a new model to PEFT, please create an entry in [constants.py](https://github.com/huggingface/peft/blob/main/src/peft/utils/constants.py) and open a pull request on the [repository](https://github.com/huggingface/peft/pulls). Don't forget to update the [README](https://github.com/huggingface/peft#models-support-matrix) as well.
+
+## Verify parameters and layers
+
+You can verify whether you've correctly applied a PEFT method to your model in a few ways.
+
+* Check the fraction of parameters that are trainable with the [`~PeftModel.print_trainable_parameters`] method. If this number is lower or higher than expected, check the model `repr` by printing the model. This shows the names of all the layer types in the model. Ensure that only the intended target layers are replaced by the adapter layers. For example, if LoRA is applied to `nn.Linear` layers, then you should only see `lora.Linear` layers being used.
+
+```py
+peft_model.print_trainable_parameters()
+```
+
+* Another way you can view the adapted layers is to use the `targeted_module_names` attribute to list the name of each module that was adapted.
+
+```python
+print(peft_model.targeted_module_names)
+```
+
+## Unsupported module types
+
+Methods like LoRA only work if the target modules are supported by PEFT. For example, it's possible to apply LoRA to `nn.Linear` and `nn.Conv2d` layers, but not, for instance, to `nn.LSTM`. If you find a layer class you want to apply PEFT to is not supported, you can:
+
+ - define a custom mapping to dynamically dispatch custom modules in LoRA
+ -  open an [issue](https://github.com/huggingface/peft/issues) and request the feature where maintainers will implement it or guide you on how to implement it yourself if demand for this module type is sufficiently high
+
+### Experimental support for dynamic dispatch of custom modules in LoRA
+
+> [!WARNING]
+> This feature is experimental and subject to change, depending on its reception by the community. We will introduce a public and stable API if there is significant demand for it.
+
+PEFT supports an experimental API for custom module types for LoRA. Let's assume you have a LoRA implementation for LSTMs. Normally, you would not be able to tell PEFT to use it, even if it would theoretically work with PEFT. However, this is possible with dynamic dispatch of custom layers.
+
+The experimental API currently looks like this:
+
+```python
+class MyLoraLSTMLayer:
+    ...
+
+base_model = ...  # load the base model that uses LSTMs
+
+# add the LSTM layer names to target_modules
+config = LoraConfig(..., target_modules=["lstm"])
+# define a mapping from base layer type to LoRA layer type
+custom_module_mapping = {nn.LSTM: MyLoraLSTMLayer}
+# register the new mapping
+config._register_custom_module(custom_module_mapping)
+# after registration, create the PEFT model
+peft_model = get_peft_model(base_model, config)
+# do training
+```
+
+> [!TIP]
+> When you call [`get_peft_model`], you will see a warning because PEFT does not recognize the targeted module type. In this case, you can ignore this warning.
+
+By supplying a custom mapping, PEFT first checks the base model's layers against the custom mapping and dispatches to the custom LoRA layer type if there is a match. If there is no match, PEFT checks the built-in LoRA layer types for a match.
+
+Therefore, this feature can also be used to override existing dispatch logic, e.g. if you want to use your own LoRA layer for `nn.Linear` instead of using the one provided by PEFT.
+
+When creating your custom LoRA module, please follow the same rules as the [existing LoRA modules](https://github.com/huggingface/peft/blob/main/src/peft/tuners/lora/layer.py). Some important constraints to consider:
+
+- The custom module should inherit from `nn.Module` and `peft.tuners.lora.layer.LoraLayer`.
+- The `__init__` method of the custom module should have the positional arguments `base_layer` and `adapter_name`. After this, there are additional `**kwargs` that you are free to use or ignore.
+- The learnable parameters should be stored in an `nn.ModuleDict` or `nn.ParameterDict`, where the key corresponds to the name of the specific adapter (remember that a model can have more than one adapter at a time).
+- The name of these learnable parameter attributes should start with `"lora_"`, e.g. `self.lora_new_param = ...`.
+- Some methods are optional, e.g. you only need to implement `merge` and `unmerge` if you want to support weight merging.
+
+Currently, the information about the custom module does not persist when you save the model. When loading the model, you have to register the custom modules again.
+
+```python
+# saving works as always and includes the parameters of the custom modules
+peft_model.save_pretrained(<model-path>)
+
+# loading the model later:
+base_model = ...
+# load the LoRA config that you saved earlier
+config = LoraConfig.from_pretrained(<model-path>)
+# register the custom module again, the same way as the first time
+custom_module_mapping = {nn.LSTM: MyLoraLSTMLayer}
+config._register_custom_module(custom_module_mapping)
+# pass the config instance to from_pretrained:
+peft_model = PeftModel.from_pretrained(model, tmp_path / "lora-custom-module", config=config)
+```
+
+If you use this feature and find it useful, or if you encounter problems, let us know by creating an issue or a discussion on GitHub. This allows us to estimate the demand for this feature and add a public API if it is sufficiently high.
--- a/docs/source/developer_guides/custom_models.mdx
+++ b/docs/source/developer_guides/custom_models.mdx
@ -1,197 +0,0 @@
-<!--Copyright 2023 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-
-# Working with custom models
-
-Some fine-tuning techniques, such as prompt tuning, are specific to language models. That means in 🤗 PEFT, it is
-assumed a 🤗 Transformers model is being used. However, other fine-tuning techniques - like
-[LoRA](./conceptual_guides/lora) - are not restricted to specific model types.
-
-In this guide, we will see how LoRA can be applied to a multilayer perceptron and a computer vision model from the [timm](https://huggingface.co/docs/timm/index) library.
-
-## Multilayer perceptron
-
-Let's assume that we want to fine-tune a multilayer perceptron with LoRA. Here is the definition:
-
-```python
-from torch import nn
-
-
-class MLP(nn.Module):
-    def __init__(self, num_units_hidden=2000):
-        super().__init__()
-        self.seq = nn.Sequential(
-            nn.Linear(20, num_units_hidden),
-            nn.ReLU(),
-            nn.Linear(num_units_hidden, num_units_hidden),
-            nn.ReLU(),
-            nn.Linear(num_units_hidden, 2),
-            nn.LogSoftmax(dim=-1),
-        )
-
-    def forward(self, X):
-        return self.seq(X)
-```
-
-This is a straightforward multilayer perceptron with an input layer, a hidden layer, and an output layer. 
-
-<Tip>
-
-For this toy example, we choose an exceedingly large number of hidden units to highlight the efficiency gains
-from PEFT, but those gains are in line with more realistic examples.
-
-</Tip>
-
-There are a few linear layers in this model that could be tuned with LoRA. When working with common 🤗 Transformers
-models, PEFT will know which layers to apply LoRA to, but in this case, it is up to us as a user to choose the layers.
-To determine the names of the layers to tune:
-
-```python
-print([(n, type(m)) for n, m in MLP().named_modules()])
-```
-
-This should print:
-
-```
-[('', __main__.MLP),
- ('seq', torch.nn.modules.container.Sequential),
- ('seq.0', torch.nn.modules.linear.Linear),
- ('seq.1', torch.nn.modules.activation.ReLU),
- ('seq.2', torch.nn.modules.linear.Linear),
- ('seq.3', torch.nn.modules.activation.ReLU),
- ('seq.4', torch.nn.modules.linear.Linear),
- ('seq.5', torch.nn.modules.activation.LogSoftmax)]
-```
-
-Let's say we want to apply LoRA to the input layer and to the hidden layer, those are `'seq.0'` and `'seq.2'`. Moreover,
-let's assume we want to update the output layer without LoRA, that would be `'seq.4'`. The corresponding config would
-be:
-
-```python
-from peft import LoraConfig
-
-config = LoraConfig(
-    target_modules=["seq.0", "seq.2"],
-    modules_to_save=["seq.4"],
-)
-```
-
-With that, we can create our PEFT model and check the fraction of parameters trained:
-
-```python
-from peft import get_peft_model
-
-model = MLP()
-peft_model = get_peft_model(model, config)
-peft_model.print_trainable_parameters()
-# prints trainable params: 56,164 || all params: 4,100,164 || trainable%: 1.369798866581922
-```
-
-Finally, we can use any training framework we like, or write our own fit loop, to train the `peft_model`.
-
-For a complete example, check out [this notebook](https://github.com/huggingface/peft/blob/main/examples/multilayer_perceptron/multilayer_perceptron_lora.ipynb).
-
-## timm model
-
-The [timm](https://huggingface.co/docs/timm/index) library contains a large number of pretrained computer vision models.
-Those can also be fine-tuned with PEFT. Let's check out how this works in practice.
-
-To start, ensure that timm is installed in the Python environment:
-
-```bash
-python -m pip install -U timm
-```
-
-Next we load a timm model for an image classification task:
-
-```python
-import timm
-
-num_classes = ...
-model_id = "timm/poolformer_m36.sail_in1k"
-model = timm.create_model(model_id, pretrained=True, num_classes=num_classes)
-```
-
-Again, we need to make a decision about what layers to apply LoRA to. Since LoRA supports 2D conv layers, and since
-those are a major building block of this model, we should apply LoRA to the 2D conv layers. To identify the names of
-those layers, let's look at all the layer names:
-
-```python
-print([(n, type(m)) for n, m in MLP().named_modules()])
-```
-
-This will print a very long list, we'll only show the first few:
-
-```
-[('', timm.models.metaformer.MetaFormer),
- ('stem', timm.models.metaformer.Stem),
- ('stem.conv', torch.nn.modules.conv.Conv2d),
- ('stem.norm', torch.nn.modules.linear.Identity),
- ('stages', torch.nn.modules.container.Sequential),
- ('stages.0', timm.models.metaformer.MetaFormerStage),
- ('stages.0.downsample', torch.nn.modules.linear.Identity),
- ('stages.0.blocks', torch.nn.modules.container.Sequential),
- ('stages.0.blocks.0', timm.models.metaformer.MetaFormerBlock),
- ('stages.0.blocks.0.norm1', timm.layers.norm.GroupNorm1),
- ('stages.0.blocks.0.token_mixer', timm.models.metaformer.Pooling),
- ('stages.0.blocks.0.token_mixer.pool', torch.nn.modules.pooling.AvgPool2d),
- ('stages.0.blocks.0.drop_path1', torch.nn.modules.linear.Identity),
- ('stages.0.blocks.0.layer_scale1', timm.models.metaformer.Scale),
- ('stages.0.blocks.0.res_scale1', torch.nn.modules.linear.Identity),
- ('stages.0.blocks.0.norm2', timm.layers.norm.GroupNorm1),
- ('stages.0.blocks.0.mlp', timm.layers.mlp.Mlp),
- ('stages.0.blocks.0.mlp.fc1', torch.nn.modules.conv.Conv2d),
- ('stages.0.blocks.0.mlp.act', torch.nn.modules.activation.GELU),
- ('stages.0.blocks.0.mlp.drop1', torch.nn.modules.dropout.Dropout),
- ('stages.0.blocks.0.mlp.norm', torch.nn.modules.linear.Identity),
- ('stages.0.blocks.0.mlp.fc2', torch.nn.modules.conv.Conv2d),
- ('stages.0.blocks.0.mlp.drop2', torch.nn.modules.dropout.Dropout),
- ('stages.0.blocks.0.drop_path2', torch.nn.modules.linear.Identity),
- ('stages.0.blocks.0.layer_scale2', timm.models.metaformer.Scale),
- ('stages.0.blocks.0.res_scale2', torch.nn.modules.linear.Identity),
- ('stages.0.blocks.1', timm.models.metaformer.MetaFormerBlock),
- ('stages.0.blocks.1.norm1', timm.layers.norm.GroupNorm1),
- ('stages.0.blocks.1.token_mixer', timm.models.metaformer.Pooling),
- ('stages.0.blocks.1.token_mixer.pool', torch.nn.modules.pooling.AvgPool2d),
- ...
- ('head.global_pool.flatten', torch.nn.modules.linear.Identity),
- ('head.norm', timm.layers.norm.LayerNorm2d),
- ('head.flatten', torch.nn.modules.flatten.Flatten),
- ('head.drop', torch.nn.modules.linear.Identity),
- ('head.fc', torch.nn.modules.linear.Linear)]
- ]
-```
-
-Upon closer inspection, we see that the 2D conv layers have names such as `"stages.0.blocks.0.mlp.fc1"` and
-`"stages.0.blocks.0.mlp.fc2"`. How can we match those layer names specifically? You can write a [regular
-expressions](https://docs.python.org/3/library/re.html) to match the layer names. For our case, the regex
-`r".*\.mlp\.fc\d"` should do the job.
-
-Furthermore, as in the first example, we should ensure that the output layer, in this case the classification head, is
-also updated. Looking at the end of the list printed above, we can see that it's named `'head.fc'`. With that in mind,
-here is our LoRA config:
-
-```python
-config = LoraConfig(target_modules=r".*\.mlp\.fc\d", modules_to_save=["head.fc"])
-```
-
-Then we only need to create the PEFT model by passing our base model and the config to `get_peft_model`:
-
-```python
-peft_model = get_peft_model(model, config)
-peft_model.print_trainable_parameters()
-# prints trainable params: 1,064,454 || all params: 56,467,974 || trainable%: 1.88505789139876
-```
-
-This shows us that we only need to train less than 2% of all parameters, which is a huge efficiency gain.
-
-For a complete example, check out [this notebook](https://github.com/huggingface/peft/blob/main/examples/image_classification/image_classification_timm_peft_lora.ipynb).
--- a/docs/source/developer_guides/lora.md
+++ b/docs/source/developer_guides/lora.md
@ -0,0 +1,822 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contains specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# LoRA
+
+LoRA is low-rank decomposition method to reduce the number of trainable parameters which speeds up finetuning large models and uses less memory. In PEFT, using LoRA is as easy as setting up a [`LoraConfig`] and wrapping it with [`get_peft_model`] to create a trainable [`PeftModel`].
+
+This guide explores in more detail other options and features for using LoRA.
+
+## Initialization
+
+The initialization of LoRA weights is controlled by the parameter `init_lora_weights` in [`LoraConfig`]. By default, PEFT initializes LoRA weights with Kaiming-uniform for weight A and zeros for weight B resulting in an identity transform (same as the reference [implementation](https://github.com/microsoft/LoRA)).
+
+It is also possible to pass `init_lora_weights="gaussian"`. As the name suggests, this initializes weight A with a Gaussian distribution and zeros for weight B (this is how [Diffusers](https://huggingface.co/docs/diffusers/index) initializes LoRA weights).
+
+```py
+from peft import LoraConfig
+
+config = LoraConfig(init_lora_weights="gaussian", ...)
+```
+
+There is also an option to set `init_lora_weights=False` which is useful for debugging and testing. This should be the only time you use this option. When choosing this option, the LoRA weights are initialized such that they do *not* result in an identity transform.
+
+```py
+from peft import LoraConfig
+
+config = LoraConfig(init_lora_weights=False, ...)
+```
+
+### PiSSA
+[PiSSA](https://huggingface.co/papers/2404.02948) initializes the LoRA adapter using the principal singular values and singular vectors. This straightforward modification allows PiSSA to converge more rapidly than LoRA and ultimately attain superior performance. Moreover, PiSSA reduces the quantization error compared to QLoRA, leading to further enhancements.
+
+Configure the initialization method to "pissa", which may take several minutes to execute SVD on the pre-trained model:
+```python
+from peft import LoraConfig
+config = LoraConfig(init_lora_weights="pissa", ...)
+```
+Alternatively, execute fast SVD, which takes only a few seconds. The number of iterations determines the trade-off between the error and computation time:
+```python
+lora_config = LoraConfig(init_lora_weights="pissa_niter_[number of iters]", ...)
+```
+For detailed instruction on using PiSSA, please follow [these instructions](https://github.com/huggingface/peft/tree/main/examples/pissa_finetuning).
+
+### CorDA
+
+[CorDA](https://huggingface.co/papers/2406.05223) builds task-aware LoRA adapters from weight decomposition oriented by the context of downstream task to learn (instruction-previewed mode, IPM) or world knowledge to maintain (knowledge-preserved mode, KPM).
+The KPM not only achieves better performance than LoRA on fine-tuning tasks, but also mitigates the catastrophic forgetting of pre-trained world knowledge.
+When preserving pre-trained knowledge is not a concern,
+the IPM is favored because it can further accelerate convergence and enhance the fine-tuning performance.
+
+You need to configure the initialization method to "corda", and specify the mode of IPM or KPM and the dataset to collect covariance matrices.
+
+```py
+@torch.no_grad()
+def run_model():
+    # Assume `model` and `dataset` is in context...
+    model.eval()
+    for batch in dataset:
+        model(**batch)
+
+
+corda_config = CordaConfig(
+    corda_method="kpm",
+)
+lora_config = LoraConfig(
+    init_lora_weights="corda",
+    corda_config=corda_config,
+)
+preprocess_corda(model, lora_config, run_model=run_model)
+peft_model = get_peft_model(model, lora_config)
+```
+
+For detailed instruction on using CorDA, please follow [these instructions](https://github.com/huggingface/peft/tree/main/examples/corda_finetuning).
+
+### OLoRA
+[OLoRA](https://huggingface.co/papers/2406.01775) utilizes QR decomposition to initialize the LoRA adapters. OLoRA translates the base weights of the model by a factor of their QR decompositions, i.e., it mutates the weights before performing any training on them. This approach significantly improves stability, accelerates convergence speed, and ultimately achieves superior performance.
+
+You just need to pass a single additional option to use OLoRA:
+```python
+from peft import LoraConfig
+config = LoraConfig(init_lora_weights="olora", ...)
+```
+For more advanced usage, please refer to our [documentation](https://github.com/huggingface/peft/tree/main/examples/olora_finetuning).
+
+### EVA
+[EVA](https://huggingface.co/papers/2410.07170) performs SVD on the input activations of each layer and uses the right-singular vectors to initialize LoRA weights. It is therefore a data-driven initialization scheme. Furthermore EVA adaptively allocates ranks across layers based on their "explained variance ratio" - a metric derived from the SVD analysis.
+
+You can use EVA by setting `init_lora_weights="eva"` and defining [`EvaConfig`] in [`LoraConfig`]:
+```python
+from peft import LoraConfig, EvaConfig
+peft_config = LoraConfig(
+    init_lora_weights = "eva",
+    eva_config = EvaConfig(rho = 2.0),
+    ...
+)
+```
+The parameter `rho` (≥ 1.0) determines how much redistribution is allowed. When `rho=1.0` and `r=16`, LoRA adapters are limited to exactly 16 ranks, preventing any redistribution from occurring. A recommended value for EVA with redistribution is 2.0, meaning the maximum rank allowed for a layer is 2r.
+
+It is recommended to perform EVA initialization on an accelerator(e.g. CUDA GPU, Intel XPU) as it is much faster. To optimize the amount of available memory for EVA, you can use the `low_cpu_mem_usage` flag in [`get_peft_model`]:
+```python
+peft_model = get_peft_model(model, peft_config, low_cpu_mem_usage=True)
+```
+Then, call [`initialize_lora_eva_weights`] to initialize the EVA weights (in most cases the dataloader used for eva initialization can be the same as the one used for finetuning):
+```python
+initialize_lora_eva_weights(peft_model, dataloader)
+```
+EVA works out of the box with bitsandbytes. Simply initialize the model with `quantization_config` and call [`initialize_lora_eva_weights`] as usual.
+
+> [!TIP]
+> For further instructions on using EVA, please refer to our [documentation](https://github.com/huggingface/peft/tree/main/examples/eva_finetuning).
+
+### LoftQ
+
+#### Standard approach
+
+When quantizing the base model for QLoRA training, consider using the [LoftQ initialization](https://huggingface.co/papers/2310.08659), which has been shown to improve performance when training quantized models. The idea is that the LoRA weights are initialized such that the quantization error is minimized. To use LoftQ, follow [these instructions](https://github.com/huggingface/peft/tree/main/examples/loftq_finetuning).
+
+In general, for LoftQ to work best, it is recommended to target as many layers with LoRA as possible, since those not targeted cannot have LoftQ applied. This means that passing `LoraConfig(..., target_modules="all-linear")` will most likely give the best results. Also, you should use `nf4` as quant type in your quantization config when using 4bit quantization, i.e. `BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")`.
+
+#### A more convenient way
+
+An easier but more limited way to apply LoftQ initialization is to use the convenience function `replace_lora_weights_loftq`. This takes the quantized PEFT model as input and replaces the LoRA weights in-place with their LoftQ-initialized counterparts.
+
+```python
+from peft import replace_lora_weights_loftq
+from transformers import BitsAndBytesConfig
+
+bnb_config = BitsAndBytesConfig(load_in_4bit=True, ...)
+base_model = AutoModelForCausalLM.from_pretrained(..., quantization_config=bnb_config)
+# note: don't pass init_lora_weights="loftq" or loftq_config!
+lora_config = LoraConfig(task_type="CAUSAL_LM")
+peft_model = get_peft_model(base_model, lora_config)
+replace_lora_weights_loftq(peft_model)
+```
+
+`replace_lora_weights_loftq` also allows you to pass a `callback` argument to give you more control over which layers should be modified or not, which empirically can improve the results quite a lot. To see a more elaborate example of this, check out [this notebook](https://github.com/huggingface/peft/blob/main/examples/loftq_finetuning/LoftQ_weight_replacement.ipynb).
+
+`replace_lora_weights_loftq` implements only one iteration step of LoftQ. This means that only the LoRA weights are updated, instead of iteratively updating LoRA weights and quantized base model weights. This may lead to lower performance but has the advantage that we can use the original quantized weights derived from the base model, instead of having to keep an extra copy of modified quantized weights. Whether this tradeoff is worthwhile depends on the use case.
+
+At the moment, `replace_lora_weights_loftq` has these additional limitations:
+
+- Model files must be stored as a `safetensors` file.
+- Only bitsandbytes 4bit quantization is supported.
+
+> [!TIP]
+> Learn more about how PEFT works with quantization in the [Quantization](quantization) guide.
+
+### Rank-stabilized LoRA
+
+Another way to initialize [`LoraConfig`] is with the [rank-stabilized LoRA (rsLoRA)](https://huggingface.co/papers/2312.03732) method. The LoRA architecture scales each adapter during every forward pass by a fixed scalar which is set at initialization and depends on the rank `r`. The scalar is given by `lora_alpha/r` in the original implementation, but rsLoRA uses `lora_alpha/math.sqrt(r)` which stabilizes the adapters and increases the performance potential from using a higher `r`.
+
+```py
+from peft import LoraConfig
+
+config = LoraConfig(use_rslora=True, ...)
+```
+### Activated LoRA (aLoRA)
+
+Activated LoRA (aLoRA) is a low rank adapter architecture for Causal LMs that allows for reusing existing base model KV cache for more efficient inference. This approach is best suited for inference pipelines which rely on the base model for most tasks/generations, but use aLoRA adapter(s) to perform specialized task(s) within the chain. For example, checking or correcting generated outputs of the base model. In these settings, inference times can be sped up by an order of magnitude or more. For more information on aLoRA and many example use cases, see https://huggingface.co/papers/2504.12397.
+
+This technique scans for the last occurence of an invocation sequence (`alora_invocation_tokens`) in each input (this can be as short as 1 token), and activates the adapter weights on tokens starting with the beginning of the invocation sequence (any inputs after the invocation sequence are also adapted, and all generated tokens will use the adapted weights). Weights on prior tokens are left un-adapted -- making the cache for those tokens interchangeable with base model cache due to the causal attention mask in Causal LMs. Usage is very similar to standard LoRA, with the key difference that this invocation sequence must be specified when the adapter is created:
+
+```py
+from peft import LoraConfig
+
+config = LoraConfig(alora_invocation_tokens=alora_invocation_tokens, task_type="CAUSAL_LM", ...)
+```
+
+where `alora_invocation_tokens` is a list of integer token ids. Given a desired invocation string, this can be obtained as
+```
+invocation_string = "placeholder"
+alora_invocation_tokens = tokenizer.encode(invocation_string, add_special_tokens=False).
+```
+where the tokenizer is the tokenizer for the base model. Note that we have `add_special_tokens=False` to avoid adding SOS/EOS tokens in our search string (which will most likely cause failure to find).
+
+**Notes**
+* aLoRA is only supported for `task_type=CAUSAL_LM` tasks due to its focus on cache reuse.
+* Since the weights are adapted on fewer tokens, often (not always) aLoRA requires higher rank (`r`) than LoRA. `r=32` can be a good starting point.
+* aLoRA weights cannot be merged into the base model by definition, since the adapter weights are selectively applied to a subset of tokens. Attempts to merge will throw errors.
+* Beam search is not yet supported.
+* It is generally not recommended to add new tokens to the tokenizer that are not present in the base model, as this can complicate the target use case of both the base model and adapter model operating on overlapping context. That said, there is a possible workaround by first efficiently adding [trainable tokens](https://huggingface.co/docs/peft/en/package_reference/trainable_tokens) to the base model prior to training the adapter.
+
+#### Choice of invocation sequence and SFT design 
+
+Each input must have the `alora_invocation_tokens` sequence present, it is not added automatically. To maximize model performance without compromising cache reuse, it is recommended to have the adapter weights activated early, i.e. at the start of any adapter-specific prompting, but after any long inputs such as prior generations or documents. As with any model,
+formatting should be consistent between train and test.
+
+Consider the following example, where the base model has a chat template,
+and the goal is to train the adapter to generate a desired output. 
+
+* Option 1: If there is no task-specific prompt, i.e. the input is a chat history with the `assistant` prompt, then the chat template's `assistant` prompt (e.g. `<|start_of_role|>assistant<|end_of_role|>`) is a natural choice for the invocation string. See the model's chat template to find the prompt for the model.
+* Option 2: If there is a task-specific prompt for the adapter that describes the task the adapter is learning, and that prompt is put as a `user` turn immediately prior to the generation, then the chat template's `user` prompt (e.g. `<|start_of_role|>user<|end_of_role|>`) is a natural choice for the invocation string.
+
+Once deciding on an invocation string, get the model tokenizer and obtain `alora_invocation_tokens` as 
+```
+alora_invocation_tokens = tokenizer.encode(invocation_string, add_special_tokens=False).
+```
+
+An example inference setup is at [alora finetuning](https://github.com/huggingface/peft/blob/main/examples/alora_finetuning/alora_finetuning.py).
+
+**Note** If using custom strings for the invocation string, make sure that the start and end of the string are special tokens to avoid issues with tokenization at the boundaries. 
+
+To see why, imagine that 'a', 'b', 'c', and 'ab' are tokens in your tokenizer (numbers 1, 2, 3, 4 respectively). Suppose that your alora_invocation_tokens = [2, 3]. Now imagine your input string is "abc". Because "ab" is a token, this will get tokenized as [4,3]. So the alora_invocation_tokens will fail to be found, despite the string "bc" being in it. If the start and end of the invocation string are special tokens, however, this failure case will never happen since special tokens are never tokenized into the same token with other characters.
+
+#### Using (and reusing) cache for generation
+The main purpose of Activated LoRA is to make KV cache interchangeable between the base model and aLoRA adapter models **prior to the invocation sequence** since base and adapted KV values are not compatible. Specifically, keys and values stored during one model generation can be used in subsequent generations to avoid expensive prefill operations for context tokens. When sharing cache between the base model and aLoRA adapters, there are 2 main patterns:
+1. The base model has generated something, and an aLoRA adapter is then called to do a followup generation. Example: the base model answers a question, and an aLoRA trained to detect hallucinations checks the base model response.
+2. An aLoRA adapter has generated something, and the base model or a different aLoRA adapter is called to do a followup generation where there is partial context overlap with the original aLoRA. Example: The user provides a query, and an aLoRA rewrites the query to be more self-contained and improve retrieval in a RAG system. Then, documents are retrieved and loaded into context, an aLoRA checks if these documents are indeed relevant to the question, and then the base model generates an answer.
+
+
+To demonstrate the above behaviors when using caching, we're using [DynamicCache](https://huggingface.co/docs/transformers/en/kv_cache) from `transformers`. Care must be taken to ensure that adapted cache values are not mixed with base cache values. In particular, an extra step is required for sharing the cache when there is partial context overlap (pattern 2).
+
+**Pattern 1: Base model followed by aLoRA** Here, the entire input and generation from the base model is input into the aLoRA adapter, along with the invocation sequence:
+```
+from transformers import DynamicCache
+...
+cache = DynamicCache()
+inputs_base = tokenizer(prompt_base, return_tensors="pt")
+# Generate from base model and save cache
+with model_alora.disable_adapter(): 
+    output = model_alora.generate(inputs_base["input_ids"].to(device),attention_mask=inputs_base["attention_mask"].to(device),past_key_values = cache,return_dict_in_generate=True)
+output_text_base = tokenizer.decode(output.sequences[0])
+cache = output.past_key_values
+
+# Generate with aLoRA adapter from cache
+prompt_alora = output_text + INVOCATION_STRING
+inputs_alora = tokenizer(prompt_alora, return_tensors="pt").to(device)
+output = model_alora.generate(**inputs_alora, past_key_values=cache)
+output_text_alora = tokenizer.decode(output[0])
+
+# Note: cache is now tainted with adapter values and cannot be used in base model from here on!
+```
+
+**Pattern 2: aLoRA generation followed by base model (or another aLoRA) with partial context overlap** Here, we prefill the shared context using the base model, and then generate.
+
+```
+from transformers import DynamicCache
+import copy
+...
+cache = DynamicCache()
+inputs_shared = tokenizer(prompt_shared, return_tensors="pt").to(device)
+
+# Prefill from base model and save cache
+with model_alora.disable_adapter():
+    with torch.no_grad():
+        model_alora(**inputs_shared, past_key_values=cache)
+cache_copy = copy.deepcopy(cache)
+
+# Generate from aLoRA using prefilled cache
+prompt_alora = prompt_shared + INVOCATION_STRING
+inputs_alora = tokenizer(prompt_alora, return_tensors="pt").to(device)
+output = model_alora.generate(**inputs_alora, past_key_values=cache)
+output_text_alora = tokenizer.decode(output[0])
+
+# Generate from base model using saved cache not tainted by aLoRA KV values
+prompt_base = prompt_shared
+inputs_base = tokenizer(prompt_base, return_tensors="pt").to(device)
+with model_alora.disable_adapter(): 
+    output = model_alora.generate(**inputs_base, past_key_values=cache_copy)
+output_text_base = tokenizer.decode(output[0])
+```
+
+### Weight-Decomposed Low-Rank Adaptation (DoRA)
+
+This technique decomposes the updates of the weights into two parts, magnitude and direction. Direction is handled by normal LoRA, whereas the magnitude is handled by a separate learnable parameter. This can improve the performance of LoRA, especially at low ranks. For more information on DoRA, see  https://huggingface.co/papers/2402.09353.
+
+```py
+from peft import LoraConfig
+
+config = LoraConfig(use_dora=True, ...)
+```
+
+If parts of the model or the DoRA adapter are offloaded to CPU you can get a significant speedup at the cost of some temporary (ephemeral) VRAM overhead by using `ephemeral_gpu_offload=True` in `config.runtime_config`.
+
+```py
+from peft import LoraConfig, LoraRuntimeConfig
+
+config = LoraConfig(use_dora=True, runtime_config=LoraRuntimeConfig(ephemeral_gpu_offload=True), ...)
+```
+
+A `PeftModel` with a DoRA adapter can also be loaded with `ephemeral_gpu_offload=True` flag using the `from_pretrained` method as well as the `load_adapter` method.
+
+```py
+from peft import PeftModel
+
+model = PeftModel.from_pretrained(base_model, peft_model_id, ephemeral_gpu_offload=True)
+```
+
+DoRA is optimized (computes faster and takes less memory) for models in the evaluation mode, or when dropout is set to 0. We reuse the
+base result at those times to get the speedup.
+Running [dora finetuning](https://github.com/huggingface/peft/blob/main/examples/dora_finetuning/dora_finetuning.py)
+with `CUDA_VISIBLE_DEVICES=0 ZE_AFFINITY_MASK=0 time python examples/dora_finetuning/dora_finetuning.py --quantize --lora_dropout 0 --batch_size 16 --eval_step 2 --use_dora`
+on a 4090 with gradient accumulation set to 2 and max step to 20 resulted with the following observations:
+
+| | Without Optimization | With Optimization |
+| :--: | :--: | :--: |
+| train_runtime | 359.7298 | **279.2676** |
+| train_samples_per_second | 1.779 | **2.292** |
+| train_steps_per_second | 0.056 | **0.072** |
+
+#### Caveats
+
+- DoRA only supports embedding, linear, and Conv2d layers at the moment.
+- DoRA introduces a bigger overhead than pure LoRA, so it is recommended to merge weights for inference, see [`LoraModel.merge_and_unload`].
+- DoRA should work with weights quantized with bitsandbytes ("QDoRA"). However, issues have been reported when using QDoRA with DeepSpeed Zero2.
+
+### QLoRA-style training
+
+The default LoRA settings in PEFT add trainable weights to the query and value layers of each attention block. But [QLoRA](https://hf.co/papers/2305.14314), which adds trainable weights to all the linear layers of a transformer model, can provide performance equal to a fully finetuned model. To apply LoRA to all the linear layers, like in QLoRA, set `target_modules="all-linear"` (easier than specifying individual modules by name which can vary depending on the architecture).
+
+```py
+config = LoraConfig(target_modules="all-linear", ...)
+```
+
+### Memory efficient Layer Replication with LoRA
+
+An approach used to improve the performance of models is to expand a model by duplicating layers in the model to build a larger model from a pretrained model of a given size. For example increasing a 7B model to a 10B model as described in the [SOLAR](https://huggingface.co/papers/2312.15166) paper. PEFT LoRA supports this kind of expansion in a memory efficient manner that supports further fine-tuning using LoRA adapters attached to the layers post replication of the layers. The replicated layers do not take additional memory as they share the underlying weights so the only additional memory required is the memory for the adapter weights. To use this feature you would create a config with the `layer_replication` argument.
+
+```py
+config = LoraConfig(layer_replication=[[0,4], [2,5]], ...)
+```
+
+Assuming the original model had 5 layers `[0, 1, 2 ,3, 4]`, this would create a model with 7 layers arranged as `[0, 1, 2, 3, 2, 3, 4]`. This follows the [mergekit](https://github.com/arcee-ai/mergekit) pass through merge convention where sequences of layers specified as start inclusive and end exclusive tuples are stacked to build the final model. Each layer in the final model gets its own distinct set of LoRA adapters.
+
+[Fewshot-Metamath-OrcaVicuna-Mistral-10B](https://huggingface.co/abacusai/Fewshot-Metamath-OrcaVicuna-Mistral-10B) is an example of a model trained using this method on Mistral-7B expanded to 10B. The
+[adapter_config.json](https://huggingface.co/abacusai/Fewshot-Metamath-OrcaVicuna-Mistral-10B/blob/main/adapter_config.json) shows a sample LoRA adapter config applying this method for fine-tuning.
+
+### Fine grained control over ranks and alpha (scaling)
+
+By default, all layers targeted with LoRA will have the same rank `r` and the same `lora_alpha` (which determines the LoRA scaling), depending on what was specified in the [`LoraConfig`]. In some cases, however, you may want to indicate different values for different layers. This is possible by passing the `rank_pattern` and `alpha_pattern` arguments to [`LoraConfig`]. These arguments should be dictionaries with the key being the layer name and the value being the rank/alpha value. The keys can be [regular expressions](https://docs.python.org/3/library/re.html) (regex). All LoRA layers that are not explicitly mentioned in `rank_pattern` and `alpha_pattern` will take the default `r` and `lora_alpha` values.
+
+To give an example, let's assume that we have a model with the following structure:
+
+```python
+>>> print(model)
+Outer(
+  (foo): Linear(...)
+  (module): Middle(
+    (foo): Linear(...)
+    (foobar): Linear(...)
+    (module): Inner(
+      (foo): Linear(...)
+      (barfoo): Linear(...)
+    )
+  )
+)
+```
+
+- `rank_pattern={"foo": 42}` will match all 3 `foo` layers. Neither `foobar` nor `barfoo` are matched.
+- `rank_pattern={"^foo": 42}` will only match the `foo` layer of the model, but neither `module.foo` nor `module.module.foo`. This is because the `^` means "start of string" when using regular expressions, and only `foo` starts with `"foo"`, the other layer names have prefixes.
+- `rank_pattern={"^module.foo": 42}` matches only `module.foo`, but not `module.module.foo`, for the same reason.
+- `rank_pattern={"module.foo": 42}` matches both `module.foo` and `module.module.foo`, but not `foo`.
+- `rank_pattern={"^foo": 42, "^module.module.foo": 55}` matches `foo` and `module.module.foo`, respectively, but not `module.foo`.
+- There is no need to indicate `$` to mark the end of the match, as this is added automatically by PEFT.
+
+The same logic applies to `alpha_pattern`. If you're in doubt, don't try to get fancy with regular expressions -- just pass the full name for each module with a different rank/alpha, preceded by the `^` prefix, and you should be good.
+
+### Targeting `nn.Parameter` directly
+
+> [!WARNING]
+> This feature is experimental and subject to change.
+
+Generally, you should use `target_modules` to target the module (e.g. `nn.Linear`). However, in some circumstances, this is not possible. E.g., in many mixture of expert (MoE) layers in HF Transformers, instead of using `nn.Linear`, an `nn.Parameter` is used. PEFT normally overwrites the `forward` method for LoRA, but for `nn.Parameter`, there is none. Therefore, to apply LoRA to that parameter, it needs to be targeted with `target_parameters`. As an example, for [Llama4](https://huggingface.co/collections/meta-llama/llama-4-67f0c30d9fe03840bc9d0164), you can pass: `target_parameters=['feed_forward.experts.gate_up_proj', 'feed_forward.experts.down_proj]`.
+
+#### Caveats
+
+- At the moment, this argument allows to target 2-dim or 3-dim `nn.Parameter`s. It is assumed that in the case of a 3-dim parameter, the 0th dimension is the expert dimension.
+- It is currently not possible to add multiple LoRA adapters (via `model.add_adapter` or `model.load_adapter`) that use `target_parameters` at the same time.
+
+## Optimizers
+
+LoRA training can optionally include special purpose optimizers. Currently PEFT supports LoRA-FA and LoRA+.
+
+### LoRA-FA Optimizer
+
+LoRA training can be more effective and efficient using LoRA-FA, as described in [LoRA-FA](https://huggingface.co/papers/2308.03303). LoRA-FA reduces activation memory consumption by fixing the matrix A and only tuning the matrix B. During training, the gradient of B is optimized to approximate the full parameter fine-tuning gradient. Moreover, the memory consumption of LoRA-FA is not sensitive to the rank (since it erases the activation of $A$), therefore it can improve performance by enlarging lora rank without increasing memory consumption.
+
+```py
+from peft import LoraConfig, get_peft_model
+from peft.optimizers import create_lorafa_optimizer
+from transformers import Trainer, get_cosine_schedule_with_warmup
+
+base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
+
+config = LoraConfig(...)
+model = get_peft_model(base_model, config)
+
+optimizer = create_lorafa_optimizer(
+    model=model,
+    r=128,
+    lora_alpha=32,
+    lr=7e-5,
+)
+
+scheduler = get_cosine_schedule_with_warmup(
+    optimizer,
+    num_warmup_steps=100,
+    num_training_steps=1000,
+)
+
+trainer = Trainer(
+    ...,
+    optimizers=(optimizer, scheduler),
+)
+```
+
+### LoRA+ optimized LoRA
+
+LoRA training can be optimized using [LoRA+](https://huggingface.co/papers/2402.12354), which uses different learning rates for the adapter matrices A and B, shown to increase finetuning speed by up to 2x and performance by 1-2%.
+
+```py
+from peft import LoraConfig, get_peft_model
+from peft.optimizers import create_loraplus_optimizer
+from transformers import Trainer
+import bitsandbytes as bnb
+
+base_model = ...
+config = LoraConfig(...)
+model = get_peft_model(base_model, config)
+
+optimizer = create_loraplus_optimizer(
+    model=model,
+    optimizer_cls=bnb.optim.Adam8bit,
+    lr=5e-5,
+    loraplus_lr_ratio=16,
+)
+scheduler = None
+
+...
+trainer = Trainer(
+    ...,
+    optimizers=(optimizer, scheduler),
+)
+```
+
+## Efficiently train tokens alongside LoRA
+
+Sometimes it is necessary to not only change some layer's weights but to add new tokens as well. With larger models this can be a memory-costly endeavour. PEFT LoRA adapters support the `trainable_token_indices` parameter which allows tuning of other tokens alongside fine-tuning of specific layers with LoRA. This method only trains the tokens you specify and leaves all other tokens untouched. This saves memory and doesn't throw away learned context of existing token embeddings in contrast to when training the whole embedding matrix. Under the hood this method uses the layer of [`TrainableTokensModel`].
+
+```py
+# for layer 'embed_tokens'
+config = LoraConfig(trainable_token_indices=[idx_1, idx_2, ...], ...)
+
+# specific embedding layer
+config = LoraConfig(trainable_token_indices={'emb_tokens': [idx_1, idx_2, ...]}, ...)
+```
+
+In the snippet below we show how to add new tokens to the model and how to train it alongside the other layers in the model.
+
+```py
+from transformers import AutoTokenizer, AutoModelForCausalLM
+from peft import get_peft_model, LoraConfig
+
+base_model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
+tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
+
+# we define our new tokens and add them to the tokenizer as special tokens
+special_tokens = ['<|start_think|>', '<|stop_think|>']
+tokenizer.add_special_tokens({'additional_special_tokens': special_tokens})
+
+# make room for new tokens in the embedding matrix if it isn't big enough already
+base_model.resize_token_embeddings(max(len(tokenizer), base_model.model.embed_tokens.num_embeddings))
+
+# typical LoRA config with `trainable_token_indices` targeting embedding layer `embed_tokens`
+# and specifically our new tokens we just added
+lora_config = LoraConfig(
+    target_modules='all-linear',
+    trainable_token_indices={'embed_tokens': tokenizer.convert_tokens_to_ids(special_tokens)},
+)
+peft_model = get_peft_model(base_model, lora_config)
+
+# proceed to train the model like normal
+[...]
+```
+
+The token weights are part of your adapter state dict and saved alongside the LoRA weights.
+If we would have used full fine-tuning with `modules_to_save=['embed_tokens']` we would have stored the full embedding matrix in the checkpoint, leading to a much bigger file.
+
+To give a bit of an indication how much VRAM can be saved, a rudimentary comparison of the above example was made between training the embedding matrix fully (`modules_to_save=["embed_tokens"]`), using a LoRA for the embedding matrix (`target_modules=[..., "embed_tokens"]`, rank 32) and trainable tokens (`trainable_token_indices=[...]`, 6 tokens). Trainable tokens used about as much VRAM (15,562MB vs. 15,581MB) as LoRA while being specific to the tokens and saved ~1GB of VRAM over fully training the embedding matrix.
+
+
+## Merge LoRA weights into the base model
+
+While LoRA is significantly smaller and faster to train, you may encounter latency issues during inference due to separately loading the base model and the LoRA adapter. To eliminate latency, use the [`~LoraModel.merge_and_unload`] function to merge the adapter weights with the base model. This allows you to use the newly merged model as a standalone model. The [`~LoraModel.merge_and_unload`] function doesn't keep the adapter weights in memory.
+
+Below is a diagram that explains the intuition of LoRA adapter merging:
+
+<div class="flex justify-center">
+    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/peft/lora_diagram.png"/>
+</div>
+
+We show in the snippets below how to run that using PEFT.
+
+```py
+from transformers import AutoModelForCausalLM
+from peft import PeftModel
+
+base_model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
+peft_model_id = "alignment-handbook/zephyr-7b-sft-lora"
+model = PeftModel.from_pretrained(base_model, peft_model_id)
+model.merge_and_unload()
+```
+
+If you need to keep a copy of the weights so you can unmerge the adapter later or delete and load different ones, you should use the [`~LoraModel.merge_adapter`] function instead. Now you have the option to use [`~LoraModel.unmerge_adapter`] to return the base model.
+
+```py
+from transformers import AutoModelForCausalLM
+from peft import PeftModel
+
+base_model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
+peft_model_id = "alignment-handbook/zephyr-7b-sft-lora"
+model = PeftModel.from_pretrained(base_model, peft_model_id)
+model.merge_adapter()
+
+# unmerge the LoRA layers from the base model
+model.unmerge_adapter()
+```
+
+The [`~LoraModel.add_weighted_adapter`] function is useful for merging multiple LoRAs into a new adapter based on a user provided weighting scheme in the `weights` parameter. Below is an end-to-end example.
+
+First load the base model:
+
+```python
+from transformers import AutoModelForCausalLM
+from peft import PeftModel
+import torch
+
+base_model = AutoModelForCausalLM.from_pretrained(
+    "mistralai/Mistral-7B-v0.1", dtype=torch.float16, device_map="auto"
+)
+```
+
+Then we load the first adapter:
+
+```python
+peft_model_id = "alignment-handbook/zephyr-7b-sft-lora"
+model = PeftModel.from_pretrained(base_model, peft_model_id, adapter_name="sft")
+```
+
+Then load a different adapter and merge it with the first one:
+
+```python
+weighted_adapter_name = "sft-dpo"
+model.load_adapter("alignment-handbook/zephyr-7b-dpo-lora", adapter_name="dpo")
+model.add_weighted_adapter(
+    adapters=["sft", "dpo"],
+    weights=[0.7, 0.3],
+    adapter_name=weighted_adapter_name,
+    combination_type="linear"
+)
+model.set_adapter(weighted_adapter_name)
+```
+
+> [!TIP]
+> There are several supported methods for `combination_type`. Refer to the [documentation](../package_reference/lora#peft.LoraModel.add_weighted_adapter) for more details. Note that "svd" as the `combination_type` is not supported when using `torch.float16` or `torch.bfloat16` as the datatype.
+
+Now, perform inference:
+
+```python
+device = torch.accelerator.current_accelerator().type if hasattr(torch, "accelerator") else "cuda"
+
+tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
+
+prompt = "Hey, are you conscious? Can you talk to me?"
+inputs = tokenizer(prompt, return_tensors="pt")
+inputs = {k: v.to(device) for k, v in inputs.items()}
+
+with torch.no_grad():
+    generate_ids = model.generate(**inputs, max_length=30)
+outputs = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+print(outputs)
+```
+
+## Load adapters
+
+Adapters can be loaded onto a pretrained model with [`~PeftModel.load_adapter`], which is useful for trying out different adapters whose weights aren't merged. Set the active adapter weights with the [`~LoraModel.set_adapter`] function.
+
+```py
+from transformers import AutoModelForCausalLM
+from peft import PeftModel
+
+base_model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
+peft_model_id = "alignment-handbook/zephyr-7b-sft-lora"
+model = PeftModel.from_pretrained(base_model, peft_model_id)
+
+# load different adapter
+model.load_adapter("alignment-handbook/zephyr-7b-dpo-lora", adapter_name="dpo")
+
+# set adapter as active
+model.set_adapter("dpo")
+```
+
+To return the base model, you could use [`~LoraModel.unload`] to unload all of the LoRA modules or [`~LoraModel.delete_adapter`] to delete the adapter entirely.
+
+```py
+# unload adapter
+model.unload()
+
+# delete adapter
+model.delete_adapter("dpo")
+```
+
+## Inference with different LoRA adapters in the same batch
+
+Normally, each inference batch has to use the same adapter(s) in PEFT. This can sometimes be annoying, because we may have batches that contain samples intended to be used with different LoRA adapters. For example, we could have a base model that works well in English and two more LoRA adapters, one for French and one for German. Usually, we would have to split our batches such that each batch only contains samples of one of the languages, we cannot combine different languages in the same batch.
+
+Thankfully, it is possible to mix different LoRA adapters in the same batch using the `adapter_name` argument. Below, we show an example of how this works in practice. First, let's load the base model, English, and the two adapters, French and German, like this:
+
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+from peft import PeftModel
+
+model_id = ...
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+
+model = AutoModelForCausalLM.from_pretrained(model_id)
+# load the LoRA adapter for French
+peft_model = PeftModel.from_pretrained(model, <path>, adapter_name="adapter_fr")
+# next, load the LoRA adapter for German
+peft_model.load_adapter(<path>, adapter_name="adapter_de")
+```
+
+Now, we want to generate text on a sample that contains all three languages: The first three samples are in English, the next three are in French, and the last three are in German. We can use the `adapter_names` argument to specify which adapter to use for each sample. Since our base model is used for English, we use the special string `"__base__"` for these samples. For the next three samples, we indicate the adapter name of the French LoRA fine-tune, in this case `"adapter_fr"`. For the last three samples, we indicate the adapter name of the German LoRA fine-tune, in this case `"adapter_de"`. This way, we can use the base model and the two adapters in a single batch.
+
+```python
+inputs = tokenizer(
+    [
+        "Hello, my dog is cute",
+        "Hello, my cat is awesome",
+        "Hello, my fish is great",
+        "Salut, mon chien est mignon",
+        "Salut, mon chat est génial",
+        "Salut, mon poisson est super",
+        "Hallo, mein Hund ist süß",
+        "Hallo, meine Katze ist toll",
+        "Hallo, mein Fisch ist großartig",
+    ],
+    return_tensors="pt",
+    padding=True,
+)
+
+adapter_names = [
+    "__base__", "__base__", "__base__",
+    "adapter_fr", "adapter_fr", "adapter_fr",
+    "adapter_de", "adapter_de", "adapter_de",
+]
+output = peft_model.generate(**inputs, adapter_names=adapter_names, max_new_tokens=20)
+```
+
+Note that the order does not matter here, i.e. the samples in the batch don't need to be grouped by adapter as in the example above. We just need to ensure that the `adapter_names` argument is aligned correctly with the samples.
+
+Additionally, the same approach also works with the `modules_to_save` feature, which allows for saving and reusing specific neural network layers, such as custom heads for classification tasks, across different LoRA adapters.
+
+### Caveats
+
+Using this feature has some drawbacks, namely:
+
+- It only works for inference, not for training.
+- Disabling adapters using the `with model.disable_adapter()` context takes precedence over `adapter_names`.
+- You cannot pass `adapter_names` when some adapter weights were merged with base weight using the `merge_adapter` method. Please unmerge all adapters first by calling `model.unmerge_adapter()`.
+- For obvious reasons, this cannot be used after calling `merge_and_unload()`, since all the LoRA adapters will be merged into the base weights in this case.
+- This feature does not currently work with DoRA, so set `use_dora=False` in your `LoraConfig` if you want to use it.
+- The `modules_to_save` feature is currently only supported for the layers of types `Linear`, `Embedding`, `Conv2d` and `Conv1d`.
+- There is an expected overhead for inference with `adapter_names`, especially if the amount of different adapters in the batch is high. This is because the batch size is effectively reduced to the number of samples per adapter. If runtime performance is your top priority, try the following:
+  - Increase the batch size.
+  - Try to avoid having a large number of different adapters in the same batch, prefer homogeneous batches. This can be achieved by buffering samples with the same adapter and only perform inference with a small handful of different adapters.
+  - Take a look at alternative implementations such as [LoRAX](https://github.com/predibase/lorax), [punica](https://github.com/punica-ai/punica), or [S-LoRA](https://github.com/S-LoRA/S-LoRA), which are specialized to work with a large number of different adapters.
+
+## Composing and Reusing LoRA Adapters
+### Arrow
+[Arrow](https://huggingface.co/papers/2405.11157) is a modular routing algorithm designed to combine multiple pre-trained task-specific LoRA adapters to solve a given task. Rather than merging all adapters naively, Arrow introduces a **gradient-free, token-wise mixture-of-experts (MoE) routing mechanism**. At inference time, it first computes a _prototype_ for each LoRA by extracting the top right singular vector from its SVD decomposition. Each token representation is then compared to these prototypes via cosine similarity to obtain routing coefficients. Tokens are assigned to the top-k most relevant LoRA adapters, with the coefficients normalized through softmax, and their outputs linearly combined. This allows effective reuse of existing LoRA modules for new tasks and leads to stronger zero-shot generalization.
+
+In PEFT, Arrow is enabled through ```ArrowConfig``` and ```create_arrow_model```. You can also configure parameters such as ```top_k``` (the number of LoRA adapters combined per token), ```router_temperature``` (the softmax temperature applied to the routing coefficients), and ```rng_seed``` (for reproducibility). 
+
+```py
+from peft import create_arrow_model, ArrowConfig
+from transformers import AutoModelForCausalLM
+
+# Loading the model
+base_model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
+
+# Creating the Arrow config
+arrow_config = ArrowConfig(
+    top_k=3,
+    router_temperature=1.0,
+    rng_seed=42,
+)
+
+# The LoRA adapters below were trained on a clustered FLAN dataset.
+# Task clustering was performed using the Model-Based Clustering (MBC) method,
+# as described in the Arrow paper.
+# While one could train a separate LoRA for each task and let Arrow route tokens among them,
+# training LoRAs on clusters of tasks instead provides an indirect optimization for
+# transfer across the multi-task dataset.
+task_specific_adapter_paths = [
+        f"TahaBa/phi3-mini-clustered-flan/ts_expert_{i}" for i in range(10)
+    ]
+
+# Creating the Arrow model
+model = create_arrow_model(
+        base_model=base_model,
+        task_specific_adapter_paths=task_specific_adapter_paths,
+        arrow_config=arrow_config,
+    )
+
+# Now the forward path could be called on this model, like a normal PeftModel.
+```
+
+Furthermore, you can add or remove adapters after calling ```create_arrow_model```—for example, to fine-tune a new adapter or discard an unnecessary one. Once the adapters are in place, you can activate the ```"arrow_router"``` for inference to use Arrow. Note that if you add a new LoRA adapter after ```create_arrow_model``` and want to fine-tune it, you must explicitly set the new adapter as active, since ```"arrow_router"``` is activated by default in ```create_arrow_model```.
+
+```py
+from trl import SFTTrainer, SFTConfig
+
+# Adding a new adapter and activating it
+model.add_adapter(adapter_name='new_adapter')
+model.set_adapter('new_adapter')
+
+# Now the model could be trained along the `new_adapter`.
+trainer = SFTTrainer(
+        model=model,
+        args=SFTConfig(...),
+        ...
+    )
+
+# Once the training is done, you can activate `arrow_router` and use it in inference
+model.set_adapter('arrow_router')    # Model is ready to be used at inference time now
+```
+
+### GenKnowSub
+[GenKnowSub](https://aclanthology.org/2025.acl-short.54/) augments Arrow by purifying task-specific LoRA adapters before routing. The key idea is to subtract general knowledge encoded in LoRA space—based on the [forgetting-via-negation principle](https://huggingface.co/papers/2212.04089)—so that task adapters become more isolated and focused on task-relevant signals. Concretely, GenKnowSub estimates a low-dimensional “general” subspace from a set of general (non task-specific) LoRA adapters and removes this component from each task adapter’s LoRA update prior to Arrow’s token-wise routing. This typically improves compositionality and reduces interference when combining many task adapters.
+
+In PEFT, enable GenKnowSub by setting ```use_gks=True``` in ArrowConfig, and providing ```general_adapter_paths``` in ```create_arrow_model```:
+
+```py
+from peft import create_arrow_model, ArrowConfig
+from transformers import AutoModelForCausalLM
+
+# Loading the model
+base_model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
+
+# Creating the Arrow config
+arrow_config = ArrowConfig(
+    top_k=3,
+    router_temperature=1.0,
+    use_gks=True,
+    rng_seed=42,
+)
+
+# Path to task-specific, trained on flan clustered dataset (as we explained before.)
+task_specific_adapter_paths = [
+        f"TahaBa/phi3-mini-clustered-flan/ts_expert_{i}" for i in range(10)
+    ]
+# These general adapters are trained on English, German, and French Wikipedia dataset,
+# with causal language modelling objective, each pair like: (507 token tsentence, 5 token completion), and the loss computed on the completion
+general_adapter_paths = [
+        "TahaBa/phi3-mini-general-adapters/cluster0_batch16_prop1.0_langen/checkpoint-17",
+        "TahaBa/phi3-mini-general-adapters/cluster0_batch16_prop1.0_langfr/checkpoint-35",
+        "TahaBa/phi3-mini-general-adapters/cluster0_batch16_prop1.0_langger/checkpoint-17"
+    ]
+
+# Creating the Arrow model
+model = create_arrow_model(
+        base_model=base_model,
+        task_specific_adapter_paths=task_specific_adapter_paths,
+        general_adapter_paths=general_adapter_paths,
+        arrow_config=arrow_config,
+    )
+
+# Now the forward path could be called on this model, like a normal PeftModel.
+```
+To encode general knowledge, GenKnowSub subtracts the average of the provided general adapters from each task-specific adapter once, before routing begins. Furthermore, the ability to add or remove adapters after calling ```create_arrow_model``` (as described in the Arrow section) is still supported in this case.
+
+> [!TIP]
+> **Things to keep in mind when using Arrow + GenKnowSub:**
+>
+> - All LoRA adapters (task-specific and general) must share the same ```rank``` and ```target_modules```.
+>
+> - Any inconsistency in these settings will raise an error in ```create_arrow_model```.
+>
+> - Having different scaling factors (```lora_alpha```) across task adapters is supported — Arrow handles them automatically.
+>
+> - Merging the ```"arrow_router"``` is not supported, due to its dynamic routing behavior.
+>
+> - In create_arrow_model, task adapters are loaded as ```task_i``` and general adapters as ```gks_j``` (where ```i``` and ```j``` are indices). The function ensures consistency of ```target_modules```, ```rank```, and whether adapters are applied to ```Linear``` or ```Linear4bit``` layers. It then adds the ```"arrow_router"``` module and activates it. Any customization of this process requires overriding ```create_arrow_model```.
+>
+> - This implementation is compatible with 4-bit quantization (via bitsandbytes):
+>
+>     ```py
+>     from transformers import AutoModelForCausalLM, BitsAndBytesConfig
+>     import torch
+>
+>     # Quantisation config
+>     bnb_config = BitsAndBytesConfig(
+>             load_in_4bit=True,
+>             bnb_4bit_quant_type="nf4",
+>             bnb_4bit_compute_dtype=torch.bfloat16,
+>             bnb_4bit_use_double_quant=False,
+>         )
+>
+>     # Loading the model
+>     base_model = AutoModelForCausalLM.from_pretrained(
+>         "microsoft/Phi-3-mini-4k-instruct",
+>         dtype=torch.bfloat16,
+>         device_map="auto",
+>         quantization_config=bnb_config,
+>     )
+>
+>     # Now call create_arrow_model() as we explained before.
+>     ```
--- a/docs/source/developer_guides/low_level_api.md
+++ b/docs/source/developer_guides/low_level_api.md
@ -0,0 +1,148 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# Adapter injection
+
+With PEFT, you can inject trainable adapters into any `torch` module which allows you to use adapter methods without relying on the modeling classes in PEFT. This works for all adapters except for those based on prompt learning (e.g. prefix tuning or p-tuning).
+
+Check the table below to see when you should inject adapters.
+
+| Pros | Cons |
+|---|---|
+| the model is modified inplace, keeping all the original attributes and methods | manually write the `from_pretrained` and `save_pretrained` utility functions from Hugging Face to save and load adapters |
+| works for any `torch` module and modality | doesn't work with any of the utility methods provided by `PeftModel` such as disabling and merging adapters |
+
+## Creating a new PEFT model
+
+To perform the adapter injection, use the [`inject_adapter_in_model`] method. This method takes 3 arguments, the PEFT config, the model, and an optional adapter name. You can also attach multiple adapters to the model if you call [`inject_adapter_in_model`] multiple times with different adapter names.
+
+For example, to inject LoRA adapters into the `linear` submodule of the `DummyModel` module:
+
+```python
+import torch
+from peft import inject_adapter_in_model, LoraConfig
+
+class DummyModel(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.embedding = torch.nn.Embedding(10, 10)
+        self.linear = torch.nn.Linear(10, 10)
+        self.lm_head = torch.nn.Linear(10, 10)
+
+    def forward(self, input_ids):
+        x = self.embedding(input_ids)
+        x = self.linear(x)
+        x = self.lm_head(x)
+        return x
+
+
+lora_config = LoraConfig(
+    lora_alpha=16,
+    lora_dropout=0.1,
+    r=64,
+    bias="none",
+    target_modules=["linear"],
+)
+
+model = DummyModel()
+model = inject_adapter_in_model(lora_config, model)
+
+dummy_inputs = torch.LongTensor([[0, 1, 2, 3, 4, 5, 6, 7]])
+dummy_outputs = model(dummy_inputs)
+```
+
+Print the model to see that the adapters have been correctly injected.
+
+```bash
+DummyModel(
+  (embedding): Embedding(10, 10)
+  (linear): Linear(
+    in_features=10, out_features=10, bias=True
+    (lora_dropout): ModuleDict(
+      (default): Dropout(p=0.1, inplace=False)
+    )
+    (lora_A): ModuleDict(
+      (default): Linear(in_features=10, out_features=64, bias=False)
+    )
+    (lora_B): ModuleDict(
+      (default): Linear(in_features=64, out_features=10, bias=False)
+    )
+    (lora_embedding_A): ParameterDict()
+    (lora_embedding_B): ParameterDict()
+  )
+  (lm_head): Linear(in_features=10, out_features=10, bias=True)
+)
+```
+
+### Injection based on a `state_dict`
+
+Sometimes, it is possible that there is a PEFT adapter checkpoint but the corresponding PEFT config is not known for whatever reason. To inject the PEFT layers for this checkpoint, you would usually have to reverse-engineer the corresponding PEFT config, most notably the `target_modules` argument, based on the `state_dict` from the checkpoint. This can be cumbersome and error prone. To avoid this, it is also possible to call [`inject_adapter_in_model`] and pass the loaded `state_dict` as an argument:
+
+```python
+from safetensors.torch import load_file
+
+model = ...
+state_dict = load_file(<path-to-safetensors-file>)
+lora_config = LoraConfig(...)
+model = inject_adapter_in_model(lora_config, model, state_dict=state_dict)
+```
+
+In this case, PEFT will use the `state_dict` as reference for which layers to target instead of using the PEFT config. As a user, you don't have to set the exact `target_modules` of the PEFT config for this to work. However, you should still pass a PEFT config of the right type, in this example `LoraConfig`, you can leave the `target_modules` as `None`.
+
+Be aware that this still only creates the uninitialized PEFT layers, the values from the `state_dict` are not used to populate the model weights. To populate the weights, proceed with calling [`set_peft_model_state_dict`] as described below.
+
+⚠️ Note that if there is a mismatch between what is configured in the PEFT config and what is found in the `state_dict`, PEFT will warn you about this. You can ignore the warning if you know that the PEFT config is not correctly specified.
+
+> [!WARNING]
+> If the original PEFT adapters was using `target_parameters` instead of `target_modules`, injecting from a `state_dict` will not work correctly. In this case, it is mandatory to use the correct PEFT config for injection.
+
+## Saving the model
+
+To only save the adapter, use the [`get_peft_model_state_dict`] function:
+
+```python
+from peft import get_peft_model_state_dict
+
+peft_state_dict = get_peft_model_state_dict(model)
+print(peft_state_dict)
+```
+
+Otherwise, `model.state_dict()` returns the full state dict of the model.
+
+## Loading the model
+
+After loading the saved `state_dict`, it can be applied using the [`set_peft_model_state_dict`] function:
+
+```python
+from peft import set_peft_model_state_dict
+
+model = DummyModel()
+model = inject_adapter_in_model(lora_config, model)
+outcome = set_peft_model_state_dict(model, peft_state_dict)
+# check that there were no wrong keys
+print(outcome.unexpected_keys)
+```
+
+If injecting the adapter is slow or you need to load a large number of adapters, you may use an optimization that allows to create an "empty" adapter on meta device and only fills the weights with real weights when the [`set_peft_model_state_dict`] is called. To do this, pass `low_cpu_mem_usage=True` to both [`inject_adapter_in_model`] and [`set_peft_model_state_dict`].
+
+```python
+model = DummyModel()
+model = inject_adapter_in_model(lora_config, model, low_cpu_mem_usage=True)
+
+print(model.linear.lora_A["default"].weight.device.type == "meta")  # should be True
+set_peft_model_state_dict(model, peft_state_dict, low_cpu_mem_usage=True)
+print(model.linear.lora_A["default"].weight.device.type == "cpu")  # should be True
+```
--- a/docs/source/developer_guides/low_level_api.mdx
+++ b/docs/source/developer_guides/low_level_api.mdx
@ -1,103 +0,0 @@
-<!--Copyright 2023 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-
-# PEFT as a utility library
-
-Let's cover in this section how you can leverage PEFT's low level API to inject trainable adapters into any `torch` module. 
-The development of this API has been motivated by the need for super users to not rely on modeling classes that are exposed in PEFT library and still be able to use adapter methods such as LoRA, IA3 and AdaLoRA.
-
-## Supported tuner types
-
-Currently the supported adapter types are the 'injectable' adapters, meaning adapters where an inplace modification of the model is sufficient to correctly perform the fine tuning. As such, only [LoRA](./conceptual_guides/lora), AdaLoRA and [IA3](./conceptual_guides/ia3) are currently supported in this API.
-
-## `inject_adapter_in_model` method 
-
-To perform the adapter injection, simply use `inject_adapter_in_model` method that takes 3 arguments, the PEFT config and the model itself and an optional adapter name. You can also attach multiple adapters in the model if you call multiple times `inject_adapter_in_model` with different adapter names.
-
-Below is a basic example usage of how to inject LoRA adapters into the submodule `linear` of the module `DummyModel`.
-```python
-import torch
-from peft import inject_adapter_in_model, LoraConfig
-
-
-class DummyModel(torch.nn.Module):
-    def __init__(self):
-        super().__init__()
-        self.embedding = torch.nn.Embedding(10, 10)
-        self.linear = torch.nn.Linear(10, 10)
-        self.lm_head = torch.nn.Linear(10, 10)
-
-    def forward(self, input_ids):
-        x = self.embedding(input_ids)
-        x = self.linear(x)
-        x = self.lm_head(x)
-        return x
-
-
-lora_config = LoraConfig(
-    lora_alpha=16,
-    lora_dropout=0.1,
-    r=64,
-    bias="none",
-    target_modules=["linear"],
-)
-
-model = DummyModel()
-model = inject_adapter_in_model(lora_config, model)
-
-dummy_inputs = torch.LongTensor([[0, 1, 2, 3, 4, 5, 6, 7]])
-dummy_outputs = model(dummy_inputs)
-```
-
-If you print the model, you will notice that the adapters have been correctly injected into the model
-
-```bash
-DummyModel(
-  (embedding): Embedding(10, 10)
-  (linear): Linear(
-    in_features=10, out_features=10, bias=True
-    (lora_dropout): ModuleDict(
-      (default): Dropout(p=0.1, inplace=False)
-    )
-    (lora_A): ModuleDict(
-      (default): Linear(in_features=10, out_features=64, bias=False)
-    )
-    (lora_B): ModuleDict(
-      (default): Linear(in_features=64, out_features=10, bias=False)
-    )
-    (lora_embedding_A): ParameterDict()
-    (lora_embedding_B): ParameterDict()
-  )
-  (lm_head): Linear(in_features=10, out_features=10, bias=True)
-)
-```
-Note that it should be up to users to properly take care of saving the adapters (in case they want to save adapters only), as `model.state_dict()` will return the full state dict of the model.
-In case you want to extract the adapters state dict you can use the `get_peft_model_state_dict` method:
-
-```python
-from peft import get_peft_model_state_dict
-
-peft_state_dict = get_peft_model_state_dict(model)
-print(peft_state_dict)
-```
-
-## Pros and cons 
-
-When to use this API and when to not use it? Let's discuss in this section the pros and cons 
-
-Pros:
- The model gets modified in-place, meaning the model will preserve all its original attributes and methods
- Works for any torch module, and any modality (vision, text, multi-modal)
-
-Cons:
- You need to manually writing Hugging Face `from_pretrained` and `save_pretrained` utility methods if you want to easily save / load adapters from the Hugging Face Hub.
- You cannot use any of the utility method provided by `PeftModel` such as disabling adapters, merging adapters, etc.
--- a/docs/source/developer_guides/mixed_models.md
+++ b/docs/source/developer_guides/mixed_models.md
@ -0,0 +1,37 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# Mixed adapter types
+
+Normally, it isn't possible to mix different adapter types in 🤗 PEFT. You can create a PEFT model with two different LoRA adapters (which can have different config options), but it is not possible to combine a LoRA and LoHa adapter. With [`PeftMixedModel`] however, this works as long as the adapter types are compatible. The main purpose of allowing mixed adapter types is to combine trained adapters for inference. While it is possible to train a mixed adapter model, this has not been tested and is not recommended.
+
+To load different adapter types into a PEFT model, use [`PeftMixedModel`] instead of [`PeftModel`]:
+
+```py
+from peft import PeftMixedModel
+
+base_model = ...  # load the base model, e.g. from transformers
+# load first adapter, which will be called "default"
+peft_model = PeftMixedModel.from_pretrained(base_model, <path_to_adapter1>)
+peft_model.load_adapter(<path_to_adapter2>, adapter_name="other")
+peft_model.set_adapter(["default", "other"])
+```
+
+The [`~PeftMixedModel.set_adapter`] method is necessary to activate both adapters, otherwise only the first adapter would be active. You can keep adding more adapters by calling [`~PeftModel.add_adapter`] repeatedly.
+
+[`PeftMixedModel`] does not support saving and loading mixed adapters. The adapters should already be trained, and loading the model requires a script to be run each time.
+
+## Tips
+
+- Not all adapter types can be combined. See [`peft.tuners.mixed.COMPATIBLE_TUNER_TYPES`](https://github.com/huggingface/peft/blob/1c1c7fdaa6e6abaa53939b865dee1eded82ad032/src/peft/tuners/mixed/model.py#L35) for a list of compatible types. An error will be raised if you try to combine incompatible adapter types.
+- It is possible to mix multiple adapters of the same type which can be useful for combining adapters with very different configs.
+- If you want to combine a lot of different adapters, the most performant way to do it is to consecutively add the same adapter types. For example, add LoRA1, LoRA2, LoHa1, LoHa2 in this order, instead of LoRA1, LoHa1, LoRA2, and LoHa2. While the order can affect the output, there is no inherently *best* order, so it is best to choose the fastest one.
--- a/docs/source/developer_guides/model_merging.md
+++ b/docs/source/developer_guides/model_merging.md
@ -0,0 +1,164 @@
+<!--Copyright 2024 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# Model merging
+
+Training a model for each task can be costly, take up storage space, and the models aren't able to learn new information to improve their performance. Multitask learning can overcome some of these limitations by training a model to learn several tasks, but it is expensive to train and designing a dataset for it is challenging. *Model merging* offers a solution to these challenges by combining multiple pretrained models into one model, giving it the combined abilities of each individual model without any additional training.
+
+PEFT provides several methods for merging models like a linear or SVD combination. This guide focuses on two methods that are more efficient for merging LoRA adapters by eliminating redundant parameters:
+
+* [TIES](https://hf.co/papers/2306.01708) - TrIm, Elect, and Merge (TIES) is a three-step method for merging models. First, redundant parameters are trimmed, then conflicting signs are resolved into an aggregated vector, and finally the parameters whose signs are the same as the aggregate sign are averaged. This method takes into account that some values (redundant and sign disagreement) can degrade performance in the merged model.
+* [DARE](https://hf.co/papers/2311.03099) - Drop And REscale is a method that can be used to prepare for other model merging methods like TIES. It works by randomly dropping parameters according to a drop rate and rescaling the remaining parameters. This helps to reduce the number of redundant and potentially interfering parameters among multiple models.
+
+Models are merged with the [`~LoraModel.add_weighted_adapter`] method, and the specific model merging method is specified in the `combination_type` parameter.
+
+## Merge method
+
+With TIES and DARE, merging is enabled by setting `combination_type` and `density` to a value of the weights to keep from the individual models. For example, let's merge three finetuned [TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T) models: [tinyllama_lora_nobots](https://huggingface.co/smangrul/tinyllama_lora_norobots), [tinyllama_lora_sql](https://huggingface.co/smangrul/tinyllama_lora_sql), and [tinyllama_lora_adcopy](https://huggingface.co/smangrul/tinyllama_lora_adcopy).
+
+<Tip warninig={true}>
+
+When you're attempting to merge fully trained models with TIES, you should be aware of any special tokens each model may have added to the embedding layer which are not a part of the original checkpoint's vocabulary. This may cause an issue because each model may have added a special token to the same embedding position. If this is the case, you should use the [`~transformers.PreTrainedModel.resize_token_embeddings`] method to avoid merging the special tokens at the same embedding index.
+
+<br>
+
+This shouldn't be an issue if you're only merging LoRA adapters trained from the same base model.
+
+</Tip>
+
+Load a base model and can use the [`~PeftModel.load_adapter`] method to load and assign each adapter a name:
+
+```py
+from peft import PeftConfig, PeftModel
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+
+config = PeftConfig.from_pretrained("smangrul/tinyllama_lora_norobots")
+model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, load_in_4bit=True, device_map="auto").eval()
+tokenizer = AutoTokenizer.from_pretrained("smangrul/tinyllama_lora_norobots")
+
+model.config.vocab_size = 32005
+model.resize_token_embeddings(32005)
+
+model = PeftModel.from_pretrained(model, "smangrul/tinyllama_lora_norobots", adapter_name="norobots")
+_ = model.load_adapter("smangrul/tinyllama_lora_sql", adapter_name="sql")
+_ = model.load_adapter("smangrul/tinyllama_lora_adcopy", adapter_name="adcopy")
+```
+
+Set the adapters, weights, `adapter_name`, `combination_type`, and `density` with the [`~LoraModel.add_weighted_adapter`] method.
+
+<hfoptions id="merge-method">
+<hfoption id="TIES">
+
+Weight values greater than `1.0` typically produce better results because they preserve the correct scale. A good default starting value for the weights is to set all values to `1.0`.
+
+```py
+adapters = ["norobots", "adcopy", "sql"]
+weights = [2.0, 1.0, 1.0]
+adapter_name = "merge"
+density = 0.2
+model.add_weighted_adapter(adapters, weights, adapter_name, combination_type="ties", density=density)
+```
+
+</hfoption>
+<hfoption id="DARE">
+
+```py
+adapters = ["norobots", "adcopy", "sql"]
+weights = [2.0, 0.3, 0.7]
+adapter_name = "merge"
+density = 0.2
+model.add_weighted_adapter(adapters, weights, adapter_name, combination_type="dare_ties", density=density)
+```
+
+</hfoption>
+</hfoptions>
+
+Set the newly merged model as the active model with the [`~LoraModel.set_adapter`] method.
+
+```py
+model.set_adapter("merge")
+```
+
+Now you can use the merged model as an instruction-tuned model to write ad copy or SQL queries!
+
+<hfoptions id="ties">
+<hfoption id="instruct">
+
+```py
+device = torch.accelerator.current_accelerator().type if hasattr(torch, "accelerator") else "cuda"
+messages = [
+    {"role": "user", "content": "Write an essay about Generative AI."},
+]
+text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
+inputs = tokenizer(text, return_tensors="pt")
+inputs = {k: v.to(device) for k, v in inputs.items()}
+outputs = model.generate(**inputs, max_new_tokens=256, do_sample=True, top_p=0.95, temperature=0.2, repetition_penalty=1.2, eos_token_id=tokenizer.eos_token_id)
+print(tokenizer.decode(outputs[0]))
+```
+
+</hfoption>
+<hfoption id="ad copy">
+
+```py
+device = torch.accelerator.current_accelerator().type if hasattr(torch, "accelerator") else "cuda"
+messages = [
+    {"role": "system", "content": "Create a text ad given the following product and description."},
+    {"role": "user", "content": "Product: Sony PS5 PlayStation Console\nDescription: The PS5 console unleashes new gaming possibilities that you never anticipated."},
+]
+text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
+inputs = tokenizer(text, return_tensors="pt")
+inputs = {k: v.to(device) for k, v in inputs.items()}
+outputs = model.generate(**inputs, max_new_tokens=128, do_sample=True, top_p=0.95, temperature=0.2, repetition_penalty=1.2, eos_token_id=tokenizer.eos_token_id)
+print(tokenizer.decode(outputs[0]))
+```
+
+</hfoption>
+<hfoption id="SQL">
+
+```py
+device = torch.accelerator.current_accelerator().type if hasattr(torch, "accelerator") else "cuda"
+
+text = """Table: 2-11365528-2
+Columns: ['Team', 'Head Coach', 'President', 'Home Ground', 'Location']
+Natural Query: Who is the Head Coach of the team whose President is Mario Volarevic?
+SQL Query:"""
+
+inputs = tokenizer(text, return_tensors="pt")
+inputs = {k: v.to(device) for k, v in inputs.items()}
+outputs = model.generate(**inputs, max_new_tokens=64, repetition_penalty=1.1, eos_token_id=tokenizer("</s>").input_ids[-1])
+print(tokenizer.decode(outputs[0]))
+```
+
+</hfoption>
+</hfoptions>
+
+
+## Merging (IA)³ Models
+The (IA)³ models facilitate linear merging of adapters. To merge adapters in an (IA)³ model, utilize the `add_weighted_adapter` method from the `IA3Model` class. This method is analogous to the `add_weighted_adapter` method used in `LoraModel`, with the key difference being the absence of the `combination_type` parameter. For example, to merge three (IA)³ adapters into a PEFT model, you would proceed as follows:
+
+```py
+adapters = ["adapter1", "adapter2", "adapter3"]
+weights = [0.4, 0.3, 0.3]
+adapter_name = "merge"
+model.add_weighted_adapter(adapters, weights, adapter_name)
+```
+
+It is recommended that the weights sum to 1.0 to preserve the scale of the model. The merged model can then be set as the active model using the `set_adapter` method:
+
+```py
+model.set_adapter("merge")
+```
--- a/docs/source/developer_guides/quantization.md
+++ b/docs/source/developer_guides/quantization.md
@ -0,0 +1,294 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# Quantization
+
+Quantization represents data with fewer bits, making it a useful technique for reducing memory-usage and accelerating inference especially when it comes to large language models (LLMs). There are several ways to quantize a model including:
+
+* optimizing which model weights are quantized with the [AWQ](https://hf.co/papers/2306.00978) algorithm
+* independently quantizing each row of a weight matrix with the [GPTQ](https://hf.co/papers/2210.17323) algorithm
+* quantizing to 8-bit and 4-bit precision with the [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) library
+* quantizing to as low as 2-bit precision with the [AQLM](https://huggingface.co/papers/2401.06118) algorithm
+
+However, after a model is quantized it isn't typically further trained for downstream tasks because training can be unstable due to the lower precision of the weights and activations. But since PEFT methods only add *extra* trainable parameters, this allows you to train a quantized model with a PEFT adapter on top! Combining quantization with PEFT can be a good strategy for training even the largest models on a single GPU. For example, [QLoRA](https://hf.co/papers/2305.14314) is a method that quantizes a model to 4-bits and then trains it with LoRA. This method allows you to finetune a 65B parameter model on a single 48GB GPU!
+
+In this guide, you'll see how to quantize a model to 4-bits and train it with LoRA.
+
+## Quantize a model
+
+[bitsandbytes](https://github.com/TimDettmers/bitsandbytes) is a quantization library with a Transformers integration. With this integration, you can quantize a model to 8 or 4-bits and enable many other options by configuring the [`~transformers.BitsAndBytesConfig`] class. For example, you can:
+
+* set `load_in_4bit=True` to quantize the model to 4-bits when you load it
+* set `bnb_4bit_quant_type="nf4"` to use a special 4-bit data type for weights initialized from a normal distribution
+* set `bnb_4bit_use_double_quant=True` to use a nested quantization scheme to quantize the already quantized weights
+* set `bnb_4bit_compute_dtype=torch.bfloat16` to use bfloat16 for faster computation
+
+```py
+import torch
+from transformers import BitsAndBytesConfig
+
+config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_quant_type="nf4",
+    bnb_4bit_use_double_quant=True,
+    bnb_4bit_compute_dtype=torch.bfloat16,
+)
+```
+
+Pass the `config` to the [`~transformers.AutoModelForCausalLM.from_pretrained`] method.
+
+```py
+from transformers import AutoModelForCausalLM
+
+model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1", quantization_config=config)
+```
+
+Next, you should call the [`~peft.utils.prepare_model_for_kbit_training`] function to preprocess the quantized model for training.
+
+```py
+from peft import prepare_model_for_kbit_training
+
+model = prepare_model_for_kbit_training(model)
+```
+
+Now that the quantized model is ready, let's set up a configuration.
+
+## LoraConfig
+
+Create a [`LoraConfig`] with the following parameters (or choose your own):
+
+```py
+from peft import LoraConfig
+
+config = LoraConfig(
+    r=16,
+    lora_alpha=8,
+    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
+    lora_dropout=0.05,
+    bias="none",
+    task_type="CAUSAL_LM"
+)
+```
+
+Then use the [`get_peft_model`] function to create a [`PeftModel`] from the quantized model and configuration.
+
+```py
+from peft import get_peft_model
+
+model = get_peft_model(model, config)
+```
+
+You're all set for training with whichever training method you prefer!
+
+### LoftQ initialization
+
+[LoftQ](https://hf.co/papers/2310.08659) initializes LoRA weights such that the quantization error is minimized, and it can improve performance when training quantized models. To get started, follow [these instructions](https://github.com/huggingface/peft/tree/main/examples/loftq_finetuning).
+
+In general, for LoftQ to work best, it is recommended to target as many layers with LoRA as possible, since those not targeted cannot have LoftQ applied. This means that passing `LoraConfig(..., target_modules="all-linear")` will most likely give the best results. Also, you should use `nf4` as quant type in your quantization config when using 4bit quantization, i.e. `BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")`.
+
+### QLoRA-style training
+
+QLoRA adds trainable weights to all the linear layers in the transformer architecture. Since the attribute names for these linear layers can vary across architectures, set `target_modules` to `"all-linear"` to add LoRA to all the linear layers:
+
+```py
+config = LoraConfig(target_modules="all-linear", ...)
+```
+
+## GPTQ quantization
+
+You can learn more about gptq based `[2, 3, 4, 8]` bits quantization at [GPTQModel](https://github.com/ModelCloud/GPTQModel) and the Transformers [GPTQ](https://huggingface.co/docs/transformers/quantization/gptq) doc. Post-quant training, PEFT can use both [GPTQModel](https://github.com/ModelCloud/GPTQModel) or [AutoGPTQ](https://github.com/autogptq/autogptq) libraries, but we recommend GPTQModel because AutoGPTQ will be deprecated in a future release. 
+
+```bash
+# gptqmodel install
+pip install gptqmodel --no-build-isolation
+```
+
+```py
+from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
+
+model_id = "facebook/opt-125m"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+
+gptq_config = GPTQConfig(bits=4, group_size=128, dataset="wikitext2", tokenizer=tokenizer)
+
+quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", quantization_config=gptq_config)
+
+# save quantized model
+quantized_model.save_pretrained("./opt-125m-gptq")
+tokenizer.save_pretrained("./opt-125m-gptq")
+```
+
+Once quantized, you can post-train GPTQ models with PEFT APIs.
+
+## AQLM quantization
+
+Additive Quantization of Language Models ([AQLM](https://huggingface.co/papers/2401.06118)) is a Large Language Models compression method. It quantizes multiple weights together and takes advantage of interdependencies between them. AQLM represents groups of 8-16 weights as a sum of multiple vector codes. This allows it to compress models down to as low as 2-bit with considerably low accuracy losses.
+
+Since the AQLM quantization process is computationally expensive, the use of prequantized models is recommended. A partial list of available models can be found in the official aqlm [repository](https://github.com/Vahe1994/AQLM).
+
+The models support LoRA adapter tuning. To tune the quantized model you'll need to install the `aqlm` inference library: `pip install aqlm>=1.0.2`. Finetuned LoRA adapters shall be saved separately, as merging them with AQLM quantized weights is not possible.
+
+```py
+quantized_model = AutoModelForCausalLM.from_pretrained(
+    "BlackSamorez/Mixtral-8x7b-AQLM-2Bit-1x16-hf-test-dispatch",
+    dtype="auto", device_map="auto", low_cpu_mem_usage=True,
+)
+
+peft_config = LoraConfig(...)
+
+quantized_model = get_peft_model(quantized_model, peft_config)
+```
+
+You can refer to the [Google Colab](https://colab.research.google.com/drive/12GTp1FCj5_0SnnNQH18h_2XFh9vS_guX?usp=sharing) example for an overview of AQLM+LoRA finetuning.
+
+## EETQ quantization
+
+You can also perform LoRA fine-tuning on EETQ quantized models. [EETQ](https://github.com/NetEase-FuXi/EETQ) package offers simple and efficient way to perform 8-bit quantization, which is claimed to be faster than the `LLM.int8()` algorithm. First, make sure that you have a transformers version that is compatible with EETQ (e.g. by installing it from latest pypi or from source).
+
+```py
+import torch
+from transformers import EetqConfig
+
+config = EetqConfig("int8")
+```
+
+Pass the `config` to the [`~transformers.AutoModelForCausalLM.from_pretrained`] method.
+
+```py
+from transformers import AutoModelForCausalLM
+
+model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1", quantization_config=config)
+```
+
+and create a `LoraConfig` and pass it to `get_peft_model`:
+
+```py
+from peft import LoraConfig, get_peft_model
+
+config = LoraConfig(
+    r=16,
+    lora_alpha=8,
+    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
+    lora_dropout=0.05,
+    bias="none",
+    task_type="CAUSAL_LM"
+)
+
+model = get_peft_model(model, config)
+```
+
+## HQQ quantization
+
+The models that are quantized using Half-Quadratic Quantization of Large Machine Learning Models ([HQQ](https://mobiusml.github.io/hqq_blog/)) support LoRA adapter tuning. To tune the quantized model, you'll need to install the `hqq` library with: `pip install hqq`.
+
+```python
+from hqq.engine.hf import HQQModelForCausalLM
+
+device = torch.accelerator.current_accelerator().type if hasattr(torch, "accelerator") else "cuda"
+
+quantized_model = HQQModelForCausalLM.from_quantized(save_dir_or_hfhub, device=device)
+peft_config = LoraConfig(...)
+quantized_model = get_peft_model(quantized_model, peft_config)
+```
+
+Or using transformers version that is compatible with HQQ (e.g. by installing it from latest pypi or from source).
+
+```python
+from transformers import HqqConfig, AutoModelForCausalLM
+
+quant_config = HqqConfig(nbits=4, group_size=64)
+quantized_model = AutoModelForCausalLM.from_pretrained(save_dir_or_hfhub, device_map=device_map, quantization_config=quant_config)
+peft_config = LoraConfig(...)
+quantized_model = get_peft_model(quantized_model, peft_config)
+```
+
+## torchao (PyTorch Architecture Optimization)
+
+PEFT supports models quantized with [torchao](https://github.com/pytorch/ao) ("ao") for int8 quantization.
+
+```python
+from peft import LoraConfig, get_peft_model
+from transformers import AutoModelForCausalLM, TorchAoConfig
+
+model_id = ...
+quantization_config = TorchAoConfig(quant_type="int8_weight_only")
+base_model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config)
+peft_config = LoraConfig(...)
+model = get_peft_model(base_model, peft_config)
+```
+
+### Caveats:
+
+- Use the most recent versions of torchao (>= v0.4.0) and transformers (> 4.42).
+- Only linear layers are currently supported.
+- `quant_type = "int4_weight_only"` is currently not supported.
+- `NF4` is not implemented in transformers as of yet and is thus also not supported.
+- DoRA only works with `quant_type = "int8_weight_only"` at the moment.
+- There is explicit support for torchao when used with LoRA. However, when torchao quantizes a layer, its class does not change, only the type of the underlying tensor. For this reason, PEFT methods other than LoRA will generally also work with torchao, even if not explicitly supported. Be aware, however, that **merging only works correctly with LoRA and with `quant_type = "int8_weight_only"`**. If you use a different PEFT method or dtype, merging will likely result in an error, and even it doesn't, the results will still be incorrect.
+
+## INC quantization
+
+Intel Neural Compressor ([INC](https://github.com/intel/neural-compressor)) enables model quantization for various devices,
+including Intel Gaudi accelerators (also known as HPU devices). You can perform LoRA fine-tuning on models that have been
+quantized using INC. To use INC with PyTorch models, install the library with: `pip install neural-compressor[pt]`.
+Quantizing a model to FP8 precision for HPU devices can be done with the following single-step quantization workflow:
+
+```python
+import torch
+from neural_compressor.torch.quantization import FP8Config, convert, finalize_calibration, prepare
+quant_configs = {
+    ...
+}
+config = FP8Config(**quant_configs)
+```
+
+Pass the config to the `prepare` method, run inference to gather calibration stats, and call `finalize_calibration`
+and `convert` methods to quantize model to FP8 precision:
+
+```python
+model = prepare(model, config)
+# Run inference to collect calibration statistics
+...
+# Finalize calibration and convert the model to FP8 precision
+finalize_calibration(model)
+model = convert(model)
+# Load PEFT LoRA adapter as usual
+...
+```
+
+An example demonstrating how to load a PEFT LoRA adapter into an INC-quantized FLUX text-to-image model for HPU
+devices is provided [here](https://github.com/huggingface/peft/blob/main/examples/stable_diffusion/inc_flux_lora_hpu.py).
+
+
+### Caveats:
+
+- `merge()` and `unmerge()` methods are currently not supported for INC-quantized models.
+- Currently, only **Linear** INC-quantized layers are supported when loading PEFT adapters.
+
+## Other Supported PEFT Methods
+
+Besides LoRA, the following PEFT methods also support quantization:
+
+- **VeRA** (supports bitsandbytes quantization)
+- **AdaLoRA** (supports both bitsandbytes and GPTQ quantization)
+- **(IA)³** (supports bitsandbytes quantization)
+
+## Next steps
+
+If you're interested in learning more about quantization, the following may be helpful:
+
+* Learn more details about QLoRA and check out some benchmarks on its impact in the [Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA](https://huggingface.co/blog/4bit-transformers-bitsandbytes) blog post.
+* Read more about different quantization schemes in the Transformers [Quantization](https://hf.co/docs/transformers/main/quantization) guide.
--- a/docs/source/developer_guides/torch_compile.md
+++ b/docs/source/developer_guides/torch_compile.md
@ -0,0 +1,71 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# torch.compile
+
+In PEFT, [torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) works for some but not all features. The reason why it won't always work is because PEFT is highly dynamic in certain places (loading and switching between multiple adapters, for instance), which can cause trouble for `torch.compile`. In other places, `torch.compile` may work, but won't be as fast as expected because of graph breaks.
+
+If you don't see an error, it doesn't necessarily mean that `torch.compile` worked correctly. It might give you an output, but the output is incorrect. This guide describes what works with `torch.compile` and what doesn't. For your own testing, we recommend using the latest PyTorch version, as `torch.compile` is constantly being improved.
+
+> [!TIP]
+> Unless indicated otherwise, the default `torch.compile` settings were used.
+
+## Training and inference with `torch.compile`
+
+These features **work** with `torch.compile`. Everything listed below was tested with a causal LM:
+
+- Training with `Trainer` from 🤗 transformers
+- Training with a custom PyTorch loop
+- Inference
+- Generation
+
+The following adapters were tested successfully:
+
+- AdaLoRA
+- BOFT
+- Bone
+- IA³
+- Layer Norm Tuning
+- LoHa
+- LoKr
+- LoRA
+- LoRA + DoRA
+- LoRA applied to embedding layers
+- OFT
+- VeRA
+- HRA
+
+## Advanced PEFT features with `torch.compile`
+
+Below are some of the more advanced PEFT features that **work**. They were all tested with LoRA.
+
+- `modules_to_save` (i.e. `config = LoraConfig(..., modules_to_save=...)`)
+- Merging adapters (one or multiple)
+- Merging multiple adapters into one adapter (i.e. calling `model.add_weighted_adapter(...)`)
+- Using PEFT adapters with quantization (bitsandbytes)
+- Disabling adapters (i.e. using `with model.disable_adapter()`)
+- Unloading (i.e. calling `model.merge_and_unload()`)
+- Mixed adapter batches (i.e. calling `model(batch, adapter_names=["__base__", "default", "other", ...])`)
+- Inference with multiple adapters (i.e. using `model.add_adapter` or `model.load_adapter` to load more than 1 adapter); for this, only call `torch.compile` _after_ loading all adapters
+
+Generally, we can expect that if a feature works correctly with LoRA and is also supported by other adapter types, it should also work for that adapter type.
+
+## Test cases
+
+All the use cases listed above are tested inside of [`peft/tests/test_torch_compile.py`](https://github.com/huggingface/peft/blob/main/tests/test_torch_compile.py). If you want to check in more detail how we tested a certain feature, please go to that file and check the test that corresponds to your use case.
+
+> [!TIP]
+> If you have another use case where you know that `torch.compile` does or does not work with PEFT, please contribute by letting us know or by opening a PR to add this use case to the covered test cases.
--- a/docs/source/developer_guides/troubleshooting.md
+++ b/docs/source/developer_guides/troubleshooting.md
@ -0,0 +1,458 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# Troubleshooting
+
+If you encounter any issue when using PEFT, please check the following list of common issues and their solutions.
+
+## Examples don't work
+
+Examples often rely on the most recent package versions, so please ensure they're up-to-date. In particular, check the following package versions:
+
+- `peft`
+- `transformers`
+- `accelerate`
+- `torch`
+
+In general, you can update the package version by running this command inside your Python environment:
+
+```bash
+python -m pip install -U <package_name>
+```
+
+Installing PEFT from source is useful for keeping up with the latest developments:
+
+```bash
+python -m pip install git+https://github.com/huggingface/peft
+```
+
+## Dtype-related issues
+
+### ValueError: Attempting to unscale FP16 gradients
+
+This error probably occurred because the model was loaded with `dtype=torch.float16` and then used in an automatic mixed precision (AMP) context, e.g. by setting `fp16=True` in the [`~transformers.Trainer`] class from 🤗 Transformers. The reason is that when using AMP, trainable weights should never use fp16. To make this work without loading the whole model in fp32, add the following to your code:
+
+```python
+peft_model = get_peft_model(...)
+
+# add this:
+for param in model.parameters():
+    if param.requires_grad:
+        param.data = param.data.float()
+
+# proceed as usual
+trainer = Trainer(model=peft_model, fp16=True, ...)
+trainer.train()
+```
+
+Alternatively, you can use the [`~utils.cast_mixed_precision_params`] function to correctly cast the weights:
+
+```python
+from peft import cast_mixed_precision_params
+
+peft_model = get_peft_model(...)
+cast_mixed_precision_params(peft_model, dtype=torch.float16)
+
+# proceed as usual
+trainer = Trainer(model=peft_model, fp16=True, ...)
+trainer.train()
+```
+
+> [!TIP]
+> Starting from PEFT version v0.12.0, PEFT automatically promotes the dtype of adapter weights from `torch.float16` and `torch.bfloat16` to `torch.float32` where appropriate. To _prevent_ this behavior, you can pass `autocast_adapter_dtype=False` to [`~get_peft_model`], to [`~PeftModel.from_pretrained`], and to [`~PeftModel.load_adapter`].
+
+### Selecting the dtype of the adapter
+
+Most PEFT methods, like LoRA, work by adding trainable adapter weights. By default, those weights are stored in float32 dtype (fp32), i.e. at a relatively high precision. Therefore, even if the base model is loaded in float16 (fp16) or bfloat16 (bf16), the adapter weights are float32. When the adapter results are calculated during the forward pass, the input will typically be in the dtype of the base model, thus it will be upcast to float32 if necessary, then cast back to the original dtype.
+
+If you prefer to have the adapter weights in the lower precision of the base model, i.e. in float16 or bfloat16, you can pass `autocast_adapter_dtype=False` when creating the model ([`~get_peft_model`]) or loading the model ([`~PeftModel.from_pretrained`]). There are some advantages and disadvantages to this:
+
+Advantages of half precision adapter:
+- computation slightly faster
+- slightly less memory
+- smaller file size of checkpoint (half the size)
+
+Disadvantages of half precision adapter:
+- slightly worse loss
+- higher risk of overflow or underflow
+
+Note that for most use cases, overall runtime and memory cost will be determined by the size of the base model and by the dataset, while the dtype of the PEFT adapter will only have a small impact.
+
+## Bad results from a loaded PEFT model
+
+There can be several reasons for getting a poor result from a loaded PEFT model which are listed below. If you're still unable to troubleshoot the problem, see if anyone else had a similar [issue](https://github.com/huggingface/peft/issues) on GitHub, and if you can't find any, open a new issue.
+
+When opening an issue, it helps a lot if you provide a minimal code example that reproduces the issue. Also, please report if the loaded model performs at the same level as the model did before fine-tuning, if it performs at a random level, or if it is only slightly worse than expected. This information helps us identify the problem more quickly.
+
+### Random deviations
+
+If your model outputs are not exactly the same as previous runs, there could be an issue with random elements. For example:
+
+1. please ensure it is in `.eval()` mode, which is important, for instance, if the model uses dropout
+2. if you use [`~transformers.GenerationMixin.generate`] on a language model, there could be random sampling, so obtaining the same result requires setting a random seed
+3. if you used quantization and merged the weights, small deviations are expected due to rounding errors
+
+### Incorrectly loaded model
+
+Please ensure that you load the model correctly. A common error is trying to load a _trained_ model with [`get_peft_model`] which is incorrect. Instead, the loading code should look like this:
+
+```python
+from peft import PeftModel, PeftConfig
+
+base_model = ...  # to load the base model, use the same code as when you trained it
+config = PeftConfig.from_pretrained(peft_model_id)
+peft_model = PeftModel.from_pretrained(base_model, peft_model_id)
+```
+
+### Randomly initialized layers
+
+For some tasks, it is important to correctly configure `modules_to_save` in the config to account for randomly initialized layers. 
+
+As an example, this is necessary if you use LoRA to fine-tune a language model for sequence classification because 🤗 Transformers adds a randomly initialized classification head on top of the model. If you do not add this layer to `modules_to_save`, the classification head won't be saved. The next time you load the model, you'll get a _different_ randomly initialized classification head, resulting in completely different results.
+
+PEFT tries to correctly guess the `modules_to_save` if you provide the `task_type` argument in the config. This should work for transformers models that follow the standard naming scheme. It is always a good idea to double check though because we can't guarantee all models follow the naming scheme.
+
+When you load a transformers model that has randomly initialized layers, you should see a warning along the lines of:
+
+```
+Some weights of <MODEL> were not initialized from the model checkpoint at <ID> and are newly initialized: [<LAYER_NAMES>].
+You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
+```
+
+The mentioned layers should be added to `modules_to_save` in the config to avoid the described problem.
+
+> [!TIP]
+> As an example, when loading a model that is using the DeBERTa architecture for sequence classification, you'll see a warning that the following weights are newly initialized: `['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']`. From this, it follows that the `classifier` and `pooler` layers should be added to: `modules_to_save=["classifier", "pooler"]`.
+
+### Extending the vocabulary
+
+For many language fine-tuning tasks, extending the model's vocabulary is necessary since new tokens are being introduced. This requires extending the embedding layer to account for the new tokens and, depending on the fine-tuning method, also storing the embedding layer in addition to the adapter weights when saving the adapter. There are a few ways of achieving this ordered by parameter effectiveness:
+
+- [trainable tokens](../package_reference/trainable_tokens), train only the specified tokens, optionally store only the updated values
+- training an adapter on the embedding matrix, optionally store only the updated values
+- full-finetuning of the embedding layer
+
+#### Using trainable tokens
+
+Let's start with trainable tokens, in this case its [LoRA integration](../developer_guides/lora#efficiently-train-tokens-alongside-lora).  If you're interested in only training the new embeddings and nothing else, refer to the [standalone documentation](../package_reference/trainable_tokens).
+
+To enable selective token training of the embedding layer, you'll need to supply the token ids of your newly added tokens via the `trainable_token_indices` parameter.  Optionally you can specify which layer to target if there is more than one embedding layer. For a Mistral model this could look like this:
+
+```python
+new_tokens = ['<think>', '</think>']
+tokenizer.add_tokens(new_tokens)
+base_model.resize_token_embeddings(len(tokenizer))
+
+lora_config = LoraConfig(
+    ...,
+    trainable_token_indices={'embed_tokens': tokenizer.convert_tokens_to_ids(new_tokens)},
+)
+```
+
+If your model uses tied weights (such as the `lm_head`), trainable tokens will try to resolve those and keep them updated as well, so in that case there should be no need for adding `modules_to_save=["lm_head"]`. This only works if the model uses the Transformers convention for tying weights.
+
+Saving the model with `model.save_pretrained` may save the full embedding matrix instead of
+only the difference as a precaution because the embedding matrix was resized. To save space you can disable this behavior by setting `save_embedding_layers=False` when calling `save_pretrained`. This is safe to do as long as you don't modify the embedding matrix through other means as well, as such changes will be not tracked by trainable tokens.
+
+#### Using an adapter, e.g. LoRA
+
+Prepare the embedding layer by adding it to the `target_modules` of your adapter config. For example, the Mistral config could look like this:
+
+```python
+config = LoraConfig(..., target_modules=["embed_tokens", "lm_head", "q_proj", "v_proj"])
+```
+
+Once added to `target_modules`, PEFT automatically stores the embedding layer when saving the adapter if the model has the [`~transformers.PreTrainedModel.get_input_embeddings`] and [`~transformers.PreTrainedModel.get_output_embeddings`]. This is generally the case for Transformers models.
+
+If the model's embedding layer doesn't follow the Transformer's naming scheme but nevertheless implements `get_input_embeddings`, you can still save it by manually passing `save_embedding_layers=True` when saving the adapter:
+
+```python
+model = get_peft_model(...)
+# train the model
+model.save_pretrained("my_adapter", save_embedding_layers=True)
+```
+
+For inference, load the base model first and resize it the same way you did before you trained the model. After you've resized the base model, you can load the PEFT checkpoint.
+
+For a complete example, please check out [this notebook](https://github.com/huggingface/peft/blob/main/examples/causal_language_modeling/peft_lora_clm_with_additional_tokens.ipynb).
+
+#### Full fine-tuning
+
+Full fine-tuning is more costly in terms of VRAM or storage space but if all else fails, you can fall back to this and see if it works for you. Achieve it by adding the name of the embedding layer to `modules_to_save`. Note that you need to add tied layers as well, e.g. `lm_head`. Example for a Mistral model with LoRA:
+
+```python
+config = LoraConfig(..., modules_to_save=["embed_tokens", "lm_head"], target_modules=["q_proj", "v_proj"])
+```
+
+### Getting a warning about "weights not being initialized from the model checkpoint"
+
+When you load your PEFT model which has been trained on a task (for example, classification), you may get a warning like:
+
+> Some weights of LlamaForSequenceClassification were not initialized from the model checkpoint at meta-llama/Llama-3.2-1B and are newly initialized: ['score.weight']. You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
+
+Although this looks scary, it is most likely nothing to worry about. This warning comes from Transformers, and it isn't a PEFT specific warning. It lets you know that a randomly initialized classification head (`score`) is attached to the base model, and the head must be trained to produce sensible predictions.
+
+When you get this warning _before_ training the model, PEFT automatically takes care of making the classification head trainable if you correctly passed the `task_type` argument to the PEFT config.
+
+```python
+from peft import LoraConfig, TaskType
+
+lora_config = LoraConfig(..., task_type=TaskType.SEQ_CLS)
+```
+
+If your classification head does not follow the usual naming conventions from Transformers (which is rare), you have to explicitly tell PEFT the name of the head in `modules_to_save`.
+
+```python
+lora_config = LoraConfig(..., modules_to_save=["name-of-classification-head"])
+```
+
+To check the name of the classification head, print the model and it should be the last module.
+
+If you get this warning from your inference code, i.e. _after_ training the model, when you load the PEFT model, you always have to load the Transformers model first. Since Transformers does not know that you will load PEFT weights afterwards, it still gives the warning.
+
+As always, it is best practice to ensure the model works correctly for inference by running some validation on it.
+
+### Check layer and model status
+
+Sometimes a PEFT model can end up in a bad state, especially when handling multiple adapters. There can be some confusion around what adapters exist, which one is active, which one is merged, etc. To help investigate this issue, call the [`~peft.PeftModel.get_layer_status`] and the [`~peft.PeftModel.get_model_status`] methods. 
+
+The [`~peft.PeftModel.get_layer_status`] method gives you a detailed overview of each targeted layer's active, merged, and available adapters.
+
+```python
+>>> from transformers import AutoModel
+>>> from peft import get_peft_model, LoraConfig
+
+>>> model_id = "google/flan-t5-small"
+>>> model = AutoModel.from_pretrained(model_id)
+>>> model = get_peft_model(model, LoraConfig())
+
+>>> model.get_layer_status()
+[TunerLayerStatus(name='model.encoder.block.0.layer.0.SelfAttention.q',
+                  module_type='lora.Linear',
+                  enabled=True,
+                  active_adapters=['default'],
+                  merged_adapters=[],
+                  requires_grad={'default': True},
+                  available_adapters=['default']),
+ TunerLayerStatus(name='model.encoder.block.0.layer.0.SelfAttention.v',
+                  module_type='lora.Linear',
+                  enabled=True,
+                  active_adapters=['default'],
+                  merged_adapters=[],
+                  requires_grad={'default': True},
+                  available_adapters=['default']),
+...]
+
+>>> model.get_model_status()
+TunerModelStatus(
+    base_model_type='T5Model',
+    adapter_model_type='LoraModel',
+    peft_types={'default': 'LORA'},
+    trainable_params=344064,
+    total_params=60855680,
+    num_adapter_layers=48,
+    enabled=True,
+    active_adapters=['default'],
+    merged_adapters=[],
+    requires_grad={'default': True},
+    available_adapters=['default'],
+)
+```
+
+In the model state output, you should look out for entries that say `"irregular"`. This means PEFT detected an inconsistent state in the model. For instance, if `merged_adapters="irregular"`, it means that for at least one adapter, it was merged on some target modules but not on others. The inference results will most likely be incorrect as a result.
+
+The best way to resolve this issue is to reload the whole model and adapter checkpoint(s). Ensure that you don't perform any incorrect operations on the model, e.g. manually merging adapters on some modules but not others.
+
+Convert the layer status into a pandas `DataFrame` for an easier visual inspection.
+
+```python
+from dataclasses import asdict
+import pandas as pd
+
+df = pd.DataFrame(asdict(layer) for layer in model.get_layer_status())
+```
+
+It is possible to get this information for non-PEFT models if they are using PEFT layers under the hood, but some information like the `base_model_type` or the `peft_types` cannot be determined in that case. As an example, you can call this on a [diffusers](https://huggingface.co/docs/diffusers/index) model like so:
+
+```python
+>>> import torch
+>>> from diffusers import StableDiffusionPipeline
+>>> from peft import get_model_status, get_layer_status
+
+>>> path = "runwayml/stable-diffusion-v1-5"
+>>> lora_id = "takuma104/lora-test-text-encoder-lora-target"
+>>> pipe = StableDiffusionPipeline.from_pretrained(path, dtype=torch.float16)
+>>> pipe.load_lora_weights(lora_id, adapter_name="adapter-1")
+>>> pipe.load_lora_weights(lora_id, adapter_name="adapter-2")
+>>> pipe.set_lora_device(["adapter-2"], "cuda")
+>>> get_layer_status(pipe.text_encoder)
+[TunerLayerStatus(name='text_model.encoder.layers.0.self_attn.k_proj',
+                  module_type='lora.Linear',
+                  enabled=True,
+                  active_adapters=['adapter-2'],
+                  merged_adapters=[],
+                  requires_grad={'adapter-1': False, 'adapter-2': True},
+                  available_adapters=['adapter-1', 'adapter-2'],
+                  devices={'adapter-1': ['cpu'], 'adapter-2': ['cuda']}),
+ TunerLayerStatus(name='text_model.encoder.layers.0.self_attn.v_proj',
+                  module_type='lora.Linear',
+                  enabled=True,
+                  active_adapters=['adapter-2'],
+                  merged_adapters=[],
+                  requires_grad={'adapter-1': False, 'adapter-2': True},
+                  devices={'adapter-1': ['cpu'], 'adapter-2': ['cuda']}),
+...]
+
+>>> get_model_status(pipe.unet)
+TunerModelStatus(
+    base_model_type='other',
+    adapter_model_type='None',
+    peft_types={},
+    trainable_params=797184,
+    total_params=861115332,
+    num_adapter_layers=128,
+    enabled=True,
+    active_adapters=['adapter-2'],
+    merged_adapters=[],
+    requires_grad={'adapter-1': False, 'adapter-2': True},
+    available_adapters=['adapter-1', 'adapter-2'],
+    devices={'adapter-1': ['cpu'], 'adapter-2': ['cuda']},
+)
+```
+
+## Speed
+
+### Loading adapter weights is slow
+
+Loading adapters like LoRA weights should generally be fast compared to loading the base model. However, there can be use cases where the adapter weights are quite large or where users need to load a large number of adapters -- the loading time can add up in this case. The reason for this is that the adapter weights are first initialized and then overridden by the loaded weights, which is wasteful. To speed up the loading time, you can pass the `low_cpu_mem_usage=True` argument to [`~PeftModel.from_pretrained`] and [`~PeftModel.load_adapter`].
+
+> [!TIP]
+> If this option works well across different use cases, it may become the default for adapter loading in the future.
+
+
+## Reproducibility
+
+### Models using batch norm
+
+When loading a trained PEFT model where the base model uses batch norm (e.g. `torch.nn.BatchNorm1d` or `torch.nn.BatchNorm2d`), you may find that you cannot reproduce the exact same outputs. This is because the batch norm layers keep track of running stats during training, but these stats are not part of the PEFT checkpoint. Therefore, when you load the PEFT model, the running stats of the base model will be used (i.e. from before training with PEFT).
+
+Depending on your use case, this may not be a big deal. If, however, you need your outputs to be 100% reproducible, you can achieve this by adding the batch norm layers to `modules_to_save`. Below is an example of this using resnet and LoRA. Notice that we set `modules_to_save=["classifier", "normalization"]`. We need the `"classifier"` argument because our task is image classification, and we add the `"normalization"` argument to ensure that the batch norm layers are saved in the PEFT checkpoint.
+
+```python
+from transformers import AutoModelForImageClassification
+from peft import LoraConfig, get_peft_model
+
+model_id = "microsoft/resnet-18"
+base_model = AutoModelForImageClassification.from_pretrained(self.model_id)
+config = LoraConfig(
+    target_modules=["convolution"],
+    modules_to_save=["classifier", "normalization"],
+),
+```
+
+Depending on the type of model you use, the batch norm layers could have different names than `"normalization"`, so please ensure that the name matches your model architecture.
+
+## Version mismatch
+
+### Error while loading the config because of an unexpected keyword argument
+
+When you encounter an error like the one shown below, it means the adapter you're trying to load was trained with a more recent version of PEFT than the version you have installed on your system.
+
+```
+TypeError: LoraConfig.__init__() got an unexpected keyword argument <argument-name>
+```
+
+The best way to resolve this issue is to install the latest PEFT version:
+
+```sh
+python -m pip install -U PEFT
+```
+
+If the adapter was trained from a source install of PEFT (an unreleased version of PEFT), then you also need to install PEFT from source.
+
+```sh
+python -m pip install -U git+https://github.com/huggingface/peft.git
+```
+
+If it is not possible for you to upgrade PEFT, there is a workaround you can try.
+
+Assume the error message says that the unknown keyword argument is named `foobar`. Search inside the `adapter_config.json` of this PEFT adapter for the `foobar` entry and delete it from the file. Then save the file and try loading the model again.
+
+This solution works most of the time. As long as it is the default value for `foobar`, it can be ignored. However, when it is set to some other value, you will get incorrect results. Upgrading PEFT is the recommended solution.
+
+## Adapter handling
+
+### Using multiple adapters at the same time
+
+PEFT allows you to create more than one adapter on the same model. This can be useful in many situations. For example, for inference, you may want to serve two fine-tuned models from the same base model instead of loading the base model once for each fine-tuned model, which would cost more memory. However, multiple adapters can be activated at the same time. This way, the model may leverage the learnings from all those adapters at the same time. As an example, if you have a diffusion model, you may want to use one LoRA adapter to change the style and a different one to change the subject.
+
+Activating multiple adapters at the same time is generally possible on all PEFT methods (LoRA, LoHa, IA³, etc.) except for prompt learning methods (p-tuning, prefix tuning, etc.). The following example illustrates how to achieve this:
+
+```python
+from transformers import AutoModelForCausalLM
+from peft import PeftModel
+
+model_id = ...
+base_model = AutoModelForCausalLM.from_pretrained(model_id)
+model = PeftModel.from_pretrained(base_model, lora_path_0)  # default adapter_name is 'default'
+model.load_adapter(lora_path_1, adapter_name="other")
+# the 'other' adapter was loaded but it's not active yet, so to activate both adapters:
+model.base_model.set_adapter(["default", "other"])
+```
+
+> [!TIP]
+> In the example above, you can see that we need to call `model.base_model.set_adapter(["default", "other"])`. Why can we not call `model.set_adapter(["default", "other"])`? This is unfortunately not possible because, as explained earlier, some PEFT methods don't support activating more than one adapter at a time.
+
+It is also possible to train two adapters at the same time, but you should be careful to ensure that the weights of both adapters are known to the optimizer. Otherwise, only one adapter will receive updates.
+
+```python
+from transformers import AutoModelForCausalLM
+from peft import LoraConfig, get_peft_model
+
+model_id = ...
+base_model = AutoModelForCausalLM.from_pretrained(model_id)
+lora_config_0 = LoraConfig(...)
+lora_config_1 = LoraConfig(...)
+model = get_peft_model(base_model, lora_config_0)
+model.add_adapter(adapter_name="other", peft_config=lora_config_1)
+```
+
+If we would now call:
+
+```python
+from transformers import Trainer
+
+trainer = Trainer(model=model,  ...)
+trainer.train()
+```
+
+or
+
+```python
+optimizer = torch.optim.AdamW([param for param in model.parameters() if param.requires_grad], ...)
+```
+
+then the second LoRA adapter (`"other"`) would not be trained. This is because it is inactive at this moment, which means the `requires_grad` attribute on its parameters is set to `False` and the optimizer will ignore it. Therefore, make sure to activate all adapters that should be trained _before_ initializing the optimizer:
+
+```python
+# activate all adapters
+model.base_model.set_adapter(["default", "other"])
+trainer = Trainer(model=model,  ...)
+trainer.train()
+```
+
+> [!TIP]
+> This section deals with using multiple adapters _of the same type_ on the same model, for example, using multiple LoRA adapters at the same time. It does not apply to using _different types_ of adapters on the same model, for example one LoRA adapter and one LoHa adapter. For this, please check [`PeftMixedModel`](https://huggingface.co/docs/peft/developer_guides/mixed_models).
--- a/docs/source/developer_guides/troubleshooting.mdx
+++ b/docs/source/developer_guides/troubleshooting.mdx
@ -1,79 +0,0 @@
-<!--Copyright 2023 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-
-# Troubleshooting
-
-If you encounter any issue when using PEFT, please check the following list of common issues and their solutions.
-
-## Examples don't work
-
-Examples often rely on the most recent package versions, so please ensure they're up-to-date. In particular, check the version of the following packages:
-
- `peft`
- `transformers`
- `accelerate`
- `torch`
-
-In general, you can update the package version by running this command inside your Python environment:
-
-```bash
-python -m pip install -U <package_name>
-```
-
-Installing PEFT from source is useful for keeping up with the latest developments:
-
-```bash
-python -m pip install git+https://github.com/huggingface/peft
-```
-
-## Bad results from a loaded PEFT model
-
-There can be several reasons for getting a poor result from a loaded PEFT model, which are listed below. If you're still unable to troubleshoot the problem, see if anyone else had a similar [issue](https://github.com/huggingface/peft/issues) on GitHub, and if you can't find any, open a new issue.
-
-When opening an issue, it helps a lot if you provide a minimal code example that reproduces the issue. Also, please report if the loaded model performs at the same level as the model did before fine-tuning, if it performs at a random level, or if it is only slightly worse than expected. This information helps us identify the problem more quickly.
-
-### Random deviations
-
-If your model outputs are not exactly the same as previous runs, there could be an issue with random elements. For example:
-
-1. please ensure it is in `.eval()` mode, which is important, for instance, if the model uses dropout
-2. if you use [`~transformers.GenerationMixin.generate`] on a language model, there could be random sampling, so obtaining the same result requires setting a random seed
-3. if you used quantization and merged the weights, small deviations are expected due to rounding errors
-
-### Incorrectly loaded model
-
-Please ensure that you load the model correctly. A common error is trying to load a _trained_ model with `get_peft_model`, which is incorrect. Instead, the loading code should look like this:
-
-```python
-from peft import PeftModel, PeftConfig
-
-base_model = ...  # to load the base model, use the same code as when you trained it
-config = PeftConfig.from_pretrained(peft_model_id)
-peft_model = PeftModel.from_pretrained(base_model, peft_model_id)
-```
-
-### Randomly initialized layers
-
-For some tasks, it is important to correctly configure `modules_to_save` in the config to account for randomly initialized layers. 
-
-As an example, this is necessary if you use LoRA to fine-tune a language model for sequence classification because 🤗 Transformers adds a randomly initialized classification head on top of the model. If you do not add this layer to `modules_to_save`, the classification head won't be saved. The next time you load the model, you'll get a _different_ randomly initialized classification head, resulting in completely different results.
-
-In PEFT, we try to correctly guess the `modules_to_save` if you provide the `task_type` argument in the config. This should work for transformers models that follow the standard naming scheme. It is always a good idea to double check though because we can't guarantee all models follow the naming scheme.
-
-When you load a transformers model that has randomly initialized layers, you should see a warning along the lines of:
-
-```
-Some weights of <MODEL> were not initialized from the model checkpoint at <ID> and are newly initialized: [<LAYER_NAMES>].
-You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
-```
-
-The mentioned layers should be added to `modules_to_save` in the config to avoid the described problem.
--- a/docs/source/index.md
+++ b/docs/source/index.md
@ -0,0 +1,49 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# PEFT
+
+🤗 PEFT (Parameter-Efficient Fine-Tuning) is a library for efficiently adapting large pretrained models to various downstream applications without fine-tuning all of a model's parameters because it is prohibitively costly. PEFT methods only fine-tune a small number of (extra) model parameters - significantly decreasing computational and storage costs - while yielding performance comparable to a fully fine-tuned model. This makes it more accessible to train and store large language models (LLMs) on consumer hardware.
+
+PEFT is integrated with the Transformers, Diffusers, and Accelerate libraries to provide a faster and easier way to load, train, and use large models for inference.
+
+<div class="mt-10">
+  <div class="w-full flex flex-col space-y-4 md:space-y-0 md:grid md:grid-cols-2 md:gap-y-4 md:gap-x-5">
+    <a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="quicktour"
+      ><div class="w-full text-center bg-gradient-to-br from-blue-400 to-blue-500 rounded-lg py-1.5 font-semibold mb-5 text-white text-lg leading-relaxed">Quicktour</div>
+      <p class="text-gray-700">Start here if you're new to 🤗 PEFT to get an overview of the library's main features, and how to train a model with a PEFT method.</p>
+    </a>
+    <a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="./task_guides/prompt_based_methods"
+      ><div class="w-full text-center bg-gradient-to-br from-indigo-400 to-indigo-500 rounded-lg py-1.5 font-semibold mb-5 text-white text-lg leading-relaxed">How-to guides</div>
+      <p class="text-gray-700">Practical guides demonstrating how to apply various PEFT methods across different types of tasks like image classification, causal language modeling, automatic speech recognition, and more. Learn how to use 🤗 PEFT with the DeepSpeed and Fully Sharded Data Parallel scripts.</p>
+    </a>
+    <a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="./conceptual_guides/adapter"
+      ><div class="w-full text-center bg-gradient-to-br from-pink-400 to-pink-500 rounded-lg py-1.5 font-semibold mb-5 text-white text-lg leading-relaxed">Conceptual guides</div>
+      <p class="text-gray-700">Get a better theoretical understanding of how LoRA and various soft prompting methods help reduce the number of trainable parameters to make training more efficient.</p>
+   </a>
+    <a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="./package_reference/config"
+      ><div class="w-full text-center bg-gradient-to-br from-purple-400 to-purple-500 rounded-lg py-1.5 font-semibold mb-5 text-white text-lg leading-relaxed">Reference</div>
+      <p class="text-gray-700">Technical descriptions of how 🤗 PEFT classes and methods work.</p>
+    </a>
+  </div>
+</div>
+
+<iframe
+	src="https://stevhliu-peft-methods.hf.space"
+	frameborder="0"
+	width="850"
+	height="620"
+></iframe>
--- a/docs/source/index.mdx
+++ b/docs/source/index.mdx
@ -1,138 +0,0 @@
-<!--Copyright 2023 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-
-# PEFT
-
-🤗 PEFT, or Parameter-Efficient Fine-Tuning (PEFT), is a library for efficiently adapting pre-trained language models (PLMs) to various downstream applications without fine-tuning all the model's parameters. 
-PEFT methods only fine-tune a small number of (extra) model parameters, significantly decreasing computational and storage costs because fine-tuning large-scale PLMs is prohibitively costly.
-Recent state-of-the-art PEFT techniques achieve performance comparable to that of full fine-tuning.
-
-PEFT is seamlessly integrated with 🤗 Accelerate for large-scale models leveraging DeepSpeed and [Big Model Inference](https://huggingface.co/docs/accelerate/usage_guides/big_modeling).
-
-<div class="mt-10">
-  <div class="w-full flex flex-col space-y-4 md:space-y-0 md:grid md:grid-cols-2 md:gap-y-4 md:gap-x-5">
-    <a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="quicktour"
-      ><div class="w-full text-center bg-gradient-to-br from-blue-400 to-blue-500 rounded-lg py-1.5 font-semibold mb-5 text-white text-lg leading-relaxed">Get started</div>
-      <p class="text-gray-700">Start here if you're new to 🤗 PEFT to get an overview of the library's main features, and how to train a model with a PEFT method.</p>
-    </a>
-    <a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="./task_guides/image_classification_lora"
-      ><div class="w-full text-center bg-gradient-to-br from-indigo-400 to-indigo-500 rounded-lg py-1.5 font-semibold mb-5 text-white text-lg leading-relaxed">How-to guides</div>
-      <p class="text-gray-700">Practical guides demonstrating how to apply various PEFT methods across different types of tasks like image classification, causal language modeling, automatic speech recognition, and more. Learn how to use 🤗 PEFT with the DeepSpeed and Fully Sharded Data Parallel scripts.</p>
-    </a>
-    <a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="./conceptual_guides/lora"
-      ><div class="w-full text-center bg-gradient-to-br from-pink-400 to-pink-500 rounded-lg py-1.5 font-semibold mb-5 text-white text-lg leading-relaxed">Conceptual guides</div>
-      <p class="text-gray-700">Get a better theoretical understanding of how LoRA and various soft prompting methods help reduce the number of trainable parameters to make training more efficient.</p>
-   </a>
-    <a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="./package_reference/config"
-      ><div class="w-full text-center bg-gradient-to-br from-purple-400 to-purple-500 rounded-lg py-1.5 font-semibold mb-5 text-white text-lg leading-relaxed">Reference</div>
-      <p class="text-gray-700">Technical descriptions of how 🤗 PEFT classes and methods work.</p>
-    </a>
-  </div>
-</div>
-
-## Supported methods
-
-1. LoRA: [LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS](https://arxiv.org/pdf/2106.09685.pdf)
-2. Prefix Tuning: [Prefix-Tuning: Optimizing Continuous Prompts for Generation](https://aclanthology.org/2021.acl-long.353/), [P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks](https://arxiv.org/pdf/2110.07602.pdf)
-3. P-Tuning: [GPT Understands, Too](https://arxiv.org/pdf/2103.10385.pdf)
-4. Prompt Tuning: [The Power of Scale for Parameter-Efficient Prompt Tuning](https://arxiv.org/pdf/2104.08691.pdf) 
-5. AdaLoRA: [Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning](https://arxiv.org/abs/2303.10512) 
-6. [LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention](https://github.com/ZrrSkywalker/LLaMA-Adapter)
-7. IA3: [Infused Adapter by Inhibiting and Amplifying Inner Activations](https://arxiv.org/abs/2205.05638)
-
-## Supported models
-
-The tables provided below list the PEFT methods and models supported for each task. To apply a particular PEFT method for 
-a task, please refer to the corresponding Task guides.
-
-### Causal Language Modeling
-
-| Model        | LoRA | Prefix Tuning  | P-Tuning | Prompt Tuning  | IA3 |
-|--------------| ---- | ---- | ---- | ----  | ----  |
-| GPT-2        | ✅  | ✅  | ✅  | ✅  | ✅  |
-| Bloom        | ✅  | ✅  | ✅  | ✅  | ✅  |
-| OPT          | ✅  | ✅  | ✅  | ✅  | ✅  |
-| GPT-Neo      | ✅  | ✅  | ✅  | ✅  | ✅  |
-| GPT-J        | ✅  | ✅  | ✅  | ✅  | ✅  |
-| GPT-NeoX-20B | ✅  | ✅  | ✅  | ✅  | ✅  |
-| LLaMA        | ✅  | ✅  | ✅  | ✅  | ✅  |
-| ChatGLM      | ✅  | ✅  | ✅  | ✅  | ✅  |
-
-### Conditional Generation
-
-|   Model         | LoRA | Prefix Tuning  | P-Tuning | Prompt Tuning  | IA3 |
-| --------- | ---- | ---- | ---- | ---- | ---- |
-| T5        | ✅   | ✅   | ✅   | ✅   | ✅   |
-| BART      | ✅   | ✅   | ✅   | ✅   | ✅   |
-
-### Sequence Classification
-
-|   Model         | LoRA | Prefix Tuning  | P-Tuning | Prompt Tuning  | IA3 |
-| --------- | ---- | ---- | ---- | ----  | ----  |
-| BERT           | ✅  | ✅  | ✅  | ✅  | ✅  |  
-| RoBERTa        | ✅  | ✅  | ✅  | ✅  | ✅  |
-| GPT-2          | ✅  | ✅  | ✅  | ✅  |   | 
-| Bloom          | ✅  | ✅  | ✅  | ✅  |   |
-| OPT            | ✅  | ✅  | ✅  | ✅  |   |
-| GPT-Neo        | ✅  | ✅  | ✅  | ✅  |   |
-| GPT-J          | ✅  | ✅  | ✅  | ✅  |   |
-| Deberta        | ✅  |     | ✅  | ✅  |   | 
-| Deberta-v2     | ✅  |     | ✅  | ✅  |   |    
-
-### Token Classification
-
-|   Model         | LoRA | Prefix Tuning  | P-Tuning | Prompt Tuning  | IA3 |
-| --------- | ---- | ---- | ---- | ----  | --- |
-| BERT           | ✅  | ✅  |   |   |   |  
-| RoBERTa        | ✅  | ✅  |   |   |   |
-| GPT-2          | ✅  | ✅  |   |   |   | 
-| Bloom          | ✅  | ✅  |   |   |   |
-| OPT            | ✅  | ✅  |   |   |   |
-| GPT-Neo        | ✅  | ✅  |   |   |   |
-| GPT-J          | ✅  | ✅  |   |   |   |
-| Deberta        | ✅  |     |   |   |    |
-| Deberta-v2     | ✅  |     |   |   |   |
-
-### Text-to-Image Generation
-
-|   Model         | LoRA | Prefix Tuning  | P-Tuning | Prompt Tuning  | IA3 |
-| --------- | ---- | ---- | ---- | ----  | ----  |
-| Stable Diffusion           | ✅  |   |   |   |   |  
-
-
-### Image Classification
-
-|   Model         | LoRA | Prefix Tuning  | P-Tuning | Prompt Tuning  | IA3 |
-| --------- | ---- | ---- | ---- | ----  | ----  | ----  |
-| ViT           | ✅  |   |   |   |   | 
-| Swin           | ✅  |   |   |   |   | 
-
-### Image to text (Multi-modal models)
-
-We have tested LoRA for [ViT](https://huggingface.co/docs/transformers/model_doc/vit) and [Swin](https://huggingface.co/docs/transformers/model_doc/swin) for fine-tuning on image classification. 
-However, it should be possible to use LoRA for any [ViT-based model](https://huggingface.co/models?pipeline_tag=image-classification&sort=downloads&search=vit) from 🤗 Transformers. 
-Check out the [Image classification](/task_guides/image_classification_lora) task guide to learn more. If you run into problems, please open an issue.
-
-|   Model         | LoRA | Prefix Tuning  | P-Tuning | Prompt Tuning  | IA3 |
-| --------- | ---- | ---- | ---- | ----  | ---- |
-| Blip-2           | ✅  |   |   |   |   | 
- 
-
-### Semantic Segmentation
-
-As with image-to-text models, you should be able to apply LoRA to any of the [segmentation models](https://huggingface.co/models?pipeline_tag=image-segmentation&sort=downloads). 
-It's worth noting that we haven't tested this with every architecture yet. Therefore, if you come across any issues, kindly create an issue report.
-
-|   Model         | LoRA | Prefix Tuning  | P-Tuning | Prompt Tuning  | IA3 |
-| --------- | ---- | ---- | ---- | ----  | ----  |
-| SegFormer           | ✅  |   |   |   |   |
-
--- a/docs/source/install.mdx
+++ b/docs/source/install.mdx
@ -8,11 +8,15 @@ http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
 -->

 # Installation

-Before you start, you will need to setup your environment, install the appropriate packages, and configure 🤗 PEFT. 🤗 PEFT is tested on **Python 3.8+**.
+Before you start, you will need to setup your environment, install the appropriate packages, and configure 🤗 PEFT. 🤗 PEFT is tested on **Python 3.9+**.

 🤗 PEFT is available on PyPI, as well as GitHub:

@ -39,5 +43,5 @@ repository:
 ```bash
 git clone https://github.com/huggingface/peft
 cd peft
-pip install -e .
+pip install -e .[test]
 ```
--- a/docs/source/package_reference/adalora.md
+++ b/docs/source/package_reference/adalora.md
@ -0,0 +1,31 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# AdaLoRA
+
+[AdaLoRA](https://hf.co/papers/2303.10512) is a method for optimizing the number of trainable parameters to assign to weight matrices and layers, unlike LoRA, which distributes parameters evenly across all modules. More parameters are budgeted for important weight matrices and layers while less important ones receive fewer parameters.
+
+The abstract from the paper is:
+
+*Fine-tuning large pre-trained language models on downstream tasks has become an important paradigm in NLP. However, common practice fine-tunes all of the parameters in a pre-trained model, which becomes prohibitive when a large number of downstream tasks are present. Therefore, many fine-tuning methods are proposed to learn incremental updates of pre-trained weights in a parameter efficient way, e.g., low-rank increments. These methods often evenly distribute the budget of incremental updates across all pre-trained weight matrices, and overlook the varying importance of different weight parameters. As a consequence, the fine-tuning performance is suboptimal. To bridge this gap, we propose AdaLoRA, which adaptively allocates the parameter budget among weight matrices according to their importance score. In particular, AdaLoRA parameterizes the incremental updates in the form of singular value decomposition. Such a novel approach allows us to effectively prune the singular values of unimportant updates, which is essentially to reduce their parameter budget but circumvent intensive exact SVD computations. We conduct extensive experiments with several pre-trained models on natural language processing, question answering, and natural language generation to validate the effectiveness of AdaLoRA. Results demonstrate that AdaLoRA manifests notable improvement over baselines, especially in the low budget settings. Our code is publicly available at https://github.com/QingruZhang/AdaLoRA*.
+
+## AdaLoraConfig
+
+[[autodoc]] tuners.adalora.config.AdaLoraConfig
+
+## AdaLoraModel
+
+[[autodoc]] tuners.adalora.model.AdaLoraModel
--- a/docs/source/package_reference/adapter_utils.md
+++ b/docs/source/package_reference/adapter_utils.md
@ -0,0 +1,31 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# LyCORIS
+
+[LyCORIS](https://hf.co/papers/2309.14859) (Lora beYond Conventional methods, Other Rank adaptation Implementations for Stable diffusion) are LoRA-like matrix decomposition adapters that modify the cross-attention layer of the UNet. The [LoHa](loha) and [LoKr](lokr) methods inherit from the `Lycoris` classes here.
+
+## LycorisConfig
+
+[[autodoc]] tuners.lycoris_utils.LycorisConfig
+
+## LycorisLayer
+
+[[autodoc]] tuners.lycoris_utils.LycorisLayer
+
+## LycorisTuner
+
+[[autodoc]] tuners.lycoris_utils.LycorisTuner
--- a/docs/source/package_reference/auto_class.md
+++ b/docs/source/package_reference/auto_class.md
@ -0,0 +1,48 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# AutoPeftModels
+
+The `AutoPeftModel` classes loads the appropriate PEFT model for the task type by automatically inferring it from the configuration file. They are designed to quickly and easily load a PEFT model in a single line of code without having to worry about which exact model class you need or manually loading a [`PeftConfig`].
+
+## AutoPeftModel
+
+[[autodoc]] auto.AutoPeftModel
+    - from_pretrained
+
+## AutoPeftModelForCausalLM
+
+[[autodoc]] auto.AutoPeftModelForCausalLM
+
+## AutoPeftModelForSeq2SeqLM
+
+[[autodoc]] auto.AutoPeftModelForSeq2SeqLM
+
+## AutoPeftModelForSequenceClassification
+
+[[autodoc]] auto.AutoPeftModelForSequenceClassification
+
+## AutoPeftModelForTokenClassification
+
+[[autodoc]] auto.AutoPeftModelForTokenClassification
+
+## AutoPeftModelForQuestionAnswering
+
+[[autodoc]] auto.AutoPeftModelForQuestionAnswering
+
+## AutoPeftModelForFeatureExtraction
+
+[[autodoc]] auto.AutoPeftModelForFeatureExtraction
--- a/docs/source/package_reference/boft.md
+++ b/docs/source/package_reference/boft.md
@ -0,0 +1,31 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# BOFT
+
+[Orthogonal Butterfly (BOFT)](https://hf.co/papers/2311.06243) is a generic method designed for finetuning foundation models. It improves the parameter efficiency of the finetuning paradigm -- Orthogonal Finetuning (OFT), by taking inspiration from Cooley-Tukey fast Fourier transform, showing favorable results across finetuning different foundation models, including large vision transformers, large language models and text-to-image diffusion models.
+
+The abstract from the paper is:
+
+*Large foundation models are becoming ubiquitous, but training them from scratch is prohibitively expensive. Thus, efficiently adapting these powerful models to downstream tasks is increasingly important. In this paper, we study a principled finetuning paradigm -- Orthogonal Finetuning (OFT) -- for downstream task adaptation. Despite demonstrating good generalizability, OFT still uses a fairly large number of trainable parameters due to the high dimensionality of orthogonal matrices. To address this, we start by examining OFT from an information transmission perspective, and then identify a few key desiderata that enable better parameter-efficiency. Inspired by how the Cooley-Tukey fast Fourier transform algorithm enables efficient information transmission, we propose an efficient orthogonal parameterization using butterfly structures. We apply this parameterization to OFT, creating a novel parameter-efficient finetuning method, called Orthogonal Butterfly (BOFT). By subsuming OFT as a special case, BOFT introduces a generalized orthogonal finetuning framework. Finally, we conduct an extensive empirical study of adapting large vision transformers, large language models, and text-to-image diffusion models to various downstream tasks in vision and language*.
+
+## BOFTConfig
+
+[[autodoc]] tuners.boft.config.BOFTConfig
+
+## BOFTModel
+
+[[autodoc]] tuners.boft.model.BOFTModel
--- a/docs/source/package_reference/bone.md
+++ b/docs/source/package_reference/bone.md
@ -0,0 +1,33 @@
+<!--Copyright 2024 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# Bone
+
+DiSHA: Dimension-Sharding Adaptation ([DiSHA](https://huggingface.co/papers/2409.15371)) We introduce Dimension-Sharding Adaptation (DiSHA), which expands the PEFT design space to unlock lower intrinsic ranks and faster convergence by default. Building on DiSHA, we propose an efficient algorithm called Block-Affine Adaptation (Bone) structure and a non-linear update method called Block Affine Transformation Adaptation (BAT).
+
+
+The abstract from the paper is:
+
+Low-Rank Adaptation (LoRA) leverages the low intrinsic rank of weight updates in Large Language Models (LLMs), establishing a Parameter-Efficient Fine-Tuning (PEFT) paradigm. However, LoRA suffers from slow convergence. We introduce Dimension-Sharding Adaptation (DiSHA), which expands the PEFT design space to unlock lower intrinsic ranks and faster convergence by default. Within DiSHA's design space, we propose Block Affine Adaptation (Bone), a computationally efficient structure that delivers both high performance and efficiency. While certain DiSHA configurations may result in colinear updates to weight shards, we address this with Block Affine Transformation Adaptation (BAT), a nonlinear variant of DiSHA. BAT introduces nonlinearity by combining trainable matrices with original weight shards in a nonlinear manner, inducing nonlinearity in matrix updates without introducing additional parameters. Empirical results show that Bone, under the DiSHA framework, consistently outperforms LoRA variants in both NLG and NLU tasks, with significantly improved computational efficiency. Further analysis demonstrates that BAT enhances model capabilities by leveraging its nonlinear design.
+
+
+## BoneConfig
+
+[[autodoc]] tuners.bone.config.BoneConfig
+
+## BoneModel
+
+[[autodoc]] tuners.bone.model.BoneModel
--- a/docs/source/package_reference/c3a.md
+++ b/docs/source/package_reference/c3a.md
@ -0,0 +1,43 @@
+<!--Copyright 2025 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# C3A: Parameter-Efficient Fine-Tuning via Circular Convolution
+
+[C3A](https://huggingface.co/papers/2407.19342) is a parameter-efficient fine-tuning technique that leverages Circular Convolution to achieve high rank adaptation within reasonable resource limits.
+
+Note that you should use a much larger learning rate (LR) for C3A than for other methods. For example, a LR of 1e-1 for C3A is a good starting point. Besides, a much smaller weight decay should be used. You can refer to the `method_comparison` folder for more details.
+
+For the `block_size`, it affects tunable parameters and performance. To start with, you can choose a $\mathrm{gcd}(d_1,d_2)$ near $\frac{\sqrt{d_1\times d_2}}{r}$, where $r$ is the rank for LoRA you would use for this task.
+
+C3A currently has the following constraints:
+
+- Only `nn.Linear` layers are supported.
+- Quantized layers are not supported.
+- The block size should be a common divisor of both the input and output sizes of target layers. 
+
+If these constraints don't work for your use case, consider other methods instead.
+
+The abstract from the paper is:
+
+> Low-Rank Adaptation (LoRA) has gained popularity for fine-tuning large foundation models, leveraging low-rank matrices $\mathbf{A}$ and $\mathbf{B}$ to represent weight changes (i.e., $\Delta \mathbf{W} = \mathbf{B} \mathbf{A}$). This method reduces trainable parameters and mitigates heavy memory consumption associated with full delta matrices by sequentially multiplying $\mathbf{A}$ and $\mathbf{B}$ with the activation. Despite its success, the intrinsic low-rank characteristic may limit its performance. Although several variants have been proposed to address this issue, they often overlook the crucial computational and memory efficiency brought by LoRA. In this paper, we propose Circular Convolution Adaptation (C3A), which not only achieves high-rank adaptation with enhanced performance but also excels in both computational power and memory utilization. Extensive experiments demonstrate that C3A consistently outperforms LoRA and its variants across various fine-tuning tasks. 
+
+## C3AConfig
+
+[[autodoc]] tuners.c3a.config.C3AConfig
+
+## C3AModel
+
+[[autodoc]] tuners.c3a.model.C3AModel
--- a/docs/source/package_reference/config.md
+++ b/docs/source/package_reference/config.md
@ -0,0 +1,22 @@
+<!--⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+-->
+
+# Configuration
+
+[`PeftConfigMixin`] is the base configuration class for storing the adapter configuration of a [`PeftModel`], and [`PromptLearningConfig`] is the base configuration class for soft prompt methods (p-tuning, prefix tuning, and prompt tuning). These base classes contain methods for saving and loading model configurations from the Hub, specifying the PEFT method to use, type of task to perform, and model configurations like number of layers and number of attention heads.
+
+## PeftConfigMixin
+
+[[autodoc]] config.PeftConfigMixin
+    - all
+
+## PeftConfig
+
+[[autodoc]] PeftConfig
+    - all
+
+## PromptLearningConfig
+
+[[autodoc]] PromptLearningConfig
+    - all
--- a/docs/source/package_reference/config.mdx
+++ b/docs/source/package_reference/config.mdx
@ -1,18 +0,0 @@
-# Configuration
-
-The configuration classes stores the configuration of a [`PeftModel`], PEFT adapter models, and the configurations of [`PrefixTuning`], [`PromptTuning`], and [`PromptEncoder`]. They contain methods for saving and loading model configurations from the Hub, specifying the PEFT method to use, type of task to perform, and model configurations like number of layers and number of attention heads.
-
-## PeftConfigMixin
-
-[[autodoc]] config.PeftConfigMixin
-    - all
-
-## PeftConfig
-
-[[autodoc]] PeftConfig
-    - all
-
-## PromptLearningConfig
-
-[[autodoc]] PromptLearningConfig
-    - all
--- a/docs/source/package_reference/cpt.md
+++ b/docs/source/package_reference/cpt.md
@ -0,0 +1,34 @@
+<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+
+⚠️ Note that this file is in Markdown but contains specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+-->
+
+# Context-aware Prompt Tuning: Advancing In-Context Learning with Adversarial Methods
+
+[CPT](https://huggingface.co/papers/2410.17222) combines In-Context Learning (ICL), Prompt Tuning (PT), and adversarial optimization to improve few-shot learning by refining context embeddings. CPT updates the context tokens by optimizing both the context and the training examples, encapsulating them into a novel loss design that minimizes overfitting, enables more effective optimization, and drives significant improvements in classification tasks.
+
+[//]: # ([CPT]&#40;https://huggingface.co/papers/2410.17222&#41; for the paper)
+
+The abstract from the paper is:
+
+> Large Language Models (LLMs) can perform few-shot learning using either optimization-based approaches or In-Context Learning (ICL). Optimization-based methods often suffer from overfitting, as they require updating a large number of parameters with limited data. In contrast, ICL avoids overfitting but typically underperforms compared to optimization-based methods and is highly sensitive to the selection, order, and format of demonstration examples. To overcome these challenges, we introduce Context-aware Prompt Tuning (CPT), a method inspired by ICL, Prompt Tuning (PT), and adversarial attacks. CPT builds on the ICL strategy of concatenating examples before the input, extending it by incorporating PT-like learning to refine the context embedding through iterative optimization, extracting deeper insights from the training examples. Our approach carefully modifies specific context tokens, considering the unique structure of the examples within the context. In addition to updating the context with PT-like optimization, CPT draws inspiration from adversarial attacks, adjusting the input based on the labels present in the context while preserving the inherent value of the user-provided data. To ensure robustness and stability during optimization, we employ a projected gradient descent algorithm, constraining token embeddings to remain close to their original values and safeguarding the quality of the context. Our method has demonstrated superior accuracy across multiple classification tasks using various LLM models, outperforming existing baselines and effectively addressing the overfitting challenge in few-shot learning.
+
+
+Take a look at [Example](https://github.com/huggingface/peft/blob/main/examples/cpt_finetuning/README.md) for a step-by-step guide on how to train a model with CPT.
+
+
+## CPTConfig
+
+[[autodoc]] tuners.cpt.config.CPTConfig
+
+## CPTEmbedding
+
+[[autodoc]] tuners.cpt.model.CPTEmbedding
+
--- a/docs/source/package_reference/fourierft.md
+++ b/docs/source/package_reference/fourierft.md
@ -0,0 +1,38 @@
+<!--Copyright 2024 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# FourierFT: Discrete Fourier Transformation Fine-Tuning
+
+[FourierFT](https://huggingface.co/papers/2405.03003) is a parameter-efficient fine-tuning technique that leverages Discrete Fourier Transform to compress the model's tunable weights. This method outperforms LoRA in the GLUE benchmark and common ViT classification tasks using much less parameters.
+
+FourierFT currently has the following constraints:
+
+- Only `nn.Linear` layers are supported.
+- Quantized layers are not supported.
+
+If these constraints don't work for your use case, consider other methods instead.
+
+The abstract from the paper is:
+
+> Low-rank adaptation (LoRA) has recently gained much interest in fine-tuning foundation models. It effectively reduces the number of trainable parameters by incorporating low-rank matrices A and B to represent the weight change, i.e., Delta W=BA. Despite LoRA's progress, it faces storage challenges when handling extensive customization adaptations or larger base models. In this work, we aim to further compress trainable parameters by enjoying the powerful expressiveness of the Fourier transform. Specifically, we introduce FourierFT, which treats Delta W as a matrix in the spatial domain and learns only a small fraction of its spectral coefficients. With the trained spectral coefficients, we implement the inverse discrete Fourier transform to recover Delta W. Empirically, our FourierFT method shows comparable or better performance with fewer parameters than LoRA on various tasks, including natural language understanding, natural language generation, instruction tuning, and image classification. For example, when performing instruction tuning on the LLaMA2-7B model, FourierFT surpasses LoRA with only 0.064M trainable parameters, compared to LoRA's 33.5M.
+
+## FourierFTConfig
+
+[[autodoc]] tuners.fourierft.config.FourierFTConfig
+
+## FourierFTModel
+
+[[autodoc]] tuners.fourierft.model.FourierFTModel
--- a/docs/source/package_reference/functional.md
+++ b/docs/source/package_reference/functional.md
@ -0,0 +1,37 @@
+<!--⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+-->
+
+# Functions for PEFT integration
+
+A collection of functions that could be useful for non-PeftModel models, e.g. transformers or diffusers integration
+
+The functions provided here can be considered "public API" of PEFT and hence are safe to be used by packages that provide PEFT integrations.
+
+## Cast the adapter weight dtypes
+[[autodoc]] functional.cast_adapter_dtype
+    - all
+
+## Delete the PEFT adapter from model
+[[autodoc]] functional.delete_adapter
+    - all
+
+## Get the state dict of the PEFT adapter
+[[autodoc]] functional.get_peft_model_state_dict
+    - all
+
+## Inject a PEFT adapter into the model based on a PEFT config
+[[autodoc]] functional.inject_adapter_in_model
+    - all
+
+## Set the active PEFT adapter(s) of the model
+[[autodoc]] functional.set_adapter
+    - all
+
+## Set the `requires_grad` attribute of the specified adapters
+[[autodoc]] functional.set_requires_grad
+    - all
+
+## Load the weights of the PEFT state dict into the model
+[[autodoc]] functional.set_peft_model_state_dict
+    - all
--- a/docs/source/package_reference/helpers.md
+++ b/docs/source/package_reference/helpers.md
@ -0,0 +1,22 @@
+<!--⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+-->
+
+# Helper methods
+
+A collection of helper functions for PEFT.
+
+## Checking if a model is a PEFT model
+
+[[autodoc]] helpers.check_if_peft_model
+    - all
+
+## Temporarily Rescaling Adapter Scale in LoraLayer Modules
+
+[[autodoc]] helpers.rescale_adapter_scale
+    - all
+
+## Context manager to disable input dtype casting in the `forward` method of LoRA layers
+
+[[autodoc]] helpers.disable_input_dtype_casting
+    - all
--- a/docs/source/package_reference/hotswap.md
+++ b/docs/source/package_reference/hotswap.md
@ -0,0 +1,76 @@
+<!--⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+-->
+
+# Hotswapping adapters
+
+The idea of hotswapping an adapter is the following: We can already load multiple adapters, e.g. two LoRAs, at the same time. But sometimes, we want to load one LoRA and then replace its weights in-place with the LoRA weights of another adapter. This is now possible the `hotswap_adapter` function.
+
+In general, this should be faster than deleting one adapter and loading the adapter in its place, which would be the how to achieve the same final outcome without hotswapping. Another advantage of hotswapping is that it prevents re-compilation in case the PEFT model is already compiled using `torch.compile`. This can save quite a lot of time.
+
+## Example without `torch.compile`
+
+```python
+import torch
+from transformers import AutoModelForCausalLM
+from peft import PeftModel
+from peft.utils.hotswap import hotswap_adapter
+
+model_id = ...
+inputs = ...
+device = ...
+model = AutoModelForCausalLM.from_pretrained(model_id).to(device)
+
+# load lora 0
+model = PeftModel.from_pretrained(model, <path-adapter-0>)
+with torch.inference_mode():
+    output_adapter_0 = model(inputs)
+
+# replace the "default" lora adapter with the new one
+hotswap_adapter(model, <path-adapter-1>, adapter_name="default", torch_device=device)
+with torch.inference_mode():
+    output_adapter_1 = model(inputs).logits
+```
+
+## Example with `torch.compile`
+
+```python
+import torch
+from transformers import AutoModelForCausalLM
+from peft import PeftModel
+from peft.utils.hotswap import hotswap_adapter, prepare_model_for_compiled_hotswap
+
+model_id = ...
+inputs = ...
+device = ...
+max_rank = ...  # maximum rank among all LoRA adapters that will be used
+model = AutoModelForCausalLM.from_pretrained(model_id).to(device)
+
+# load lora 0
+model = PeftModel.from_pretrained(model, <path-adapter-0>)
+# Prepare the model to allow hotswapping even if ranks/scalings of 2nd adapter differ.
+# You can skip this step if all ranks and scalings are identical.
+prepare_model_for_compiled_hotswap(model, target_rank=max_rank)
+model = torch.compile(model)
+with torch.inference_mode():
+    output_adapter_0 = model(inputs)
+
+# replace the "default" lora adapter with the new one
+hotswap_adapter(model, <path-adapter-1>, adapter_name="default", torch_device=device)
+with torch.inference_mode():
+    output_adapter_1 = model(inputs).logits
+```
+
+## Caveats
+
+Hotswapping works with transformers models and diffusers models. However, there are some caveats:
+
+- Right now, only LoRA is properly supported.
+- It only works for the same PEFT method, so no swapping LoRA and LoHa, for example.
+- The adapter that is being swapped in must target the same layers as the previous adapter or a subset of those layers. It cannot target new layers. Therefore, if possible, start with the adapter that targets most layers.
+
+[[autodoc]] utils.hotswap.hotswap_adapter
+    - all
+
+[[autodoc]] utils.hotswap.hotswap_adapter_from_state_dict
+    - all
--- a/docs/source/package_reference/hra.md
+++ b/docs/source/package_reference/hra.md
@ -0,0 +1,32 @@
+<!--Copyright 2024 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# Bridging The Gap between Low-rank and Orthogonal Adaptation via Householder Reflection Adaptation (HRA)
+
+[HRA](https://huggingface.co/papers/2405.17484) is a simple but effective adapter-based fine-tuning method by leveraging Householder reflections. This method harnesses the advantages of both strategies, reducing parameters and computation costs while penalizing the loss of pre-training knowledge. It consistently achieves better performance with fewer trainable parameters and outperforms state-of-the-art adapters across different models, including large language models (LLMs) and conditional image generators.
+
+
+The abstract from the paper is:
+
+> While following different technical routes, both low-rank and orthogonal adaptation techniques can efficiently adapt large-scale pre-training models in specific tasks or domains based on a small piece of trainable parameters. In this study, we bridge the gap between these two techniques, proposing a simple but effective adaptation method based on Householder reflections. Given a pre-trained model, our method fine-tunes its layers by multiplying each frozen weight matrix with an orthogonal matrix constructed by a chain of learnable Householder reflections (HRs). This HR-based orthogonal fine-tuning is equivalent to an adaptive low-rank adaptation. Moreover, we show that the orthogonality of the reflection planes corresponding to the HRs impacts the model capacity and regularity. The analysis motivates us to regularize the orthogonality of the HRs, leading to different implementations of the proposed Householder reflection adaptation (HRA) method. Compared with state-of-the-art methods, HRA achieves superior performance with fewer learnable parameters when adapting large language models and conditional image generators. The code is available at [peft](https://github.com/huggingface/peft/tree/main/src/peft/tuners/hra) and [HRA](https://github.com/DaShenZi721/HRA).
+
+## HRAConfig
+
+[[autodoc]] tuners.hra.config.HRAConfig
+
+## HRAModel
+
+[[autodoc]] tuners.hra.model.HRAModel
--- a/docs/source/package_reference/ia3.md
+++ b/docs/source/package_reference/ia3.md
@ -0,0 +1,31 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# IA3
+
+Infused Adapter by Inhibiting and Amplifying Inner Activations, or [IA3](https://hf.co/papers/2205.05638), is a method that adds three learned vectors to rescale the keys and values of the self-attention and encoder-decoder attention layers, and the intermediate activation of the position-wise feed-forward network.
+
+The abstract from the paper is:
+
+*Few-shot in-context learning (ICL) enables pre-trained language models to perform a previously-unseen task without any gradient-based training by feeding a small number of training examples as part of the input. ICL incurs substantial computational, memory, and storage costs because it involves processing all of the training examples every time a prediction is made. Parameter-efficient fine-tuning (PEFT) (e.g. adapter modules, prompt tuning, sparse update methods, etc.) offers an alternative paradigm where a small set of parameters are trained to enable a model to perform the new task. In this paper, we rigorously compare few-shot ICL and PEFT and demonstrate that the latter offers better accuracy as well as dramatically lower computational costs. Along the way, we introduce a new PEFT method called (IA)^3 that scales activations by learned vectors, attaining stronger performance while only introducing a relatively tiny amount of new parameters. We also propose a simple recipe based on the T0 model called T-Few that can be applied to new tasks without task-specific tuning or modifications. We validate the effectiveness of T-Few on completely unseen tasks by applying it to the RAFT benchmark, attaining super-human performance for the first time and outperforming the state-of-the-art by 6% absolute. All of the code used in our experiments is publicly available*.
+
+## IA3Config
+
+[[autodoc]] tuners.ia3.config.IA3Config
+
+## IA3Model
+
+[[autodoc]] tuners.ia3.model.IA3Model
--- a/docs/source/package_reference/layernorm_tuning.md
+++ b/docs/source/package_reference/layernorm_tuning.md
@ -0,0 +1,34 @@
+<!--Copyright 2024 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# LayerNorm Tuning
+
+LayerNorm Tuning ([LN Tuning](https://huggingface.co/papers/2312.11420)) is a PEFT method that only fine-tunes the parameters of the LayerNorm layers in a model.
+The paper has tested the performance of this method on large language models and has shown that it can achieve strong performance with a significant reduction in the number of trainable parameters and GPU memory usage.
+However, the method is not limited to language models and can be applied to any model that uses LayerNorm layers.
+In this implementation, the default is that all layernorm layers inside a model is finetuned, but it could be used to target other layer types such as `MLP` or `Attention` layers, this can be done by specifying the `target_modules` in the `LNTuningConfig`.
+
+The abstract from the paper is:
+
+*This paper introduces an efficient strategy to transform Large Language Models (LLMs) into Multi-Modal Large Language Models (MLLMs). By conceptualizing this transformation as a domain adaptation process, i.e., transitioning from text understanding to embracing multiple modalities, we intriguingly note that, within each attention block, tuning LayerNorm suffices to yield strong performance. Moreover, when benchmarked against other tuning approaches like full parameter finetuning or LoRA, its benefits on efficiency are substantial. For example, when compared to LoRA on a 13B model scale, performance can be enhanced by an average of over 20% across five multi-modal tasks, and meanwhile, results in a significant reduction of trainable parameters by 41.9% and a decrease in GPU memory usage by 17.6%. On top of this LayerNorm strategy, we showcase that selectively tuning only with conversational data can improve efficiency further. Beyond these empirical outcomes, we provide a comprehensive analysis to explore the role of LayerNorm in adapting LLMs to the multi-modal domain and improving the expressive power of the model.*
+
+## LNTuningConfig
+
+[[autodoc]] tuners.ln_tuning.config.LNTuningConfig
+
+## LNTuningModel
+
+[[autodoc]] tuners.ln_tuning.model.LNTuningModel
--- a/docs/source/package_reference/llama_adapter.md
+++ b/docs/source/package_reference/llama_adapter.md
@ -0,0 +1,31 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# Llama-Adapter
+
+[Llama-Adapter](https://hf.co/papers/2303.16199) is a PEFT method specifically designed for turning Llama into an instruction-following model. The Llama model is frozen and only a set of adaptation prompts prefixed to the input instruction tokens are learned. Since randomly initialized modules inserted into the model can cause the model to lose some of its existing knowledge, Llama-Adapter uses zero-initialized attention with zero gating to progressively add the instructional prompts to the model.
+
+The abstract from the paper is:
+
+*We present LLaMA-Adapter, a lightweight adaption method to efficiently fine-tune LLaMA into an instruction-following model. Using 52K self-instruct demonstrations, LLaMA-Adapter only introduces 1.2M learnable parameters upon the frozen LLaMA 7B model, and costs less than one hour for fine-tuning on 8 A100 GPUs. Specifically, we adopt a set of learnable adaption prompts, and prepend them to the input text tokens at higher transformer layers. Then, a zero-init attention mechanism with zero gating is proposed, which adaptively injects the new instructional cues into LLaMA, while effectively preserves its pre-trained knowledge. With efficient training, LLaMA-Adapter generates high-quality responses, comparable to Alpaca with fully fine-tuned 7B parameters. Furthermore, our approach can be simply extended to multi-modal input, e.g., images, for image-conditioned LLaMA, which achieves superior reasoning capacity on ScienceQA. We release our code at https://github.com/ZrrSkywalker/LLaMA-Adapter*.
+
+## AdaptionPromptConfig
+
+[[autodoc]] tuners.adaption_prompt.config.AdaptionPromptConfig
+
+## AdaptionPromptModel
+
+[[autodoc]] tuners.adaption_prompt.model.AdaptionPromptModel
--- a/docs/source/package_reference/loha.md
+++ b/docs/source/package_reference/loha.md
@ -0,0 +1,31 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# LoHa
+
+Low-Rank Hadamard Product ([LoHa](https://huggingface.co/papers/2108.06098)), is similar to LoRA except it approximates the large weight matrix with more low-rank matrices and combines them with the Hadamard product. This method is even more parameter-efficient than LoRA and achieves comparable performance.
+
+The abstract from the paper is:
+
+*In this work, we propose a communication-efficient parameterization, FedPara, for federated learning (FL) to overcome the burdens on frequent model uploads and downloads. Our method re-parameterizes weight parameters of layers using low-rank weights followed by the Hadamard product. Compared to the conventional low-rank parameterization, our FedPara method is not restricted to low-rank constraints, and thereby it has a far larger capacity. This property enables to achieve comparable performance while requiring 3 to 10 times lower communication costs than the model with the original layers, which is not achievable by the traditional low-rank methods. The efficiency of our method can be further improved by combining with other efficient FL optimizers. In addition, we extend our method to a personalized FL application, pFedPara, which separates parameters into global and local ones. We show that pFedPara outperforms competing personalized FL methods with more than three times fewer parameters*.
+
+## LoHaConfig
+
+[[autodoc]] tuners.loha.config.LoHaConfig
+
+## LoHaModel
+
+[[autodoc]] tuners.loha.model.LoHaModel
--- a/docs/source/package_reference/lokr.md
+++ b/docs/source/package_reference/lokr.md
@ -0,0 +1,27 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# LoKr
+
+Low-Rank Kronecker Product ([LoKr](https://hf.co/papers/2309.14859)), is a LoRA-variant method that approximates the large weight matrix with two low-rank matrices and combines them with the Kronecker product. LoKr also provides an optional third low-rank matrix to provide better control during fine-tuning.
+
+## LoKrConfig
+
+[[autodoc]] tuners.lokr.config.LoKrConfig
+
+## LoKrModel
+
+[[autodoc]] tuners.lokr.model.LoKrModel
--- a/docs/source/package_reference/lora.md
+++ b/docs/source/package_reference/lora.md
@ -0,0 +1,55 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# LoRA
+
+Low-Rank Adaptation ([LoRA](https://huggingface.co/papers/2309.15223)) is a PEFT method that decomposes a large matrix into two smaller low-rank matrices in the attention layers. This drastically reduces the number of parameters that need to be fine-tuned.
+
+The abstract from the paper is:
+
+*We propose a neural language modeling system based on low-rank adaptation (LoRA) for speech recognition output rescoring. Although pretrained language models (LMs) like BERT have shown superior performance in second-pass rescoring, the high computational cost of scaling up the pretraining stage and adapting the pretrained models to specific domains limit their practical use in rescoring. Here we present a method based on low-rank decomposition to train a rescoring BERT model and adapt it to new domains using only a fraction (0.08%) of the pretrained parameters. These inserted matrices are optimized through a discriminative training objective along with a correlation-based regularization loss. The proposed low-rank adaptation Rescore-BERT (LoRB) architecture is evaluated on LibriSpeech and internal datasets with decreased training times by factors between 5.4 and 3.6.*.
+
+## LoraConfig
+
+[[autodoc]] tuners.lora.config.LoraConfig
+
+## LoraModel
+
+[[autodoc]] tuners.lora.model.LoraModel
+
+## Utility
+
+### ArrowConfig
+
+[[autodoc]] tuners.lora.config.ArrowConfig
+
+### LoftQ
+
+[[autodoc]] utils.loftq_utils.replace_lora_weights_loftq
+
+### Eva
+
+#### EvaConfig
+
+[[autodoc]] tuners.lora.config.EvaConfig
+
+#### initialize_lora_eva_weights
+
+[[autodoc]] tuners.lora.eva.initialize_lora_eva_weights
+
+#### get_eva_state_dict
+
+[[autodoc]] tuners.lora.eva.get_eva_state_dict
--- a/docs/source/package_reference/merge_utils.md
+++ b/docs/source/package_reference/merge_utils.md
@ -0,0 +1,33 @@
+<!--Copyright 2024 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# Model merge
+
+PEFT provides several internal utilities for [merging LoRA adapters](../developer_guides/model_merging) with the TIES and DARE methods.
+
+[[autodoc]] utils.merge_utils.prune
+
+[[autodoc]] utils.merge_utils.calculate_majority_sign_mask
+
+[[autodoc]] utils.merge_utils.disjoint_merge
+
+[[autodoc]] utils.merge_utils.task_arithmetic
+
+[[autodoc]] utils.merge_utils.ties
+
+[[autodoc]] utils.merge_utils.dare_linear
+
+[[autodoc]] utils.merge_utils.dare_ties
--- a/docs/source/package_reference/miss.md
+++ b/docs/source/package_reference/miss.md
@ -0,0 +1,32 @@
+<!--Copyright 2025 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# MiSS
+
+MiSS: Balancing LoRA Performance and Efficiency with Simple Shard Sharing([MiSS](https://huggingface.co/papers/2409.15371)) is a novel PEFT method that adopts a low-rank structure, requires only a single trainable matrix, and introduces a new update mechanism distinct from LoRA, achieving an excellent balance between performance and efficiency.
+
+The abstract from the paper is:
+
+*Parameter-Efficient Fine-Tuning (PEFT) methods, particularly Low-Rank Adaptation (LoRA), effectively reduce the number of trainable parameters in Large Language Models (LLMs). However, as model scales continue to grow, the demand for computational resources remains a significant challenge. Existing LoRA variants often struggle to strike an optimal balance between adaptability (model performance and convergence speed) and efficiency (computational overhead, memory usage, and initialization time). This paper introduces MiSS(Matrix Shard Sharing ), a novel PEFT approach that addresses this trade-off through a simple shard-sharing mechanism. MiSS leverages the insight that a low-rank adaptation can be achieved by decomposing the weight matrix into multiple fragment matrices and utilizing a shared, trainable common fragment. This method constructs the low-rank update matrix through the replication of these shared, partitioned shards. We also propose a hardware-efficient and broadly applicable implementation for MiSS. Extensive experiments conducted on a range of tasks, alongside a systematic analysis of computational performance, demonstrate MiSS's superiority. The results show that MiSS significantly outperforms standard LoRA and its prominent variants in both model performance metrics and computational efficiency, including initialization speed and training throughput. By effectively balancing expressive power and resource utilization, MiSS offers a compelling solution for efficiently adapting large-scale models*.
+
+
+## MissConfig
+
+[[autodoc]] tuners.miss.config.MissConfig
+
+## MissModel
+
+[[autodoc]] tuners.miss.model.MissModel
--- a/docs/source/package_reference/multitask_prompt_tuning.md
+++ b/docs/source/package_reference/multitask_prompt_tuning.md
@ -0,0 +1,31 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# Multitask prompt tuning
+
+[Multitask prompt tuning](https://huggingface.co/papers/2303.02861)  decomposes the soft prompts of each task into a single learned transferable prompt instead of a separate prompt for each task. The single learned prompt can be adapted for each task by multiplicative low rank updates.
+
+The abstract from the paper is:
+
+*Prompt tuning, in which a base pretrained model is adapted to each task via conditioning on learned prompt vectors, has emerged as a promising approach for efficiently adapting large language models to multiple downstream tasks. However, existing methods typically learn soft prompt vectors from scratch, and it has not been clear how to exploit the rich cross-task knowledge with prompt vectors in a multitask learning setting. We propose multitask prompt tuning (MPT), which first learns a single transferable prompt by distilling knowledge from multiple task-specific source prompts. We then learn multiplicative low rank updates to this shared prompt to efficiently adapt it to each downstream target task. Extensive experiments on 23 NLP datasets demonstrate that our proposed approach outperforms the state-of-the-art methods, including the full finetuning baseline in some cases, despite only tuning 0.035% as many task-specific parameters*.
+
+## MultitaskPromptTuningConfig
+
+[[autodoc]] tuners.multitask_prompt_tuning.config.MultitaskPromptTuningConfig
+
+## MultitaskPromptEmbedding
+
+[[autodoc]] tuners.multitask_prompt_tuning.model.MultitaskPromptEmbedding
--- a/docs/source/package_reference/oft.md
+++ b/docs/source/package_reference/oft.md
@ -0,0 +1,31 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# OFT
+
+[Orthogonal Finetuning (OFT)](https://hf.co/papers/2306.07280) is a method developed for adapting text-to-image diffusion models. It works by reparameterizing the pretrained weight matrices with its orthogonal matrix to preserve information in the pretrained model. To reduce the number of parameters, OFT introduces a block-diagonal structure in the orthogonal matrix.
+
+The abstract from the paper is:
+
+*Large text-to-image diffusion models have impressive capabilities in generating photorealistic images from text prompts. How to effectively guide or control these powerful models to perform different downstream tasks becomes an important open problem. To tackle this challenge, we introduce a principled finetuning method -- Orthogonal Finetuning (OFT), for adapting text-to-image diffusion models to downstream tasks. Unlike existing methods, OFT can provably preserve hyperspherical energy which characterizes the pairwise neuron relationship on the unit hypersphere. We find that this property is crucial for preserving the semantic generation ability of text-to-image diffusion models. To improve finetuning stability, we further propose Constrained Orthogonal Finetuning (COFT) which imposes an additional radius constraint to the hypersphere. Specifically, we consider two important finetuning text-to-image tasks: subject-driven generation where the goal is to generate subject-specific images given a few images of a subject and a text prompt, and controllable generation where the goal is to enable the model to take in additional control signals. We empirically show that our OFT framework outperforms existing methods in generation quality and convergence speed*.
+
+## OFTConfig
+
+[[autodoc]] tuners.oft.config.OFTConfig
+
+## OFTModel
+
+[[autodoc]] tuners.oft.model.OFTModel
--- a/docs/source/package_reference/p_tuning.md
+++ b/docs/source/package_reference/p_tuning.md
@ -0,0 +1,31 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# P-tuning
+
+[P-tuning](https://hf.co/papers/2103.10385) adds trainable prompt embeddings to the input that is optimized by a prompt encoder to find a better prompt, eliminating the need to manually design prompts. The prompt tokens can be added anywhere in the input sequence, and p-tuning also introduces anchor tokens for improving performance.
+
+The abstract from the paper is:
+
+*While GPTs with traditional fine-tuning fail to achieve strong results on natural language understanding (NLU), we show that GPTs can be better than or comparable to similar-sized BERTs on NLU tasks with a novel method P-tuning -- which employs trainable continuous prompt embeddings. On the knowledge probing (LAMA) benchmark, the best GPT recovers 64\% (P@1) of world knowledge without any additional text provided during test time, which substantially improves the previous best by 20+ percentage points. On the SuperGlue benchmark, GPTs achieve comparable and sometimes better performance to similar-sized BERTs in supervised learning. Importantly, we find that P-tuning also improves BERTs' performance in both few-shot and supervised settings while largely reducing the need for prompt engineering. Consequently, P-tuning outperforms the state-of-the-art approaches on the few-shot SuperGlue benchmark.*.
+
+## PromptEncoderConfig
+
+[[autodoc]] tuners.p_tuning.config.PromptEncoderConfig
+
+## PromptEncoder
+
+[[autodoc]] tuners.p_tuning.model.PromptEncoder
--- a/docs/source/package_reference/peft_model.mdx
+++ b/docs/source/package_reference/peft_model.mdx
@ -1,6 +1,10 @@
+<!--⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+-->
+
 # Models

-[`PeftModel`] is the base model class for specifying the base Transformer model and configuration to apply a PEFT method to. The base `PeftModel` contains methods for loading and saving models from the Hub, and supports the [`PromptEncoder`] for prompt learning.
+[`PeftModel`] is the base model class for specifying the base Transformer model and configuration to apply a PEFT method to. The base `PeftModel` contains methods for loading and saving models from the Hub.

 ## PeftModel

@ -48,3 +52,26 @@ A `PeftModel` for getting extracting features/embeddings from transformer models

 [[autodoc]] PeftModelForFeatureExtraction
    - all
+
+## PeftMixedModel
+
+A `PeftModel` for mixing different adapter types (e.g. LoRA and LoHa).
+
+[[autodoc]] PeftMixedModel
+    - all
+
+## Utilities
+
+[[autodoc]] utils.cast_mixed_precision_params
+
+[[autodoc]] get_peft_model
+
+[[autodoc]] inject_adapter_in_model
+
+[[autodoc]] utils.get_peft_model_state_dict
+
+[[autodoc]] utils.prepare_model_for_kbit_training
+
+[[autodoc]] get_layer_status
+
+[[autodoc]] get_model_status
--- a/docs/source/package_reference/peft_types.md
+++ b/docs/source/package_reference/peft_types.md
@ -0,0 +1,27 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# PEFT types
+
+[`PeftType`] includes the supported adapters in PEFT, and [`TaskType`] includes PEFT-supported tasks.
+
+## PeftType
+
+[[autodoc]] utils.peft_types.PeftType
+
+## TaskType
+
+[[autodoc]] utils.peft_types.TaskType
--- a/docs/source/package_reference/poly.md
+++ b/docs/source/package_reference/poly.md
@ -0,0 +1,44 @@
+<!--Copyright 2024 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# Polytropon
+
+[Polytropon](https://hf.co/papers/2202.13914) is a multitask model with a number of different LoRA adapters in its "inventory". The model learns the correct combination of adapters from the inventory with a routing function to choose the best subset of modules for a specific task. PEFT also supports [Multi-Head Adapter Routing (MHR)](https://hf.co/papers/2211.03831) for Polytropon which builds on and improves the routing function by combining the adapter heads more granularly. The adapter heads are separated into disjoint blocks and a different routing function is learned for each one, allowing for more expressivity.
+
+<hfoptions id="paper">
+<hfoption id="Combining Modular Skills in Multitask Learning">
+
+The abstract from the paper is:
+
+*A modular design encourages neural models to disentangle and recombine different facets of knowledge to generalise more systematically to new tasks. In this work, we assume that each task is associated with a subset of latent discrete skills from a (potentially small) inventory. In turn, skills correspond to parameter-efficient (sparse / low-rank) model parameterisations. By jointly learning these and a task-skill allocation matrix, the network for each task is instantiated as the average of the parameters of active skills. To favour non-trivial soft partitions of skills across tasks, we experiment with a series of inductive biases, such as an Indian Buffet Process prior and a two-speed learning rate. We evaluate our latent-skill model on two main settings: 1) multitask reinforcement learning for grounded instruction following on 8 levels of the BabyAI platform; and 2) few-shot adaptation of pre-trained text-to-text generative models on CrossFit, a benchmark comprising 160 NLP tasks. We find that the modular design of a network significantly increases sample efficiency in reinforcement learning and few-shot generalisation in supervised learning, compared to baselines with fully shared, task-specific, or conditionally generated parameters where knowledge is entangled across tasks. In addition, we show how discrete skills help interpretability, as they yield an explicit hierarchy of tasks.*
+
+</hfoption>
+<hfoption id="Multi-Head Adapter Routing for Cross-Task Generalization">
+
+The abstract from the paper is:
+
+*Parameter-efficient fine-tuning (PEFT) for cross-task generalization consists in pre-training adapters on a multi-task training set before few-shot adaptation to test tasks. Polytropon [Ponti et al., 2023] (Poly) jointly learns an inventory of adapters and a routing function that selects a (variable-size) subset of adapters for each task during both pre-training and few-shot adaptation. In this paper, we investigate the role that adapter routing plays in its success and design new variants based on our findings. First, we build on the intuition that finer-grained routing provides more expressivity. Hence, we propose MHR (Multi-Head Routing), which combines subsets of adapter parameters and outperforms Poly under a comparable parameter budget; by only fine-tuning the routing function and not the adapters (MHR-z), we achieve competitive performance with extreme parameter efficiency. Second, we find that Poly/MHR performance is a result of better multi-task optimization, rather than modular inductive biases that facilitate adapter recombination and local adaptation, as previously hypothesized. In fact, we find that MHR exhibits higher gradient alignment between tasks than any other method. Since this implies that routing is only crucial during multi-task pre-training, we propose MHR-mu, which discards routing and fine-tunes the average of the pre-trained adapters during few-shot adaptation. This establishes MHR-mu as an effective method for single-adapter fine-tuning.*.
+
+</hfoption>
+</hfoptions>
+
+## PolyConfig
+
+[[autodoc]] tuners.poly.config.PolyConfig
+
+## PolyModel
+
+[[autodoc]] tuners.poly.model.PolyModel
--- a/docs/source/package_reference/prefix_tuning.md
+++ b/docs/source/package_reference/prefix_tuning.md
@ -0,0 +1,31 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# Prefix tuning
+
+[Prefix tuning](https://hf.co/papers/2101.00190) prefixes a series of task-specific vectors to the input sequence that can be learned while keeping the pretrained model frozen. The prefix parameters are inserted in all of the model layers.
+
+The abstract from the paper is:
+
+*Fine-tuning is the de facto way to leverage large pretrained language models to perform downstream tasks. However, it modifies all the language model parameters and therefore necessitates storing a full copy for each task. In this paper, we propose prefix-tuning, a lightweight alternative to fine-tuning for natural language generation tasks, which keeps language model parameters frozen, but optimizes a small continuous task-specific vector (called the prefix). Prefix-tuning draws inspiration from prompting, allowing subsequent tokens to attend to this prefix as if it were "virtual tokens". We apply prefix-tuning to GPT-2 for table-to-text generation and to BART for summarization. We find that by learning only 0.1\% of the parameters, prefix-tuning obtains comparable performance in the full data setting, outperforms fine-tuning in low-data settings, and extrapolates better to examples with topics unseen during training*.
+
+## PrefixTuningConfig
+
+[[autodoc]] tuners.prefix_tuning.config.PrefixTuningConfig
+
+## PrefixEncoder
+
+[[autodoc]] tuners.prefix_tuning.model.PrefixEncoder
--- a/docs/source/package_reference/prompt_tuning.md
+++ b/docs/source/package_reference/prompt_tuning.md
@ -0,0 +1,31 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# Prompt tuning
+
+[Prompt tuning](https://hf.co/papers/2104.08691) adds task-specific prompts to the input, and these prompt parameters are updated independently of the pretrained model parameters which are frozen.
+
+The abstract from the paper is:
+
+*In this work, we explore "prompt tuning", a simple yet effective mechanism for learning "soft prompts" to condition frozen language models to perform specific downstream tasks. Unlike the discrete text prompts used by GPT-3, soft prompts are learned through backpropagation and can be tuned to incorporate signal from any number of labeled examples. Our end-to-end learned approach outperforms GPT-3's "few-shot" learning by a large margin. More remarkably, through ablations on model size using T5, we show that prompt tuning becomes more competitive with scale: as models exceed billions of parameters, our method "closes the gap" and matches the strong performance of model tuning (where all model weights are tuned). This finding is especially relevant in that large models are costly to share and serve, and the ability to reuse one frozen model for multiple downstream tasks can ease this burden. Our method can be seen as a simplification of the recently proposed "prefix tuning" of Li and Liang (2021), and we provide a comparison to this and other similar approaches. Finally, we show that conditioning a frozen model with soft prompts confers benefits in robustness to domain transfer, as compared to full model tuning*.
+
+## PromptTuningConfig
+
+[[autodoc]] tuners.prompt_tuning.config.PromptTuningConfig
+
+## PromptEmbedding
+
+[[autodoc]] tuners.prompt_tuning.model.PromptEmbedding
--- a/docs/source/package_reference/randlora.md
+++ b/docs/source/package_reference/randlora.md
@ -0,0 +1,45 @@
+<!--Copyright 2025 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# RandLora: Full-rank parameter-efficient fine-tuning of large models 
+[RandLora](https://huggingface.co/papers/2502.00987) is a parameter-efficient fine-tuning technique that is similar to [LoRA](https://huggingface.co/papers/2106.09685) and [VeRA](https://huggingface.co/papers/2310.11454) but performs full rank updates to improve performance. RandLora can be particulary usefull when adapting large model to hard tasks that require complex updates while preserving the parameter efficiency of LoRA. The full rank update of RandLora is achieved by linearly scaling random bases. The random bases are a collection of multiple low rank matrices such that the summation of their ranks if greater or equal to the full rank of the parameter matrices. The trainable parameters of RandLora are two diagonal matrices (vectors) that get multiplied with the right hand low rank random bases, in a similar way to VeRA's update. To maintain low memory usage, RandLora uses a custom function that prevents storing unnecessary bases in memory for backpropagation.
+
+RandLora presents the noteworthy difference that contrary to other LoRA-like PEFT algorithm, increasing RandLora's random base ranks increases the amount of trainable parameters. Because number of bases x bases rank is constant in RandLora, reducing the rank will increase the number of random bases, hence the number of base-specific trainable diagonal bases.
+
+Because reducing the rank of RandLora's random bases will increase their number, RandLora can become slower to train than LoRA for very small ranks where typically, ranks below 4 with result in a large training time increase. This does not affect inference though as the RandLora adapters can be merged into the pretrained weight matrices.
+
+RandLora additionally supports training with sparse, ternary random bases (only containing -1, 0 and 1). These bases are as described in [Bingham et al.](https://cs-people.bu.edu/evimaria/cs565/kdd-rp.pdf) and [Ping et al.](https://hastie.su.domains/Papers/Ping/KDD06_rp.pdf) and could theoretically be used to reduce compute needs by performing aggregations instead of matrix multiplications to create the weight update. This is not currently supported. Although it does not currently reduce compute, using sparse random bases in RandLora can reduce overfitting in some cases. For users intersted in using sparse ternary bases, the `sparse` option is recommended over the `very_sparse` one that can reduce perfromance. 
+
+Similarly to VeRA, when saving the RandLora's parameters, it's possible to eschew storing the low rank matrices by setting `save_projection=False` on the `VeraConfig`. In that case, these matrices will be restored based on the fixed random seed from the `projection_prng_key` argument. This cuts down on the size of the checkpoint, but we cannot guarantee reproducibility on all devices and for all future versions of PyTorch. If you want to ensure reproducibility, set `save_projection=True` (which is the default).
+
+As in Vera and to handle different shapes of adapted layers, RandLora initializes shared A and B matrices with the largest required size for each dimension. During the forward pass, submatrices A and B for a given layer are sliced out from these shared matrices and used as described in the paper. For example, adapting two linear layers of shapes (100, 20) and (80, 50) will create A and B matrices of shapes (rank, 50) and (100, rank) respectively. Then, to adapt a layer of shape (100, 20), submatrices A and B of shapes (rank, 20) and (100, rank) will be extracted.
+
+RandLora currently has the following constraint:
+
+- Only `nn.Linear` layers are supported.
+
+The abstract from the paper is:
+
+> Low-Rank Adaptation (LoRA) and its variants have shown impressive results in reducing the number of trainable parameters and memory requirements of large transformer networks while maintaining fine-tuning performance. The low-rank nature of the weight update inherently limits the representation power of fine-tuned models, however, thus potentially compromising performance on complex tasks. This raises a critical question: when a performance gap between LoRA and standard fine-tuning is observed, is it due to the reduced number of trainable parameters or the rank deficiency?
+This paper aims to answer this question by introducing RandLora, a parameter-efficient method that performs full-rank updates using a learned linear combinations of low-rank, non-trainable random matrices. Our method limits the number of trainable parameters by restricting optimization to diagonal scaling matrices applied to the fixed random matrices. This allows us to effectively overcome the low-rank limitations while maintaining parameter and memory efficiency during training. Through extensive experimentation across vision, language, and vision-language benchmarks, we systematically evaluate the limitations of LoRA and existing random basis methods. Our findings reveal that full-rank updates are beneficial across vision and language tasks individually, and even more so for vision-language tasks, where RandLora significantly reduces---and sometimes eliminates---the performance gap between standard fine-tuning and LoRA, demonstrating its efficacy.
+
+## RandLoraConfig
+
+[[autodoc]] tuners.randlora.config.RandLoraConfig
+
+## RandLoraModel
+
+[[autodoc]] tuners.randlora.model.RandLoraModel
--- a/docs/source/package_reference/road.md
+++ b/docs/source/package_reference/road.md
@ -0,0 +1,31 @@
+<!--Copyright 2025 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# RoAd
+
+[RoAd](https://arxiv.org/pdf/2409.00119) is a parameter‑efficient fine‑tuning technique that adapts large language models by learning a small set of 2×2 rotation matrices (and optional scaling factors) applied to pairs of hidden dimensions. RoAd achieves competitive or superior performance compared to other PEFT methods with under 0.1% trainable parameters. Unlike LoRA’s batched low‑rank updates, RoAd’s sparse rotations reformulate to simple element‑wise operations, yielding significantly higher serving throughput when handling heterogeneous requests in the same batch, i.e. serving multiple adapters simulatenously. Moreover, RoAd integrates seamlessly into a distributed interchange intervention framework, interpreting its sparse 2D rotations as task-specific interventions within learned subspaces of hidden representations. These orthogonal subspaces can be composed to merge multiple task-specific behaviors—like multilingual capabilities or instruction following—without additional fine-tuning, enabling modular, interpretable adaptations in LLMs.
+
+Finetuning with RoAd typically requires higher learning rate compared to LoRA or similar methods, around 1e-3. Currently RoAd only supports linear layers and it can be used on models quantized with bitsandbytes (4-bit or 8-bit).
+
+For running inference with different RoAd adapters in the same batch see [Inference with different LoRA adapters in the same batch](../developer_guides/lora#inference-with-different-lora-adapters-in-the-same-batch).
+
+## RoadConfig
+
+[[autodoc]] tuners.road.config.RoadConfig
+
+## RoadModel
+
+[[autodoc]] tuners.road.model.RoadModel
--- a/docs/source/package_reference/shira.md
+++ b/docs/source/package_reference/shira.md
@ -0,0 +1,35 @@
+<!--Copyright 2025 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# Sparse High Rank Adapters
+
+Sparse High Rank Adapters or [SHiRA](https://arxiv.org/abs/2406.13175) is an alternate type of adapter and has been found to have significant advantages over the low rank adapters. Specifically, SHiRA achieves better accuracy than LoRA for a variety of vision and language tasks. It also offers simpler and higher quality multi-adapter fusion by significantly reducing concept loss, a common problem faced by low rank adapters. SHiRA directly finetunes a small number of the base model's parameters to finetune the model on any adaptation task.
+
+SHiRA currently has the following constraint:
+
+- Only `nn.Linear` layers are supported.
+
+The abstract from the paper is:
+
+> Low Rank Adaptation (LoRA) has gained massive attention in the recent generative AI research. One of the main advantages of LoRA is its ability to be fused with pretrained models, adding no overhead during inference. However, from a mobile deployment standpoint, we can either avoid inference overhead in the fused mode but lose the ability to switch adapters rapidly, or suffer significant (up to 30% higher) inference latency while enabling rapid switching in the unfused mode. LoRA also exhibits concept-loss when multiple adapters are used concurrently. In this paper, we propose Sparse High Rank Adapters (SHiRA), a new paradigm which incurs no inference overhead, enables rapid switching, and significantly reduces concept-loss. Specifically, SHiRA can be trained by directly tuning only 1-2% of the base model weights while leaving others unchanged. This results in a highly sparse adapter which can be switched directly in the fused mode. We further provide theoretical and empirical insights on how high sparsity in SHiRA can aid multi-adapter fusion by reducing concept loss. Our extensive experiments on LVMs and LLMs demonstrate that finetuning only a small fraction of the parameters in the base model significantly outperforms LoRA while enabling both rapid switching and multi-adapter fusion. Finally, we provide a latency- and memory-efficient SHiRA implementation based on Parameter-Efficient Finetuning (PEFT) Library which trains at nearly the same speed as LoRA while consuming up to 16% lower peak GPU memory, thus making SHiRA easy to adopt for practical use cases. To demonstrate rapid switching benefits during inference, we show that loading SHiRA on a base model can be 5x-16x faster than LoRA fusion on a CPU.
+
+## ShiraConfig
+
+[[autodoc]] tuners.shira.config.ShiraConfig
+
+## ShiraModel
+
+[[autodoc]] tuners.shira.model.ShiraModel
--- a/docs/source/package_reference/trainable_tokens.md
+++ b/docs/source/package_reference/trainable_tokens.md
@ -0,0 +1,50 @@
+<!--Copyright 2025 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# Trainable Tokens
+
+The Trainable Tokens method provides a way to target specific token embeddings for fine-tuning without resorting to
+training the full embedding matrix or using an adapter on the embedding matrix. It is based on the initial implementation from
+[here](https://github.com/huggingface/peft/pull/1541).
+
+The method only targets specific tokens and selectively trains the token indices you specify. Consequently the
+required RAM will be lower and disk memory is also significantly lower than storing the full fine-tuned embedding matrix.
+
+Some preliminary benchmarks acquired with [this script](https://github.com/huggingface/peft/blob/main/scripts/train_memory.py)
+suggest that for `gemma-2-2b` (which has a rather large embedding matrix) you can save ~4 GiB VRAM with Trainable Tokens
+over fully fine-tuning the embedding matrix. While LoRA will use comparable amounts of VRAM it might also target
+tokens you don't want to be changed. Note that these are just indications and varying embedding matrix sizes might skew
+these numbers a bit.
+
+Note that this method does not add tokens for you, you have to add tokens to the tokenizer yourself and resize the
+embedding matrix of the model accordingly. This method will only re-train the embeddings for the tokens you specify.
+This method can also be used in conjunction with LoRA layers! See [the LoRA developer guide](../developer_guides/lora#efficiently-train-tokens-alongside-lora).
+
+> [!TIP]
+> Saving the model with [`~PeftModel.save_pretrained`] or retrieving the state dict using
+> [`get_peft_model_state_dict`] when adding new tokens may save the full embedding matrix instead of only the difference
+> as a precaution because the embedding matrix was resized. To save space you can disable this behavior by setting
+> `save_embedding_layers=False` when calling `save_pretrained`. This is safe to do as long as you don't modify the
+> embedding matrix through other means as well, as such changes will be not tracked by trainable tokens.
+
+## TrainableTokensConfig
+
+[[autodoc]] tuners.trainable_tokens.config.TrainableTokensConfig
+
+## TrainableTokensModel
+
+[[autodoc]] tuners.trainable_tokens.model.TrainableTokensModel
+
--- a/docs/source/package_reference/tuners.md
+++ b/docs/source/package_reference/tuners.md
@ -0,0 +1,27 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# Tuners
+
+A tuner (or adapter) is a module that can be plugged into a `torch.nn.Module`. [`BaseTuner`] base class for other tuners and provides shared methods and attributes for preparing an adapter configuration and replacing a target module with the adapter module. [`BaseTunerLayer`] is a base class for adapter layers. It offers methods and attributes for managing adapters such as activating and disabling adapters.
+
+## BaseTuner
+
+[[autodoc]] tuners.tuners_utils.BaseTuner
+
+## BaseTunerLayer
+
+[[autodoc]] tuners.tuners_utils.BaseTunerLayer
--- a/docs/source/package_reference/tuners.mdx
+++ b/docs/source/package_reference/tuners.mdx
@ -1,39 +0,0 @@
-# Tuners
-
-Each tuner (or PEFT method) has a configuration and model.
-
-## LoRA
-
-For finetuning a model with LoRA.
-
-[[autodoc]] LoraConfig
-
-[[autodoc]] LoraModel
-
-[[autodoc]] tuners.lora.LoraLayer
-
-[[autodoc]] tuners.lora.Linear
-
-## P-tuning
-
-[[autodoc]] tuners.p_tuning.PromptEncoderConfig
-
-[[autodoc]] tuners.p_tuning.PromptEncoder
-
-## Prefix tuning
-
-[[autodoc]] tuners.prefix_tuning.PrefixTuningConfig
-
-[[autodoc]] tuners.prefix_tuning.PrefixEncoder
-
-## Prompt tuning
-
-[[autodoc]] tuners.prompt_tuning.PromptTuningConfig
-
-[[autodoc]] tuners.prompt_tuning.PromptEmbedding
-
-## IA3
-
-[[autodoc]] tuners.ia3.IA3Config
-
-[[autodoc]] tuners.ia3.IA3Model
--- a/docs/source/package_reference/vblora.md
+++ b/docs/source/package_reference/vblora.md
@ -0,0 +1,40 @@
+<!--Copyright 2024 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# VB-LoRA: Extreme Parameter Efficient Fine-Tuning with Vector Banks
+
+## Overview
+
+[VB-LoRA](https://huggingface.co/papers/2405.15179) is a parameter-efficient fine-tuning technique that extends LoRA by learning a fine-grained parameter-sharing scheme at the sub-vector level, achieving significantly higher parameter efficiency. This makes VB-LoRA especially useful in scenarios where storage and transmission costs are critical. It works by decomposing low-rank matrices—from different layers and modules such as K, Q, V, and FFN—into sub-vectors, which are then globally shared through a vector bank.
+
+The abstract from the paper is:
+
+*As the adoption of large language models increases and the need for per-user or per-task model customization grows, the parameter-efficient fine-tuning (PEFT) methods, such as low-rank adaptation (LoRA) and its variants, incur substantial storage and transmission costs. To further reduce stored parameters, we introduce a "divide-and-share" paradigm that breaks the barriers of low-rank decomposition across matrix dimensions, modules and layers by sharing parameters globally via a vector bank. As an instantiation of the paradigm to LoRA, our proposed VB-LoRA composites all the low-rank matrices of LoRA from a shared vector bank with a differentiable top-k admixture module. VB-LoRA achieves extreme parameter efficiency while maintaining comparable or better performance compared to state-of-the-art PEFT methods. Extensive experiments demonstrate the effectiveness of VB-LoRA on natural language understanding, natural language generation, and instruction tuning tasks. When fine-tuning the Llama2-13B model, VB-LoRA only uses 0.4% of LoRA's stored parameters, yet achieves superior results.*
+
+## Usage Tips
+
+- VB-LoRA utilizes a sparse top-k module to learn the sharing machanism. When saving adapter parameters, you can either save only the top-k weights and their indices by setting `save_only_topk_weights = True` in `VBLoRAConfig`, or save all the trainable logits by setting it to `False`. Enabling `save_only_topk_weights = True` significantly reduces storage space; for instance, in Llama2-7B, the storage file size decreases from 308MB to 2.5MB. Note that models saved with `save_only_topk_weights = True` are intended for merging or inference only and cannot be used to resume training.
+
+- VB-LoRA has two sets of training parameters: vector bank parameters and logit parameters. In practice, we found that logit parameters require a higher learning rate, while vector bank parameters require a lower learning rate. When using the AdamW optimizer, typical learning rates are 0.01 for logits and 0.001 for vector bank parameters.
+
+## VBLoRAConfig
+
+[[autodoc]] tuners.vblora.config.VBLoRAConfig
+
+## VBLoRAModel
+
+[[autodoc]] tuners.vblora.model.VBLoRAModel
+
--- a/docs/source/package_reference/vera.md
+++ b/docs/source/package_reference/vera.md
@ -0,0 +1,39 @@
+<!--Copyright 2024 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# VeRA: Vector-based Random Matrix Adaptation
+
+[VeRA](https://huggingface.co/papers/2310.11454) is a parameter-efficient fine-tuning technique that is similar to LoRA but requires even fewer extra parameters while promising similar or even better performance. As such, it is particularly useful when the parameter budget is very limited, e.g. when scaling to very large models. The reduction of the count of trainable parameters is achieved by sharing the same low-rank matrices across all layers, and only training two additional vectors per layer.
+
+When saving the adapter parameters, it's possible to eschew storing the low rank matrices by setting `save_projection=False` on the `VeraConfig`. In that case, these matrices will be restored based on the fixed random seed from the `projection_prng_key` argument. This cuts down on the size of the checkpoint, but we cannot guarantee reproducibility on all devices and for all future versions of PyTorch. If you want to ensure reproducibility, set `save_projection=True` (which is the default).
+
+To handle different shapes of adapted layers, VeRA initializes shared A and B matrices with the largest required size for each dimension. During the forward pass, submatrices A and B for a given layer are sliced out from these shared matrices and used as described in the paper. For example, adapting two linear layers of shapes (100, 20) and (80, 50) will create A and B matrices of shapes (rank, 50) and (100, rank) respectively. Then, to adapt a layer of shape (100, 20), submatrices A and B of shapes (rank, 20) and (100, rank) will be extracted.
+
+VeRA currently has the following constraint:
+
+- Only `nn.Linear` layers are supported.
+
+The abstract from the paper is:
+
+> Low-rank adapation (LoRA) is a popular method that reduces the number of trainable parameters when finetuning large language models, but still faces acute storage challenges when scaling to even larger models or deploying numerous per-user or per-task adapted models. In this work, we present Vector-based Random Matrix Adaptation (VeRA), which significantly reduces the number of trainable parameters compared to LoRA, yet maintains the same performance. It achieves this by using a single pair of low-rank matrices shared across all layers and learning small scaling vectors instead. We demonstrate its effectiveness on the GLUE and E2E benchmarks, image classification tasks, and show its application in instruction-tuning of 7B and 13B language models.
+
+## VeRAConfig
+
+[[autodoc]] tuners.vera.config.VeraConfig
+
+## VeRAModel
+
+[[autodoc]] tuners.vera.model.VeraModel
--- a/docs/source/package_reference/waveft.md
+++ b/docs/source/package_reference/waveft.md
@ -0,0 +1,35 @@
+<!--Copyright 2025 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# WaveFT: Wavelet Fine-Tuning
+
+[WaveFT](https://arxiv.org/abs/2505.12532) is a novel parameter-efficient fine-tuning (PEFT) method that introduces sparse updates in the **wavelet domain** of residual matrices. Unlike LoRA, which is constrained by discrete low-rank choices, WaveFT enables fine-grained control over the number of trainable parameters by directly learning a sparse set of coefficients in the transformed space. These coefficients are then mapped back to the weight domain via the Inverse Discrete Wavelet Transform (IDWT), producing high-rank updates without incurring inference overhead.
+
+WaveFT currently has the following constraint:
+
+- Only `nn.Linear` layers are supported.
+
+The abstract from the paper is:
+
+>Efficiently adapting large foundation models is critical, especially with tight compute and memory budgets. Parameter-Efficient Fine-Tuning (PEFT) methods such as LoRA offer limited granularity and effectiveness in few-parameter regimes. We propose Wavelet Fine-Tuning (WaveFT), a novel PEFT method that learns highly sparse updates in the wavelet domain of residual matrices. WaveFT allows precise control of trainable parameters, offering fine-grained capacity adjustment and excelling with remarkably low parameter count, potentially far fewer than LoRA’s minimum—ideal for extreme parameter-efficient scenarios. Evaluated on personalized text-to-image generation using Stable Diffusion XL as baseline, WaveFT significantly outperforms LoRA and other PEFT methods, especially at low parameter counts; achieving superior subject fidelity, prompt alignment, and image diversity.
+
+## WaveFTConfig
+
+[[autodoc]] tuners.waveft.config.WaveFTConfig
+
+## WaveFTModel
+
+[[autodoc]] tuners.waveft.model.WaveFTModel
--- a/docs/source/package_reference/xlora.md
+++ b/docs/source/package_reference/xlora.md
@ -0,0 +1,56 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# X-LoRA
+
+Mixture of LoRA Experts ([X-LoRA](https://huggingface.co/papers/2402.07148)) is a PEFT method enabling sparse or dense mixture of LoRA experts based on a high granularity (token, layer, sequence) scalings matrix. This leverages frozen LoRA adapters and a frozen base model to drastically reduces the number of parameters that need to be fine-tuned.
+
+A unique aspect of X-LoRA is its versatility: it can be applied to any `transformers` base model with LoRA adapters. This means that, despite the mixture of experts strategy, no changes to the model code must be made.
+
+The below graphic demonstrates how the scalings change for different prompts for each token. This highlights the activation of different adapters as the generation progresses and the sequence creates new context.
+
+![Token-by-token scalings](https://github.com/EricLBuehler/xlora/raw/master/res/token_by_token_scalings.gif)
+
+The abstract from the paper is:
+
+*We report a mixture of expert strategy to create fine-tuned large language models using a deep layer-wise token-level approach based on low-rank adaptation (LoRA). Starting with a set of pre-trained LoRA adapters, our gating strategy uses the hidden states to dynamically mix adapted layers, allowing the resulting X-LoRA model to draw upon different capabilities and create never-before-used deep layer-wise combinations to solve tasks. The design is inspired by the biological principles of universality and diversity, where neural network building blocks are reused in different hierarchical manifestations. Hence, the X-LoRA model can be easily implemented for any existing large language model (LLM) without a need for modifications of the underlying structure. We develop a tailored X-LoRA model that offers scientific capabilities including forward/inverse analysis tasks and enhanced reasoning capability, focused on biomaterial analysis, protein mechanics and design. The impact of this work include access to readily expandable and adaptable models with strong domain knowledge and the capability to integrate across areas of knowledge. Featuring experts in biology, mathematics, reasoning, bio-inspired materials, mechanics and materials, chemistry, protein biophysics, mechanics and quantum-mechanics based molecular properties, we conduct a series of physics-focused case studies. We examine knowledge recall, protein mechanics forward/inverse tasks, protein design, adversarial agentic modeling including ontological knowledge graph construction, as well as molecular design. The model is capable not only of making quantitative predictions of nanomechanical properties of proteins or quantum mechanical molecular properties, but also reasons over the results and correctly predicts likely mechanisms that explain distinct molecular behaviors.*.
+
+Please cite X-LoRA as:
+```bibtex
+@article{10.1063/5.0203126,
+    author = {Buehler, Eric L. and Buehler, Markus J.},
+    title = "{X-LoRA: Mixture of low-rank adapter experts, a flexible framework for large language models with applications in protein mechanics and molecular design}",
+    journal = {APL Machine Learning},
+    volume = {2},
+    number = {2},
+    pages = {026119},
+    year = {2024},
+    month = {05},
+    abstract = "{We report a mixture of expert strategy to create fine-tuned large language models using a deep layer-wise token-level approach based on low-rank adaptation (LoRA). Starting with a set of pre-trained LoRA adapters, our gating strategy uses the hidden states to dynamically mix adapted layers, allowing the resulting X-LoRA model to draw upon different capabilities and create never-before-used deep layer-wise combinations to solve tasks. The design is inspired by the biological principles of universality and diversity, where neural network building blocks are reused in different hierarchical manifestations. Hence, the X-LoRA model can be easily implemented for any existing large language model without a need for modifications of the underlying structure. We develop a tailored X-LoRA model that offers scientific capabilities, including forward/inverse analysis tasks and enhanced reasoning capability, focused on biomaterial analysis, protein mechanics, and design. The impact of this work includes access to readily expandable and adaptable models with strong domain knowledge and the capability to integrate across areas of knowledge. Featuring experts in biology, mathematics, reasoning, bio-inspired materials, mechanics and materials, chemistry, protein biophysics, mechanics, and quantum-mechanics based molecular properties, we conduct a series of physics-focused case studies. We examine knowledge recall, protein mechanics forward/inverse tasks, protein design, adversarial agentic modeling including ontological knowledge graph construction, and molecular design. The model is capable not only of making quantitative predictions of nanomechanical properties of proteins or quantum mechanical molecular properties but also reasoning over the results and correctly predicting likely mechanisms that explain distinct molecular behaviors.}",
+    issn = {2770-9019},
+    doi = {10.1063/5.0203126},
+    url = {https://doi.org/10.1063/5.0203126},
+    eprint = {https://pubs.aip.org/aip/aml/article-pdf/doi/10.1063/5.0203126/19964043/026119\_1\_5.0203126.pdf},
+}
+```
+
+## XLoraConfig
+
+[[autodoc]] tuners.xlora.config.XLoraConfig
+
+## XLoraModel
+
+[[autodoc]] tuners.xlora.model.XLoraModel
--- a/docs/source/quicktour.md
+++ b/docs/source/quicktour.md
@ -0,0 +1,164 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# Quicktour
+
+PEFT offers parameter-efficient methods for finetuning large pretrained models. The traditional paradigm is to finetune all of a model's parameters for each downstream task, but this is becoming exceedingly costly and impractical because of the enormous number of parameters in models today. Instead, it is more efficient to train a smaller number of prompt parameters or use a reparametrization method like low-rank adaptation (LoRA) to reduce the number of trainable parameters.
+
+This quicktour will show you PEFT's main features and how you can train or run inference on large models that would typically be inaccessible on consumer devices.
+
+## Train
+
+Each PEFT method is defined by a [`PeftConfig`] class that stores all the important parameters for building a [`PeftModel`]. For example, to train with LoRA, load and create a [`LoraConfig`] class and specify the following parameters:
+
+- `task_type`: the task to train for (sequence-to-sequence language modeling in this case)
+- `inference_mode`: whether you're using the model for inference or not
+- `r`: the dimension of the low-rank matrices
+- `lora_alpha`: the scaling factor for the low-rank matrices
+- `lora_dropout`: the dropout probability of the LoRA layers
+
+```python
+from peft import LoraConfig, TaskType
+
+peft_config = LoraConfig(task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1)
+```
+
+> [!TIP]
+> See the [`LoraConfig`] reference for more details about other parameters you can adjust, such as the modules to target or the bias type.
+
+Once the [`LoraConfig`] is setup, create a [`PeftModel`] with the [`get_peft_model`] function. It takes a base model - which you can load from the Transformers library - and the [`LoraConfig`] containing the parameters for how to configure a model for training with LoRA.
+
+Load the base model you want to finetune.
+
+```python
+from transformers import AutoModelForSeq2SeqLM
+
+model = AutoModelForSeq2SeqLM.from_pretrained("bigscience/mt0-large")
+```
+
+Wrap the base model and `peft_config` with the [`get_peft_model`] function to create a [`PeftModel`]. To get a sense of the number of trainable parameters in your model, use the [`print_trainable_parameters`] method.
+
+```python
+from peft import get_peft_model
+
+model = get_peft_model(model, peft_config)
+model.print_trainable_parameters()
+"output: trainable params: 2359296 || all params: 1231940608 || trainable%: 0.19151053100118282"
+```
+
+Out of [bigscience/mt0-large's](https://huggingface.co/bigscience/mt0-large) 1.2B parameters, you're only training 0.19% of them!
+
+That is it 🎉! Now you can train the model with the Transformers [`~transformers.Trainer`], Accelerate, or any custom PyTorch training loop.
+
+For example, to train with the [`~transformers.Trainer`] class, setup a [`~transformers.TrainingArguments`] class with some training hyperparameters.
+
+```py
+training_args = TrainingArguments(
+    output_dir="your-name/bigscience/mt0-large-lora",
+    learning_rate=1e-3,
+    per_device_train_batch_size=32,
+    per_device_eval_batch_size=32,
+    num_train_epochs=2,
+    weight_decay=0.01,
+    eval_strategy="epoch",
+    save_strategy="epoch",
+    load_best_model_at_end=True,
+)
+```
+
+Pass the model, training arguments, dataset, tokenizer, and any other necessary component to the [`~transformers.Trainer`], and call [`~transformers.Trainer.train`] to start training.
+
+```py
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    train_dataset=tokenized_datasets["train"],
+    eval_dataset=tokenized_datasets["test"],
+    processing_class=tokenizer,
+    data_collator=data_collator,
+    compute_metrics=compute_metrics,
+)
+
+trainer.train()
+```
+
+### Save model
+
+After your model is finished training, you can save your model to a directory using the [`~transformers.PreTrainedModel.save_pretrained`] function.
+
+```py
+model.save_pretrained("output_dir")
+```
+
+You can also save your model to the Hub (make sure you're logged in to your Hugging Face account first) with the [`~transformers.PreTrainedModel.push_to_hub`] function.
+
+```python
+from huggingface_hub import notebook_login
+
+notebook_login()
+model.push_to_hub("your-name/bigscience/mt0-large-lora")
+```
+
+Both methods only save the extra PEFT weights that were trained, meaning it is super efficient to store, transfer, and load. For example, this [facebook/opt-350m](https://huggingface.co/ybelkada/opt-350m-lora) model trained with LoRA only contains two files: `adapter_config.json` and `adapter_model.safetensors`. The `adapter_model.safetensors` file is just 6.3MB!
+
+<div class="flex flex-col justify-center">
+  <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/peft/PEFT-hub-screenshot.png"/>
+  <figcaption class="text-center">The adapter weights for a opt-350m model stored on the Hub are only ~6MB compared to the full size of the model weights, which can be ~700MB.</figcaption>
+</div>
+
+## Inference
+
+> [!TIP]
+> Take a look at the [AutoPeftModel](package_reference/auto_class) API reference for a complete list of available `AutoPeftModel` classes.
+
+Easily load any PEFT-trained model for inference with the [`AutoPeftModel`] class and the [`~transformers.PreTrainedModel.from_pretrained`] method:
+
+```py
+from peft import AutoPeftModelForCausalLM
+from transformers import AutoTokenizer
+import torch
+
+model = AutoPeftModelForCausalLM.from_pretrained("ybelkada/opt-350m-lora")
+tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")
+
+model = model.to("cuda")
+model.eval()
+inputs = tokenizer("Preheat the oven to 350 degrees and place the cookie dough", return_tensors="pt")
+
+outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=50)
+print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0])
+
+"Preheat the oven to 350 degrees and place the cookie dough in the center of the oven. In a large bowl, combine the flour, baking powder, baking soda, salt, and cinnamon. In a separate bowl, combine the egg yolks, sugar, and vanilla."
+```
+
+For other tasks that aren't explicitly supported with an `AutoPeftModelFor` class - such as automatic speech recognition - you can still use the base [`AutoPeftModel`] class to load a model for the task.
+
+```py
+from peft import AutoPeftModel
+
+model = AutoPeftModel.from_pretrained("smangrul/openai-whisper-large-v2-LORA-colab")
+```
+
+## Next steps
+
+Now that you've seen how to train a model with one of the PEFT methods, we encourage you to try out some of the other methods like prompt tuning. The steps are very similar to the ones shown in the quicktour:
+
+1. prepare a [`PeftConfig`] for a PEFT method
+2. use the [`get_peft_model`] method to create a [`PeftModel`] from the configuration and base model
+
+Then you can train it however you like! To load a PEFT model for inference, you can use the [`AutoPeftModel`] class.
+
+Feel free to also take a look at the task guides if you're interested in training a model with another PEFT method for a specific task such as semantic segmentation, multilingual automatic speech recognition, DreamBooth, token classification, and more.
--- a/docs/source/quicktour.mdx
+++ b/docs/source/quicktour.mdx
@ -1,145 +0,0 @@
-<!--Copyright 2023 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-
-# Quicktour
-
-🤗 PEFT contains parameter-efficient finetuning methods for training large pretrained models. The traditional paradigm is to finetune all of a model's parameters for each downstream task, but this is becoming exceedingly costly and impractical because of the enormous number of parameters in models today. Instead, it is more efficient to train a smaller number of prompt parameters or use a reparametrization method like low-rank adaptation (LoRA) to reduce the number of trainable parameters. 
-
-This quicktour will show you 🤗 PEFT's main features and help you train large pretrained models that would typically be inaccessible on consumer devices. You'll see how to train the 1.2B parameter [`bigscience/mt0-large`](https://huggingface.co/bigscience/mt0-large) model with LoRA to generate a classification label and use it for inference.
-
-## PeftConfig
-
-Each 🤗 PEFT method is defined by a [`PeftConfig`] class that stores all the important parameters for building a [`PeftModel`]. 
-
-Because you're going to use LoRA, you'll need to load and create a [`LoraConfig`] class. Within `LoraConfig`, specify the following parameters:
-
- the `task_type`, or sequence-to-sequence language modeling in this case
- `inference_mode`, whether you're using the model for inference or not
- `r`, the dimension of the low-rank matrices
- `lora_alpha`, the scaling factor for the low-rank matrices
- `lora_dropout`, the dropout probability of the LoRA layers
-
-```python
-from peft import LoraConfig, TaskType
-
-peft_config = LoraConfig(task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1)
-```
-
-<Tip>
-
-💡 See the [`LoraConfig`] reference for more details about other parameters you can adjust.
-
-</Tip>
-
-## PeftModel
-
-A [`PeftModel`] is created by the [`get_peft_model`] function. It takes a base model - which you can load from the 🤗 Transformers library - and the [`PeftConfig`] containing the instructions for how to configure a model for a specific 🤗 PEFT method.
-
-Start by loading the base model you want to finetune.
-
-```python
-from transformers import AutoModelForSeq2SeqLM
-
-model_name_or_path = "bigscience/mt0-large"
-tokenizer_name_or_path = "bigscience/mt0-large"
-model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path)
-```
-
-Wrap your base model and `peft_config` with the `get_peft_model` function to create a [`PeftModel`]. To get a sense of the number of trainable parameters in your model, use the [`print_trainable_parameters`] method. In this case, you're only training 0.19% of the model's parameters! 🤏
-
-```python
-from peft import get_peft_model
-
-model = get_peft_model(model, peft_config)
-model.print_trainable_parameters()
-"output: trainable params: 2359296 || all params: 1231940608 || trainable%: 0.19151053100118282"
-```
-
-That is it 🎉! Now you can train the model using the 🤗 Transformers [`~transformers.Trainer`], 🤗 Accelerate, or any custom PyTorch training loop.
-
-## Save and load a model
-
-After your model is finished training, you can save your model to a directory using the [`~transformers.PreTrainedModel.save_pretrained`] function. You can also save your model to the Hub (make sure you log in to your Hugging Face account first) with the [`~transformers.PreTrainedModel.push_to_hub`] function.
-
-```python
-model.save_pretrained("output_dir")
-
-# if pushing to Hub
-from huggingface_hub import notebook_login
-
-notebook_login()
-model.push_to_hub("my_awesome_peft_model")
-```
-
-This only saves the incremental 🤗 PEFT weights that were trained, meaning it is super efficient to store, transfer, and load. For example, this [`bigscience/T0_3B`](https://huggingface.co/smangrul/twitter_complaints_bigscience_T0_3B_LORA_SEQ_2_SEQ_LM) model trained with LoRA on the [`twitter_complaints`](https://huggingface.co/datasets/ought/raft/viewer/twitter_complaints/train) subset of the RAFT [dataset](https://huggingface.co/datasets/ought/raft) only contains two files: `adapter_config.json` and `adapter_model.bin`. The latter file is just 19MB!
-
-Easily load your model for inference using the [`~transformers.PreTrainedModel.from_pretrained`] function:
-
-```diff
-  from transformers import AutoModelForSeq2SeqLM
-+ from peft import PeftModel, PeftConfig
-
-+ peft_model_id = "smangrul/twitter_complaints_bigscience_T0_3B_LORA_SEQ_2_SEQ_LM"
-+ config = PeftConfig.from_pretrained(peft_model_id)
-  model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path)
-+ model = PeftModel.from_pretrained(model, peft_model_id)
-  tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
-
-  model = model.to(device)
-  model.eval()
-  inputs = tokenizer("Tweet text : @HondaCustSvc Your customer service has been horrible during the recall process. I will never purchase a Honda again. Label :", return_tensors="pt")
-
-  with torch.no_grad():
-      outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=10)
-      print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0])
-  'complaint'
-```
-
-## Easy loading with Auto classes 
-
-If you have saved your adapter locally or on the Hub, you can leverage the `AutoPeftModelForxxx` classes and load any PEFT model with a single line of code:
-
-```diff
- from peft import PeftConfig, PeftModel
- from transformers import AutoModelForCausalLM
-+ from peft import AutoPeftModelForCausalLM
-
- peft_config = PeftConfig.from_pretrained("ybelkada/opt-350m-lora") 
- base_model_path = peft_config.base_model_name_or_path
- transformers_model = AutoModelForCausalLM.from_pretrained(base_model_path)
- peft_model = PeftModel.from_pretrained(transformers_model, peft_config)
-+ peft_model = AutoPeftModelForCausalLM.from_pretrained("ybelkada/opt-350m-lora")
-```
-
-Currently, supported auto classes are: `AutoPeftModelForCausalLM`, `AutoPeftModelForSequenceClassification`, `AutoPeftModelForSeq2SeqLM`, `AutoPeftModelForTokenClassification`, `AutoPeftModelForQuestionAnswering` and `AutoPeftModelForFeatureExtraction`. For other tasks (e.g. Whisper, StableDiffusion), you can load the model with:
-
-```diff
- from peft import PeftModel, PeftConfig, AutoPeftModel
-+ from peft import AutoPeftModel
- from transformers import WhisperForConditionalGeneration
-
- model_id = "smangrul/openai-whisper-large-v2-LORA-colab"
-
-peft_model_id = "smangrul/openai-whisper-large-v2-LORA-colab"
- peft_config = PeftConfig.from_pretrained(peft_model_id)
- model = WhisperForConditionalGeneration.from_pretrained(
-     peft_config.base_model_name_or_path, load_in_8bit=True, device_map="auto"
- )
- model = PeftModel.from_pretrained(model, peft_model_id)
-+ model = AutoPeftModel.from_pretrained(peft_model_id)
-```
-
-## Next steps
-
-Now that you've seen how to train a model with one of the 🤗 PEFT methods, we encourage you to try out some of the other methods like prompt tuning. The steps are very similar to the ones shown in this quickstart; prepare a [`PeftConfig`] for a 🤗 PEFT method, and use the `get_peft_model` to create a [`PeftModel`] from the configuration and base model. Then you can train it however you like!
-
-Feel free to also take a look at the task guides if you're interested in training a model with a 🤗 PEFT method for a specific task such as semantic segmentation, multilingual automatic speech recognition, DreamBooth, and token classification.
--- a/docs/source/task_guides/clm-prompt-tuning.mdx
+++ b/docs/source/task_guides/clm-prompt-tuning.mdx
@ -1,289 +0,0 @@
-<!--Copyright 2023 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-
-# Prompt tuning for causal language modeling
-
-[[open-in-colab]]
-
-Prompting helps guide language model behavior by adding some input text specific to a task. Prompt tuning is an additive method for only training and updating the newly added prompt tokens to a pretrained model. This way, you can use one pretrained model whose weights are frozen, and train and update a smaller set of prompt parameters for each downstream task instead of fully finetuning a separate model. As models grow larger and larger, prompt tuning can be more efficient, and results are even better as model parameters scale.
-
-<Tip>
-
-💡 Read [The Power of Scale for Parameter-Efficient Prompt Tuning](https://arxiv.org/abs/2104.08691) to learn more about prompt tuning.
-
-</Tip>
-
-This guide will show you how to apply prompt tuning to train a [`bloomz-560m`](https://huggingface.co/bigscience/bloomz-560m) model on the `twitter_complaints` subset of the [RAFT](https://huggingface.co/datasets/ought/raft) dataset.
-
-Before you begin, make sure you have all the necessary libraries installed:
-
-```bash
-!pip install -q peft transformers datasets
-```
-
-## Setup
-
-Start by defining the model and tokenizer, the dataset and the dataset columns to train on, some training hyperparameters, and the [`PromptTuningConfig`]. The [`PromptTuningConfig`] contains information about the task type, the text to initialize the prompt embedding, the number of virtual tokens, and the tokenizer to use:
-
-```py
-from transformers import AutoModelForCausalLM, AutoTokenizer, default_data_collator, get_linear_schedule_with_warmup
-from peft import get_peft_config, get_peft_model, PromptTuningInit, PromptTuningConfig, TaskType, PeftType
-import torch
-from datasets import load_dataset
-import os
-from torch.utils.data import DataLoader
-from tqdm import tqdm
-
-device = "cuda"
-model_name_or_path = "bigscience/bloomz-560m"
-tokenizer_name_or_path = "bigscience/bloomz-560m"
-peft_config = PromptTuningConfig(
-    task_type=TaskType.CAUSAL_LM,
-    prompt_tuning_init=PromptTuningInit.TEXT,
-    num_virtual_tokens=8,
-    prompt_tuning_init_text="Classify if the tweet is a complaint or not:",
-    tokenizer_name_or_path=model_name_or_path,
-)
-
-dataset_name = "twitter_complaints"
-checkpoint_name = f"{dataset_name}_{model_name_or_path}_{peft_config.peft_type}_{peft_config.task_type}_v1.pt".replace(
-    "/", "_"
-)
-text_column = "Tweet text"
-label_column = "text_label"
-max_length = 64
-lr = 3e-2
-num_epochs = 50
-batch_size = 8
-```
-
-## Load dataset
-
-For this guide, you'll load the `twitter_complaints` subset of the [RAFT](https://huggingface.co/datasets/ought/raft) dataset. This subset contains tweets that are labeled either `complaint` or `no complaint`:
-
-```py
-dataset = load_dataset("ought/raft", dataset_name)
-dataset["train"][0]
-{"Tweet text": "@HMRCcustomers No this is my first job", "ID": 0, "Label": 2}
-```
-
-To make the `Label` column more readable, replace the `Label` value with the corresponding label text and store them in a `text_label` column. You can use the [`~datasets.Dataset.map`] function to apply this change over the entire dataset in one step:
-
-```py
-classes = [k.replace("_", " ") for k in dataset["train"].features["Label"].names]
-dataset = dataset.map(
-    lambda x: {"text_label": [classes[label] for label in x["Label"]]},
-    batched=True,
-    num_proc=1,
-)
-dataset["train"][0]
-{"Tweet text": "@HMRCcustomers No this is my first job", "ID": 0, "Label": 2, "text_label": "no complaint"}
-```
-
-## Preprocess dataset
-
-Next, you'll setup a tokenizer; configure the appropriate padding token to use for padding sequences, and determine the maximum length of the tokenized labels:
-
-```py
-tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
-if tokenizer.pad_token_id is None:
-    tokenizer.pad_token_id = tokenizer.eos_token_id
-target_max_length = max([len(tokenizer(class_label)["input_ids"]) for class_label in classes])
-print(target_max_length)
-3
-```
-
-Create a `preprocess_function` to:
-
-1. Tokenize the input text and labels.
-2. For each example in a batch, pad the labels with the tokenizers `pad_token_id`.
-3. Concatenate the input text and labels into the `model_inputs`.
-4. Create a separate attention mask for `labels` and `model_inputs`.
-5. Loop through each example in the batch again to pad the input ids, labels, and attention mask to the `max_length` and convert them to PyTorch tensors.
-
-```py
-def preprocess_function(examples):
-    batch_size = len(examples[text_column])
-    inputs = [f"{text_column} : {x} Label : " for x in examples[text_column]]
-    targets = [str(x) for x in examples[label_column]]
-    model_inputs = tokenizer(inputs)
-    labels = tokenizer(targets)
-    for i in range(batch_size):
-        sample_input_ids = model_inputs["input_ids"][i]
-        label_input_ids = labels["input_ids"][i] + [tokenizer.pad_token_id]
-        # print(i, sample_input_ids, label_input_ids)
-        model_inputs["input_ids"][i] = sample_input_ids + label_input_ids
-        labels["input_ids"][i] = [-100] * len(sample_input_ids) + label_input_ids
-        model_inputs["attention_mask"][i] = [1] * len(model_inputs["input_ids"][i])
-    # print(model_inputs)
-    for i in range(batch_size):
-        sample_input_ids = model_inputs["input_ids"][i]
-        label_input_ids = labels["input_ids"][i]
-        model_inputs["input_ids"][i] = [tokenizer.pad_token_id] * (
-            max_length - len(sample_input_ids)
-        ) + sample_input_ids
-        model_inputs["attention_mask"][i] = [0] * (max_length - len(sample_input_ids)) + model_inputs[
-            "attention_mask"
-        ][i]
-        labels["input_ids"][i] = [-100] * (max_length - len(sample_input_ids)) + label_input_ids
-        model_inputs["input_ids"][i] = torch.tensor(model_inputs["input_ids"][i][:max_length])
-        model_inputs["attention_mask"][i] = torch.tensor(model_inputs["attention_mask"][i][:max_length])
-        labels["input_ids"][i] = torch.tensor(labels["input_ids"][i][:max_length])
-    model_inputs["labels"] = labels["input_ids"]
-    return model_inputs
-```
-
-Use the [`~datasets.Dataset.map`] function to apply the `preprocess_function` to the entire dataset. You can remove the unprocessed columns since the model won't need them:
-
-```py
-processed_datasets = dataset.map(
-    preprocess_function,
-    batched=True,
-    num_proc=1,
-    remove_columns=dataset["train"].column_names,
-    load_from_cache_file=False,
-    desc="Running tokenizer on dataset",
-)
-```
-
-Create a [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) from the `train` and `eval` datasets. Set `pin_memory=True` to speed up the data transfer to the GPU during training if the samples in your dataset are on a CPU.
-
-```py
-train_dataset = processed_datasets["train"]
-eval_dataset = processed_datasets["test"]
-
-
-train_dataloader = DataLoader(
-    train_dataset, shuffle=True, collate_fn=default_data_collator, batch_size=batch_size, pin_memory=True
-)
-eval_dataloader = DataLoader(eval_dataset, collate_fn=default_data_collator, batch_size=batch_size, pin_memory=True)
-```
-
-## Train
-
-You're almost ready to setup your model and start training!
-
-Initialize a base model from [`~transformers.AutoModelForCausalLM`], and pass it and `peft_config` to the [`get_peft_model`] function to create a [`PeftModel`]. You can print the new [`PeftModel`]'s trainable parameters to see how much more efficient it is than training the full parameters of the original model!
-
-```py
-model = AutoModelForCausalLM.from_pretrained(model_name_or_path)
-model = get_peft_model(model, peft_config)
-print(model.print_trainable_parameters())
-"trainable params: 8192 || all params: 559222784 || trainable%: 0.0014648902430985358"
-```
-
-Setup an optimizer and learning rate scheduler:
-
-```py
-optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
-lr_scheduler = get_linear_schedule_with_warmup(
-    optimizer=optimizer,
-    num_warmup_steps=0,
-    num_training_steps=(len(train_dataloader) * num_epochs),
-)
-```
-
-Move the model to the GPU, then write a training loop to start training!
-
-```py
-model = model.to(device)
-
-for epoch in range(num_epochs):
-    model.train()
-    total_loss = 0
-    for step, batch in enumerate(tqdm(train_dataloader)):
-        batch = {k: v.to(device) for k, v in batch.items()}
-        outputs = model(**batch)
-        loss = outputs.loss
-        total_loss += loss.detach().float()
-        loss.backward()
-        optimizer.step()
-        lr_scheduler.step()
-        optimizer.zero_grad()
-
-    model.eval()
-    eval_loss = 0
-    eval_preds = []
-    for step, batch in enumerate(tqdm(eval_dataloader)):
-        batch = {k: v.to(device) for k, v in batch.items()}
-        with torch.no_grad():
-            outputs = model(**batch)
-        loss = outputs.loss
-        eval_loss += loss.detach().float()
-        eval_preds.extend(
-            tokenizer.batch_decode(torch.argmax(outputs.logits, -1).detach().cpu().numpy(), skip_special_tokens=True)
-        )
-
-    eval_epoch_loss = eval_loss / len(eval_dataloader)
-    eval_ppl = torch.exp(eval_epoch_loss)
-    train_epoch_loss = total_loss / len(train_dataloader)
-    train_ppl = torch.exp(train_epoch_loss)
-    print(f"{epoch=}: {train_ppl=} {train_epoch_loss=} {eval_ppl=} {eval_epoch_loss=}")
-```
-
-## Share model
-
-You can store and share your model on the Hub if you'd like. Log in to your Hugging Face account and enter your token when prompted:
-
-```py
-from huggingface_hub import notebook_login
-
-notebook_login()
-```
-
-Use the [`~transformers.PreTrainedModel.push_to_hub`] function to upload your model to a model repository on the Hub:
-
-```py
-peft_model_id = "your-name/bloomz-560m_PROMPT_TUNING_CAUSAL_LM"
-model.push_to_hub("your-name/bloomz-560m_PROMPT_TUNING_CAUSAL_LM", use_auth_token=True)
-```
-
-Once the model is uploaded, you'll see the model file size is only 33.5kB! 🤏
-
-## Inference
-
-Let's try the model on a sample input for inference. If you look at the repository you uploaded the model to, you'll see a `adapter_config.json` file. Load this file into [`PeftConfig`] to specify the `peft_type` and `task_type`. Then you can load the prompt tuned model weights, and the configuration into [`~PeftModel.from_pretrained`] to create the [`PeftModel`]:
-
-```py
-from peft import PeftModel, PeftConfig
-
-peft_model_id = "stevhliu/bloomz-560m_PROMPT_TUNING_CAUSAL_LM"
-
-config = PeftConfig.from_pretrained(peft_model_id)
-model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path)
-model = PeftModel.from_pretrained(model, peft_model_id)
-```
-
-Grab a tweet and tokenize it:
-
-```py
-inputs = tokenizer(
-    f'{text_column} : {"@nationalgridus I have no water and the bill is current and paid. Can you do something about this?"} Label : ',
-    return_tensors="pt",
-)
-```
-
-Put the model on a GPU and *generate* the predicted label:
-
-```py
-model.to(device)
-
-with torch.no_grad():
-    inputs = {k: v.to(device) for k, v in inputs.items()}
-    outputs = model.generate(
-        input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], max_new_tokens=10, eos_token_id=3
-    )
-    print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))
-[
-    "Tweet text : @nationalgridus I have no water and the bill is current and paid. Can you do something about this? Label : complaint"
-]
-```
--- a/Show More
+++ b/Show More