usage tips

fixes
remove result
2025-11-16 07:40:53 +08:00 · 2025-10-15 14:08:54 -07:00 · 2025-10-15 11:20:56 -07:00 · 2025-10-15 11:20:56 -07:00 · 2025-10-15 11:20:54 -07:00
2235 changed files with 67542 additions and 69783 deletions
--- a/.circleci/config.yml
+++ b/.circleci/config.yml
@ -46,8 +46,8 @@ jobs:
            - run: uv pip install -U -e .
            - run: echo 'export "GIT_COMMIT_MESSAGE=$(git show -s --format=%s)"' >> "$BASH_ENV" && source "$BASH_ENV"
            - run: mkdir -p test_preparation
-            - run: python utils/tests_fetcher.py | tee tests_fetched_summary.txt || true
-            - run: python utils/tests_fetcher.py --filter_tests || true
+            - run: python utils/tests_fetcher.py | tee tests_fetched_summary.txt
+            - run: python utils/tests_fetcher.py --filter_tests
            - run: export "GIT_COMMIT_MESSAGE=$(git show -s --format=%s)" && echo $GIT_COMMIT_MESSAGE && python .circleci/create_circleci_config.py --fetcher_folder test_preparation
            - run: |
                if [ ! -s test_preparation/generated_config.yml ]; then
@ -98,8 +98,8 @@ jobs:
            - run: uv pip install -U -e .
            - run: echo 'export "GIT_COMMIT_MESSAGE=$(git show -s --format=%s)"' >> "$BASH_ENV" && source "$BASH_ENV"
            - run: mkdir -p test_preparation
-            - run: python utils/tests_fetcher.py --fetch_all | tee tests_fetched_summary.txt || true
-            - run: python utils/tests_fetcher.py --filter_tests || true
+            - run: python utils/tests_fetcher.py --fetch_all | tee tests_fetched_summary.txt
+            - run: python utils/tests_fetcher.py --filter_tests
            - run: export "GIT_COMMIT_MESSAGE=$(git show -s --format=%s)" && echo $GIT_COMMIT_MESSAGE && python .circleci/create_circleci_config.py --fetcher_folder test_preparation
            - run: |
                if [ ! -s test_preparation/generated_config.yml ]; then
--- a/.cursor/commands/style-guide.md
+++ b/.cursor/commands/style-guide.md
@ -0,0 +1,53 @@
+## Sentence structure
+- Write short, declarative sentences most of the time.
+- Vary sentence length to avoid sounding robotic. Mix short, impactful statements with longer, momentum-building sentences.
+- Every time you use a comma, ask whether you can use a period instead.
+- Avoid repeating the same words in a paragraph. Use synonyms or rephrase.
+
+## Voice and tone
+- Write like humans speak. Avoid corporate jargon and marketing fluff.
+- Be confident and direct. Avoid softening phrases like "I think", "maybe", or "could".
+- Use active voice instead of passive voice.
+- Use positive phrasing - say what something *is* rather than what is *isn't*.
+- Say "you" more than "we" when addressing external audiences.
+- Use contractions like "I'll", "won't", and "can't" for a warmer tone.
+
+## Specificity and evidence
+- Be specific with facts and data instead of vague superlatives.
+- Back up claims with concrete examples or metrics.
+- Highlight customers and community members over company achievements.
+- Use realistic, product-based examples instead of `foo/bar/baz` in code.
+- Make content concrete, visual, and falsifiable.
+
+## Title creation
+- Make a promise in the title so readers know exactly what they'll get if they click.
+- Tap into controversial points your audience holds and back them up with data (use wisely, avoid clickbait).
+- Share something uniquely helpful that makes readers better at meaningful aspects of their lives.
+- Avoid vague titles like "My Thoughts on XYZ". Titles should be opinions or shareable facts.
+- Write placeholder titles first, complete the content, then spend time iterating on titles at the end.
+
+## Ban phrases
+- Avoid using "You can"
+
+## Avoid LLM patterns
+- Replace em dashes (-) with semicolons, commas, or sentence breaks.
+- Avoid starting responses with "Great question!", "You're right!", or "Let me help you."
+- Don't use phrases like "Let's dive into..."
+- Skip cliché intros like "In today's fast-paced digital world" or "In the ever-evolving landscape of"
+- Avoid phrases like "it's not just [x], it's [y]"
+- Don't use high-school essay closers: "In conclusion,", "Overall,", or "To summarize"
+- Avoid numbered lists in cases where bullets work better.
+- Replace "In conclusion" with direct statements.
+- Avoid hedge words: "might", "perhaps", "potentially" unless uncertainty is real.
+- Don't stack hedging phrases: "may potentially", "it's important to note that".
+- Don't create perfectly symmetrical paragraphs or lists that start with "Firstly... Secondly..."
+- Avoid title-case headings: prefer sentence casing.
+- Remove Unicode artifacts when copy-pasting: smart quotes ("), em-dashes, non-breaking spaces.
+- Use '
+- Delete empty citation placeholders like "[1]" with no actual source
+
+## Punctuation and formatting
+- Use Oxford commas consistently
+- Use exclamation points sparingly
+- Sentences can start with "But" and "And" but don't overuse
+- Use periods instead of commas when possible for clarity
--- a/.github/scripts/codeowners_for_review_action
+++ b/.github/scripts/codeowners_for_review_action
@ -22,6 +22,7 @@ tests/generation/ @gante
 /src/transformers/models/auto/ @ArthurZucker
 /src/transformers/utils/ @ArthurZucker @Rocketknight1
 /src/transformers/loss/ @ArthurZucker
+/src/transformers/onnx/ @michaelbenayoun

 # Specific files come after the sections/globs, so they take priority
 /.circleci/config.yml @ArthurZucker @ydshieh
--- a/.github/workflows/benchmark.yml
+++ b/.github/workflows/benchmark.yml
@ -1,10 +1,7 @@
 name: Self-hosted runner (benchmark)

 on:
-  push:
-    branches: [main]
-  pull_request:
-    types: [ opened, labeled, reopened, synchronize ]
+  workflow_dispatch:

 concurrency:
  group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
@ -12,8 +9,6 @@ concurrency:

 env:
  HF_HOME: /mnt/cache
-  DATASET_ID: hf-benchmarks/transformers
-  MODEL_ID: meta-llama/Llama-3.1-8B-Instruct

 jobs:
  benchmark:
@ -28,20 +23,35 @@ jobs:
      (github.event_name == 'pull_request' && contains( github.event.pull_request.labels.*.name, 'run-benchmark') )||
      (github.event_name == 'push' && github.ref == 'refs/heads/main')
    container:
-      image: huggingface/transformers-all-latest-gpu
+      image: huggingface/transformers-pytorch-gpu
      options: --gpus all --privileged --ipc host
    steps:
      - name: Get repo
-        uses: actions/checkout@v5
+        uses: actions/checkout@v4
        with:
-          fetch-depth: 1
+          ref: ${{ github.event.pull_request.head.sha || github.sha }}
+
+      - name: Install libpq-dev & psql
+        run: |
+          apt update
+          apt install -y libpq-dev postgresql-client

      - name: Install benchmark script dependencies
-        run: python3 -m pip install -r benchmark_v2/requirements.txt kernels
+        run: python3 -m pip install -r benchmark/requirements.txt

      - name: Reinstall transformers in edit mode (remove the one installed during docker image build)
+        working-directory: /transformers
        run: python3 -m pip uninstall -y transformers && python3 -m pip install -e ".[torch]"

+      - name: Run database init script
+        run: |
+          psql -f benchmark/utils/init_db.sql
+        env:
+          PGDATABASE: metrics
+          PGHOST: ${{ secrets.TRANSFORMERS_BENCHMARKS_PGHOST }}
+          PGUSER: transformers_benchmarks
+          PGPASSWORD: ${{ secrets.TRANSFORMERS_BENCHMARKS_PGPASSWORD }}
+
      - name: Run benchmark
        run: |
          git config --global --add safe.directory /__w/transformers/transformers
@ -51,11 +61,13 @@ jobs:
            commit_id=$GITHUB_SHA
          fi
          commit_msg=$(git show -s --format=%s | cut -c1-70)
-          python3 benchmark_v2/run_benchmarks.py -b 32 -s 128 -n 256 --level 2 --branch-name "$BRANCH_NAME" --commit-id "$commit_id" --commit-message "$commit_msg" --model-id "$MODEL_ID" --log-level INFO --push-result-to-dataset "$DATASET_ID"
+          python3 benchmark/benchmarks_entrypoint.py "huggingface/transformers" "$BRANCH_NAME" "$commit_id" "$commit_msg"
        env:
          HF_TOKEN: ${{ secrets.HF_HUB_READ_TOKEN }}
-          PUSH_TO_HUB_TOKEN: ${{ secrets.PUSH_TO_HUB_TOKEN }}
          # Enable this to see debug logs
          # HF_HUB_VERBOSITY: debug
          # TRANSFORMERS_VERBOSITY: debug
+          PGHOST: ${{ secrets.TRANSFORMERS_BENCHMARKS_PGHOST }}
+          PGUSER: transformers_benchmarks
+          PGPASSWORD: ${{ secrets.TRANSFORMERS_BENCHMARKS_PGPASSWORD }}
          BRANCH_NAME: ${{ github.head_ref || github.ref_name }}
--- a/.github/workflows/benchmark_v2_a10_caller.yml
+++ b/.github/workflows/benchmark_v2_a10_caller.yml
@ -9,7 +9,7 @@ jobs:
    uses: ./.github/workflows/benchmark_v2.yml
    with:
      runner: aws-g5-4xlarge-cache-use1-public-80
-      container_image: huggingface/transformers-all-latest-gpu
+      container_image: huggingface/transformers-pytorch-gpu
      container_options: --gpus all --privileged --ipc host --shm-size "16gb"
      commit_sha: ${{ github.sha }}
      run_id: ${{ github.run_id }}
--- a/.github/workflows/build-docker-images.yml
+++ b/.github/workflows/build-docker-images.yml
@ -45,59 +45,33 @@ jobs:
            REF=main
          push: true
          tags: huggingface/transformers-all-latest-gpu${{ inputs.image_postfix }}
-
-      - name: Post to Slack
-        if: always()
-        uses: huggingface/hf-workflows/.github/actions/post-slack@main
-        with:
-          slack_channel: ${{ secrets.CI_SLACK_CHANNEL_DOCKER }}
-          title: 🤗 Results of the transformers-all-latest-gpu docker build
-          status: ${{ job.status }}
-          slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
-
-  flash-attn-ci-image:
-    name: "PyTorch with Flash Attn [dev]"
-    runs-on:
-      group: aws-general-8-plus
-    steps:
+      # Push CI images still need to be re-built daily
      -
-        name: Set up Docker Buildx
-        uses: docker/setup-buildx-action@v3
-      -
-        name: Check out code
-        uses: actions/checkout@v4
-      -
-        name: Login to DockerHub
-        uses: docker/login-action@v3
-        with:
-          username: ${{ secrets.DOCKERHUB_USERNAME }}
-          password: ${{ secrets.DOCKERHUB_PASSWORD }}
-      -
-        name: Build and push
+        name: Build and push (for Push CI) in a daily basis
+        # This condition allows `schedule` events, or `push` events that trigger this workflow NOT via `workflow_call`.
+        # The later case is useful for manual image building for debugging purpose. Use another tag in this case!
+        if: inputs.image_postfix != '-push-ci'
        uses: docker/build-push-action@v5
        with:
          context: ./docker/transformers-all-latest-gpu
          build-args: |
            REF=main
-            PYTORCH=2.8.0
-            TORCHCODEC=0.7.0
-            FLASH_ATTN=yes
          push: true
-          tags: huggingface/transformers-all-latest-gpu${{ inputs.image_postfix }}:flash-attn
+          tags: huggingface/transformers-all-latest-gpu-push-ci

      - name: Post to Slack
        if: always()
        uses: huggingface/hf-workflows/.github/actions/post-slack@main
        with:
          slack_channel: ${{ secrets.CI_SLACK_CHANNEL_DOCKER }}
-          title: 🤗 Results of the transformers-all-latest-gpu docker build
+          title: 🤗 Results of the transformers-all-latest-gpu-push-ci docker build
          status: ${{ job.status }}
          slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}

  latest-torch-deepspeed-docker:
    name: "Latest PyTorch + DeepSpeed"
    runs-on:
-      group: aws-general-8-plus
+      group: aws-g4dn-2xlarge-cache
    steps:
      -
        name: Set up Docker Buildx
@ -130,8 +104,51 @@ jobs:
          status: ${{ job.status }}
          slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}

+  # Can't build 2 images in a single job `latest-torch-deepspeed-docker` (for `nvcr.io/nvidia`)
+  latest-torch-deepspeed-docker-for-push-ci-daily-build:
+    name: "Latest PyTorch + DeepSpeed (Push CI - Daily Build)"
+    runs-on:
+      group: aws-general-8-plus
+    steps:
+      -
+        name: Set up Docker Buildx
+        uses: docker/setup-buildx-action@v3
+      -
+        name: Check out code
+        uses: actions/checkout@v4
+      -
+        name: Login to DockerHub
+        uses: docker/login-action@v3
+        with:
+          username: ${{ secrets.DOCKERHUB_USERNAME }}
+          password: ${{ secrets.DOCKERHUB_PASSWORD }}
+      # Push CI images still need to be re-built daily
+      -
+        name: Build and push (for Push CI) in a daily basis
+        # This condition allows `schedule` events, or `push` events that trigger this workflow NOT via `workflow_call`.
+        # The later case is useful for manual image building for debugging purpose. Use another tag in this case!
+        if: inputs.image_postfix != '-push-ci'
+        uses: docker/build-push-action@v5
+        with:
+          context: ./docker/transformers-pytorch-deepspeed-latest-gpu
+          build-args: |
+            REF=main
+          push: true
+          tags: huggingface/transformers-pytorch-deepspeed-latest-gpu-push-ci
+
+      - name: Post to Slack
+        if: always()
+        uses: huggingface/hf-workflows/.github/actions/post-slack@main
+        with:
+          slack_channel: ${{ secrets.CI_SLACK_CHANNEL_DOCKER }}
+          title: 🤗 Results of the transformers-pytorch-deepspeed-latest-gpu-push-ci docker build
+          status: ${{ job.status }}
+          slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
+
  doc-builder:
    name: "Doc builder"
+    # Push CI doesn't need this image
+    if: inputs.image_postfix != '-push-ci'
    runs-on:
      group: aws-general-8-plus
    steps:
@ -164,6 +181,44 @@ jobs:
          status: ${{ job.status }}
          slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}

+  latest-pytorch:
+    name: "Latest PyTorch [dev]"
+    # Push CI doesn't need this image
+    if: inputs.image_postfix != '-push-ci'
+    runs-on:
+      group: aws-general-8-plus
+    steps:
+      -
+        name: Set up Docker Buildx
+        uses: docker/setup-buildx-action@v3
+      -
+        name: Check out code
+        uses: actions/checkout@v4
+      -
+        name: Login to DockerHub
+        uses: docker/login-action@v3
+        with:
+          username: ${{ secrets.DOCKERHUB_USERNAME }}
+          password: ${{ secrets.DOCKERHUB_PASSWORD }}
+      -
+        name: Build and push
+        uses: docker/build-push-action@v5
+        with:
+          context: ./docker/transformers-pytorch-gpu
+          build-args: |
+            REF=main
+          push: true
+          tags: huggingface/transformers-pytorch-gpu
+
+      - name: Post to Slack
+        if: always()
+        uses: huggingface/hf-workflows/.github/actions/post-slack@main
+        with:
+          slack_channel: ${{ secrets.CI_SLACK_CHANNEL_DOCKER }}
+          title: 🤗 Results of the huggingface/transformers-pytorch-gpudocker build
+          status: ${{ job.status }}
+          slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
+
  latest-pytorch-amd:
    name: "Latest PyTorch (AMD) [dev]"
    runs-on:
@ -190,47 +245,29 @@ jobs:
            REF=main
          push: true
          tags: huggingface/transformers-pytorch-amd-gpu${{ inputs.image_postfix }}
+      # Push CI images still need to be re-built daily
+      -
+        name: Build and push (for Push CI) in a daily basis
+        # This condition allows `schedule` events, or `push` events that trigger this workflow NOT via `workflow_call`.
+        # The later case is useful for manual image building for debugging purpose. Use another tag in this case!
+        if: inputs.image_postfix != '-push-ci'
+        uses: docker/build-push-action@v5
+        with:
+          context: ./docker/transformers-pytorch-amd-gpu
+          build-args: |
+            REF=main
+          push: true
+          tags: huggingface/transformers-pytorch-amd-gpu-push-ci

      - name: Post to Slack
        if: always()
        uses: huggingface/hf-workflows/.github/actions/post-slack@main
        with:
          slack_channel: ${{ secrets.CI_SLACK_CHANNEL_DOCKER }}
-          title: 🤗 Results of the huggingface/transformers-pytorch-amd-gpu build
+          title: 🤗 Results of the huggingface/transformers-pytorch-amd-gpu-push-ci build
          status: ${{ job.status }}
          slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}

-  cache-latest-pytorch-amd:
-    name: "Cache Latest Pytorch (AMD) Image"
-    needs: latest-pytorch-amd
-    runs-on:
-      group: amd-mi325-1gpu
-    steps:
-      -
-        name: Login to DockerHub
-        uses: docker/login-action@v3
-        with:
-          username: ${{ secrets.DOCKERHUB_USERNAME }}
-          password: ${{ secrets.DOCKERHUB_PASSWORD }}
-        
-      - 
-        name: Pull and save docker image to cache
-        run: |
-          image="huggingface/transformers-pytorch-amd-gpu"
-          final_path="/mnt/image-cache/transformers-pytorch-amd-gpu.tar"
-          tmp_path="${final_path}.tmp"
-
-          echo "Pulling image: ${image}"
-          docker pull "${image}"
-
-          echo "Saving to temp file: ${tmp_path}"
-          docker save "${image}" -o "${tmp_path}"
-
-          echo "Moving to final path: ${final_path}"
-          mv -f "${tmp_path}" "${final_path}"
-
-          echo "Cache populated successfully at ${final_path}"
-
  latest-pytorch-deepspeed-amd:
    name: "PyTorch + DeepSpeed (AMD) [dev]"
    runs-on:
@ -257,6 +294,19 @@ jobs:
            REF=main
          push: true
          tags: huggingface/transformers-pytorch-deepspeed-amd-gpu${{ inputs.image_postfix }}
+      # Push CI images still need to be re-built daily
+      -
+        name: Build and push (for Push CI) in a daily basis
+        # This condition allows `schedule` events, or `push` events that trigger this workflow NOT via `workflow_call`.
+        # The later case is useful for manual image building for debugging purpose. Use another tag in this case!
+        if: inputs.image_postfix != '-push-ci'
+        uses: docker/build-push-action@v5
+        with:
+          context: ./docker/transformers-pytorch-deepspeed-amd-gpu
+          build-args: |
+            REF=main
+          push: true
+          tags: huggingface/transformers-pytorch-deepspeed-amd-gpu-push-ci

      - name: Post to Slack
        if: always()
@ -269,6 +319,8 @@ jobs:

  latest-quantization-torch-docker:
    name: "Latest Pytorch + Quantization [dev]"
+     # Push CI doesn't need this image
+    if: inputs.image_postfix != '-push-ci'
    runs-on:
      group: aws-general-8-plus
    steps:
--- a/.github/workflows/check-workflow-permissions.yml
+++ b/.github/workflows/check-workflow-permissions.yml
@ -1,23 +0,0 @@
---
-name: Check Permissions Advisor
-
-on:
-  workflow_dispatch:
-    inputs:
-      workflow_name:
-        description: 'Workflow file name'
-        type: string
-      run_count:
-        description: 'Number of runs to analyze'
-        type: string
-        default: "10"
-
-jobs:
-  advisor:
-    uses: huggingface/security-workflows/.github/workflows/permissions-advisor-reusable.yml@main
-    permissions:
-      actions: read
-      contents: read
-    with:
-      workflow_name: ${{ inputs.workflow_name }}
-      run_count: ${{ fromJSON(inputs.run_count) }}
--- a/.github/workflows/check_failed_tests.yml
+++ b/.github/workflows/check_failed_tests.yml
@ -6,6 +6,9 @@ on:
      docker:
        required: true
        type: string
+      start_sha:
+        required: true
+        type: string
      job:
        required: true
        type: string
@ -21,13 +24,7 @@ on:
      commit_sha:
        required: false
        type: string
-      pr_number:
-        required: false
-        type: string
-    outputs:
-      report:
-        description: "Content of the report of new failures"
-        value: ${{ jobs.process_new_failures_with_commit_info.outputs.report }}
+

 env:
  HF_HOME: /mnt/cache
@ -44,14 +41,9 @@ env:

 jobs:
  check_new_failures:
-    name: "Find commits for new failing tests"
-    strategy:
-      matrix:
-        run_idx: [1]
+    name: " "
    runs-on:
      group: aws-g5-4xlarge-cache
-    outputs:
-      process: ${{ steps.check_file.outputs.process }}
    container:
      image: ${{ inputs.docker }}
      options: --gpus all --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/
@ -62,19 +54,14 @@ jobs:
          path: /transformers/ci_results_${{ inputs.job }}

      - name: Check file
-        id: check_file
        working-directory: /transformers
-        env:
-          job: ${{ inputs.job }}
        run: |
-          if [ -f "ci_results_${job}/new_failures.json" ]; then
-            echo "\`ci_results_${job}/new_failures.json\` exists, continue ..."
+          if [ -f ci_results_${{ inputs.job }}/new_failures.json ]; then
+            echo "`ci_results_${{ inputs.job }}/new_failures.json` exists, continue ..."
            echo "process=true" >> $GITHUB_ENV
-            echo "process=true" >> $GITHUB_OUTPUT
          else
-            echo "\`ci_results_${job}/new_failures.json\` doesn't exist, abort."
+            echo "`ci_results_${{ inputs.job }}/new_failures.json` doesn't exist, abort."
            echo "process=false" >> $GITHUB_ENV
-            echo "process=false" >> $GITHUB_OUTPUT
          fi

      - uses: actions/download-artifact@v4
@ -93,62 +80,27 @@ jobs:
            echo "PREV_WORKFLOW_RUN_ID=" >> $GITHUB_ENV
          fi

+          if [ -f setup_values/other_workflow_run_id.txt ]; then
+            echo "OTHER_WORKFLOW_RUN_ID=$(cat setup_values/other_workflow_run_id.txt)" >> $GITHUB_ENV
+          else
+            echo "OTHER_WORKFLOW_RUN_ID=" >> $GITHUB_ENV
+          fi
+
      - name: Update clone
        working-directory: /transformers
        if: ${{ env.process == 'true' }}
-        env:
-          commit_sha: ${{ inputs.commit_sha || github.sha }}
-        run: |
-          git fetch origin "$commit_sha" && git checkout "$commit_sha"
+        run: git fetch && git checkout ${{ inputs.commit_sha || github.sha }}

-      - name: Get `START_SHA`
+      - name: Get target commit
        working-directory: /transformers/utils
        if: ${{ env.process == 'true' }}
-        env:
-          commit_sha: ${{ inputs.commit_sha || github.sha }}
        run: |
-          echo "START_SHA=$commit_sha" >> $GITHUB_ENV
+          echo "END_SHA=$(TOKEN=${{ secrets.ACCESS_REPO_INFO_TOKEN }} python3 -c 'import os; from get_previous_daily_ci import get_last_daily_ci_run_commit; commit=get_last_daily_ci_run_commit(token=os.environ["TOKEN"], workflow_run_id=os.environ["PREV_WORKFLOW_RUN_ID"]); print(commit)')" >> $GITHUB_ENV

-      # This is used if the CI is triggered from a pull request `self-comment-ci.yml` (after security check is verified)
-      - name: Extract the base commit on `main` (of the merge commit created by Github) if it is a PR
-        id: pr_info
-        if: ${{ env.process == 'true' && inputs.pr_number != '' }}
-        uses: actions/github-script@v6
-        with:
-          script: |            
-            const { data: pr } = await github.rest.pulls.get({
-              owner: context.repo.owner,
-              repo: context.repo.repo,
-              pull_number: ${{ inputs.pr_number }}
-            });
-
-            const { data: merge_commit }  = await github.rest.repos.getCommit({
-              owner: pr.base.repo.owner.login,
-              repo: pr.base.repo.name,
-              ref: '${{ inputs.commit_sha }}',
-            });
-
-            core.setOutput('merge_commit_base_sha', merge_commit.parents[0].sha);
-
-      # Usually, `END_SHA` should be the commit of the last previous workflow run of the **SAME** (scheduled) workflow.
-      # (This is why we don't need to specify `workflow_id` which would be fetched automatically in the python script.)
-      - name: Get `END_SHA` from previous CI runs of the same workflow
-        working-directory: /transformers/utils
-        if: ${{ env.process == 'true' && inputs.pr_number == '' }}
-        env:
-          ACCESS_TOKEN: ${{ secrets.ACCESS_REPO_INFO_TOKEN }}
-        run: |
-          echo "END_SHA=$(TOKEN="$ACCESS_TOKEN" python3 -c 'import os; from get_previous_daily_ci import get_last_daily_ci_run_commit; commit=get_last_daily_ci_run_commit(token=os.environ["TOKEN"], workflow_run_id=os.environ["PREV_WORKFLOW_RUN_ID"]); print(commit)')" >> $GITHUB_ENV
-
-      # However, for workflow runs triggered by `issue_comment` (for pull requests), we want to check against the
-      # parent commit (on `main`) of the `merge_commit` (dynamically created by GitHub). In this case, the goal is to
-      # see if a reported failing test is actually ONLY failing on the `merge_commit`.
-      - name: Set `END_SHA`
-        if: ${{ env.process == 'true' && inputs.pr_number != '' }}
-        env:
-          merge_commit_base_sha: ${{ steps.pr_info.outputs.merge_commit_base_sha }}
-        run: |
-          echo "END_SHA=$merge_commit_base_sha" >> $GITHUB_ENV
+      - name: Checkout to `start_sha`
+        working-directory: /transformers
+        if: ${{ env.process == 'true' }}
+        run: git fetch && git checkout ${{ inputs.start_sha }}

      - name: Reinstall transformers in edit mode (remove the one installed during docker image build)
        working-directory: /transformers
@ -166,10 +118,6 @@ jobs:
        run: |
          python3 utils/print_env.py

-      - name: Install pytest-flakefinder
-        if: ${{ env.process == 'true' }}
-        run: python3 -m pip install pytest-flakefinder
-
      - name: Show installed libraries and their versions
        working-directory: /transformers
        if: ${{ env.process == 'true' }}
@ -178,78 +126,37 @@ jobs:
      - name: Check failed tests
        working-directory: /transformers
        if: ${{ env.process == 'true' }}
-        env:
-          job: ${{ inputs.job }}
-          run_idx: ${{ matrix.run_idx }}
-        run: python3 utils/check_bad_commit.py --start_commit "$START_SHA" --end_commit "$END_SHA" --file "ci_results_${job}/new_failures.json" --output_file "new_failures_with_bad_commit_${job}_${run_idx}.json"
+        run: python3 utils/check_bad_commit.py --start_commit ${{ inputs.start_sha }} --end_commit ${{ env.END_SHA }} --file ci_results_${{ inputs.job }}/new_failures.json --output_file new_failures_with_bad_commit.json

      - name: Show results
        working-directory: /transformers
        if: ${{ env.process == 'true' }}
-        env:
-          job: ${{ inputs.job }}
-          run_idx: ${{ matrix.run_idx }}
        run: |
-          ls -l "new_failures_with_bad_commit_${job}_${run_idx}.json"
-          cat "new_failures_with_bad_commit_${job}_${run_idx}.json"
+          ls -l new_failures_with_bad_commit.json
+          cat new_failures_with_bad_commit.json

-      - name: Upload artifacts
-        uses: actions/upload-artifact@v4
-        with:
-          name: new_failures_with_bad_commit_${{ inputs.job }}_${{ matrix.run_idx }}
-          path: /transformers/new_failures_with_bad_commit_${{ inputs.job }}_${{ matrix.run_idx }}.json
-
-  process_new_failures_with_commit_info:
-    name: "process bad commit reports"
-    needs: check_new_failures
-    if: needs.check_new_failures.outputs.process == 'true'
-    runs-on:
-      group: aws-g5-4xlarge-cache
-    outputs:
-      report: ${{ steps.set_output.outputs.report }}
-    container:
-      image: ${{ inputs.docker }}
-      options: --gpus all --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/
-    steps:
-      - uses: actions/download-artifact@v4
-        with:
-          name: ci_results_${{ inputs.job }}
-          path: /transformers/ci_results_${{ inputs.job }}
-
-      - uses: actions/download-artifact@v4
-        with:
-          pattern: new_failures_with_bad_commit_${{ inputs.job }}*
-          path: /transformers/new_failures_with_bad_commit_${{ inputs.job }}
-          merge-multiple: true
-
-      - name: Check files
+      - name: Checkout back
        working-directory: /transformers
-        env:
-          job: ${{ inputs.job }}
+        if: ${{ env.process == 'true' }}
        run: |
-          ls -la /transformers
-          ls -la "/transformers/new_failures_with_bad_commit_${job}"
-
-      # Currently, we only run with a single runner by using `run_idx: [1]`. We might try to run with multiple runners
-      # to further reduce the false positive caused by flaky tests, which requires further processing to merge reports.
-      - name: Merge files
-        shell: bash
-        working-directory: /transformers
-        env:
-          job: ${{ inputs.job }}
-        run: |
-          cp "/transformers/new_failures_with_bad_commit_${job}/new_failures_with_bad_commit_${job}_1.json" new_failures_with_bad_commit.json
-
-      - name: Update clone
-        working-directory: /transformers
-        env:
-          commit_sha: ${{ inputs.commit_sha || github.sha }}
-        run: |
-          git fetch origin "$commit_sha" && git checkout "$commit_sha"
+          git checkout ${{ inputs.start_sha }}

      - name: Process report
        shell: bash
        working-directory: /transformers
+        if: ${{ env.process == 'true' }}
+        env:
+          ACCESS_REPO_INFO_TOKEN: ${{ secrets.ACCESS_REPO_INFO_TOKEN }}
+          TRANSFORMERS_CI_RESULTS_UPLOAD_TOKEN: ${{ secrets.TRANSFORMERS_CI_RESULTS_UPLOAD_TOKEN }}
+          JOB_NAME: ${{ inputs.job }}
+          REPORT_REPO_ID: ${{ inputs.report_repo_id }}
+        run: |
+          python3 utils/process_bad_commit_report.py
+
+      - name: Process report
+        shell: bash
+        working-directory: /transformers
+        if: ${{ env.process == 'true' }}
        env:
          ACCESS_REPO_INFO_TOKEN: ${{ secrets.ACCESS_REPO_INFO_TOKEN }}
          TRANSFORMERS_CI_RESULTS_UPLOAD_TOKEN: ${{ secrets.TRANSFORMERS_CI_RESULTS_UPLOAD_TOKEN }}
@ -262,40 +169,15 @@ jobs:
            echo EOF
          } >> "$GITHUB_ENV"

-      # The output is useful if a caller needs more processing, for example, we have a chain
-      # self-comment-ci.yml -> self-scheduled.yml -> this one (check_failed_tests.yml),
-      # and `self-comment-ci.yml` needs further processing before sending a GitHub comment to the pull request page.
-      - name: Show results & Set outputs
-        id: set_output
-        working-directory: /transformers
-        run: |
-          ls -l new_failures_with_bad_commit.json
-          cat new_failures_with_bad_commit.json
-
-          {
-            echo 'report<<EOF'
-            cat new_failures_with_bad_commit.json
-            echo ''  # Force a newline
-            echo EOF
-          } >> "$GITHUB_OUTPUT"
-
-      - name: Upload artifacts
-        uses: actions/upload-artifact@v4
-        with:
-          name: new_failures_with_bad_commit_${{ inputs.job }}
-          path: /transformers/new_failures_with_bad_commit.json
-
      - name: Prepare Slack report title
        working-directory: /transformers
-        env:
-          ci_event: ${{ inputs.ci_event }}
-          job: ${{ inputs.job }}
+        if: ${{ env.process == 'true' }}
        run: |
          pip install slack_sdk
-          echo "title=$(python3 -c 'import sys; import os; sys.path.append("utils"); from utils.notification_service import job_to_test_map; ci_event = os.environ["ci_event"]; job = os.environ["job"]; test_name = job_to_test_map[job]; title = f"New failed tests of {ci_event}" + ":" + f" {test_name}"; print(title)')" >> $GITHUB_ENV
+          echo "title=$(python3 -c 'import sys; sys.path.append("utils"); from utils.notification_service import job_to_test_map; ci_event = "${{ inputs.ci_event }}"; job = "${{ inputs.job }}"; test_name = job_to_test_map[job]; title = f"New failed tests of {ci_event}" + ":" + f" {test_name}"; print(title)')" >> $GITHUB_ENV

      - name: Send processed report
-        if: ${{ !endsWith(env.REPORT_TEXT, '{}') }}
+        if: ${{ env.process == 'true' && !endsWith(env.REPORT_TEXT, '{}') }}
        uses: slackapi/slack-github-action@6c661ce58804a1a20f6dc5fbee7f0381b469e001
        with:
          # Slack channel id, channel name, or user id to post message.
--- a/.github/workflows/codeql.yml
+++ b/.github/workflows/codeql.yml
@ -1,22 +0,0 @@
---
-name: CodeQL Security Analysis
-
-on:
-  push:
-    branches: ["main", "fix_security_issue_*"]
-  # pull_request:
-  #   branches: ["main"]
-  workflow_dispatch:
-
-jobs:
-  codeql:
-    name: CodeQL Analysis
-    uses: huggingface/security-workflows/.github/workflows/codeql-reusable.yml@main
-    permissions:
-      security-events: write
-      packages: read
-      actions: read
-      contents: read
-    with:
-      languages: '["actions"]'
-      queries: 'security-extended,security-and-quality'
--- a/.github/workflows/get-pr-info.yml
+++ b/.github/workflows/get-pr-info.yml
@ -39,9 +39,6 @@ on:
      PR_MERGE_COMMIT_SHA:
        description: "The sha of the merge commit for the pull request (created by GitHub) in the base repository"
        value: ${{ jobs.get-pr-info.outputs.PR_MERGE_COMMIT_SHA }}
-      PR_MERGE_COMMIT_BASE_SHA:
-        description: "The sha of the parent commit of the the merge commit on the target branch in the base repository"
-        value: ${{ jobs.get-pr-info.outputs.PR_MERGE_COMMIT_BASE_SHA }}
      PR_HEAD_COMMIT_DATE:
        description: "The date of the head sha of the pull request branch in the head repository"
        value: ${{ jobs.get-pr-info.outputs.PR_HEAD_COMMIT_DATE }}
@ -77,7 +74,6 @@ jobs:
      PR_BASE_REF: ${{ steps.pr_info.outputs.base_ref }}
      PR_HEAD_SHA: ${{ steps.pr_info.outputs.head_sha }}
      PR_BASE_SHA: ${{ steps.pr_info.outputs.base_sha }}
-      PR_MERGE_COMMIT_BASE_SHA: ${{ steps.pr_info.outputs.merge_commit_base_sha }}
      PR_MERGE_COMMIT_SHA: ${{ steps.pr_info.outputs.merge_commit_sha }}
      PR_HEAD_COMMIT_DATE: ${{ steps.pr_info.outputs.head_commit_date }}
      PR_MERGE_COMMIT_DATE: ${{ steps.pr_info.outputs.merge_commit_date }}
@ -126,7 +122,6 @@ jobs:
            core.setOutput('base_ref', pr.base.ref);
            core.setOutput('head_sha', pr.head.sha);
            core.setOutput('base_sha', pr.base.sha);
-            core.setOutput('merge_commit_base_sha', merge_commit.parents[0].sha);
            core.setOutput('merge_commit_sha', pr.merge_commit_sha);
            core.setOutput('pr', pr);

@ -147,21 +142,16 @@ jobs:
              date: merge_commit.commit.committer.date
            });

-            console.log('PR Info:', {
-              pr_info: pr
-            });
-
      - name: Convert dates to timestamps
        id: get_timestamps
-        env:
-          head_commit_date: ${{ steps.pr_info.outputs.head_commit_date }}
-          merge_commit_date: ${{ steps.pr_info.outputs.merge_commit_date }}
        run: |
-          echo "$head_commit_date"
-          echo "$merge_commit_date"
+          head_commit_date=${{ steps.pr_info.outputs.head_commit_date }}
+          merge_commit_date=${{ steps.pr_info.outputs.merge_commit_date }}
+          echo $head_commit_date
+          echo $merge_commit_date
          head_commit_timestamp=$(date -d "$head_commit_date" +%s)
          merge_commit_timestamp=$(date -d "$merge_commit_date" +%s)
-          echo "$head_commit_timestamp"
-          echo "$merge_commit_timestamp"
+          echo $head_commit_timestamp
+          echo $merge_commit_timestamp
          echo "head_commit_timestamp=$head_commit_timestamp" >> $GITHUB_OUTPUT
-          echo "merge_commit_timestamp=$merge_commit_timestamp" >> $GITHUB_OUTPUT
+          echo "merge_commit_timestamp=$merge_commit_timestamp" >> $GITHUB_OUTPUT
--- a/.github/workflows/get-pr-number.yml
+++ b/.github/workflows/get-pr-number.yml
@ -15,19 +15,13 @@ jobs:
    steps:
      - name: Get PR number
        shell: bash
-        env:
-          issue_number: ${{ github.event.issue.number }}
-          is_pull_request_issue: ${{ github.event.issue.pull_request != null }}
-          pr_number: ${{ github.event.pull_request.number }}
-          is_pull_request: ${{ github.event.pull_request != null }}
-          event_number: ${{ github.event.number }}
        run: |
-          if [[ "$issue_number" != "" && "$is_pull_request_issue" == "true" ]]; then
-            echo "PR_NUMBER=$issue_number" >> $GITHUB_ENV
-          elif [[ "$pr_number" != "" ]]; then
-            echo "PR_NUMBER=$pr_number" >> $GITHUB_ENV
-          elif [[ "$is_pull_request" == "true" ]]; then
-            echo "PR_NUMBER=$event_number" >> $GITHUB_ENV
+          if [[ "${{ github.event.issue.number }}" != "" && "${{ github.event.issue.pull_request }}" != "" ]]; then
+            echo "PR_NUMBER=${{ github.event.issue.number }}" >> $GITHUB_ENV
+          elif [[ "${{ github.event.pull_request.number }}" != "" ]]; then
+            echo "PR_NUMBER=${{ github.event.pull_request.number }}" >> $GITHUB_ENV
+          elif [[ "${{ github.event.pull_request }}" != "" ]]; then
+            echo "PR_NUMBER=${{ github.event.number }}" >> $GITHUB_ENV
          else
            echo "PR_NUMBER=" >> $GITHUB_ENV
          fi
@ -35,8 +29,8 @@ jobs:
      - name: Check PR number
        shell: bash
        run: |
-          echo "$PR_NUMBER"
+          echo "${{ env.PR_NUMBER }}"

      - name: Set PR number
        id: set_pr_number
-        run: echo "PR_NUMBER=$PR_NUMBER" >> "$GITHUB_OUTPUT"
+        run: echo "PR_NUMBER=${{ env.PR_NUMBER }}" >> "$GITHUB_OUTPUT"
--- a/.github/workflows/model_jobs.yml
+++ b/.github/workflows/model_jobs.yml
@ -28,9 +28,6 @@ on:
      report_repo_id:
        required: false
        type: string
-      pytest_marker:
-        required: false
-        type: string

 env:
  HF_HOME: /mnt/cache
@ -62,33 +59,25 @@ jobs:
    steps:
      - name: Echo input and matrix info
        shell: bash
-        env:
-          folder_slices: ${{ inputs.folder_slices }}
-          matrix_folders: ${{ matrix.folders }}
-          slice_data: ${{ toJson(fromJson(inputs.folder_slices)[inputs.slice_id]) }}
        run: |
-          echo "$folder_slices"
-          echo "$matrix_folders"
-          echo "$slice_data"
+          echo "${{ inputs.folder_slices }}"
+          echo "${{ matrix.folders }}"
+          echo "${{ toJson(fromJson(inputs.folder_slices)[inputs.slice_id]) }}"

      - name: Echo folder ${{ matrix.folders }}
        shell: bash
        # For folders like `models/bert`, set an env. var. (`matrix_folders`) to `models_bert`, which will be used to
        # set the artifact folder names (because the character `/` is not allowed).
-        env:
-          matrix_folders_raw: ${{ matrix.folders }}
        run: |
-          echo "$matrix_folders_raw"
-          matrix_folders="${matrix_folders_raw/'models/'/'models_'}"
+          echo "${{ matrix.folders }}"
+          matrix_folders=${{ matrix.folders }}
+          matrix_folders=${matrix_folders/'models/'/'models_'}
          echo "$matrix_folders"
          echo "matrix_folders=$matrix_folders" >> $GITHUB_ENV

      - name: Update clone
        working-directory: /transformers
-        env:
-          commit_sha: ${{ inputs.commit_sha || github.sha }}
-        run: |
-          git fetch origin "$commit_sha" && git checkout "$commit_sha"
+        run: git fetch && git checkout ${{ inputs.commit_sha || github.sha }}

      - name: Reinstall transformers in edit mode (remove the one installed during docker image build)
        working-directory: /transformers
@ -123,17 +112,15 @@ jobs:
        id: set_machine_type
        working-directory: /transformers
        shell: bash
-        env:
-          input_machine_type: ${{ inputs.machine_type }}
        run: |
-          echo "$input_machine_type"
+          echo "${{ inputs.machine_type }}"

-          if [ "$input_machine_type" = "aws-g5-4xlarge-cache" ]; then
+          if [ "${{ inputs.machine_type }}" = "aws-g5-4xlarge-cache" ]; then
            machine_type=single-gpu
-          elif [ "$input_machine_type" = "aws-g5-12xlarge-cache" ]; then
+          elif [ "${{ inputs.machine_type }}" = "aws-g5-12xlarge-cache" ]; then
            machine_type=multi-gpu
          else
-            machine_type="$input_machine_type"
+            machine_type=${{ inputs.machine_type }}
          fi

          echo "$machine_type"
@ -142,21 +129,15 @@ jobs:

      - name: Create report directory if it doesn't exist
        shell: bash
-        env:
-          report_name_prefix: ${{ inputs.report_name_prefix }}
        run: |
-          mkdir -p "/transformers/reports/${machine_type}_${report_name_prefix}_${matrix_folders}_test_reports"
-          echo "dummy" > "/transformers/reports/${machine_type}_${report_name_prefix}_${matrix_folders}_test_reports/dummy.txt"
-          ls -la "/transformers/reports/${machine_type}_${report_name_prefix}_${matrix_folders}_test_reports"
+          mkdir -p /transformers/reports/${{ env.machine_type }}_${{ inputs.report_name_prefix }}_${{ env.matrix_folders }}_test_reports
+          echo "dummy" > /transformers/reports/${{ env.machine_type }}_${{ inputs.report_name_prefix }}_${{ env.matrix_folders }}_test_reports/dummy.txt
+          ls -la /transformers/reports/${{ env.machine_type }}_${{ inputs.report_name_prefix }}_${{ env.matrix_folders }}_test_reports

      - name: Run all tests on GPU
        working-directory: /transformers
-        env:
-          report_name_prefix: ${{ inputs.report_name_prefix }}
-          pytest_marker: ${{ inputs.pytest_marker }}
-          model: ${{ matrix.folders }}
        run: |
-          script -q -c "PATCH_TESTING_METHODS_TO_COLLECT_OUTPUTS=yes _PATCHED_TESTING_METHODS_OUTPUT_DIR=/transformers/reports/${machine_type}_${report_name_prefix}_${matrix_folders}_test_reports python3 -m pytest -rsfE -v -m '${pytest_marker}' --make-reports=${machine_type}_${report_name_prefix}_${matrix_folders}_test_reports tests/${model}" test_outputs.txt
+          script -q -c "PATCH_TESTING_METHODS_TO_COLLECT_OUTPUTS=yes _PATCHED_TESTING_METHODS_OUTPUT_DIR=/transformers/reports/${{ env.machine_type }}_${{ inputs.report_name_prefix }}_${{ env.matrix_folders }}_test_reports python3 -m pytest -rsfE -v --make-reports=${{ env.machine_type }}_${{ inputs.report_name_prefix }}_${{ env.matrix_folders }}_test_reports tests/${{ matrix.folders }}" test_outputs.txt
          ls -la
          # Extract the exit code from the output file
          EXIT_CODE=$(tail -1 test_outputs.txt | grep -o 'COMMAND_EXIT_CODE="[0-9]*"' | cut -d'"' -f2)
@ -167,25 +148,19 @@ jobs:
        # This step is only to show information on Github Actions log.
        # Always mark this step as successful, even if the report directory or the file `failures_short.txt` in it doesn't exist
        continue-on-error: true
-        env:
-          report_name_prefix: ${{ inputs.report_name_prefix }}
-        run: cat "/transformers/reports/${machine_type}_${report_name_prefix}_${matrix_folders}_test_reports/failures_short.txt"
+        run: cat /transformers/reports/${{ env.machine_type }}_${{ inputs.report_name_prefix }}_${{ env.matrix_folders }}_test_reports/failures_short.txt

      - name: Captured information
        if: ${{ failure() }}
        continue-on-error: true
-        env:
-          report_name_prefix: ${{ inputs.report_name_prefix }}
        run: |
-          cat "/transformers/reports/${machine_type}_${report_name_prefix}_${matrix_folders}_test_reports/captured_info.txt"
+          cat /transformers/reports/${{ env.machine_type }}_${{ inputs.report_name_prefix }}_${{ env.matrix_folders }}_test_reports/captured_info.txt

      - name: Copy test_outputs.txt
        if: ${{ always() }}
        continue-on-error: true
-        env:
-          report_name_prefix: ${{ inputs.report_name_prefix }}
        run: |
-          cp /transformers/test_outputs.txt "/transformers/reports/${machine_type}_${report_name_prefix}_${matrix_folders}_test_reports"
+          cp /transformers/test_outputs.txt /transformers/reports/${{ env.machine_type }}_${{ inputs.report_name_prefix }}_${{ env.matrix_folders }}_test_reports

      - name: "Test suite reports artifacts: ${{ env.machine_type }}_${{ inputs.report_name_prefix }}_${{ env.matrix_folders }}_test_reports"
        if: ${{ always() }}
@ -196,7 +171,7 @@ jobs:

  collated_reports:
    name: Collated Reports
-    if: ${{ always() && inputs.runner_type != '' }}
+    if: ${{ always() }}
    needs: run_models_gpu
    uses: huggingface/transformers/.github/workflows/collated-reports.yml@main
    with:
--- a/.github/workflows/pr_build_doc_with_comment.yml
+++ b/.github/workflows/pr_build_doc_with_comment.yml
@ -98,7 +98,7 @@ jobs:
      commit_sha: ${{ needs.get-pr-info.outputs.PR_HEAD_SHA }}
      pr_number: ${{ needs.get-pr-number.outputs.PR_NUMBER }}
      package: transformers
-      languages: ar de en es fr hi it ja ko pt zh
+      languages: ar de en es fr hi it ko pt tr zh ja te

  update_run_status:
    name: Update Check Run Status
--- a/.github/workflows/pr_slow_ci_suggestion.yml
+++ b/.github/workflows/pr_slow_ci_suggestion.yml
@ -1,4 +1,4 @@
-name: PR slow CI - Suggestion
+name: PR slow CI
 on:
  pull_request_target:
    types: [opened, synchronize, reopened]
@ -23,28 +23,11 @@ jobs:
    outputs:
      jobs: ${{ steps.get_jobs.outputs.jobs_to_run }}
    steps:
-      # This checkout to the main branch
-      - uses: actions/checkout@v4
-        with:
-          fetch-depth: "0"
-
-      # We need to use `${{ ... }}` here to avoid `Argument list too long` error when a PR changes a lot of files.
-      # (We could also try to use artifact approach, but it's more involved).
-      # `CodeQL` doesn't identify any security issue here. Also `PR_FILES` is from `get-pr-info.yml` by using an api
-      # `github.rest.pulls.listFiles`, which is fine.
-      - name: Write pr_files file
-        run: |
-          cat > pr_files.txt << 'EOF'
-          ${{ needs.get-pr-info.outputs.PR_FILES }}
-          EOF
-
      - name: Get repository content
        id: repo_content
        uses: actions/github-script@v6
        with:
          script: |
-            const fs = require('node:fs');
-
            const { data: tests_dir } = await github.rest.repos.getContent({
              owner: '${{ needs.get-pr-info.outputs.PR_HEAD_REPO_OWNER }}',
              repo: '${{ needs.get-pr-info.outputs.PR_HEAD_REPO_NAME }}',
@ -66,10 +49,38 @@ jobs:
              ref: '${{ needs.get-pr-info.outputs.PR_HEAD_SHA }}',
            });

-            // Write to files instead of outputs
-            fs.writeFileSync('tests_dir.txt', JSON.stringify(tests_dir, null, 2));
-            fs.writeFileSync('tests_models_dir.txt', JSON.stringify(tests_models_dir, null, 2));
-            fs.writeFileSync('tests_quantization_dir.txt', JSON.stringify(tests_quantization_dir, null, 2));
+            core.setOutput('tests_dir', tests_dir);
+            core.setOutput('tests_models_dir', tests_models_dir);
+            core.setOutput('tests_quantization_dir', tests_quantization_dir);
+
+      # This checkout to the main branch
+      - uses: actions/checkout@v4
+        with:
+          fetch-depth: "0"
+
+      - name: Write pr_files file
+        run: |
+          cat > pr_files.txt << 'EOF'
+          ${{ needs.get-pr-info.outputs.PR_FILES }}
+          EOF
+
+      - name: Write tests_dir file
+        run: |
+          cat > tests_dir.txt << 'EOF'
+          ${{ steps.repo_content.outputs.tests_dir }}
+          EOF
+
+      - name: Write tests_models_dir file
+        run: |
+          cat > tests_models_dir.txt << 'EOF'
+          ${{ steps.repo_content.outputs.tests_models_dir }}
+          EOF
+
+      - name: Write tests_quantization_dir file
+        run: |
+          cat > tests_quantization_dir.txt << 'EOF'
+          ${{ steps.repo_content.outputs.tests_quantization_dir }}
+          EOF

      - name: Run script to get jobs to run
        id: get_jobs
--- a/.github/workflows/push-important-models.yml
+++ b/.github/workflows/push-important-models.yml
@ -149,9 +149,9 @@ jobs:
    with:
      job: run_models_gpu
      slack_report_channel: "#transformers-ci-push"
-      docker: huggingface/transformers-all-latest-gpu:flash-attn
+      docker: huggingface/transformers-all-latest-gpu
      ci_event: push
      report_repo_id: hf-internal-testing/transformers_ci_push
      commit_sha: ${{ github.sha }}
-      subdirs: ${{ needs.get_modified_models.outputs.matrix }}
+      models: ${{ needs.get_modified_models.outputs.matrix }}
    secrets: inherit
--- a/.github/workflows/self-comment-ci.yml
+++ b/.github/workflows/self-comment-ci.yml
@ -23,34 +23,62 @@ env:
  TF_FORCE_GPU_ALLOW_GROWTH: true
  CUDA_VISIBLE_DEVICES: 0,1

-
 jobs:
  get-pr-number:
+    runs-on: ubuntu-22.04
    name: Get PR number
+    # For security: only allow team members to run
    if: ${{ github.event.issue.state == 'open' && contains(fromJSON('["ydshieh", "ArthurZucker", "zucchini-nlp", "molbap", "gante", "LysandreJik", "Cyrilvallez", "Rocketknight1", "SunMarc", "eustlb", "MekkCyber", "vasqu", "ivarflakstad", "stevhliu", "ebezzam", "remi-or", "itazap"]'), github.actor) && (startsWith(github.event.comment.body, 'run-slow') || startsWith(github.event.comment.body, 'run slow') || startsWith(github.event.comment.body, 'run_slow')) }}
-    uses: ./.github/workflows/get-pr-number.yml
+    outputs:
+      PR_NUMBER: ${{ steps.set_pr_number.outputs.PR_NUMBER }}
+    steps:
+      - name: Get PR number
+        shell: bash
+        run: |
+          if [[ "${{ github.event.issue.number }}" != "" && "${{ github.event.issue.pull_request }}" != "" ]]; then
+            echo "PR_NUMBER=${{ github.event.issue.number }}" >> $GITHUB_ENV
+          else
+            echo "PR_NUMBER=" >> $GITHUB_ENV
+          fi

-  get-pr-info:
-    name: Get PR commit SHA
+      - name: Check PR number
+        shell: bash
+        run: |
+          echo "${{ env.PR_NUMBER }}"
+
+      - name: Set PR number
+        id: set_pr_number
+        run: echo "PR_NUMBER=${{ env.PR_NUMBER }}" >> "$GITHUB_OUTPUT"
+
+  get-sha:
+    runs-on: ubuntu-22.04
    needs: get-pr-number
    if: ${{ needs.get-pr-number.outputs.PR_NUMBER != ''}}
-    uses: ./.github/workflows/get-pr-info.yml
-    with:
-      pr_number: ${{ needs.get-pr-number.outputs.PR_NUMBER }}
-
-  check-timestamps:
-    name: Check timestamps (security check)
-    runs-on: ubuntu-22.04
-    needs: get-pr-info
    outputs:
-      PR_HEAD_SHA: ${{ needs.get-pr-info.outputs.PR_HEAD_SHA }}
-      PR_MERGE_SHA: ${{ needs.get-pr-info.outputs.PR_MERGE_COMMIT_SHA }}
+      PR_HEAD_SHA: ${{ steps.get_sha.outputs.PR_HEAD_SHA }}
+      PR_MERGE_SHA: ${{ steps.get_sha.outputs.PR_MERGE_SHA }}
    steps:
-      - name: Verify `merge_commit` timestamp is older than the issue comment timestamp
+      - uses: actions/checkout@v4
+        with:
+          fetch-depth: "0"
+          ref: "refs/pull/${{needs.get-pr-number.outputs.PR_NUMBER}}/merge"
+
+      - name: Get SHA (and verify timestamps against the issue comment date)
+        id: get_sha
        env:
+          PR_NUMBER: ${{ needs.get-pr-number.outputs.PR_NUMBER }}
          COMMENT_DATE: ${{ github.event.comment.created_at }}
-          PR_MERGE_COMMIT_TIMESTAMP: ${{ needs.get-pr-info.outputs.PR_MERGE_COMMIT_TIMESTAMP }}
        run: |
+            git fetch origin refs/pull/$PR_NUMBER/head:refs/remotes/pull/$PR_NUMBER/head
+            git checkout refs/remotes/pull/$PR_NUMBER/head
+            echo "PR_HEAD_SHA: $(git log -1 --format=%H)"
+            echo "PR_HEAD_SHA=$(git log -1 --format=%H)" >> "$GITHUB_OUTPUT"
+            git fetch origin refs/pull/$PR_NUMBER/merge:refs/remotes/pull/$PR_NUMBER/merge
+            git checkout refs/remotes/pull/$PR_NUMBER/merge
+            echo "PR_MERGE_SHA: $(git log -1 --format=%H)"
+            echo "PR_MERGE_SHA=$(git log -1 --format=%H)" >> "$GITHUB_OUTPUT"
+            PR_MERGE_COMMIT_TIMESTAMP=$(git log -1 --date=unix --format=%cd)
+            echo "PR_MERGE_COMMIT_TIMESTAMP: $PR_MERGE_COMMIT_TIMESTAMP"
            COMMENT_TIMESTAMP=$(date -d "${COMMENT_DATE}" +"%s")
            echo "COMMENT_DATE: $COMMENT_DATE"
            echo "COMMENT_TIMESTAMP: $COMMENT_TIMESTAMP"
@ -59,10 +87,13 @@ jobs:
              exit -1;
            fi

-  # use a python script to handle this complex logic.
+  # use a python script to handle this complex logic
+  # case 1: `run-slow` (auto. infer with limited number of models, but in particular, new model)
+  # case 2: `run-slow model_1, model_2`
  get-tests:
    runs-on: ubuntu-22.04
-    needs: [get-pr-number, check-timestamps]
+    needs: [get-pr-number, get-sha]
+    if: ${{ needs.get-pr-number.outputs.PR_NUMBER != ''}}
    outputs:
      models: ${{ steps.models_to_run.outputs.models }}
      quantizations: ${{ steps.models_to_run.outputs.quantizations }}
@ -70,11 +101,11 @@ jobs:
      - uses: actions/checkout@v4
        with:
          fetch-depth: "0"
-          ref: "refs/pull/${{ needs.get-pr-number.outputs.PR_NUMBER }}/merge"
+          ref: "refs/pull/${{needs.get-pr-number.outputs.PR_NUMBER}}/merge"

      - name: Verify merge commit SHA
        env:
-          VERIFIED_PR_MERGE_SHA: ${{ needs.check-timestamps.outputs.PR_MERGE_SHA }}
+          VERIFIED_PR_MERGE_SHA: ${{ needs.get-sha.outputs.PR_MERGE_SHA }}
        run: |
            PR_MERGE_SHA=$(git log -1 --format=%H)
            if [ $PR_MERGE_SHA != $VERIFIED_PR_MERGE_SHA ]; then
@ -95,33 +126,11 @@ jobs:
      - name: Show models to test
        id: models_to_run
        run: |
-          echo "$models"
-          echo "models=$models" >> $GITHUB_OUTPUT
-          echo "$quantizations"
-          echo "quantizations=$quantizations" >> $GITHUB_OUTPUT
-
-  # Report back if we are not able to get the tests (for example, security check is failing)
-  report_error_earlier:
-    name: Report error earlier
-    if: ${{ always() && needs.get-pr-info.result == 'success' && needs.get-tests.result != 'success' }}
-    needs: [get-pr-number, get-pr-info, get-tests]
-    permissions:
-      pull-requests: write
-    runs-on: ubuntu-22.04
-    steps:
-      - name: Reply to the comment
-        env:
-          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-          GITHUB_RUN_URL: https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}
-          github_repository: ${{ github.repository }}
-          pr_number: ${{ needs.get-pr-number.outputs.PR_NUMBER }}
-        run: |
-          gh api \
-            --method POST \
-            -H "Accept: application/vnd.github+json" \
-            -H "X-GitHub-Api-Version: 2022-11-28" \
-            "repos/${github_repository}/issues/${pr_number}/comments" \
-            -f body="💔 This comment contains \`run-slow\`, but unknown error occurred and [the workflow run]($GITHUB_RUN_URL) aborted!"
+          echo "${{ env.models }}"
+          echo "models=${{ env.models }}" >> $GITHUB_ENV
+          echo "models=${{ env.models }}" >> $GITHUB_OUTPUT
+          echo "${{ env.quantizations }}"
+          echo "quantizations=${{ env.quantizations }}" >> $GITHUB_OUTPUT

  reply_to_comment:
    name: Reply to the comment
@ -134,20 +143,20 @@ jobs:
      - name: Reply to the comment
        env:
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-          BODY: '\n\nmodels: ${{ needs.get-tests.outputs.models }}\nquantizations: ${{ needs.get-tests.outputs.quantizations }}'
-          github_repository: ${{ github.repository }}
-          pr_number: ${{ needs.get-pr-number.outputs.PR_NUMBER }}
+          MODELS: ${{ needs.get-tests.outputs.models }}
+          BODY: "\n\nmodels: ${{ needs.get-tests.outputs.models }}\nquantizations: ${{ needs.get-tests.outputs.quantizations }}"
        run: |
          gh api \
            --method POST \
            -H "Accept: application/vnd.github+json" \
            -H "X-GitHub-Api-Version: 2022-11-28" \
-            "repos/${github_repository}/issues/${pr_number}/comments" \
-            -f body="This comment contains \`run-slow\`, running the specified jobs: $(echo -e "$BODY")"
+            repos/${{ github.repository }}/issues/${{ needs.get-pr-number.outputs.PR_NUMBER }}/comments \
+            -f "body=This comment contains run-slow, running the specified jobs: ${{ env.BODY }} ..."

  create_run:
    name: Create run
-    needs: [check-timestamps, reply_to_comment]
+    if: ${{ needs.get-tests.outputs.models != '[]' || needs.get-tests.outputs.quantizations != '[]' }}
+    needs: [get-sha, get-tests, reply_to_comment]
    permissions:
      statuses: write
    runs-on: ubuntu-22.04
@ -159,196 +168,248 @@ jobs:
          # Create a commit status (pending) for a run of this workflow. The status has to be updated later in `update_run_status`.
          # See https://docs.github.com/en/rest/commits/statuses?apiVersion=2022-11-28#create-a-commit-status
          GITHUB_RUN_URL: https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}
-          github_repository: ${{ github.repository }}
-          pr_head_sha: ${{ needs.check-timestamps.outputs.PR_HEAD_SHA }}
        run: |
          gh api \
            --method POST \
            -H "Accept: application/vnd.github+json" \
            -H "X-GitHub-Api-Version: 2022-11-28" \
-            "repos/${github_repository}/statuses/${pr_head_sha}" \
+            repos/${{ github.repository }}/statuses/${{ needs.get-sha.outputs.PR_HEAD_SHA }} \
            -f "target_url=$GITHUB_RUN_URL" -f "state=pending" -f "description=Slow CI job" -f "context=pytest/custom-tests"

-  model-ci:
-    name: Model CI
+  run_models_gpu:
+    name: Run all tests for the model
    if: ${{ needs.get-tests.outputs.models != '[]' }}
-    uses: ./.github/workflows/self-scheduled.yml
-    needs: [get-pr-number, check-timestamps, get-tests, create_run]
-    with:
-      job: run_models_gpu
-      slack_report_channel: "#transformers-ci-pr"
-      docker: huggingface/transformers-all-latest-gpu
-      ci_event: PR Comment CI
-      report_repo_id: hf-internal-testing/transformers_pr_ci
-      commit_sha: ${{ needs.check-timestamps.outputs.PR_MERGE_SHA }}
-      subdirs: ${{ needs.get-tests.outputs.models }}
-      pr_number: ${{ needs.get-pr-number.outputs.PR_NUMBER }}
-    secrets: inherit
+    needs: [get-pr-number, get-sha, get-tests, create_run]
+    strategy:
+      fail-fast: false
+      matrix:
+        folders: ${{ fromJson(needs.get-tests.outputs.models) }}
+        machine_type: [aws-g5-4xlarge-cache, aws-g5-12xlarge-cache]
+    runs-on:
+       group: '${{ matrix.machine_type }}'
+    container:
+      image: huggingface/transformers-all-latest-gpu
+      options: --gpus all --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/
+    steps:
+      - name: Echo input and matrix info
+        shell: bash
+        run: |
+          echo "${{ matrix.folders }}"

-  quantization-ci:
-    name: Quantization CI
+      - name: Echo folder ${{ matrix.folders }}
+        shell: bash
+        # For folders like `models/bert`, set an env. var. (`matrix_folders`) to `models_bert`, which will be used to
+        # set the artifact folder names (because the character `/` is not allowed).
+        run: |
+          echo "${{ matrix.folders }}"
+          matrix_folders=${{ matrix.folders }}
+          matrix_folders=${matrix_folders/'models/'/'models_'}
+          echo "$matrix_folders"
+          echo "matrix_folders=$matrix_folders" >> $GITHUB_ENV
+
+      - name: Checkout to PR merge commit
+        working-directory: /transformers
+        run: |
+          git fetch origin refs/pull/${{ needs.get-pr-number.outputs.PR_NUMBER }}/merge:refs/remotes/pull/${{ needs.get-pr-number.outputs.PR_NUMBER }}/merge
+          git checkout refs/remotes/pull/${{ needs.get-pr-number.outputs.PR_NUMBER }}/merge
+          git log -1 --format=%H
+
+      - name: Verify merge commit SHA
+        env:
+          VERIFIED_PR_MERGE_SHA: ${{ needs.get-sha.outputs.PR_MERGE_SHA }}
+        working-directory: /transformers
+        run: |
+          PR_MERGE_SHA=$(git log -1 --format=%H)
+          if [ $PR_MERGE_SHA != $VERIFIED_PR_MERGE_SHA ]; then
+            echo "The merged commit SHA is not the same as the verified one! Security issue detected, abort the workflow!";
+            exit -1;
+          fi
+
+      - name: Reinstall transformers in edit mode (remove the one installed during docker image build)
+        working-directory: /transformers
+        run: python3 -m pip uninstall -y transformers && python3 -m pip install -e .
+
+      - name: NVIDIA-SMI
+        run: |
+          nvidia-smi
+
+      - name: Set `machine_type` for report and artifact names
+        working-directory: /transformers
+        shell: bash
+        run: |
+          echo "${{ matrix.machine_type }}"
+          if [ "${{ matrix.machine_type }}" = "aws-g5-4xlarge-cache" ]; then
+            machine_type=single-gpu
+          elif [ "${{ matrix.machine_type }}" = "aws-g5-12xlarge-cache" ]; then
+            machine_type=multi-gpu
+          else
+            machine_type=${{ matrix.machine_type }}
+          fi
+          echo "$machine_type"
+          echo "machine_type=$machine_type" >> $GITHUB_ENV
+
+      - name: Environment
+        working-directory: /transformers
+        run: |
+          python3 utils/print_env.py
+
+      - name: Show installed libraries and their versions
+        working-directory: /transformers
+        run: pip freeze
+
+      - name: Run all tests on GPU
+        working-directory: /transformers
+        run: |
+          export CUDA_VISIBLE_DEVICES="$(python3 utils/set_cuda_devices_for_ci.py --test_folder ${{ matrix.folders }})"
+          echo $CUDA_VISIBLE_DEVICES
+          python3 -m pytest -v -rsfE --make-reports=${{ env.machine_type }}_run_models_gpu_${{ matrix.folders }}_test_reports tests/${{ matrix.folders }}
+
+      - name: Failure short reports
+        if: ${{ failure() }}
+        continue-on-error: true
+        run: cat /transformers/reports/${{ env.machine_type }}_run_models_gpu_${{ matrix.folders }}_test_reports/failures_short.txt
+
+      - name: Make sure report directory exists
+        shell: bash
+        run: |
+          mkdir -p /transformers/reports/${{ env.machine_type }}_run_models_gpu_${{ matrix.folders }}_test_reports
+          echo "hello" > /transformers/reports/${{ env.machine_type }}_run_models_gpu_${{ matrix.folders }}_test_reports/hello.txt
+          echo "${{ env.machine_type }}_run_models_gpu_${{ matrix.folders }}_test_reports"
+
+      - name: "Test suite reports artifacts: ${{ env.machine_type }}_run_models_gpu_${{ env.matrix_folders }}_test_reports"
+        if: ${{ always() }}
+        uses: actions/upload-artifact@v4
+        with:
+          name: ${{ env.machine_type }}_run_models_gpu_${{ env.matrix_folders }}_test_reports
+          path: /transformers/reports/${{ env.machine_type }}_run_models_gpu_${{ matrix.folders }}_test_reports
+
+  run_quantization_torch_gpu:
+    name: Run all tests for a quantization
    if: ${{ needs.get-tests.outputs.quantizations != '[]' }}
-    uses: ./.github/workflows/self-scheduled.yml
-    needs: [get-pr-number, check-timestamps, get-tests, create_run]
-    with:
-      job: run_quantization_torch_gpu
-      slack_report_channel: "#transformers-ci-pr"
-      docker: huggingface/transformers-quantization-latest-gpu
-      ci_event: PR Comment CI
-      report_repo_id: hf-internal-testing/transformers_pr_ci
-      commit_sha: ${{ needs.check-timestamps.outputs.PR_MERGE_SHA }}
-      subdirs: ${{ needs.get-tests.outputs.quantizations }}
-      pr_number: ${{ needs.get-pr-number.outputs.PR_NUMBER }}
-    secrets: inherit
+    needs: [get-pr-number, get-sha, get-tests, create_run]
+    strategy:
+      fail-fast: false
+      matrix:
+        folders: ${{ fromJson(needs.get-tests.outputs.quantizations) }}
+        machine_type: [aws-g5-4xlarge-cache, aws-g5-12xlarge-cache]
+    runs-on:
+      group: '${{ matrix.machine_type }}'
+    container:
+      image: huggingface/transformers-quantization-latest-gpu
+      options: --gpus all --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/
+    steps:
+      - name: Echo folder ${{ matrix.folders }}
+        shell: bash
+        run: |
+          echo "${{ matrix.folders }}"
+          matrix_folders=${{ matrix.folders }}
+          matrix_folders=${matrix_folders/'quantization/'/'quantization_'}
+          echo "$matrix_folders"
+          echo "matrix_folders=$matrix_folders" >> $GITHUB_ENV

-  report:
-    name: Check & Report
-    needs: [get-pr-number, check-timestamps, create_run, model-ci, quantization-ci]
+      - name: Checkout to PR merge commit
+        working-directory: /transformers
+        run: |
+          git fetch origin refs/pull/${{ needs.get-pr-number.outputs.PR_NUMBER }}/merge:refs/remotes/pull/${{ needs.get-pr-number.outputs.PR_NUMBER }}/merge
+          git checkout refs/remotes/pull/${{ needs.get-pr-number.outputs.PR_NUMBER }}/merge
+          git log -1 --format=%H
+
+      - name: Verify merge commit SHA
+        env:
+          VERIFIED_PR_MERGE_SHA: ${{ needs.get-sha.outputs.PR_MERGE_SHA }}
+        working-directory: /transformers
+        run: |
+          PR_MERGE_SHA=$(git log -1 --format=%H)
+          if [ $PR_MERGE_SHA != $VERIFIED_PR_MERGE_SHA ]; then
+            echo "The merged commit SHA is not the same as the verified one! Security issue detected, abort the workflow!";
+            exit -1;
+          fi
+
+      - name: Reinstall transformers in edit mode (remove the one installed during docker image build)
+        working-directory: /transformers
+        run: python3 -m pip uninstall -y transformers && python3 -m pip install -e .
+      - name: NVIDIA-SMI
+        run: |
+          nvidia-smi
+
+      - name: Set `machine_type` for report and artifact names
+        working-directory: /transformers
+        shell: bash
+        run: |
+          echo "${{ matrix.machine_type }}"
+          if [ "${{ matrix.machine_type }}" = "aws-g5-4xlarge-cache" ]; then
+            machine_type=single-gpu
+          elif [ "${{ matrix.machine_type }}" = "aws-g5-12xlarge-cache" ]; then
+            machine_type=multi-gpu
+          else
+            machine_type=${{ matrix.machine_type }}
+          fi
+          echo "$machine_type"
+          echo "machine_type=$machine_type" >> $GITHUB_ENV
+
+      - name: Environment
+        working-directory: /transformers
+        run: |
+          python3 utils/print_env.py
+
+      - name: Show installed libraries and their versions
+        working-directory: /transformers
+        run: pip freeze
+
+      - name: Run quantization tests on GPU
+        working-directory: /transformers
+        run: |
+          python3 -m pytest -v --make-reports=${{ env.machine_type }}_run_quantization_torch_gpu_${{ matrix.folders }}_test_reports tests/${{ matrix.folders }}
+
+      - name: Failure short reports
+        if: ${{ failure() }}
+        continue-on-error: true
+        run: cat /transformers/reports/${{ env.machine_type }}_run_quantization_torch_gpu_${{ matrix.folders }}_test_reports/failures_short.txt
+
+      - name: Make sure report directory exists
+        shell: bash
+        run: |
+          mkdir -p /transformers/reports/${{ env.machine_type }}_run_quantization_gpu_${{ matrix.folders }}_test_reports
+          echo "hello" > /transformers/reports/${{ env.machine_type }}_run_quantization_gpu_${{ matrix.folders }}_test_reports/hello.txt
+          echo "${{ env.machine_type }}_run_quantization_gpu_${{ matrix.folders }}_test_reports"
+
+      - name: "Test suite reports artifacts: ${{ env.machine_type }}_run_quantization_torch_gpu_${{ env.matrix_folders }}_test_reports"
+        if: ${{ always() }}
+        uses: actions/upload-artifact@v4
+        with:
+          name: ${{ env.machine_type }}_run_quantization_torch_gpu_${{ env.matrix_folders }}_test_reports
+          path: /transformers/reports/${{ env.machine_type }}_run_quantization_torch_gpu_${{ matrix.folders }}_test_reports
+
+  update_run_status:
+    name: Update Check Run Status
+    needs: [get-sha, create_run, run_models_gpu, run_quantization_torch_gpu]
    permissions:
-      pull-requests: write
      statuses: write
    if: ${{ always() && needs.create_run.result == 'success' }}
    runs-on: ubuntu-22.04
+    env:
+      GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+      GITHUB_RUN_URL: https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}
+      STATUS_OK: ${{ contains(fromJSON('["skipped", "success"]'), needs.run_models_gpu.result) && contains(fromJSON('["skipped", "success"]'), needs.run_quantization_torch_gpu.result) }}
    steps:
-      - name: Show reports from jobs
-        env:
-          MODEL_REPORT: ${{ needs.model-ci.outputs.report }}
-          QUANT_REPORT: ${{ needs.quantization-ci.outputs.report }}
+      - name: Get `run_models_gpu` job status
        run: |
-          echo "$MODEL_REPORT"
-          echo "$QUANT_REPORT"
-
-      - name: Process and filter reports
-        env:
-          MODEL_REPORT: ${{ needs.model-ci.outputs.report }}
-          QUANT_REPORT: ${{ needs.quantization-ci.outputs.report }}
-        run: |
-          # Preprocess with Python
-          python3 << 'PYTHON_SCRIPT'
-          import json
-          import os
-          
-          def filter_and_format_report(data):
-            """
-            Filter out entries where commit is `None` (failing tests who status is not certain) and format as text
-            """
-            lines = []
-            
-            for model, model_result in data.items():
-                model_lines = []
-                for device, failures in model_result.items():
-                    
-                    # Filter out None commits and extract just the test names
-                    test_names = [
-                        failure['test'] 
-                        for failure in failures 
-                        if isinstance(failure, dict) and failure.get('commit') is not None
-                    ]
-
-                    # Add tests to model lines
-                    for idx, test_name in enumerate(test_names):
-                        if idx == 0:
-                            job_link = failures[idx]['job_link']
-                            model_lines.append(f"- [{model}]({job_link}):")
-          
-                        model_lines.append(f"    {test_name}")
-
-                # Only add model section if it has tests
-                if len(model_lines) > 0:
-                    lines.extend(model_lines)
-                    lines.append("")  # Empty line between models
-            
-            return "\n".join(lines).strip()
-          
-          # Load and filter reports
-          model_report_str = os.environ.get('MODEL_REPORT', '{}')
-          quant_report_str = os.environ.get('QUANT_REPORT', '{}')
-          
-          model_report = json.loads(model_report_str) if model_report_str else {}
-          quant_report = json.loads(quant_report_str) if quant_report_str else {}
-          
-          formatted_model = filter_and_format_report(model_report)
-          formatted_quant = filter_and_format_report(quant_report)
-          
-          # Write to files
-          with open('model_ci.txt', 'w') as f:
-              f.write(formatted_model)
-              if formatted_model:
-                  f.write('\n')
-          
-          with open('quantization_ci.txt', 'w') as f:
-              f.write(formatted_quant)
-              if formatted_quant:
-                  f.write('\n')
-          PYTHON_SCRIPT
-
-      - name: Post results as PR comment
-        env:
-          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-          GITHUB_RUN_URL: https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}
-          github_repository: ${{ github.repository }}
-          pr_number: ${{ needs.get-pr-number.outputs.PR_NUMBER }}
-          model_ci_result: ${{ needs.model-ci.result }}
-          quantization_ci_result: ${{ needs.quantization-ci.result }}
-        run: |
-          {
-            echo '## CI Results'
-            echo "[Workflow Run ⚙️]($GITHUB_RUN_URL)"
-            echo ''
-
-            # Check if both jobs were skipped or cancelled
-            if [[ "$model_ci_result" == "skipped" || "$model_ci_result" == "cancelled" ]] && \
-               [[ "$quantization_ci_result" == "skipped" || "$quantization_ci_result" == "cancelled" ]]; then
-              echo '⚠️ No test being reported (jobs are skipped or cancelled)!'
-              echo "STATUS=error" >> $GITHUB_ENV
-
-            # Check if either file has content
-            elif [ -s model_ci.txt ] || [ -s quantization_ci.txt ]; then
-              echo "STATUS=failure" >> $GITHUB_ENV
-
-              # Check if model_ci.txt has content
-              if [ -s model_ci.txt ]; then
-                echo '### Model CI Report'
-                echo ''
-                echo '#### ❌ Failed tests'
-                echo ''
-                cat model_ci.txt
-                echo ''
-              fi
-              
-              # Check if quantization_ci.txt has content
-              if [ -s quantization_ci.txt ]; then
-                echo '### Quantization CI Report'
-                echo ''
-                echo '#### ❌ Failed tests'
-                echo ''
-                cat quantization_ci.txt
-                echo ''
-              fi
-            else
-              echo "STATUS=success" >> $GITHUB_ENV
-              echo '✅ No failing test specific to this PR 🎉 !'
-            fi
-          } > comment_body.txt
-
-          gh api \
-            --method POST \
-            -H "Accept: application/vnd.github+json" \
-            -H "X-GitHub-Api-Version: 2022-11-28" \
-            "repos/${github_repository}/issues/${pr_number}/comments" \
-            -F body=@comment_body.txt
+          echo "${{ needs.run_models_gpu.result }}"
+          echo "${{ needs.run_quantization_torch_gpu.result }}"
+          echo $STATUS_OK
+          if [ "$STATUS_OK" = "true" ]; then
+            echo "STATUS=success" >> $GITHUB_ENV
+          else
+            echo "STATUS=failure" >> $GITHUB_ENV
+          fi

      - name: Update PR commit statuses
-        env:
-          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-          GITHUB_RUN_URL: https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}
-          github_repository: ${{ github.repository }}
-          pr_head_sha: ${{ needs.check-timestamps.outputs.PR_HEAD_SHA }}
-        # The env. variable `STATUS` used here is set in the previous step
        run: |
+          echo "${{ needs.run_models_gpu.result }}"
+          echo "${{ env.STATUS }}"
          gh api \
            --method POST \
            -H "Accept: application/vnd.github+json" \
            -H "X-GitHub-Api-Version: 2022-11-28" \
-            "repos/${github_repository}/statuses/${pr_head_sha}" \
-            -f "target_url=$GITHUB_RUN_URL" -f "state=$STATUS" -f "description=Slow CI job" -f "context=pytest/custom-tests"
+            repos/${{ github.repository }}/statuses/${{ needs.get-sha.outputs.PR_HEAD_SHA }} \
+            -f "target_url=$GITHUB_RUN_URL" -f "state=${{ env.STATUS }}" -f "description=Slow CI job" -f "context=pytest/custom-tests"
--- a/.github/workflows/self-nightly-caller.yml
+++ b/.github/workflows/self-nightly-caller.yml
@ -51,7 +51,6 @@ jobs:
      slack_report_channel: "#transformers-ci-past-future"
      docker: huggingface/transformers-all-latest-torch-nightly-gpu
      ci_event: Nightly CI
-      runner_type: "a10"
      report_repo_id: hf-internal-testing/transformers_daily_ci_with_torch_nightly
      commit_sha: ${{ github.event.workflow_run.head_sha || github.sha }}
    secrets: inherit
--- a/.github/workflows/self-push-amd-mi210-caller.yml
+++ b/.github/workflows/self-push-amd-mi210-caller.yml
@ -0,0 +1,25 @@
+name: Self-hosted runner (AMD mi210 CI caller)
+
+on:
+  #workflow_run:
+  #  workflows: ["Self-hosted runner (push-caller)"]
+  #  branches: ["main"]
+  #  types: [completed]
+  push:
+    branches:
+      - run_amd_push_ci_caller*
+    paths:
+      - "src/**"
+      - "tests/**"
+      - ".github/**"
+      - "templates/**"
+      - "utils/**"
+
+jobs:
+  run_amd_ci:
+    name: AMD mi210
+    if: (cancelled() != true) && ((github.event_name == 'workflow_run') || ((github.event_name == 'push') && startsWith(github.ref_name, 'run_amd_push_ci_caller')))
+    uses: ./.github/workflows/self-push-amd.yml
+    with:
+      gpu_flavor: mi210
+    secrets: inherit
--- a/.github/workflows/self-push-amd-mi250-caller.yml
+++ b/.github/workflows/self-push-amd-mi250-caller.yml
@ -0,0 +1,25 @@
+name: Self-hosted runner (AMD mi250 CI caller)
+
+on:
+  #workflow_run:
+  #  workflows: ["Self-hosted runner (push-caller)"]
+  #  branches: ["main"]
+  #  types: [completed]
+  push:
+    branches:
+      - run_amd_push_ci_caller*
+    paths:
+      - "src/**"
+      - "tests/**"
+      - ".github/**"
+      - "templates/**"
+      - "utils/**"
+
+jobs:
+  run_amd_ci:
+    name: AMD mi250
+    if: (cancelled() != true) && ((github.event_name == 'workflow_run') || ((github.event_name == 'push') && startsWith(github.ref_name, 'run_amd_push_ci_caller')))
+    uses: ./.github/workflows/self-push-amd.yml
+    with:
+      gpu_flavor: mi250
+    secrets: inherit
--- a/.github/workflows/self-push-amd.yml
+++ b/.github/workflows/self-push-amd.yml
@ -0,0 +1,334 @@
+name: Self-hosted runner AMD GPU (push)
+
+on:
+  workflow_call:
+    inputs:
+      gpu_flavor:
+        required: true
+        type: string
+
+env:
+  HF_HOME: /mnt/cache
+  TRANSFORMERS_IS_CI: yes
+  OMP_NUM_THREADS: 8
+  MKL_NUM_THREADS: 8
+  PYTEST_TIMEOUT: 60
+  TF_FORCE_GPU_ALLOW_GROWTH: true
+  HF_HUB_READ_TOKEN: ${{ secrets.HF_HUB_READ_TOKEN }}
+
+jobs:
+  check_runner_status:
+    name: Check Runner Status
+    runs-on: ubuntu-22.04
+    steps:
+      - name: Checkout transformers
+        uses: actions/checkout@v4
+        with:
+          fetch-depth: 2
+
+      - name: Check Runner Status
+        run: python utils/check_self_hosted_runner.py --target_runners amd-mi210-single-gpu-ci-runner-docker --token ${{ secrets.ACCESS_REPO_INFO_TOKEN }}
+
+  check_runners:
+    name: Check Runners
+    needs: check_runner_status
+    strategy:
+      matrix:
+        machine_type: [single-gpu, multi-gpu]
+    runs-on: [self-hosted, amd-gpu, '${{ matrix.machine_type }}', '${{ inputs.gpu_flavor }}']
+    container:
+      image: huggingface/transformers-pytorch-amd-gpu-push-ci  # <--- We test only for PyTorch for now
+      options: --device /dev/kfd --device /dev/dri --env ROCR_VISIBLE_DEVICES --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/
+    steps:
+      - name: ROCM-SMI
+        run: |
+          rocm-smi
+      - name: ROCM-INFO
+        run: |
+          rocminfo  | grep "Agent" -A 14
+      - name: Show ROCR environment
+        run: |
+          echo "ROCR: $ROCR_VISIBLE_DEVICES"
+
+  setup_gpu:
+    name: Setup
+    needs: check_runners
+    strategy:
+      matrix:
+        machine_type: [single-gpu, multi-gpu]
+    runs-on: [self-hosted, amd-gpu, '${{ matrix.machine_type }}', '${{ inputs.gpu_flavor }}']
+    container:
+      image: huggingface/transformers-pytorch-amd-gpu-push-ci  # <--- We test only for PyTorch for now
+      options: --device /dev/kfd --device /dev/dri --env ROCR_VISIBLE_DEVICES --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/
+    outputs:
+      matrix: ${{ steps.set-matrix.outputs.matrix }}
+      test_map: ${{ steps.set-matrix.outputs.test_map }}
+    env:
+      # `CI_BRANCH_PUSH`: The branch name from the push event
+      # `CI_BRANCH_WORKFLOW_RUN`: The name of the branch on which this workflow is triggered by `workflow_run` event
+      # `CI_SHA_PUSH`: The commit SHA from the push event
+      # `CI_SHA_WORKFLOW_RUN`: The commit SHA that triggers this workflow by `workflow_run` event
+      CI_BRANCH_PUSH: ${{ github.event.ref }}
+      CI_BRANCH_WORKFLOW_RUN: ${{ github.event.workflow_run.head_branch }}
+      CI_SHA_PUSH: ${{ github.event.head_commit.id }}
+      CI_SHA_WORKFLOW_RUN: ${{ github.event.workflow_run.head_sha }}
+    steps:
+      # Necessary to get the correct branch name and commit SHA for `workflow_run` event
+      # We also take into account the `push` event (we might want to test some changes in a branch)
+      - name: Prepare custom environment variables
+        shell: bash
+        # `CI_BRANCH`: The non-empty branch name from the above two (one and only one of them is empty)
+        # `CI_SHA`: The non-empty commit SHA from the above two (one and only one of them is empty)
+        run: |
+          CI_BRANCH_PUSH=${CI_BRANCH_PUSH/'refs/heads/'/''}
+          echo $CI_BRANCH_PUSH
+          echo $CI_BRANCH_WORKFLOW_RUN
+          echo $CI_SHA_PUSH
+          echo $CI_SHA_WORKFLOW_RUN
+          [[ ! -z "$CI_BRANCH_PUSH" ]] && echo "CI_BRANCH=$CI_BRANCH_PUSH" >> $GITHUB_ENV || echo "CI_BRANCH=$CI_BRANCH_WORKFLOW_RUN" >> $GITHUB_ENV
+          [[ ! -z "$CI_SHA_PUSH" ]] && echo "CI_SHA=$CI_SHA_PUSH" >> $GITHUB_ENV || echo "CI_SHA=$CI_SHA_WORKFLOW_RUN" >> $GITHUB_ENV
+
+      - name: print environment variables
+        run: |
+          echo "env.CI_BRANCH = ${{ env.CI_BRANCH }}"
+          echo "env.CI_SHA = ${{ env.CI_SHA }}"
+
+      - name: Update clone using environment variables
+        working-directory: /transformers
+        run: |
+          echo "original branch = $(git branch --show-current)"
+          git fetch && git checkout ${{ env.CI_BRANCH }}
+          echo "updated branch = $(git branch --show-current)"
+          git checkout ${{ env.CI_SHA }}
+          echo "log = $(git log -n 1)"
+
+      - name: Cleanup
+        working-directory: /transformers
+        run: |
+          rm -rf tests/__pycache__
+          rm -rf tests/models/__pycache__
+          rm -rf reports
+
+      - name: Show installed libraries and their versions
+        working-directory: /transformers
+        run: pip freeze
+
+      - name: Fetch the tests to run
+        working-directory: /transformers
+        # TODO: add `git-python` in the docker images
+        run: |
+          pip install --upgrade git-python
+          python3 utils/tests_fetcher.py --diff_with_last_commit | tee test_preparation.txt
+
+      - name: Report fetched tests
+        uses: actions/upload-artifact@v4
+        with:
+          name: test_fetched
+          path: /transformers/test_preparation.txt
+
+      - id: set-matrix
+        name: Organize tests into models
+        working-directory: /transformers
+        # The `keys` is used as GitHub actions matrix for jobs, i.e. `models/bert`, `tokenization`, `pipeline`, etc.
+        # The `test_map` is used to get the actual identified test files under each key.
+        # If no test to run (so no `test_map.json` file), create a dummy map (empty matrix will fail)
+        run: |
+          if [ -f test_map.json ]; then
+              keys=$(python3 -c 'import json; fp = open("test_map.json"); test_map = json.load(fp); fp.close(); d = list(test_map.keys()); print(d)')
+              test_map=$(python3 -c 'import json; fp = open("test_map.json"); test_map = json.load(fp); fp.close(); print(test_map)')
+          else
+              keys=$(python3 -c 'keys = ["dummy"]; print(keys)')
+              test_map=$(python3 -c 'test_map = {"dummy": []}; print(test_map)')
+          fi
+          echo $keys
+          echo $test_map
+          echo "matrix=$keys" >> $GITHUB_OUTPUT
+          echo "test_map=$test_map" >> $GITHUB_OUTPUT
+
+  run_models_gpu:
+    name: Model tests
+    needs: setup_gpu
+    # `dummy` means there is no test to run
+    if: contains(fromJson(needs.setup_gpu.outputs.matrix), 'dummy') != true
+    strategy:
+      fail-fast: false
+      matrix:
+        folders: ${{ fromJson(needs.setup_gpu.outputs.matrix) }}
+        machine_type: [single-gpu, multi-gpu]
+    runs-on: [self-hosted, amd-gpu, '${{ matrix.machine_type }}', '${{ inputs.gpu_flavor }}']
+    container:
+      image: huggingface/transformers-pytorch-amd-gpu-push-ci  # <--- We test only for PyTorch for now
+      options: --device /dev/kfd --device /dev/dri --env ROCR_VISIBLE_DEVICES --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/
+    env:
+      # For the meaning of these environment variables, see the job `Setup`
+      CI_BRANCH_PUSH: ${{ github.event.ref }}
+      CI_BRANCH_WORKFLOW_RUN: ${{ github.event.workflow_run.head_branch }}
+      CI_SHA_PUSH: ${{ github.event.head_commit.id }}
+      CI_SHA_WORKFLOW_RUN: ${{ github.event.workflow_run.head_sha }}
+    steps:
+      # Necessary to get the correct branch name and commit SHA for `workflow_run` event
+      # We also take into account the `push` event (we might want to test some changes in a branch)
+      - name: Prepare custom environment variables
+        shell: bash
+        # For the meaning of these environment variables, see the job `Setup`
+        run: |
+          CI_BRANCH_PUSH=${CI_BRANCH_PUSH/'refs/heads/'/''}
+          echo $CI_BRANCH_PUSH
+          echo $CI_BRANCH_WORKFLOW_RUN
+          echo $CI_SHA_PUSH
+          echo $CI_SHA_WORKFLOW_RUN
+          [[ ! -z "$CI_BRANCH_PUSH" ]] && echo "CI_BRANCH=$CI_BRANCH_PUSH" >> $GITHUB_ENV || echo "CI_BRANCH=$CI_BRANCH_WORKFLOW_RUN" >> $GITHUB_ENV
+          [[ ! -z "$CI_SHA_PUSH" ]] && echo "CI_SHA=$CI_SHA_PUSH" >> $GITHUB_ENV || echo "CI_SHA=$CI_SHA_WORKFLOW_RUN" >> $GITHUB_ENV
+
+      - name: print environment variables
+        run: |
+          echo "env.CI_BRANCH = ${{ env.CI_BRANCH }}"
+          echo "env.CI_SHA = ${{ env.CI_SHA }}"
+
+      - name: Update clone using environment variables
+        working-directory: /transformers
+        run: |
+          echo "original branch = $(git branch --show-current)"
+          git fetch && git checkout ${{ env.CI_BRANCH }}
+          echo "updated branch = $(git branch --show-current)"
+          git checkout ${{ env.CI_SHA }}
+          echo "log = $(git log -n 1)"
+
+      - name: Reinstall transformers in edit mode (remove the one installed during docker image build)
+        working-directory: /transformers
+        run: python3 -m pip uninstall -y transformers && python3 -m pip install -e .
+
+      - name: Echo folder ${{ matrix.folders }}
+        shell: bash
+        # For folders like `models/bert`, set an env. var. (`matrix_folders`) to `models_bert`, which will be used to
+        # set the artifact folder names (because the character `/` is not allowed).
+        run: |
+          echo "${{ matrix.folders }}"
+          echo "${{ fromJson(needs.setup_gpu.outputs.test_map)[matrix.folders] }}"
+          matrix_folders=${{ matrix.folders }}
+          matrix_folders=${matrix_folders/'models/'/'models_'}
+          echo "$matrix_folders"
+          echo "matrix_folders=$matrix_folders" >> $GITHUB_ENV
+
+      - name: ROCM-SMI
+        run: |
+          rocm-smi
+      - name: ROCM-INFO
+        run: |
+          rocminfo  | grep "Agent" -A 14
+      - name: Show ROCR environment
+        run: |
+          echo "ROCR: $ROCR_VISIBLE_DEVICES"
+
+      - name: Environment
+        working-directory: /transformers
+        run: |
+          python3 utils/print_env.py
+
+      - name: Show installed libraries and their versions
+        working-directory: /transformers
+        run: pip freeze
+
+      - name: Run all non-slow selected tests on GPU
+        working-directory: /transformers
+        run: |
+          python3 -m pytest -n 2 --dist=loadfile -v --make-reports=${{ matrix.machine_type }}_run_models_gpu_${{ matrix.folders }}_test_reports ${{ fromJson(needs.setup_gpu.outputs.test_map)[matrix.folders] }} -m "not not_device_test"
+
+      - name: Failure short reports
+        if: ${{ failure() }}
+        continue-on-error: true
+        run: cat /transformers/reports/${{ matrix.machine_type }}_run_models_gpu_${{ matrix.folders }}_test_reports/failures_short.txt
+
+      - name: "Test suite reports artifacts: ${{ matrix.machine_type }}_run_models_gpu_${{ env.matrix_folders }}_test_reports"
+        if: ${{ always() }}
+        uses: actions/upload-artifact@v4
+        with:
+          name: ${{ matrix.machine_type }}_run_models_gpu_${{ env.matrix_folders }}_test_reports
+          path: /transformers/reports/${{ matrix.machine_type }}_run_models_gpu_${{ matrix.folders }}_test_reports
+
+  send_results:
+    name: Send results to webhook
+    runs-on: ubuntu-22.04
+    if: always()
+    needs: [
+        check_runner_status,
+        check_runners,
+        setup_gpu,
+        run_models_gpu,
+#        run_tests_torch_cuda_extensions_single_gpu,
+#        run_tests_torch_cuda_extensions_multi_gpu
+    ]
+    env:
+      # For the meaning of these environment variables, see the job `Setup`
+      CI_BRANCH_PUSH: ${{ github.event.ref }}
+      CI_BRANCH_WORKFLOW_RUN: ${{ github.event.workflow_run.head_branch }}
+      CI_SHA_PUSH: ${{ github.event.head_commit.id }}
+      CI_SHA_WORKFLOW_RUN: ${{ github.event.workflow_run.head_sha }}
+    steps:
+      - name: Preliminary job status
+        shell: bash
+        # For the meaning of these environment variables, see the job `Setup`
+        run: |
+          echo "Runner availability: ${{ needs.check_runner_status.result }}"
+          echo "Setup status: ${{ needs.setup_gpu.result }}"
+          echo "Runner status: ${{ needs.check_runners.result }}"
+
+      # Necessary to get the correct branch name and commit SHA for `workflow_run` event
+      # We also take into account the `push` event (we might want to test some changes in a branch)
+      - name: Prepare custom environment variables
+        shell: bash
+        # For the meaning of these environment variables, see the job `Setup`
+        run: |
+          CI_BRANCH_PUSH=${CI_BRANCH_PUSH/'refs/heads/'/''}
+          echo $CI_BRANCH_PUSH
+          echo $CI_BRANCH_WORKFLOW_RUN
+          echo $CI_SHA_PUSH
+          echo $CI_SHA_WORKFLOW_RUN
+          [[ ! -z "$CI_BRANCH_PUSH" ]] && echo "CI_BRANCH=$CI_BRANCH_PUSH" >> $GITHUB_ENV || echo "CI_BRANCH=$CI_BRANCH_WORKFLOW_RUN" >> $GITHUB_ENV
+          [[ ! -z "$CI_SHA_PUSH" ]] && echo "CI_SHA=$CI_SHA_PUSH" >> $GITHUB_ENV || echo "CI_SHA=$CI_SHA_WORKFLOW_RUN" >> $GITHUB_ENV
+
+      - name: print environment variables
+        run: |
+          echo "env.CI_BRANCH = ${{ env.CI_BRANCH }}"
+          echo "env.CI_SHA = ${{ env.CI_SHA }}"
+
+      - uses: actions/checkout@v4
+        # To avoid failure when multiple commits are merged into `main` in a short period of time.
+        # Checking out to an old commit beyond the fetch depth will get an error `fatal: reference is not a tree: ...
+        # (Only required for `workflow_run` event, where we get the latest HEAD on `main` instead of the event commit)
+        with:
+          fetch-depth: 20
+
+      - name: Update clone using environment variables
+        run: |
+          echo "original branch = $(git branch --show-current)"
+          git fetch && git checkout ${{ env.CI_BRANCH }}
+          echo "updated branch = $(git branch --show-current)"
+          git checkout ${{ env.CI_SHA }}
+          echo "log = $(git log -n 1)"
+
+      - uses: actions/download-artifact@v4
+      - name: Send message to Slack
+        env:
+          CI_SLACK_BOT_TOKEN: ${{ secrets.CI_SLACK_BOT_TOKEN }}
+          CI_SLACK_CHANNEL_ID: ${{ secrets.CI_SLACK_CHANNEL_ID }}
+          CI_SLACK_CHANNEL_ID_DAILY: ${{ secrets.CI_SLACK_CHANNEL_ID_DAILY }}
+          CI_SLACK_CHANNEL_ID_AMD: ${{ secrets.CI_SLACK_CHANNEL_ID_AMD }}
+          CI_SLACK_CHANNEL_DUMMY_TESTS: ${{ secrets.CI_SLACK_CHANNEL_DUMMY_TESTS }}
+          CI_SLACK_REPORT_CHANNEL_ID: ${{ secrets.CI_SLACK_CHANNEL_ID_AMD }}
+          ACCESS_REPO_INFO_TOKEN: ${{ secrets.ACCESS_REPO_INFO_TOKEN }}
+          CI_EVENT: Push CI (AMD) - ${{ inputs.gpu_flavor }}
+          CI_TITLE_PUSH: ${{ github.event.head_commit.message }}
+          CI_TITLE_WORKFLOW_RUN: ${{ github.event.workflow_run.head_commit.message }}
+          CI_SHA: ${{ env.CI_SHA }}
+          RUNNER_STATUS: ${{ needs.check_runner_status.result }}
+          RUNNER_ENV_STATUS: ${{ needs.check_runners.result }}
+          SETUP_STATUS: ${{ needs.setup_gpu.result }}
+
+        # We pass `needs.setup_gpu.outputs.matrix` as the argument. A processing in `notification_service.py` to change
+        # `models/bert` to `models_bert` is required, as the artifact names use `_` instead of `/`.
+        run: |
+          pip install huggingface_hub
+          pip install slack_sdk
+          pip show slack_sdk
+          python utils/notification_service.py "${{ needs.setup_gpu.outputs.matrix }}"
--- a/.github/workflows/self-push-caller.yml
+++ b/.github/workflows/self-push-caller.yml
@ -0,0 +1,54 @@
+# Used to trigger self-push CI
+name: Self-hosted runner (push-caller)
+
+on:
+  push:
+    branches:
+      - main
+    paths:
+      - "src/**"
+      - "tests/**"
+      - ".github/**"
+      - "templates/**"
+      - "utils/**"
+
+jobs:
+  check-for-setup:
+      runs-on: ubuntu-22.04
+      name: Check if setup was changed
+      outputs:
+        changed: ${{ steps.was_changed.outputs.changed }}
+      steps:
+        - uses: actions/checkout@v4
+          with: 
+            fetch-depth: "2"
+        
+        - name: Get changed files
+          id: changed-files
+          uses: tj-actions/changed-files@1c8e6069583811afb28f97afeaf8e7da80c6be5c
+        
+        - name: Was setup changed 
+          id: was_changed
+          run: |
+            for file in ${{ steps.changed-files.outputs.all_changed_files }}; do
+              if [ `basename "${file}"` = "setup.py" ]; then
+                echo "changed=1" >> $GITHUB_OUTPUT
+              fi
+            done
+
+  build-docker-containers:
+    needs: check-for-setup
+    if: (github.event_name == 'push') && (needs.check-for-setup.outputs.changed == '1')
+    uses: ./.github/workflows/build-docker-images.yml
+    with:
+      image_postfix: "-push-ci"
+    secrets: inherit
+
+  run_push_ci:
+    name: Trigger Push CI
+    runs-on: ubuntu-22.04
+    if: ${{ always() }}
+    needs: build-docker-containers
+    steps:
+      - name: Trigger push CI via workflow_run
+        run: echo "Trigger push CI via workflow_run"
--- a/.github/workflows/self-push.yml
+++ b/.github/workflows/self-push.yml
@ -0,0 +1,652 @@
+name: Self-hosted runner (push)
+
+on:
+  workflow_run:
+    workflows: ["Self-hosted runner (push-caller)"]
+    branches: ["main"]
+    types: [completed]
+  push:
+    branches:
+      - ci_*
+      - ci-*
+    paths:
+      - "src/**"
+      - "tests/**"
+      - ".github/**"
+      - "templates/**"
+      - "utils/**"
+  repository_dispatch:
+
+env:
+  HF_HOME: /mnt/cache
+  TRANSFORMERS_IS_CI: yes
+  OMP_NUM_THREADS: 8
+  MKL_NUM_THREADS: 8
+  PYTEST_TIMEOUT: 60
+  TF_FORCE_GPU_ALLOW_GROWTH: true
+  CUDA_VISIBLE_DEVICES: 0,1
+
+jobs:
+  setup:
+    name: Setup
+    strategy:
+      matrix:
+        machine_type: [aws-g5-4xlarge-cache, aws-g5-12xlarge-cache]
+    runs-on:
+      group: '${{ matrix.machine_type }}'
+    container:
+      image: huggingface/transformers-all-latest-gpu-push-ci
+      options: --gpus all --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/
+    outputs:
+      matrix: ${{ steps.set-matrix.outputs.matrix }}
+      test_map: ${{ steps.set-matrix.outputs.test_map }}
+    env:
+      # `CI_BRANCH_PUSH`: The branch name from the push event
+      # `CI_BRANCH_WORKFLOW_RUN`: The name of the branch on which this workflow is triggered by `workflow_run` event
+      # `CI_SHA_PUSH`: The commit SHA from the push event
+      # `CI_SHA_WORKFLOW_RUN`: The commit SHA that triggers this workflow by `workflow_run` event
+      CI_BRANCH_PUSH: ${{ github.event.ref }}
+      CI_BRANCH_WORKFLOW_RUN: ${{ github.event.workflow_run.head_branch }}
+      CI_SHA_PUSH: ${{ github.event.head_commit.id }}
+      CI_SHA_WORKFLOW_RUN: ${{ github.event.workflow_run.head_sha }}
+    steps:
+      # Necessary to get the correct branch name and commit SHA for `workflow_run` event
+      # We also take into account the `push` event (we might want to test some changes in a branch)
+      - name: Prepare custom environment variables
+        shell: bash
+        # `CI_BRANCH`: The non-empty branch name from the above two (one and only one of them is empty)
+        # `CI_SHA`: The non-empty commit SHA from the above two (one and only one of them is empty)
+        run: |
+          CI_BRANCH_PUSH=${CI_BRANCH_PUSH/'refs/heads/'/''}
+          echo $CI_BRANCH_PUSH
+          echo $CI_BRANCH_WORKFLOW_RUN
+          echo $CI_SHA_PUSH
+          echo $CI_SHA_WORKFLOW_RUN
+          [[ ! -z "$CI_BRANCH_PUSH" ]] && echo "CI_BRANCH=$CI_BRANCH_PUSH" >> $GITHUB_ENV || echo "CI_BRANCH=$CI_BRANCH_WORKFLOW_RUN" >> $GITHUB_ENV
+          [[ ! -z "$CI_SHA_PUSH" ]] && echo "CI_SHA=$CI_SHA_PUSH" >> $GITHUB_ENV || echo "CI_SHA=$CI_SHA_WORKFLOW_RUN" >> $GITHUB_ENV
+
+      - name: print environment variables
+        run: |
+          echo "env.CI_BRANCH = ${{ env.CI_BRANCH }}"
+          echo "env.CI_SHA = ${{ env.CI_SHA }}"
+
+      - name: Update clone using environment variables
+        working-directory: /transformers
+        run: |
+          echo "original branch = $(git branch --show-current)"
+          git fetch && git checkout ${{ env.CI_BRANCH }}
+          echo "updated branch = $(git branch --show-current)"
+          git checkout ${{ env.CI_SHA }}
+          echo "log = $(git log -n 1)"
+
+      - name: Cleanup
+        working-directory: /transformers
+        run: |
+          rm -rf tests/__pycache__
+          rm -rf tests/models/__pycache__
+          rm -rf reports
+
+      - name: Show installed libraries and their versions
+        working-directory: /transformers
+        run: pip freeze
+
+      - name: Fetch the tests to run
+        working-directory: /transformers
+        # TODO: add `git-python` in the docker images
+        run: |
+          pip install --upgrade git-python
+          python3 utils/tests_fetcher.py --diff_with_last_commit | tee test_preparation.txt
+
+      - name: Report fetched tests
+        uses: actions/upload-artifact@v4
+        with:
+          name: test_fetched
+          path: /transformers/test_preparation.txt
+
+      - id: set-matrix
+        name: Organize tests into models
+        working-directory: /transformers
+        # The `keys` is used as GitHub actions matrix for jobs, i.e. `models/bert`, `tokenization`, `pipeline`, etc.
+        # The `test_map` is used to get the actual identified test files under each key.
+        # If no test to run (so no `test_map.json` file), create a dummy map (empty matrix will fail)
+        run: |
+          if [ -f test_map.json ]; then
+              keys=$(python3 -c 'import json; fp = open("test_map.json"); test_map = json.load(fp); fp.close(); d = list(test_map.keys()); print(d)')
+              test_map=$(python3 -c 'import json; fp = open("test_map.json"); test_map = json.load(fp); fp.close(); print(test_map)')
+          else
+              keys=$(python3 -c 'keys = ["dummy"]; print(keys)')
+              test_map=$(python3 -c 'test_map = {"dummy": []}; print(test_map)')
+          fi
+          echo $keys
+          echo $test_map
+          echo "matrix=$keys" >> $GITHUB_OUTPUT
+          echo "test_map=$test_map" >> $GITHUB_OUTPUT
+
+  run_tests_single_gpu:
+    name: Model tests
+    needs: setup
+    # `dummy` means there is no test to run
+    if: contains(fromJson(needs.setup.outputs.matrix), 'dummy') != true
+    strategy:
+      fail-fast: false
+      matrix:
+        folders: ${{ fromJson(needs.setup.outputs.matrix) }}
+        machine_type: [aws-g5-4xlarge-cache]
+    runs-on:
+      group: '${{ matrix.machine_type }}'
+    container:
+      image: huggingface/transformers-all-latest-gpu-push-ci
+      options: --gpus all --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/
+    env:
+      # For the meaning of these environment variables, see the job `Setup`
+      CI_BRANCH_PUSH: ${{ github.event.ref }}
+      CI_BRANCH_WORKFLOW_RUN: ${{ github.event.workflow_run.head_branch }}
+      CI_SHA_PUSH: ${{ github.event.head_commit.id }}
+      CI_SHA_WORKFLOW_RUN: ${{ github.event.workflow_run.head_sha }}
+    steps:
+      # Necessary to get the correct branch name and commit SHA for `workflow_run` event
+      # We also take into account the `push` event (we might want to test some changes in a branch)
+      - name: Prepare custom environment variables
+        shell: bash
+        # For the meaning of these environment variables, see the job `Setup`
+        run: |
+          CI_BRANCH_PUSH=${CI_BRANCH_PUSH/'refs/heads/'/''}
+          echo $CI_BRANCH_PUSH
+          echo $CI_BRANCH_WORKFLOW_RUN
+          echo $CI_SHA_PUSH
+          echo $CI_SHA_WORKFLOW_RUN
+          [[ ! -z "$CI_BRANCH_PUSH" ]] && echo "CI_BRANCH=$CI_BRANCH_PUSH" >> $GITHUB_ENV || echo "CI_BRANCH=$CI_BRANCH_WORKFLOW_RUN" >> $GITHUB_ENV
+          [[ ! -z "$CI_SHA_PUSH" ]] && echo "CI_SHA=$CI_SHA_PUSH" >> $GITHUB_ENV || echo "CI_SHA=$CI_SHA_WORKFLOW_RUN" >> $GITHUB_ENV
+
+      - name: print environment variables
+        run: |
+          echo "env.CI_BRANCH = ${{ env.CI_BRANCH }}"
+          echo "env.CI_SHA = ${{ env.CI_SHA }}"
+
+      - name: Set `machine_type` for report and artifact names
+        working-directory: /transformers
+        shell: bash
+        run: |
+          echo "${{ matrix.machine_type }}"
+
+          if [ "${{ matrix.machine_type }}" = "aws-g5-4xlarge-cache" ]; then
+            machine_type=single-gpu
+          elif [ "${{ matrix.machine_type }}" = "aws-g5-12xlarge-cache" ]; then
+            machine_type=multi-gpu
+          else
+            machine_type=${{ matrix.machine_type }}
+          fi
+
+          echo "$machine_type"
+          echo "machine_type=$machine_type" >> $GITHUB_ENV
+
+      - name: Update clone using environment variables
+        working-directory: /transformers
+        run: |
+          echo "original branch = $(git branch --show-current)"
+          git fetch && git checkout ${{ env.CI_BRANCH }}
+          echo "updated branch = $(git branch --show-current)"
+          git checkout ${{ env.CI_SHA }}
+          echo "log = $(git log -n 1)"
+
+      - name: Reinstall transformers in edit mode (remove the one installed during docker image build)
+        working-directory: /transformers
+        run: python3 -m pip uninstall -y transformers && python3 -m pip install -e .
+
+      - name: Echo folder ${{ matrix.folders }}
+        shell: bash
+        # For folders like `models/bert`, set an env. var. (`matrix_folders`) to `models_bert`, which will be used to
+        # set the artifact folder names (because the character `/` is not allowed).
+        run: |
+          echo "${{ matrix.folders }}"
+          echo "${{ fromJson(needs.setup.outputs.test_map)[matrix.folders] }}"
+          matrix_folders=${{ matrix.folders }}
+          matrix_folders=${matrix_folders/'models/'/'models_'}
+          echo "$matrix_folders"
+          echo "matrix_folders=$matrix_folders" >> $GITHUB_ENV
+
+      - name: NVIDIA-SMI
+        run: |
+          nvidia-smi
+
+      - name: Environment
+        working-directory: /transformers
+        run: |
+          python3 utils/print_env.py
+
+      - name: Show installed libraries and their versions
+        working-directory: /transformers
+        run: pip freeze
+
+      - name: Run all non-slow selected tests on GPU
+        working-directory: /transformers
+        run: |
+          python3 -m pytest -n 2 --dist=loadfile -v --make-reports=${{ env.machine_type }}_tests_gpu_${{ matrix.folders }} ${{ fromJson(needs.setup.outputs.test_map)[matrix.folders] }}
+
+      - name: Failure short reports
+        if: ${{ failure() }}
+        continue-on-error: true
+        run: cat /transformers/reports/${{ env.machine_type }}_tests_gpu_${{ matrix.folders }}/failures_short.txt
+
+      - name: "Test suite reports artifacts: ${{ env.machine_type }}_run_all_tests_gpu_${{ env.matrix_folders }}_test_reports"
+        if: ${{ always() }}
+        uses: actions/upload-artifact@v4
+        with:
+          name: ${{ env.machine_type }}_run_all_tests_gpu_${{ env.matrix_folders }}_test_reports
+          path: /transformers/reports/${{ env.machine_type }}_tests_gpu_${{ matrix.folders }}
+
+  run_tests_multi_gpu:
+    name: Model tests
+    needs: setup
+    # `dummy` means there is no test to run
+    if: contains(fromJson(needs.setup.outputs.matrix), 'dummy') != true
+    strategy:
+      fail-fast: false
+      matrix:
+        folders: ${{ fromJson(needs.setup.outputs.matrix) }}
+        machine_type: [aws-g5-12xlarge-cache]
+    runs-on:
+      group: '${{ matrix.machine_type }}'
+    container:
+      image: huggingface/transformers-all-latest-gpu-push-ci
+      options: --gpus all --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/
+    env:
+      # For the meaning of these environment variables, see the job `Setup`
+      CI_BRANCH_PUSH: ${{ github.event.ref }}
+      CI_BRANCH_WORKFLOW_RUN: ${{ github.event.workflow_run.head_branch }}
+      CI_SHA_PUSH: ${{ github.event.head_commit.id }}
+      CI_SHA_WORKFLOW_RUN: ${{ github.event.workflow_run.head_sha }}
+    steps:
+      # Necessary to get the correct branch name and commit SHA for `workflow_run` event
+      # We also take into account the `push` event (we might want to test some changes in a branch)
+      - name: Prepare custom environment variables
+        shell: bash
+        # For the meaning of these environment variables, see the job `Setup`
+        run: |
+          CI_BRANCH_PUSH=${CI_BRANCH_PUSH/'refs/heads/'/''}
+          echo $CI_BRANCH_PUSH
+          echo $CI_BRANCH_WORKFLOW_RUN
+          echo $CI_SHA_PUSH
+          echo $CI_SHA_WORKFLOW_RUN
+          [[ ! -z "$CI_BRANCH_PUSH" ]] && echo "CI_BRANCH=$CI_BRANCH_PUSH" >> $GITHUB_ENV || echo "CI_BRANCH=$CI_BRANCH_WORKFLOW_RUN" >> $GITHUB_ENV
+          [[ ! -z "$CI_SHA_PUSH" ]] && echo "CI_SHA=$CI_SHA_PUSH" >> $GITHUB_ENV || echo "CI_SHA=$CI_SHA_WORKFLOW_RUN" >> $GITHUB_ENV
+
+      - name: print environment variables
+        run: |
+          echo "env.CI_BRANCH = ${{ env.CI_BRANCH }}"
+          echo "env.CI_SHA = ${{ env.CI_SHA }}"
+
+      - name: Set `machine_type` for report and artifact names
+        working-directory: /transformers
+        shell: bash
+        run: |
+          echo "${{ matrix.machine_type }}"
+
+          if [ "${{ matrix.machine_type }}" = "aws-g5-4xlarge-cache" ]; then
+            machine_type=single-gpu
+          elif [ "${{ matrix.machine_type }}" = "aws-g5-12xlarge-cache" ]; then
+            machine_type=multi-gpu
+          else
+            machine_type=${{ matrix.machine_type }}
+          fi
+
+          echo "$machine_type"
+          echo "machine_type=$machine_type" >> $GITHUB_ENV
+
+      - name: Update clone using environment variables
+        working-directory: /transformers
+        run: |
+          echo "original branch = $(git branch --show-current)"
+          git fetch && git checkout ${{ env.CI_BRANCH }}
+          echo "updated branch = $(git branch --show-current)"
+          git checkout ${{ env.CI_SHA }}
+          echo "log = $(git log -n 1)"
+
+      - name: Reinstall transformers in edit mode (remove the one installed during docker image build)
+        working-directory: /transformers
+        run: python3 -m pip uninstall -y transformers && python3 -m pip install -e .
+
+      - name: Echo folder ${{ matrix.folders }}
+        shell: bash
+        # For folders like `models/bert`, set an env. var. (`matrix_folders`) to `models_bert`, which will be used to
+        # set the artifact folder names (because the character `/` is not allowed).
+        run: |
+          echo "${{ matrix.folders }}"
+          echo "${{ fromJson(needs.setup.outputs.test_map)[matrix.folders] }}"
+          matrix_folders=${{ matrix.folders }}
+          matrix_folders=${matrix_folders/'models/'/'models_'}
+          echo "$matrix_folders"
+          echo "matrix_folders=$matrix_folders" >> $GITHUB_ENV
+
+      - name: NVIDIA-SMI
+        run: |
+          nvidia-smi
+
+      - name: Environment
+        working-directory: /transformers
+        run: |
+          python3 utils/print_env.py
+
+      - name: Show installed libraries and their versions
+        working-directory: /transformers
+        run: pip freeze
+
+      - name: Run all non-slow selected tests on GPU
+        env:
+          MKL_SERVICE_FORCE_INTEL: 1
+        working-directory: /transformers
+        run: |
+          python3 -m pytest -n 2 --dist=loadfile -v --make-reports=${{ env.machine_type }}_tests_gpu_${{ matrix.folders }} ${{ fromJson(needs.setup.outputs.test_map)[matrix.folders] }}
+
+      - name: Failure short reports
+        if: ${{ failure() }}
+        continue-on-error: true
+        run: cat /transformers/reports/${{ env.machine_type }}_tests_gpu_${{ matrix.folders }}/failures_short.txt
+
+      - name: "Test suite reports artifacts: ${{ env.machine_type }}_run_all_tests_gpu_${{ env.matrix_folders }}_test_reports"
+        if: ${{ always() }}
+        uses: actions/upload-artifact@v4
+        with:
+          name: ${{ env.machine_type }}_run_all_tests_gpu_${{ env.matrix_folders }}_test_reports
+          path: /transformers/reports/${{ env.machine_type }}_tests_gpu_${{ matrix.folders }}
+
+  run_tests_torch_cuda_extensions_single_gpu:
+    name: Torch CUDA extension tests
+    needs: setup
+    if: contains(fromJson(needs.setup.outputs.matrix), 'deepspeed') || contains(fromJson(needs.setup.outputs.matrix), 'extended')
+    strategy:
+      fail-fast: false
+      matrix:
+        machine_type: [aws-g5-4xlarge-cache]
+    runs-on:
+      group: '${{ matrix.machine_type }}'
+    container:
+      image: huggingface/transformers-pytorch-deepspeed-latest-gpu-push-ci
+      options: --gpus all --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/
+    env:
+      # For the meaning of these environment variables, see the job `Setup`
+      CI_BRANCH_PUSH: ${{ github.event.ref }}
+      CI_BRANCH_WORKFLOW_RUN: ${{ github.event.workflow_run.head_branch }}
+      CI_SHA_PUSH: ${{ github.event.head_commit.id }}
+      CI_SHA_WORKFLOW_RUN: ${{ github.event.workflow_run.head_sha }}
+    steps:
+      # Necessary to get the correct branch name and commit SHA for `workflow_run` event
+      # We also take into account the `push` event (we might want to test some changes in a branch)
+      - name: Prepare custom environment variables
+        shell: bash
+        # For the meaning of these environment variables, see the job `Setup`
+        run: |
+          CI_BRANCH_PUSH=${CI_BRANCH_PUSH/'refs/heads/'/''}
+          echo $CI_BRANCH_PUSH
+          echo $CI_BRANCH_WORKFLOW_RUN
+          echo $CI_SHA_PUSH
+          echo $CI_SHA_WORKFLOW_RUN
+          [[ ! -z "$CI_BRANCH_PUSH" ]] && echo "CI_BRANCH=$CI_BRANCH_PUSH" >> $GITHUB_ENV || echo "CI_BRANCH=$CI_BRANCH_WORKFLOW_RUN" >> $GITHUB_ENV
+          [[ ! -z "$CI_SHA_PUSH" ]] && echo "CI_SHA=$CI_SHA_PUSH" >> $GITHUB_ENV || echo "CI_SHA=$CI_SHA_WORKFLOW_RUN" >> $GITHUB_ENV
+
+      - name: print environment variables
+        run: |
+          echo "env.CI_BRANCH = ${{ env.CI_BRANCH }}"
+          echo "env.CI_SHA = ${{ env.CI_SHA }}"
+
+      - name: Set `machine_type` for report and artifact names
+        working-directory: /workspace/transformers
+        shell: bash
+        run: |
+          echo "${{ matrix.machine_type }}"
+
+          if [ "${{ matrix.machine_type }}" = "aws-g5-4xlarge-cache" ]; then
+            machine_type=single-gpu
+          elif [ "${{ matrix.machine_type }}" = "aws-g5-12xlarge-cache" ]; then
+            machine_type=multi-gpu
+          else
+            machine_type=${{ matrix.machine_type }}
+          fi
+
+          echo "$machine_type"
+          echo "machine_type=$machine_type" >> $GITHUB_ENV
+
+      - name: Update clone using environment variables
+        working-directory: /workspace/transformers
+        run: |
+          echo "original branch = $(git branch --show-current)"
+          git fetch && git checkout ${{ env.CI_BRANCH }}
+          echo "updated branch = $(git branch --show-current)"
+          git checkout ${{ env.CI_SHA }}
+          echo "log = $(git log -n 1)"
+
+      - name: Reinstall transformers in edit mode (remove the one installed during docker image build)
+        working-directory: /workspace/transformers
+        run: python3 -m pip uninstall -y transformers && python3 -m pip install -e .
+
+      - name: Remove cached torch extensions
+        run: rm -rf /github/home/.cache/torch_extensions/
+
+      # To avoid unknown test failures
+      - name: Pre build DeepSpeed *again*
+        working-directory: /workspace
+        run: |
+          python3 -m pip uninstall -y deepspeed
+          DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 python3 -m pip install deepspeed --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check
+
+      - name: NVIDIA-SMI
+        run: |
+          nvidia-smi
+
+      - name: Environment
+        working-directory: /workspace/transformers
+        run: |
+          python utils/print_env.py
+
+      - name: Show installed libraries and their versions
+        working-directory: /workspace/transformers
+        run: pip freeze
+
+      - name: Run all non-slow selected tests on GPU
+        working-directory: /workspace/transformers
+        # TODO: Here we pass all tests in the 2 folders for simplicity. It's better to pass only the identified tests.
+        run: |
+          python -m pytest -n 1 --dist=loadfile -v --make-reports=${{ env.machine_type }}_run_torch_cuda_extensions_gpu_test_reports tests/deepspeed tests/extended
+
+      - name: Failure short reports
+        if: ${{ failure() }}
+        continue-on-error: true
+        run: cat /workspace/transformers/reports/${{ env.machine_type }}_run_torch_cuda_extensions_gpu_test_reports/failures_short.txt
+
+      - name: "Test suite reports artifacts: ${{ env.machine_type }}_run_torch_cuda_extensions_gpu_test_reports"
+        if: ${{ always() }}
+        uses: actions/upload-artifact@v4
+        with:
+          name: ${{ env.machine_type }}_run_torch_cuda_extensions_gpu_test_reports
+          path: /workspace/transformers/reports/${{ env.machine_type }}_run_torch_cuda_extensions_gpu_test_reports
+
+  run_tests_torch_cuda_extensions_multi_gpu:
+    name: Torch CUDA extension tests
+    needs: setup
+    if: contains(fromJson(needs.setup.outputs.matrix), 'deepspeed') || contains(fromJson(needs.setup.outputs.matrix), 'extended')
+    strategy:
+      fail-fast: false
+      matrix:
+        machine_type: [aws-g5-12xlarge-cache]
+    runs-on:
+      group: '${{ matrix.machine_type }}'
+    container:
+      image: huggingface/transformers-pytorch-deepspeed-latest-gpu-push-ci
+      options: --gpus all --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/
+    env:
+      # For the meaning of these environment variables, see the job `Setup`
+      CI_BRANCH_PUSH: ${{ github.event.ref }}
+      CI_BRANCH_WORKFLOW_RUN: ${{ github.event.workflow_run.head_branch }}
+      CI_SHA_PUSH: ${{ github.event.head_commit.id }}
+      CI_SHA_WORKFLOW_RUN: ${{ github.event.workflow_run.head_sha }}
+    steps:
+      # Necessary to get the correct branch name and commit SHA for `workflow_run` event
+      # We also take into account the `push` event (we might want to test some changes in a branch)
+      - name: Prepare custom environment variables
+        shell: bash
+        # For the meaning of these environment variables, see the job `Setup`
+        run: |
+          CI_BRANCH_PUSH=${CI_BRANCH_PUSH/'refs/heads/'/''}
+          echo $CI_BRANCH_PUSH
+          echo $CI_BRANCH_WORKFLOW_RUN
+          echo $CI_SHA_PUSH
+          echo $CI_SHA_WORKFLOW_RUN
+          [[ ! -z "$CI_BRANCH_PUSH" ]] && echo "CI_BRANCH=$CI_BRANCH_PUSH" >> $GITHUB_ENV || echo "CI_BRANCH=$CI_BRANCH_WORKFLOW_RUN" >> $GITHUB_ENV
+          [[ ! -z "$CI_SHA_PUSH" ]] && echo "CI_SHA=$CI_SHA_PUSH" >> $GITHUB_ENV || echo "CI_SHA=$CI_SHA_WORKFLOW_RUN" >> $GITHUB_ENV
+
+      - name: print environment variables
+        run: |
+          echo "env.CI_BRANCH = ${{ env.CI_BRANCH }}"
+          echo "env.CI_SHA = ${{ env.CI_SHA }}"
+
+      - name: Set `machine_type` for report and artifact names
+        working-directory: /workspace/transformers
+        shell: bash
+        run: |
+          echo "${{ matrix.machine_type }}"
+
+          if [ "${{ matrix.machine_type }}" = "aws-g5-4xlarge-cache" ]; then
+            machine_type=single-gpu
+          elif [ "${{ matrix.machine_type }}" = "aws-g5-12xlarge-cache" ]; then
+            machine_type=multi-gpu
+          else
+            machine_type=${{ matrix.machine_type }}
+          fi
+
+          echo "$machine_type"
+          echo "machine_type=$machine_type" >> $GITHUB_ENV
+
+      - name: Update clone using environment variables
+        working-directory: /workspace/transformers
+        run: |
+          echo "original branch = $(git branch --show-current)"
+          git fetch && git checkout ${{ env.CI_BRANCH }}
+          echo "updated branch = $(git branch --show-current)"
+          git checkout ${{ env.CI_SHA }}
+          echo "log = $(git log -n 1)"
+
+      - name: Reinstall transformers in edit mode (remove the one installed during docker image build)
+        working-directory: /workspace/transformers
+        run: python3 -m pip uninstall -y transformers && python3 -m pip install -e .
+
+      - name: Remove cached torch extensions
+        run: rm -rf /github/home/.cache/torch_extensions/
+
+      # To avoid unknown test failures
+      - name: Pre build DeepSpeed *again*
+        working-directory: /workspace
+        run: |
+          python3 -m pip uninstall -y deepspeed
+          DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 python3 -m pip install deepspeed --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check
+
+      - name: NVIDIA-SMI
+        run: |
+          nvidia-smi
+
+      - name: Environment
+        working-directory: /workspace/transformers
+        run: |
+          python utils/print_env.py
+
+      - name: Show installed libraries and their versions
+        working-directory: /workspace/transformers
+        run: pip freeze
+
+      - name: Run all non-slow selected tests on GPU
+        working-directory: /workspace/transformers
+        # TODO: Here we pass all tests in the 2 folders for simplicity. It's better to pass only the identified tests.
+        run: |
+          python -m pytest -n 1 --dist=loadfile -v --make-reports=${{ env.machine_type }}_run_torch_cuda_extensions_gpu_test_reports tests/deepspeed tests/extended
+
+      - name: Failure short reports
+        if: ${{ failure() }}
+        continue-on-error: true
+        run: cat /workspace/transformers/reports/${{ env.machine_type }}_run_torch_cuda_extensions_gpu_test_reports/failures_short.txt
+
+      - name: "Test suite reports artifacts: ${{ env.machine_type }}_run_torch_cuda_extensions_gpu_test_reports"
+        if: ${{ always() }}
+        uses: actions/upload-artifact@v4
+        with:
+          name: ${{ env.machine_type }}_run_torch_cuda_extensions_gpu_test_reports
+          path: /workspace/transformers/reports/${{ env.machine_type }}_run_torch_cuda_extensions_gpu_test_reports
+
+  send_results:
+    name: Send results to webhook
+    runs-on: ubuntu-22.04
+    if: always()
+    needs: [
+        setup,
+        run_tests_single_gpu,
+        run_tests_multi_gpu,
+        run_tests_torch_cuda_extensions_single_gpu,
+        run_tests_torch_cuda_extensions_multi_gpu
+    ]
+    env:
+      # For the meaning of these environment variables, see the job `Setup`
+      CI_BRANCH_PUSH: ${{ github.event.ref }}
+      CI_BRANCH_WORKFLOW_RUN: ${{ github.event.workflow_run.head_branch }}
+      CI_SHA_PUSH: ${{ github.event.head_commit.id }}
+      CI_SHA_WORKFLOW_RUN: ${{ github.event.workflow_run.head_sha }}
+    steps:
+      - name: Preliminary job status
+        shell: bash
+        # For the meaning of these environment variables, see the job `Setup`
+        run: |
+          echo "Setup status: ${{ needs.setup.result }}"
+
+      # Necessary to get the correct branch name and commit SHA for `workflow_run` event
+      # We also take into account the `push` event (we might want to test some changes in a branch)
+      - name: Prepare custom environment variables
+        shell: bash
+        # For the meaning of these environment variables, see the job `Setup`
+        run: |
+          CI_BRANCH_PUSH=${CI_BRANCH_PUSH/'refs/heads/'/''}
+          echo $CI_BRANCH_PUSH
+          echo $CI_BRANCH_WORKFLOW_RUN
+          echo $CI_SHA_PUSH
+          echo $CI_SHA_WORKFLOW_RUN
+          [[ ! -z "$CI_BRANCH_PUSH" ]] && echo "CI_BRANCH=$CI_BRANCH_PUSH" >> $GITHUB_ENV || echo "CI_BRANCH=$CI_BRANCH_WORKFLOW_RUN" >> $GITHUB_ENV
+          [[ ! -z "$CI_SHA_PUSH" ]] && echo "CI_SHA=$CI_SHA_PUSH" >> $GITHUB_ENV || echo "CI_SHA=$CI_SHA_WORKFLOW_RUN" >> $GITHUB_ENV
+
+      - name: print environment variables
+        run: |
+          echo "env.CI_BRANCH = ${{ env.CI_BRANCH }}"
+          echo "env.CI_SHA = ${{ env.CI_SHA }}"
+
+      - uses: actions/checkout@v4
+        # To avoid failure when multiple commits are merged into `main` in a short period of time.
+        # Checking out to an old commit beyond the fetch depth will get an error `fatal: reference is not a tree: ...
+        # (Only required for `workflow_run` event, where we get the latest HEAD on `main` instead of the event commit)
+        with:
+          fetch-depth: 20
+
+      - name: Update clone using environment variables
+        run: |
+          echo "original branch = $(git branch --show-current)"
+          git fetch && git checkout ${{ env.CI_BRANCH }}
+          echo "updated branch = $(git branch --show-current)"
+          git checkout ${{ env.CI_SHA }}
+          echo "log = $(git log -n 1)"
+
+      - uses: actions/download-artifact@v4
+      - name: Send message to Slack
+        env:
+          CI_SLACK_BOT_TOKEN: ${{ secrets.CI_SLACK_BOT_TOKEN }}
+          CI_SLACK_CHANNEL_ID: ${{ secrets.CI_SLACK_CHANNEL_ID }}
+          CI_SLACK_CHANNEL_ID_DAILY: ${{ secrets.CI_SLACK_CHANNEL_ID_DAILY }}
+          CI_SLACK_CHANNEL_DUMMY_TESTS: ${{ secrets.CI_SLACK_CHANNEL_DUMMY_TESTS }}
+          CI_SLACK_REPORT_CHANNEL_ID: ${{ secrets.CI_SLACK_CHANNEL_ID }}
+          ACCESS_REPO_INFO_TOKEN: ${{ secrets.ACCESS_REPO_INFO_TOKEN }}
+          CI_EVENT: push
+          CI_TITLE_PUSH: ${{ github.event.head_commit.message }}
+          CI_TITLE_WORKFLOW_RUN: ${{ github.event.workflow_run.head_commit.message }}
+          CI_SHA: ${{ env.CI_SHA }}
+          SETUP_STATUS: ${{ needs.setup.result }}
+
+        # We pass `needs.setup.outputs.matrix` as the argument. A processing in `notification_service.py` to change
+        # `models/bert` to `models_bert` is required, as the artifact names use `_` instead of `/`.
+        run: |
+          pip install huggingface_hub
+          pip install slack_sdk
+          pip show slack_sdk
+          python utils/notification_service.py "${{ needs.setup.outputs.matrix }}"
--- a/.github/workflows/self-scheduled-amd-caller.yml
+++ b/.github/workflows/self-scheduled-amd-caller.yml
@ -2,7 +2,7 @@ name: Self-hosted runner (AMD scheduled CI caller)

 on:
  schedule:
-    - cron: "17 5 * * *"
+    - cron: "17 2 * * *"

 jobs:
  run_scheduled_amd_ci:
--- a/.github/workflows/self-scheduled-amd-mi355-caller.yml
+++ b/.github/workflows/self-scheduled-amd-mi355-caller.yml
@ -21,7 +21,7 @@ jobs:
      job: run_models_gpu
      slack_report_channel: "#amd-hf-ci"
      runner_group: hfc-amd-mi355
-      docker: huggingface/transformers-pytorch-amd-gpu
+      docker: huggingface/testing-rocm7.0-preview
      ci_event: Scheduled CI (AMD) - mi355
      report_repo_id: hf-transformers-bot/transformers-ci-dummy
    secrets: inherit
@ -33,7 +33,7 @@ jobs:
      job: run_pipelines_torch_gpu
      slack_report_channel: "#amd-hf-ci"
      runner_group: hfc-amd-mi355
-      docker: huggingface/transformers-pytorch-amd-gpu
+      docker: huggingface/testing-rocm7.0-preview
      ci_event: Scheduled CI (AMD) - mi355
      report_repo_id: hf-transformers-bot/transformers-ci-dummy
    secrets: inherit
@ -45,7 +45,7 @@ jobs:
      job: run_examples_gpu
      slack_report_channel: "#amd-hf-ci"
      runner_group: hfc-amd-mi355
-      docker: huggingface/transformers-pytorch-amd-gpu
+      docker: huggingface/testing-rocm7.0-preview
      ci_event: Scheduled CI (AMD) - mi355
      report_repo_id: hf-transformers-bot/transformers-ci-dummy
    secrets: inherit
--- a/.github/workflows/self-scheduled-caller.yml
+++ b/.github/workflows/self-scheduled-caller.yml
@ -33,13 +33,10 @@ jobs:
    runs-on: ubuntu-22.04
    steps:
      - name: Setup
-        env:
-          prev_workflow_run_id: ${{ inputs.prev_workflow_run_id || env.prev_workflow_run_id }}
-          other_workflow_run_id: ${{ inputs.other_workflow_run_id || env.other_workflow_run_id }}
        run: |
          mkdir "setup_values"
-          echo "$prev_workflow_run_id" > "setup_values/prev_workflow_run_id.txt"
-          echo "$other_workflow_run_id" > "setup_values/other_workflow_run_id.txt"
+          echo "${{ inputs.prev_workflow_run_id || env.prev_workflow_run_id }}" > "setup_values/prev_workflow_run_id.txt"
+          echo "${{ inputs.other_workflow_run_id || env.other_workflow_run_id }}" > "setup_values/other_workflow_run_id.txt"

      - name: Upload artifacts
        uses: actions/upload-artifact@v4
@ -66,7 +63,7 @@ jobs:
    with:
      job: run_pipelines_torch_gpu
      slack_report_channel: "#transformers-ci-daily-pipeline-torch"
-      docker: huggingface/transformers-all-latest-gpu
+      docker: huggingface/transformers-pytorch-gpu
      ci_event: Daily CI
      report_repo_id: hf-internal-testing/transformers_daily_ci
      commit_sha: ${{ github.sha }}
@ -121,15 +118,3 @@ jobs:
      report_repo_id: hf-internal-testing/transformers_daily_ci
      commit_sha: ${{ github.sha }}
    secrets: inherit
-
-  kernels-ci:
-    name: Kernels CI
-    uses: ./.github/workflows/self-scheduled.yml
-    with:
-      job: run_kernels_gpu
-      slack_report_channel: "#transformers-ci-daily-kernels"
-      docker: huggingface/transformers-all-latest-gpu
-      ci_event: Daily CI
-      report_repo_id: hf-internal-testing/transformers_daily_ci
-      commit_sha: ${{ github.sha }}
-    secrets: inherit
--- a/.github/workflows/self-scheduled-flash-attn-caller.yml
+++ b/.github/workflows/self-scheduled-flash-attn-caller.yml
@ -1,60 +0,0 @@
-name: Nvidia CI - Flash Attn
-
-on:
-  repository_dispatch:
-  schedule:
-    - cron: "17 2 * * *"
-  push:
-    branches:
-      - run_nvidia_ci_flash_attn*
-  workflow_dispatch:
-    inputs:
-      prev_workflow_run_id:
-        description: 'previous workflow run id to compare'
-        type: string
-        required: false
-        default: ""
-      other_workflow_run_id:
-        description: 'other workflow run id to compare'
-        type: string
-        required: false
-        default: ""
-
-
-# Used for `push` to easily modify the target workflow runs to compare against
-env:
-    prev_workflow_run_id: ""
-    other_workflow_run_id: ""
-
-
-jobs:
-  setup:
-    name: Setup
-    runs-on: ubuntu-22.04
-    steps:
-      - name: Setup
-        run: |
-          mkdir "setup_values"
-          echo "${{ inputs.prev_workflow_run_id || env.prev_workflow_run_id }}" > "setup_values/prev_workflow_run_id.txt"
-          echo "${{ inputs.other_workflow_run_id || env.other_workflow_run_id }}" > "setup_values/other_workflow_run_id.txt"
-
-      - name: Upload artifacts
-        uses: actions/upload-artifact@v4
-        with:
-          name: setup_values
-          path: setup_values
-
-
-  model-ci:
-    name: Model CI
-    uses: ./.github/workflows/self-scheduled.yml
-    with:
-      job: run_models_gpu
-      slack_report_channel: "#transformers-ci-flash-attn"
-      docker: huggingface/transformers-all-latest-gpu:flash-attn
-      ci_event: Daily CI
-      runner_type: "a10"
-      report_repo_id: hf-internal-testing/transformers_flash_attn_ci
-      commit_sha: ${{ github.sha }}
-      pytest_marker: "flash_attn_test or flash_attn_3_test"
-    secrets: inherit
--- a/.github/workflows/self-scheduled.yml
+++ b/.github/workflows/self-scheduled.yml
@ -34,20 +34,10 @@ on:
      runner_type:
        required: false
        type: string
-      subdirs:
+      models:
        default: ""
        required: false
        type: string
-      pytest_marker:
-        required: false
-        type: string
-      pr_number:
-        required: false
-        type: string
-    outputs:
-      report:
-        description: "Content of the report of new failures"
-        value: ${{ jobs.check_new_failures.outputs.report }}

 env:
  HF_HOME: /mnt/cache
@ -60,6 +50,7 @@ env:
  HF_HUB_READ_TOKEN: ${{ secrets.HF_HUB_READ_TOKEN }}
  TF_FORCE_GPU_ALLOW_GROWTH: true
  CUDA_VISIBLE_DEVICES: 0,1
+  NUM_SLICES: 2

 jobs:
  setup:
@ -80,11 +71,8 @@ jobs:
    steps:
      - name: Update clone
        working-directory: /transformers
-        env:
-          commit_sha: ${{ inputs.commit_sha || github.sha }}
        run: |
-          git fetch origin $commit_sha
-          git fetch && git checkout $commit_sha
+          git fetch && git checkout ${{ inputs.commit_sha || github.sha }}

      - name: Cleanup
        working-directory: /transformers
@ -101,17 +89,11 @@ jobs:
        if: contains(fromJSON('["run_models_gpu", "run_trainer_and_fsdp_gpu"]'), inputs.job)
        name: Identify models to test
        working-directory: /transformers/tests
-        env:
-          job: ${{ inputs.job }}
-          subdirs: ${{ inputs.subdirs }}
-          NUM_SLICES: 2
        run: |
-          if [ "$job" = "run_models_gpu" ]; then
-            python3 ../utils/split_model_tests.py --subdirs "$subdirs" --num_splits "$NUM_SLICES" > folder_slices.txt
-            echo "folder_slices=$(cat folder_slices.txt)" >> $GITHUB_OUTPUT
-            python3 -c "import ast; folder_slices = ast.literal_eval(open('folder_slices.txt').read()); open('slice_ids.txt', 'w').write(str(list(range(len(folder_slices)))))"
-            echo "slice_ids=$(cat slice_ids.txt)" >> $GITHUB_OUTPUT
-          elif [ "$job" = "run_trainer_and_fsdp_gpu" ]; then
+          if [ "${{ inputs.job }}" = "run_models_gpu" ]; then
+            echo "folder_slices=$(python3 ../utils/split_model_tests.py --models '${{ inputs.models }}' --num_splits ${{ env.NUM_SLICES }})" >> $GITHUB_OUTPUT
+            echo "slice_ids=$(python3 -c 'd = list(range(${{ env.NUM_SLICES }})); print(d)')" >> $GITHUB_OUTPUT
+          elif [ "${{ inputs.job }}" = "run_trainer_and_fsdp_gpu" ]; then
            echo "folder_slices=[['trainer'], ['fsdp']]" >> $GITHUB_OUTPUT
            echo "slice_ids=[0, 1]" >> $GITHUB_OUTPUT
          fi
@ -120,10 +102,8 @@ jobs:
        if: ${{ inputs.job == 'run_quantization_torch_gpu' }}
        name: Identify quantization method to test
        working-directory: /transformers/tests
-        env:
-          subdirs: ${{ inputs.subdirs || 'None' }}
        run: |
-          echo "quantization_matrix=$(python3 -c 'import ast; import os; tests = os.getcwd(); quantization_tests = os.listdir(os.path.join(tests, "quantization")); subdirs = ast.literal_eval(os.environ["subdirs"]); quantization_tests = [x.removeprefix("quantization/") for x in subdirs] if subdirs is not None else quantization_tests; d = sorted(list(filter(os.path.isdir, [f"quantization/{x}" for x in quantization_tests]))); print(d)')" >> $GITHUB_OUTPUT
+          echo "quantization_matrix=$(python3 -c 'import os; tests = os.getcwd(); quantization_tests = os.listdir(os.path.join(tests, "quantization")); d = sorted(list(filter(os.path.isdir, [f"quantization/{x}" for x in quantization_tests]))) ;  print(d)')" >> $GITHUB_OUTPUT

      - name: NVIDIA-SMI
        run: |
@ -147,7 +127,6 @@ jobs:
      commit_sha: ${{ inputs.commit_sha || github.sha }}
      runner_type: ${{ inputs.runner_type }}
      report_repo_id: ${{ inputs.report_repo_id }}
-      pytest_marker: ${{ inputs.pytest_marker }}
    secrets: inherit

  run_trainer_and_fsdp_gpu:
@ -181,14 +160,12 @@ jobs:
    runs-on:
      group: '${{ matrix.machine_type }}'
    container:
-      image: huggingface/transformers-all-latest-gpu
+      image: huggingface/transformers-pytorch-gpu
      options: --gpus all --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/
    steps:
      - name: Update clone
        working-directory: /transformers
-        env:
-          commit_sha: ${{ inputs.commit_sha || github.sha }}
-        run: git fetch && git checkout "$commit_sha"
+        run: git fetch && git checkout ${{ inputs.commit_sha || github.sha }}

      - name: Reinstall transformers in edit mode (remove the one installed during docker image build)
        working-directory: /transformers
@ -210,17 +187,15 @@ jobs:
      - name: Set `machine_type` for report and artifact names
        working-directory: /transformers
        shell: bash
-        env:
-          matrix_machine_type: ${{ matrix.machine_type }}
        run: |
-          echo "$matrix_machine_type"
+          echo "${{ matrix.machine_type }}"

-          if [ "$matrix_machine_type" = "aws-g5-4xlarge-cache" ]; then
+          if [ "${{ matrix.machine_type }}" = "aws-g5-4xlarge-cache" ]; then
            machine_type=single-gpu
-          elif [ "$matrix_machine_type" = "aws-g5-12xlarge-cache" ]; then
+          elif [ "${{ matrix.machine_type }}" = "aws-g5-12xlarge-cache" ]; then
            machine_type=multi-gpu
          else
-            machine_type="$matrix_machine_type"
+            machine_type=${{ matrix.machine_type }}
          fi

          echo "$machine_type"
@ -229,12 +204,12 @@ jobs:
      - name: Run all pipeline tests on GPU
        working-directory: /transformers
        run: |
-          python3 -m pytest -n 1 -v --dist=loadfile --make-reports="${machine_type}_run_pipelines_torch_gpu_test_reports" tests/pipelines
+          python3 -m pytest -n 1 -v --dist=loadfile --make-reports=${{ env.machine_type }}_run_pipelines_torch_gpu_test_reports tests/pipelines

      - name: Failure short reports
        if: ${{ failure() }}
        continue-on-error: true
-        run: cat "/transformers/reports/${machine_type}_run_pipelines_torch_gpu_test_reports/failures_short.txt"
+        run: cat /transformers/reports/${{ env.machine_type }}_run_pipelines_torch_gpu_test_reports/failures_short.txt

      - name: "Test suite reports artifacts: ${{ env.machine_type }}_run_pipelines_torch_gpu_test_reports"
        if: ${{ always() }}
@ -258,9 +233,7 @@ jobs:
    steps:
      - name: Update clone
        working-directory: /transformers
-        env:
-          commit_sha: ${{ inputs.commit_sha || github.sha }}
-        run: git fetch && git checkout "$commit_sha"
+        run: git fetch && git checkout ${{ inputs.commit_sha || github.sha }}

      - name: Reinstall transformers in edit mode (remove the one installed during docker image build)
        working-directory: /transformers
@ -282,17 +255,15 @@ jobs:
      - name: Set `machine_type` for report and artifact names
        working-directory: /transformers
        shell: bash
-        env:
-          matrix_machine_type: ${{ matrix.machine_type }}
        run: |
-          echo "$matrix_machine_type"
+          echo "${{ matrix.machine_type }}"

-          if [ "$matrix_machine_type" = "aws-g5-4xlarge-cache" ]; then
+          if [ "${{ matrix.machine_type }}" = "aws-g5-4xlarge-cache" ]; then
            machine_type=single-gpu
-          elif [ "$matrix_machine_type" = "aws-g5-12xlarge-cache" ]; then
+          elif [ "${{ matrix.machine_type }}" = "aws-g5-12xlarge-cache" ]; then
            machine_type=multi-gpu
          else
-            machine_type="$matrix_machine_type"
+            machine_type=${{ matrix.machine_type }}
          fi

          echo "$machine_type"
@ -302,12 +273,12 @@ jobs:
        working-directory: /transformers
        run: |
          pip install -r examples/pytorch/_tests_requirements.txt
-          python3 -m pytest -v --make-reports="${machine_type}_run_examples_gpu_test_reports" examples/pytorch
+          python3 -m pytest -v --make-reports=${{ env.machine_type }}_run_examples_gpu_test_reports examples/pytorch

      - name: Failure short reports
        if: ${{ failure() }}
        continue-on-error: true
-        run: cat "/transformers/reports/${machine_type}_run_examples_gpu_test_reports/failures_short.txt"
+        run: cat /transformers/reports/${{ env.machine_type }}_run_examples_gpu_test_reports/failures_short.txt

      - name: "Test suite reports artifacts: ${{ env.machine_type }}_run_examples_gpu_test_reports"
        if: ${{ always() }}
@ -331,9 +302,7 @@ jobs:
    steps:
      - name: Update clone
        working-directory: ${{ inputs.working-directory-prefix }}/transformers
-        env:
-          commit_sha: ${{ inputs.commit_sha || github.sha }}
-        run: git fetch && git checkout "$commit_sha"
+        run: git fetch && git checkout ${{ inputs.commit_sha || github.sha }}

      - name: Reinstall transformers in edit mode (remove the one installed during docker image build)
        working-directory: ${{ inputs.working-directory-prefix }}/transformers
@ -355,7 +324,7 @@ jobs:
        working-directory: ${{ inputs.working-directory-prefix }}/
        run: |
          python3 -m pip uninstall -y deepspeed
-          DS_DISABLE_NINJA=1 DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 python3 -m pip install deepspeed --no-build-isolation --config-settings="--build-option=build_ext" --config-settings="--build-option=-j8" --no-cache -v --disable-pip-version-check
+          DS_DISABLE_NINJA=1 DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 python3 -m pip install deepspeed --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check

      # To avoid unknown test failures
      - name: Pre build DeepSpeed *again* (for nightly & Past CI)
@ -365,7 +334,7 @@ jobs:
          python3 -m pip uninstall -y deepspeed
          rm -rf DeepSpeed
          git clone https://github.com/deepspeedai/DeepSpeed && cd DeepSpeed && rm -rf build
-          DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 python3 -m pip install . --no-build-isolation --config-settings="--build-option=build_ext" --config-settings="--build-option=-j8" --no-cache -v --disable-pip-version-check
+          DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 python3 -m pip install . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check

      - name: NVIDIA-SMI
        run: |
@ -383,17 +352,15 @@ jobs:
      - name: Set `machine_type` for report and artifact names
        working-directory: ${{ inputs.working-directory-prefix }}/transformers
        shell: bash
-        env:
-          matrix_machine_type: ${{ matrix.machine_type }}
        run: |
-          echo "$matrix_machine_type"
+          echo "${{ matrix.machine_type }}"

-          if [ "$matrix_machine_type" = "aws-g5-4xlarge-cache" ]; then
+          if [ "${{ matrix.machine_type }}" = "aws-g5-4xlarge-cache" ]; then
            machine_type=single-gpu
-          elif [ "$matrix_machine_type" = "aws-g5-12xlarge-cache" ]; then
+          elif [ "${{ matrix.machine_type }}" = "aws-g5-12xlarge-cache" ]; then
            machine_type=multi-gpu
          else
-            machine_type="$matrix_machine_type"
+            machine_type=${{ matrix.machine_type }}
          fi

          echo "$machine_type"
@ -402,14 +369,12 @@ jobs:
      - name: Run all tests on GPU
        working-directory: ${{ inputs.working-directory-prefix }}/transformers
        run: |
-          python3 -m pytest -v --make-reports="${machine_type}_run_torch_cuda_extensions_gpu_test_reports" tests/deepspeed tests/extended
+          python3 -m pytest -v --make-reports=${{ env.machine_type }}_run_torch_cuda_extensions_gpu_test_reports tests/deepspeed tests/extended

      - name: Failure short reports
        if: ${{ failure() }}
        continue-on-error: true
-        env:
-          working_directory_prefix: ${{ inputs.working-directory-prefix }}
-        run: cat "${working_directory_prefix}/transformers/reports/${machine_type}_run_torch_cuda_extensions_gpu_test_reports/failures_short.txt"
+        run: cat ${{ inputs.working-directory-prefix }}/transformers/reports/${{ env.machine_type }}_run_torch_cuda_extensions_gpu_test_reports/failures_short.txt

      - name: "Test suite reports artifacts: ${{ env.machine_type }}_run_torch_cuda_extensions_gpu_test_reports"
        if: ${{ always() }}
@ -436,19 +401,16 @@ jobs:
    steps:
      - name: Echo folder ${{ matrix.folders }}
        shell: bash
-        env:
-          matrix_folders_raw: ${{ matrix.folders }}
        run: |
-          echo "$matrix_folders_raw"
-          matrix_folders="${matrix_folders_raw/'quantization/'/'quantization_'}"
+          echo "${{ matrix.folders }}"
+          matrix_folders=${{ matrix.folders }}
+          matrix_folders=${matrix_folders/'quantization/'/'quantization_'}
          echo "$matrix_folders"
          echo "matrix_folders=$matrix_folders" >> $GITHUB_ENV

      - name: Update clone
        working-directory: /transformers
-        env:
-          commit_sha: ${{ inputs.commit_sha || github.sha }}
-        run: git fetch && git checkout "$commit_sha"
+        run: git fetch && git checkout ${{ inputs.commit_sha || github.sha }}

      - name: Reinstall transformers in edit mode (remove the one installed during docker image build)
        working-directory: /transformers
@ -470,17 +432,15 @@ jobs:
      - name: Set `machine_type` for report and artifact names
        working-directory: /transformers
        shell: bash
-        env:
-          matrix_machine_type: ${{ matrix.machine_type }}
        run: |
-          echo "$matrix_machine_type"
+          echo "${{ matrix.machine_type }}"

-          if [ "$matrix_machine_type" = "aws-g5-4xlarge-cache" ]; then
+          if [ "${{ matrix.machine_type }}" = "aws-g5-4xlarge-cache" ]; then
            machine_type=single-gpu
-          elif [ "$matrix_machine_type" = "aws-g5-12xlarge-cache" ]; then
+          elif [ "${{ matrix.machine_type }}" = "aws-g5-12xlarge-cache" ]; then
            machine_type=multi-gpu
          else
-            machine_type="$matrix_machine_type"
+            machine_type=${{ matrix.machine_type }}
          fi

          echo "$machine_type"
@ -488,96 +448,20 @@ jobs:

      - name: Run quantization tests on GPU
        working-directory: /transformers
-        env:
-          folders: ${{ matrix.folders }}
        run: |
-          python3 -m pytest -v --make-reports="${machine_type}_run_quantization_torch_gpu_${matrix_folders}_test_reports" tests/${folders}
+          python3 -m pytest -v --make-reports=${{ env.machine_type }}_run_quantization_torch_gpu_${{ matrix.folders }}_test_reports tests/${{ matrix.folders }}

      - name: Failure short reports
        if: ${{ failure() }}
        continue-on-error: true
-        run: cat "/transformers/reports/${machine_type}_run_quantization_torch_gpu_${matrix_folders}_test_reports/failures_short.txt"
+        run: cat /transformers/reports/${{ env.machine_type }}_run_quantization_torch_gpu_${{ matrix.folders }}_test_reports/failures_short.txt

      - name: "Test suite reports artifacts: ${{ env.machine_type }}_run_quantization_torch_gpu_${{ env.matrix_folders }}_test_reports"
        if: ${{ always() }}
        uses: actions/upload-artifact@v4
        with:
          name: ${{ env.machine_type }}_run_quantization_torch_gpu_${{ env.matrix_folders }}_test_reports
-          path: /transformers/reports/${{ env.machine_type }}_run_quantization_torch_gpu_${{ env.matrix_folders }}_test_reports
-
-  run_kernels_gpu:
-    if: ${{ inputs.job == 'run_kernels_gpu' }}
-    name: Kernel tests
-    strategy:
-      fail-fast: false
-      matrix:
-        machine_type: [aws-g5-4xlarge-cache]
-    runs-on:
-      group: '${{ matrix.machine_type }}'
-    container:
-      image: ${{ inputs.docker }}
-      options: --gpus all --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/
-    steps:
-      - name: Update clone
-        working-directory: /transformers
-        env:
-          commit_sha: ${{ inputs.commit_sha || github.sha }}
-        run: git fetch && git checkout "$commit_sha"
-
-      - name: Reinstall transformers in edit mode
-        working-directory: /transformers
-        run: python3 -m pip uninstall -y transformers && python3 -m pip install -e .[testing]
-  
-      - name: Install kernels
-        working-directory: /transformers
-        run: python3 -m pip install -U kernels
-  
-      - name: NVIDIA-SMI
-        run: nvidia-smi
-
-      - name: Environment
-        working-directory: /transformers
-        run: python3 utils/print_env.py
-
-      - name: Show installed libraries and their versions
-        working-directory: /transformers
-        run: pip freeze
-
-      - name: Set `machine_type` for report and artifact names
-        working-directory: /transformers
-        shell: bash
-        env:
-          matrix_machine_type: ${{ matrix.machine_type }}
-        run: |
-          echo "$matrix_machine_type"
-
-          if [ "$matrix_machine_type" = "aws-g5-4xlarge-cache" ]; then
-            machine_type=single-gpu
-          elif [ "$matrix_machine_type" = "aws-g5-12xlarge-cache" ]; then
-            machine_type=multi-gpu
-          else
-            machine_type="$matrix_machine_type"
-          fi
-
-          echo "$machine_type"
-          echo "machine_type=$machine_type" >> $GITHUB_ENV
-    
-      - name: Run kernel tests on GPU
-        working-directory: /transformers
-        run: |
-          python3 -m pytest -v --make-reports="${machine_type}_run_kernels_gpu_test_reports" tests/kernels/test_kernels.py
-
-      - name: Failure short reports
-        if: ${{ failure() }}
-        continue-on-error: true
-        run: cat "/transformers/reports/${machine_type}_run_kernels_gpu_test_reports/failures_short.txt"
-
-      - name: "Test suite reports artifacts: ${{ env.machine_type }}_run_kernels_gpu_test_reports"
-        if: ${{ always() }}
-        uses: actions/upload-artifact@v4
-        with:
-          name: ${{ env.machine_type }}_run_kernels_gpu_test_reports
-          path: /transformers/reports/${{ env.machine_type }}_run_kernels_gpu_test_reports
+          path: /transformers/reports/${{ env.machine_type }}_run_quantization_torch_gpu_${{ matrix.folders }}_test_reports

  run_extract_warnings:
    # Let's only do this for the job `run_models_gpu` to simplify the (already complex) logic.
@ -586,10 +470,11 @@ jobs:
    runs-on: ubuntu-22.04
    needs: [setup, run_models_gpu]
    steps:
-      # Checkout in order to run `utils/extract_warnings.py`. Avoid **explicit** checkout (i.e. don't specify `ref`) for
-      # security reason.
      - name: Checkout transformers
        uses: actions/checkout@v4
+        with:
+          fetch-depth: 2
+          ref: ${{ inputs.commit_sha || github.sha }}

      - name: Install transformers
        run: pip install transformers
@ -609,12 +494,9 @@ jobs:
        working-directory: warnings_in_ci

      - name: Extract warnings in CI artifacts
-        env:
-          github_run_id: ${{ github.run_id }}
-          access_token: ${{ secrets.ACCESS_REPO_INFO_TOKEN }}
        run: |
-          python3 utils/extract_warnings.py --workflow_run_id "$github_run_id" --output_dir warnings_in_ci --token "$access_token" --from_gh
-          echo "$(python3 -c 'import os; import json; fp = open("warnings_in_ci/selected_warnings.json"); d = json.load(fp); d = "\n".join(d); print(d)')"
+          python3 utils/extract_warnings.py --workflow_run_id ${{ github.run_id }} --output_dir warnings_in_ci --token ${{ secrets.ACCESS_REPO_INFO_TOKEN }} --from_gh
+          echo "$(python3 -c 'import os; import json; fp = open("warnings_in_ci/selected_warnings.json"); d = json.load(fp); d = "\n".join(d) ;print(d)')"

      - name: Upload artifact
        if: ${{ always() }}
@ -633,7 +515,6 @@ jobs:
      run_examples_gpu,
      run_torch_cuda_extensions_gpu,
      run_quantization_torch_gpu,
-      run_kernels_gpu,
      run_extract_warnings
    ]
    if: always() && !cancelled()
@ -653,17 +534,16 @@ jobs:
    secrets: inherit

  check_new_failures:
-    if: ${{ always() && needs.send_results.result == 'success' }}
+    if: ${{ always() && inputs.ci_event == 'Daily CI' && needs.send_results.result == 'success' }}
    name: Check new failures
    needs: send_results
    uses: ./.github/workflows/check_failed_tests.yml
    with:
      docker: ${{ inputs.docker }}
-      commit_sha: ${{ inputs.commit_sha || github.sha }}
+      start_sha: ${{ inputs.commit_sha || github.sha }}
      job: ${{ inputs.job }}
      slack_report_channel: ${{ inputs.slack_report_channel }}
      ci_event: ${{ inputs.ci_event }}
      report_repo_id: ${{ inputs.report_repo_id }}
-      pr_number: ${{ inputs.pr_number }}

    secrets: inherit
--- a/.github/workflows/slack-report.yml
+++ b/.github/workflows/slack-report.yml
@ -41,16 +41,13 @@ jobs:
      - name: Preliminary job status
        shell: bash
        # For the meaning of these environment variables, see the job `Setup`
-        env:
-          setup_status: ${{ inputs.setup_status }}
        run: |
-          echo "Setup status: $setup_status"
+          echo "Setup status: ${{ inputs.setup_status }}"

      - uses: actions/checkout@v4
        with:
          fetch-depth: 2
-          # Security: checkout to the `main` branch for untrusted triggers (issue_comment, pull_request_target), otherwise use the specified ref
-          ref: ${{ (github.event_name == 'issue_comment' || github.event_name == 'pull_request_target') && 'main' || (inputs.commit_sha || github.sha) }}
+          ref: ${{ inputs.commit_sha || github.sha }}

      - uses: actions/download-artifact@v4

@ -84,8 +81,6 @@ jobs:
          CI_TEST_JOB: ${{ inputs.job }}
          SETUP_STATUS: ${{ inputs.setup_status }}
          REPORT_REPO_ID: ${{ inputs.report_repo_id }}
-          quantization_matrix: ${{ inputs.quantization_matrix }}
-          folder_slices: ${{ inputs.folder_slices }}
        # We pass `needs.setup.outputs.matrix` as the argument. A processing in `notification_service.py` to change
        # `models/bert` to `models_bert` is required, as the artifact names use `_` instead of `/`.
        # For a job that doesn't depend on (i.e. `needs`) `setup`, the value for `inputs.folder_slices` would be an
@ -94,10 +89,10 @@ jobs:
          pip install huggingface_hub
          pip install slack_sdk
          pip show slack_sdk
-          if [ "$quantization_matrix" != "" ]; then
-            python utils/notification_service.py "$quantization_matrix"
+          if [ "${{ inputs.quantization_matrix }}" != "" ]; then
+            python utils/notification_service.py "${{ inputs.quantization_matrix }}"
          else
-            python utils/notification_service.py "$folder_slices"
+            python utils/notification_service.py "${{ inputs.folder_slices }}"
          fi

      # Upload complete failure tables, as they might be big and only truncated versions could be sent to Slack.
--- a/.github/workflows/ssh-runner.yml
+++ b/.github/workflows/ssh-runner.yml
@ -4,7 +4,7 @@ on:
  workflow_dispatch:
    inputs:
      runner_type:
-        description: 'Type of runner to test (a10)'
+        description: 'Type of runner to test (a10 or t4)'
        required: true
      docker_image:
        description: 'Name of the Docker image'
@ -36,10 +36,14 @@ jobs:
          NUM_GPUS: ${{ github.event.inputs.num_gpus }}
          RUNNER_TYPE: ${{ github.event.inputs.runner_type }}
        run: |
-          if [[ "$NUM_GPUS" == "single" && "$RUNNER_TYPE" == "a10" ]]; then
-            echo "RUNNER=aws-g5-4xlarge-cache-ssh" >> $GITHUB_ENV
+          if [[ "$NUM_GPUS" == "single" && "$RUNNER_TYPE" == "t4" ]]; then
+            echo "RUNNER=aws-g4dn-4xlarge-cache" >> $GITHUB_ENV
+          elif [[ "$NUM_GPUS" == "multi" && "$RUNNER_TYPE" == "t4" ]]; then
+            echo "RUNNER=aws-g4dn-12xlarge-cache" >> $GITHUB_ENV
+          elif [[ "$NUM_GPUS" == "single" && "$RUNNER_TYPE" == "a10" ]]; then
+            echo "RUNNER=aws-g5-4xlarge-cache" >> $GITHUB_ENV
          elif [[ "$NUM_GPUS" == "multi" && "$RUNNER_TYPE" == "a10" ]]; then
-            echo "RUNNER=aws-g5-12xlarge-cache-ssh" >> $GITHUB_ENV
+            echo "RUNNER=aws-g5-12xlarge-cache" >> $GITHUB_ENV
          else
            echo "RUNNER=" >> $GITHUB_ENV
          fi
@ -47,8 +51,8 @@ jobs:
      - name: Set runner to use
        id: set_runner
        run: |
-          echo "$RUNNER"
-          echo "RUNNER=$RUNNER" >> $GITHUB_OUTPUT
+          echo ${{ env.RUNNER }}
+          echo "RUNNER=${{ env.RUNNER }}" >> $GITHUB_OUTPUT

  ssh_runner:
    name: "SSH"
@ -57,13 +61,13 @@ jobs:
      group: ${{ needs.get_runner.outputs.RUNNER }}
    container:
      image: ${{ github.event.inputs.docker_image }}
+      options: --gpus all --privileged --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/
+
    steps:
      - name: Update clone
        working-directory: /transformers
-        env:
-          commit_sha: ${{ github.sha }}
        run: |
-          git fetch && git checkout "$commit_sha"
+          git fetch && git checkout ${{ github.sha }}

      - name: Cleanup
        working-directory: /transformers
@ -95,17 +99,14 @@ jobs:
      - name: Store Slack infos
        #because the SSH can be enabled dynamically if the workflow failed, so we need to store slack infos to be able to retrieve them during the waitforssh step
        shell: bash
-        env:
-          user_slack_id: ${{ secrets[format('{0}_{1}', env.github_actor, 'SLACK_ID')] }}
-          default_slack_channel: ${{ secrets.SLACK_CIFEEDBACK_CHANNEL }}
        run: |
-          echo "$github_actor"
-          if [ "$user_slack_id" != "" ]; then
-            echo "SLACKCHANNEL=$user_slack_id" >> $GITHUB_ENV
+          echo "${{ env.github_actor }}"
+          if [ "${{ secrets[format('{0}_{1}', env.github_actor, 'SLACK_ID')] }}" != "" ]; then
+            echo "SLACKCHANNEL=${{ secrets[format('{0}_{1}', env.github_actor, 'SLACK_ID')] }}" >> $GITHUB_ENV
          else
-            echo "SLACKCHANNEL=$default_slack_channel" >> $GITHUB_ENV
+            echo "SLACKCHANNEL=${{ secrets.SLACK_CIFEEDBACK_CHANNEL }}" >> $GITHUB_ENV
          fi
-        
+
      - name: Tailscale # In order to be able to SSH when a test fails
        uses: huggingface/tailscale-action@main
        with:
--- a/.gitignore
+++ b/.gitignore
@ -98,7 +98,6 @@ celerybeat-schedule
 # Environments
 .env
 .venv
-.venv*
 env/
 venv/
 ENV/
@ -172,6 +171,3 @@ tags

 # modular conversion
 *.modular_backup
-
-# Cursor IDE files
-.cursor/
--- a/AGENTS.md
+++ b/AGENTS.md
@ -14,7 +14,7 @@ This AGENTS.md file provides guidance for code agents working with this codebase

 - PRs should be as brief as possible. Bugfix PRs in particular can often be only one or two lines long, and do not need large comments, docstrings or new functions in this case. Aim to minimize the size of the diff.
 - When writing tests, they should be added to an existing file. The only exception is for PRs to add a new model, when a new test directory should be created for that model.
- Code style is enforced in the CI. You can install the style tools with `pip install -e ".[quality]"`. You can then run `make fixup` to apply style and consistency fixes to your code.
+- Code style is enforced in the CI. You can install the style tools with `pip install -e .[quality]`. You can then run `make fixup` to apply style and consistency fixes to your code.

 ## Copying and inheritance

@ -36,4 +36,4 @@ After making changes, you should usually run `make fixup` to ensure any copies a
 the model you made the changes in and any other models that were updated by `make fixup`. Tests can be run with `pytest tests/models/[name]/test_modeling_[name].py`
 If your changes affect code in other classes like tokenizers or processors, you should run those tests instead, like `test_processing_[name].py` or `test_tokenization_[name].py`.

-In order to run tests, you may need to install dependencies. You can do this with `pip install -e ".[testing]"`. You will probably also need to `pip install torch accelerate` if your environment does not already have them.
+In order to run tests, you may need to install dependencies. You can do this with `pip install -e .[testing]`. You will probably also need to `pip install torch accelerate` if your environment does not already have them.
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -112,126 +112,7 @@ New models are constantly released and if you want to implement a new model, ple

 If you are willing to contribute the model yourself, let us know so we can help you add it to 🤗 Transformers!

-We have a technical guide for [how to add a model to 🤗 Transformers](https://huggingface.co/docs/transformers/modular_transformers).
-
-### Vision-Language Model Contribution Checklist
-
-If you're contributing a **vision-language model** (or any multimodal model that processes images/videos), please follow this checklist. Maintainers will use this to review your PR, and completing these steps will significantly increase the likelihood of your PR being merged quickly.
-
-**Required checklist for all vision-language model contributions:**
-
-☐ **1. Implement a modular file**
-
-All new models should use the modular architecture pattern. Create a `modular_<model_name>.py` file using the modular model converter:
-
- Use the CLI, [`transformers add-new-model-like`](https://github.com/huggingface/transformers/blob/main/src/transformers/cli/add_new_model_like.py) to generate a modular skeleton and get started
- All code should be in the modular file if possible. Modeling must be in it, it's better if configuration is in it as well. [Modular guide](./modular_transformers#implementing-a-modular-file) shows a quick way to set up a modular file.
- Reuse existing patterns from similar models as much as possible
- You can make the model compatible with inference engines such as vLLM or SGLang, and enable zero-effort integration. See specific requirements for model implementation in ["Transformers modeling backend"](./transformers_as_backend#multimodal-models)
-
-To verify your modular file is correct, run:
-
-```bash
-python utils/modular_model_converter.py <model_name>
-```
-
-This will generate the separate files (`modeling_*.py`, `configuration_*.py`, etc.) from your modular file. The CI will enforce that these generated files match your modular file.
-
-☐ **2. Add a fast image processor (for image models)**
-
-If your model processes images, implement a fast image processor that uses `torch` and `torchvision` instead of PIL/numpy for better inference performance:
-
- See the detailed guide in [#36978](https://github.com/huggingface/transformers/issues/36978)
- Fast processors inherit from `BaseImageProcessorFast`
- Examples: `LlavaOnevisionImageProcessorFast`, `Idefics2ImageProcessorFast`
-
-☐ **3. Create a weight conversion script**
-
-Add a `convert_<model_name>_to_hf.py` script that converts the original model weights to the HuggingFace format:
-
- Script should handle checkpoint loading, key mapping, and saving in HF format
- Include usage examples and documentation in the script
- Examples: [`convert_llava_onevision_weights_to_hf.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llava_onevision/convert_llava_onevision_weights_to_hf.py), [`convert_idefics2_weights_to_hf.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/idefics2/convert_idefics2_weights_to_hf.py)
-
-☐ **4. Add integration tests with exact output matching**
-
-At minimum, add an `IntegrationTest` class that tests end-to-end generation (processing and modelling) with **exact** output matching:
-
- For generative models: test that generated text matches expected output exactly
- For non-generative models: test that output logits match expected values
- Tests should use real checkpoints (load in 4-bit or half precision if the checkpoint is too big to fit in our CI runners) and real inputs
- Example pattern:
-
-```python
-class MyModelIntegrationTest(unittest.TestCase):
-    @slow
-    def test_model_integration(self):
-        model = MyModelForConditionalGeneration.from_pretrained("org/model-name")
-        processor = AutoProcessor.from_pretrained("org/model-name")
-
-        inputs = processor(images=image, text=prompt, return_tensors="pt")
-        output = model.generate(**inputs, max_new_tokens=20)
-
-        EXPECTED_TEXT = "exact expected output"
-        self.assertEqual(processor.decode(output[0]), EXPECTED_TEXT)
-```
-
-See `tests/models/llava_onevision/test_modeling_llava_onevision.py` for complete examples.
-
-☐ **5. Update documentation**
-
-Add or update model documentation:
-
- Create if the cli hasn't `docs/source/en/model_doc/<model_name>.md` with usage examples
- Include model description, paper link, and basic usage with `Pipeline` and `AutoModel`
- Add the model to the appropriate TOC files
-
-☐ **6. Look for reusable patterns**
-
-The library has 400+ models with many established patterns:
-
- Search for similar models (e.g., other vision-language models)
- Reuse attention mechanisms, layer implementations, and processing patterns
- Check models like LLaVA, Idefics2, Fuyu for vision-language patterns
- Use provided decorators like (`auto_docstring`, `can_return_tuple`, `check_model_inputs` and `_can_record_outputs`) where relevant. 
- Don't reinvent the wheel
-
-☐ **7. Run quality checks and read the output**
-
-Before submitting your PR, install quality dependencies and run the full check suite:
-
-```bash
-pip install -e ".[quality]"
-make fixup
-```
-
-**Important**: Take time to read the output of `make fixup`. It will:
- Lint and format your code automatically
- Run consistency checks (imports, docstrings, etc.)
- Show any remaining issues that need manual fixes
-
-All checks must pass before your PR can be merged.
-
-**If this checklist is complete, your PR has a very high likelihood of being merged!** Following these steps makes the maintainers' work much easier and will reduce the number of review iterations, getting your important work out there faster.
-
-#### Copy-pastable checklist for maintainers
-
-Here's a condensed version maintainers can copy into PRs:
-
-```markdown
-## Multimodal Model Addition Checklist
-
-Please ensure your PR completes all following items. See the [full checklist](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#vision-language-model-contribution-checklist) for details.
-
- [ ] **Modular file**: `modular_<model_name>.py` implemented and verified with `python utils/modular_model_converter.py <model_name>`
- [ ] **Fast image processor**: Implemented using `BaseImageProcessorFast` (see [#36978](https://github.com/huggingface/transformers/issues/36978))
- [ ] **Conversion script**: `convert_<model_name>_to_hf.py` added with usage examples
- [ ] **Integration tests**: End-to-end tests with exact output matching (text or logits)
- [ ] **Documentation**: Model docs added/updated in `docs/source/en/model_doc/`
- [ ] **Pattern reuse**: Verified against similar models (LLaVA, Idefics2, etc.)
- [ ] **Quality checks**: `make fixup` passes with no errors
-
-```
+We have a technical guide for [how to add a model to 🤗 Transformers](https://huggingface.co/docs/transformers/add_new_model).

 ## Do you want to add documentation?

--- a/1
+++ b/1
@ -45,7 +45,6 @@ repo-consistency:
 	python utils/check_modular_conversion.py
 	python utils/check_dummies.py
 	python utils/check_repo.py
-	python utils/check_init_weights_data.py
 	python utils/check_inits.py
 	python utils/check_pipeline_typing.py
 	python utils/check_config_docstrings.py
--- a/README.md
+++ b/README.md
@ -64,8 +64,8 @@ limitations under the License.
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/transformers_as_a_model_definition.png"/>
 </h3>

-Transformers acts as the model-definition framework for state-of-the-art machine learning with text, computer
-vision, audio, video, and multimodal models, for both inference and training.
+Transformers acts as the model-definition framework for state-of-the-art machine learning models in text, computer
+vision, audio, video, and multimodal model, for both inference and training.

 It centralizes the model definition so that this definition is agreed upon across the ecosystem. `transformers` is the
 pivot across frameworks: if a model definition is supported, it will be compatible with the majority of training
--- a/awesome-transformers.md
+++ b/awesome-transformers.md
@ -9,12 +9,6 @@ In this list, we showcase incredibly impactful and novel projects that have push
 adding other projects to the list. If you believe a project should be here and it's not, then please, open a PR
 to add it.

-## [◉ Universal Intelligence](https://github.com/blueraai/universal-intelligence)
-
-[Universal Intelligence](https://github.com/blueraai/universal-intelligence) aims to standardize models, tools, and agents —transforming them into simple, composable, portable, interoperable, framework-agnostic, hardware-agnostic interfaces (through auto-negotiation and resource sharing); for fast and accessible development of AI applications.
-
-Keywords: Protocol, Open-source, LLMs, Large Language Models, Agents, Low-code
-
 ## [gpt4all](https://github.com/nomic-ai/gpt4all)

 [gpt4all](https://github.com/nomic-ai/gpt4all) is an ecosystem of open-source chatbots trained on massive collections of clean assistant data including code, stories and dialogue. It offers open-source, large language models such as LLaMA and GPT-J trained in an assistant-style.
--- a/benchmark/benches/llama.py
+++ b/benchmark/benches/llama.py
@ -16,6 +16,7 @@ import sys
 from logging import Logger
 from threading import Event, Thread
 from time import perf_counter, sleep
+from typing import Optional


 # Add the parent directory to Python path to import benchmarks_entrypoint
@ -41,7 +42,7 @@ except ImportError:
    GenerationConfig = None
    StaticCache = None

-os.environ["HF_XET_HIGH_PERFORMANCE"] = "1"
+os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
 os.environ["TOKENIZERS_PARALLELISM"] = "1"

 # Only set torch precision if torch is available
@ -144,7 +145,7 @@ def run_benchmark(
            q = torch.empty_like(probs_sort).exponential_(1)
            return torch.argmax(probs_sort / q, dim=-1, keepdim=True).to(dtype=torch.int)

-        def logits_to_probs(logits, temperature: float = 1.0, top_k: int | None = None):
+        def logits_to_probs(logits, temperature: float = 1.0, top_k: Optional[int] = None):
            logits = logits / max(temperature, 1e-5)

            if top_k is not None:
@ -154,7 +155,7 @@ def run_benchmark(
            probs = torch.nn.functional.softmax(logits, dim=-1)
            return probs

-        def sample(logits, temperature: float = 1.0, top_k: int | None = None):
+        def sample(logits, temperature: float = 1.0, top_k: Optional[int] = None):
            probs = logits_to_probs(logits[0, -1], temperature, top_k)
            idx_next = multinomial_sample_one_no_sync(probs)
            return idx_next, probs
--- a/benchmark/requirements.txt
+++ b/benchmark/requirements.txt
@ -1,5 +1,6 @@
 gpustat==1.1.1
 psutil==6.0.0
 psycopg2==2.9.9
-hf_xet
-pandas>=1.5.0
+torch>=2.4.0
+hf_transfer
+pandas>=1.5.0
--- a/benchmark_v2/framework/benchmark_config.py
+++ b/benchmark_v2/framework/benchmark_config.py
@ -1,10 +1,7 @@
 import hashlib
-import itertools
 import json
 import logging
-from typing import Any
-
-from transformers.utils.import_utils import is_flash_attn_2_available
+from typing import Any, Optional


 KERNELIZATION_AVAILABLE = False
@ -21,38 +18,26 @@ logger = logging.getLogger(__name__)
 class BenchmarkConfig:
    """Configuration for a single benchmark scenario."""

-    all_attn_implementations = [
-        ("flash_attention_2", None),
-        ("eager", None),
-        ("sdpa", "math"),
-        ("sdpa", "flash_attention"),
-        ("flex_attention", None),
-    ]
-
-    all_compiled_modes = [None, "default", "reduce-overhead", "max-autotune", "max-autotune-no-cudagraphs"]
-
    def __init__(
        self,
        warmup_iterations: int = 5,
        measurement_iterations: int = 20,
-        gpu_monitoring: bool = True,  # NOTE: you may want to disable this at times as we have obsvered it could heavily slow down benchmarks on AMD
-        continuous_batching: bool = False,
+        gpu_monitoring: bool = False,  # False by default because it slows down the benchmark by a lot
        batch_size: int = 1,
        sequence_length: int = 128,
        num_tokens_to_generate: int = 128,
        attn_implementation: str = "eager",
-        sdpa_backend: str | None = None,
-        compile_mode: str | None = None,
-        compile_options: dict[str, Any] | None = None,
+        sdpa_backend: Optional[str] = None,
+        compile_mode: Optional[str] = None,
+        compile_options: Optional[dict[str, Any]] = None,
        kernelize: bool = False,
-        name: str | None = None,
+        name: Optional[str] = None,
        skip_validity_check: bool = False,
    ) -> None:
        # Benchmark parameters
        self.warmup_iterations = warmup_iterations
        self.measurement_iterations = measurement_iterations
        self.gpu_monitoring = gpu_monitoring
-        self.continuous_batching = continuous_batching
        # Input parameters
        self.batch_size = batch_size
        self.sequence_length = sequence_length
@ -74,35 +59,12 @@ class BenchmarkConfig:
    def check_validity(self, skip_validity_check: bool = False) -> None:
        if skip_validity_check:
            return
-        # Check FA is installed
-        if self.attn_implementation == "flash_attention_2" and not is_flash_attn_2_available():
-            logger.warning(
-                "Flash attention does not support compile mode. Defaulting to SDPA w/ flash attention backend."
-            )
-            self.attn_implementation = "sdpa"
-            self.sdpa_backend = "flash_attention"
        # Flash attention does not support compile mode, so we turn it off # FIXME: it would be better to support it
        is_fa = self.attn_implementation == "flash_attention_2"
        is_fa |= self.attn_implementation == "sdpa" and self.sdpa_backend == "flash_attention"
        if is_fa:
            logger.warning("Flash attention does not support compile mode. Turning off compile mode.")
            self.compile_mode = None
-        # Handle SDPA backend if not determined by the config (needs to be done before skipping duplicates)
-        if self.attn_implementation == "sdpa" and self.sdpa_backend is None:
-            default_backend = "flash_attention"  # FIXME: torch has a _cur_sdpa_kernel_backends but it fails
-            logger.warning(f"No SDPA backend provided, using {default_backend} instead.")
-            self.sdpa_backend = default_backend
-        if self.continuous_batching:
-            if self.attn_implementation == "flex_attention":
-                logger.error(
-                    "disabling continuous batching because of invalid configuration: flex attention is not supported"
-                )
-                self.continuous_batching = False
-            elif self.attn_implementation == "sdpa" and self.sdpa_backend is not None:
-                logger.warning(
-                    "when continuous batching is enabled, sdpa_backend must be None because of the attention mask, setting it to None"
-                )
-                self.sdpa_backend = "math"

    @property
    def hash(self) -> str:
@ -118,7 +80,6 @@ class BenchmarkConfig:
            attn_code += f"_{self.sdpa_backend}" if self.attn_implementation == "sdpa" else ""
            compile_str = f"compiled_{self.compile_mode}" if self.compile_mode is not None else "uncompiled"
            kernelize_str = "kernelized" if self.kernelize else "unkernelized"
-            continuous_batching_str = "cb" if self.continuous_batching else "generate"
            sep = "-"
        else:
            iter_str = f"{self.warmup_iterations} warmup, {self.measurement_iterations} iterations"
@ -128,11 +89,8 @@ class BenchmarkConfig:
            attn_code += f" with {self.sdpa_backend} backend" if self.attn_implementation == "sdpa" else ""
            compile_str = "compiled" if self.compile_mode is not None else "not compiled"
            kernelize_str = "kernelized" if self.kernelize else "not kernelized"
-            continuous_batching_str = "continuous batching" if self.continuous_batching else "regular generate"
            sep = ", "
-        return sep.join(
-            [iter_str, gpu_monitor_str, dimensions_str, attn_code, compile_str, kernelize_str, continuous_batching_str]
-        )
+        return sep.join([iter_str, gpu_monitor_str, dimensions_str, attn_code, compile_str, kernelize_str])

    def to_dict(self) -> dict[str, Any]:
        return {
@ -140,14 +98,13 @@ class BenchmarkConfig:
            "warmup_iterations": self.warmup_iterations,
            "measurement_iterations": self.measurement_iterations,
            "gpu_monitoring": self.gpu_monitoring,
-            "continuous_batching": self.continuous_batching,
            "batch_size": self.batch_size,
            "sequence_length": self.sequence_length,
            "num_tokens_to_generate": self.num_tokens_to_generate,
            "attn_implementation": self.attn_implementation,
            "sdpa_backend": self.sdpa_backend,
            "compile_mode": self.compile_mode,
-            "compile_options": self.compile_options | {},  # to avoid inplace modification of the original dict
+            "compile_options": self.compile_options,
            "kernelize": self.kernelize,
        }

@ -157,7 +114,6 @@ class BenchmarkConfig:
            warmup_iterations=data.get("warmup_iterations", 5),
            measurement_iterations=data.get("measurement_iterations", 20),
            gpu_monitoring=data.get("gpu_monitoring", False),
-            continuous_batching=data.get("continuous_batching", False),
            batch_size=data.get("batch_size", 1),
            sequence_length=data.get("sequence_length", 128),
            num_tokens_to_generate=data.get("num_tokens_to_generate", 128),
@ -171,72 +127,92 @@ class BenchmarkConfig:
        )


-def adapt_configs(
-    configs: list[BenchmarkConfig],
-    warmup_iterations: int | list[int] = 5,
-    measurement_iterations: int | list[int] = 20,
-    batch_size: int | list[int] = 1,
-    sequence_length: int | list[int] = 128,
-    num_tokens_to_generate: int | list[int] = 128,
-    gpu_monitoring: bool | list[bool] = True,
+def cross_generate_configs(
+    attn_impl_and_sdpa_backend: list[tuple[str, Optional[str]]],
+    compiled_mode: list[Optional[str]],
+    kernelized: list[bool],
+    warmup_iterations: int = 5,
+    measurement_iterations: int = 20,
+    batch_size: int = 1,
+    sequence_length: int = 128,
+    num_tokens_to_generate: int = 128,
+    gpu_monitoring: bool = False,  # this slows down the benchmark by a lot so we disable it by default
 ) -> list[BenchmarkConfig]:
-    parameters = (
-        x if isinstance(x, list) else [x]
-        for x in [
-            warmup_iterations,
-            measurement_iterations,
-            batch_size,
-            sequence_length,
-            num_tokens_to_generate,
-            gpu_monitoring,
-        ]
-    )
-    iterator = itertools.product(*parameters)
-
-    adapted_configs = []
-    for warmup_iters, measurement_iters, bs, seqlen, ntok, monitor in iterator:
-        for config in configs:
-            config = config.to_dict()
-            config["warmup_iterations"] = warmup_iters
-            config["measurement_iterations"] = measurement_iters
-            config["batch_size"] = bs
-            config["sequence_length"] = seqlen
-            config["num_tokens_to_generate"] = ntok
-            config["gpu_monitoring"] = monitor
-            adapted_configs.append(BenchmarkConfig.from_dict(config))
-    return adapted_configs
-
-
-def get_config_by_level(level: int) -> list[BenchmarkConfig]:
+    # Create kwargs common to all configs
+    kwargs = {
+        "warmup_iterations": warmup_iterations,
+        "measurement_iterations": measurement_iterations,
+        "batch_size": batch_size,
+        "sequence_length": sequence_length,
+        "num_tokens_to_generate": num_tokens_to_generate,
+        "gpu_monitoring": gpu_monitoring,
+    }
+    # Cross-generate all combinations of attn_implementation, compiled_mode, and kernelized
    configs = []
-    # Early return if level is greater than 3: we generate all combinations of configs, maybe even w/ all compile modes
-    if level >= 3:
-        for attn_implementation, sdpa_backend in BenchmarkConfig.all_attn_implementations:
-            # Usually there is not much to gain by compiling with other modes, but we allow it for level 4
-            compile_modes = BenchmarkConfig.all_compiled_modes if level >= 4 else [None, "default"]
-            for cm in compile_modes:
-                for kernelize_on in {False, KERNELIZATION_AVAILABLE}:
-                    for cb_on in [False, True]:
-                        configs.append(
-                            BenchmarkConfig(
-                                attn_implementation=attn_implementation,
-                                sdpa_backend=sdpa_backend,
-                                compile_mode=cm,
-                                kernelize=kernelize_on,
-                                continuous_batching=cb_on,
-                            )
-                        )
-        return configs
-    # Otherwise, we add the configs for the given level
-    if level >= 0:
-        configs.append(BenchmarkConfig(attn_implementation="flex_attention", compile_mode="default"))
-    if level >= 1:
-        configs.append(BenchmarkConfig(attn_implementation="flash_attention_2"))
-        configs.append(BenchmarkConfig(attn_implementation="eager", compile_mode="default"))
-        configs.append(BenchmarkConfig(attn_implementation="flash_attention_2", continuous_batching=True))
-    if level >= 2:
-        configs.append(BenchmarkConfig(attn_implementation="sdpa", compile_mode="default"))
-        configs.append(BenchmarkConfig(attn_implementation="flex_attention", compile_mode="default", kernelize=True))
-        configs.append(BenchmarkConfig(attn_implementation="flash_attention_2", kernelize=True))
-        configs.append(BenchmarkConfig(attn_implementation="paged|sdpa", continuous_batching=True))
+    for attn_implementation, sdpa_backend in list(dict.fromkeys(attn_impl_and_sdpa_backend)):
+        for cm in list(dict.fromkeys(compiled_mode)):
+            for kernelize_on in list(dict.fromkeys(kernelized)):
+                config = BenchmarkConfig(
+                    attn_implementation=attn_implementation,
+                    sdpa_backend=sdpa_backend,
+                    compile_mode=cm,
+                    kernelize=kernelize_on,
+                    **kwargs,
+                )
+                configs.append(config)
    return configs
+
+
+def generate_all_configs(
+    warmup_iterations: int = 5,
+    measurement_iterations: int = 20,
+    batch_size: int = 1,
+    sequence_length: int = 128,
+    num_tokens_to_generate: int = 128,
+    gpu_monitoring: bool = False,
+) -> list[BenchmarkConfig]:
+    all_attn_implementations = [
+        ("flash_attention_2", None),
+        ("eager", None),
+        ("sdpa", "math"),
+        ("sdpa", "flash_attention"),
+        ("flex_attention", None),
+    ]
+    return cross_generate_configs(
+        attn_impl_and_sdpa_backend=all_attn_implementations,
+        compiled_mode=[None, "default", "reduce-overhead", "max-autotune", "max-autotune-no-cudagraphs"],
+        kernelized=[False, KERNELIZATION_AVAILABLE],
+        warmup_iterations=warmup_iterations,
+        measurement_iterations=measurement_iterations,
+        batch_size=batch_size,
+        sequence_length=sequence_length,
+        num_tokens_to_generate=num_tokens_to_generate,
+        gpu_monitoring=gpu_monitoring,
+    )
+
+
+def generate_default_configs(
+    warmup_iterations: int = 5,
+    measurement_iterations: int = 20,
+    batch_size: int = 1,
+    sequence_length: int = 128,
+    num_tokens_to_generate: int = 128,
+    gpu_monitoring: bool = False,
+) -> list[BenchmarkConfig]:
+    all_attn_implementations = [
+        ("flash_attention_2", None),
+        ("eager", None),
+        ("sdpa", "math"),
+        ("sdpa", "flash_attention"),  # note: this one can fail with compile because of attn mask
+    ]
+    return cross_generate_configs(
+        attn_impl_and_sdpa_backend=all_attn_implementations,
+        compiled_mode=[None, "max-autotune"],
+        kernelized=[False, KERNELIZATION_AVAILABLE],
+        warmup_iterations=warmup_iterations,
+        measurement_iterations=measurement_iterations,
+        batch_size=batch_size,
+        sequence_length=sequence_length,
+        num_tokens_to_generate=num_tokens_to_generate,
+        gpu_monitoring=gpu_monitoring,
+    )
--- a/benchmark_v2/framework/benchmark_runner.py
+++ b/benchmark_v2/framework/benchmark_runner.py
@ -4,16 +4,13 @@ import logging
 import os
 import pathlib
 import re
-import tempfile
 import time
 from contextlib import nullcontext
 from datetime import datetime
 from queue import Queue
-from typing import Any
+from typing import Any, Optional

 import torch
-from datasets import Dataset
-from huggingface_hub import HfApi
 from tqdm import trange

 from transformers import (
@ -53,8 +50,6 @@ DEFAULT_PROMPT = "\n".join([
    "Its instability ended in the coup of 18 Brumaire and the establishment of the Consulate, with Napoleon Bonaparte as First Consul.",
 ])  # fmt: skip

-PUSH_TO_HUB_TOKEN = os.getenv("PUSH_TO_HUB_TOKEN", None)
-

 def compact_json_numeric_arrays(data: dict):
    # Match arrays that contain only numbers (ints/floats), whitespace, commas, and newlines
@ -79,7 +74,7 @@ def get_git_revision() -> str:
        return git_hash.readline().strip()


-def get_sdpa_backend(backend_name: str | None) -> torch.nn.attention.SDPBackend | None:
+def get_sdpa_backend(backend_name: Optional[str]) -> Optional[torch.nn.attention.SDPBackend]:
    """Get the SDPA backend enum from string name."""
    if backend_name is None:
        return None
@ -117,25 +112,23 @@ def flush_memory():
    # Clear CUDA cache
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
+        torch.cuda.reset_max_memory_allocated()
+        torch.cuda.reset_peak_memory_stats()
        torch.cuda.synchronize()
    gc.collect()


 class BenchmarkStreamer(BaseStreamer):
    def __init__(self, **kwargs) -> None:
-        self.timeout = kwargs.pop("timeout", 10)
        self.timestamps = []
        self.text_queue = Queue()
-        self.stop_signal = None

    def put(self, value):
        """Receives tokens and logs the timestamp of the generation."""
        self.timestamps.append(time.perf_counter())
-        self.text_queue.put(value)

    def end(self):
        self.timestamps.append(time.perf_counter())
-        self.text_queue.put(self.stop_signal)

    def __iter__(self):
        return self
@ -152,34 +145,25 @@ class BenchmarkRunner:
    """Main benchmark runner that coordinates benchmark execution."""

    def __init__(
-        self,
-        logger: logging.Logger,
-        output_dir: str | None = None,
-        branch_name: str | None = None,
-        commit_id: str | None = None,
-        commit_message: str | None = None,
+        self, logger: logging.Logger, output_dir: str = "benchmark_results", commit_id: Optional[str] = None
    ) -> None:
        # Those stay constant for the whole run
        self.logger = logger
-        if output_dir is None:
-            output_dir = os.path.join(os.path.dirname(os.path.dirname(__file__)), "benchmark_results")
        self.output_dir = output_dir
-        self.branch_name = branch_name
        self.commit_id = get_git_revision() if commit_id is None else commit_id
-        self.commit_message = commit_message
        os.makedirs(self.output_dir, exist_ok=True)
        self.profile_dir = None
        # Attributes that are reset for each model
        self._setup_for = ""
        # Attributes that are reset for each run
-        self.model: GenerationMixin | None = None
+        self.model: Optional[GenerationMixin] = None

    def cleanup(self) -> None:
        del self.model
        self.model = None
        flush_memory()

-    def setup_benchmark(self, model_id: str, config: BenchmarkConfig) -> None:
+    def setup_one_run(self, model_id: str, config: BenchmarkConfig) -> None:
        # Some attributes only need to be set once per model
        if self._setup_for != model_id:
            self.tokenizer = AutoTokenizer.from_pretrained(model_id)
@ -216,13 +200,10 @@ class BenchmarkRunner:
        self.model = self.model.eval().to(config.device)

        # Kernelize the model if needed
-        if config.kernelize and kernelize is not None and Mode is not None:
+        if config.kernelize:
            self.model = kernelize(self.model, mode=Mode.INFERENCE)

-    def run_benchmark(
-        self, model_id: str, config: BenchmarkConfig, num_tokens_to_profile: int = 0
-    ) -> dict[str, Any] | None:
-        """Run a single benchmark with the given model ID and config."""
+    def run_one_benchmark(self, model_id: str, config: BenchmarkConfig, num_tokens_to_profile: int = 0) -> None:
        sdpa_ctx = nullcontext()
        if config.attn_implementation == "sdpa":
            sdpa_backend = get_sdpa_backend(config.sdpa_backend)
@ -232,9 +213,8 @@ class BenchmarkRunner:
            self.logger.info(f"Running benchmark scenario: {config.name}")

            # Quick validation: try one measurement first to see if this scenario works
-            generate_fn = self.time_generate_batch if config.continuous_batching else self.time_generate
            flush_memory()
-            e2e_latency, token_generation_times, shape_and_decoded_output, gpu_metrics = generate_fn(
+            e2e_latency, token_generation_times, decoded_output, gpu_metrics = self.time_generate(
                max_new_tokens=1, gpu_monitor=None
            )
            if e2e_latency < 0:
@ -244,18 +224,18 @@ class BenchmarkRunner:
            # Warmup runs
            self.logger.info(f"Warming up with {config.warmup_iterations} iterations...")
            for _ in trange(config.warmup_iterations):
-                _ = generate_fn(max_new_tokens=config.num_tokens_to_generate)
+                _ = self.time_generate(max_new_tokens=config.num_tokens_to_generate)
            self.logger.info("Warmup over.")

            # Measurement runs
            result = BenchmarkResult()
            self.logger.info(f"Benchmarking with {config.measurement_iterations} iterations.")
            for _ in trange(config.measurement_iterations):
-                e2e_latency, token_generation_times, shape_and_decoded_output, gpu_metrics = generate_fn(
+                e2e_latency, token_generation_times, decoded_output, gpu_metrics = self.time_generate(
                    max_new_tokens=config.num_tokens_to_generate,
                    gpu_monitor=(GPUMonitor(logger=self.logger) if config.gpu_monitoring else None),
                )
-                result.accumulate(e2e_latency, token_generation_times, shape_and_decoded_output, gpu_metrics)
+                result.accumulate(e2e_latency, token_generation_times, decoded_output, gpu_metrics)
            self.logger.info("Benchmarking done. Cleaning up.")

            # Profile if needed
@ -263,73 +243,16 @@ class BenchmarkRunner:
                self.profile_generate(num_tokens_to_profile, config.name)

            return {
-                "metadata": BenchmarkMetadata(
-                    model_id=model_id,
-                    branch_name=self.branch_name,
-                    commit_id=self.commit_id,
-                    commit_message=self.commit_message,
-                ),
+                "metadata": BenchmarkMetadata(model_id=model_id, commit_id=self.commit_id),
                "measurements": result,
                "config": config,
            }

-    # TODO: refactor `generate_batch` to handle streaming so we can use it here
-    def time_generate_batch(
-        self,
-        max_new_tokens: int,
-        gpu_monitor: GPUMonitor | None = None,
-    ) -> tuple[float, list[float], str, GPURawMetrics | None]:
-        if gpu_monitor is not None:
-            gpu_monitor.start()
-        config = GenerationConfig(
-            max_new_tokens=max_new_tokens,
-            eos_token_id=self.tokenizer.eos_token_id,
-            pad_token_id=self.tokenizer.pad_token_id,
-            do_sample=True,
-        )
-        manager = self.model.init_continuous_batching(config)
-        manager.start()
-        try:
-            first_req_results = []
-            timestamps = []
-            wall_time_0 = time.perf_counter()
-            inputs = self.inputs["input_ids"].tolist()
-            manager.add_requests(inputs, max_new_tokens=max_new_tokens, streaming=True)
-            first_req_id = None
-            num_requests = len(inputs)
-            finished_requests = 0
-            while finished_requests < num_requests:
-                # NOTE: I don't like having the extra if stmt here, but hopefully won't degrade perf too much
-                result = manager.get_result()
-                if result:
-                    timestamps.append(time.perf_counter() - wall_time_0)
-                    if result.is_finished():
-                        finished_requests += 1
-                    if first_req_id is None:
-                        first_req_id = result.request_id
-                    if result.request_id == first_req_id:
-                        first_req_results.append(result)
-                else:
-                    if not manager.is_running():
-                        raise RuntimeError("Generation thread exited unexpectedly")
-            wall_time_1 = time.perf_counter()
-            gpu_metrics = gpu_monitor.stop_and_collect() if gpu_monitor is not None else None
-            decoded_output = self.tokenizer.decode(
-                [res.generated_tokens[0] for res in first_req_results], skip_special_tokens=True
-            )
-            shape_and_decoded_output = f"{(1, len(first_req_results))} | {decoded_output}"
-            e2e_latency = wall_time_1 - wall_time_0
-            return e2e_latency, timestamps, shape_and_decoded_output, gpu_metrics
-        except Exception as e:
-            raise e
-        finally:
-            manager.stop()
-
    def time_generate(
        self,
        max_new_tokens: int,
-        gpu_monitor: GPUMonitor | None = None,
-    ) -> tuple[float, list[float], str, GPURawMetrics | None]:
+        gpu_monitor: Optional[GPUMonitor] = None,
+    ) -> tuple[float, list[float], str, Optional[GPURawMetrics]]:
        """Time the latency of a call to model.generate() with the given (inputs) and (max_new_tokens)."""
        # Prepare gpu monitoring if needed
        if gpu_monitor is not None:
@ -354,11 +277,10 @@ class BenchmarkRunner:
            raise RuntimeError(f"Generated {new_tokens} tokens, expected {max_new_tokens}")
        # Decode outputs
        decoded_output = self.tokenizer.decode(outputs[0, input_tokens:], skip_special_tokens=True)
-        shape_and_decoded_output = f"{tuple(outputs.shape)} | {decoded_output}"
        # Compute intermediate quantities
        e2e_latency = wall_time_1 - wall_time_0
        token_generation_times = [t - wall_time_0 for t in streamer.timestamps[1:]]
-        return e2e_latency, token_generation_times, shape_and_decoded_output, gpu_metrics
+        return e2e_latency, token_generation_times, decoded_output, gpu_metrics

    def profile_generate(self, num_tokens_to_profile: int, config_name: str) -> None:
        """Profile the latency of a call to model.generate() with the given (inputs) and (max_new_tokens)."""
@ -382,28 +304,33 @@ class BenchmarkRunner:
        benchmark_configs: list[BenchmarkConfig],
        num_tokens_to_profile: int = 0,
        pretty_print_summary: bool = True,
-    ) -> tuple[str, dict[str, Any]]:
-        """Run multiple benchmarks for the given model ID and list of benchmark configs."""
+    ) -> dict[str, Any]:
        all_results = {}
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        start_time = time.perf_counter()

        n_configs = len(benchmark_configs)
        for i, config in enumerate(benchmark_configs):
+            # Handle SDPA backend if not determined by the config (needs to be done before skipping duplicates)
+            if config.attn_implementation == "sdpa" and config.sdpa_backend is None:
+                default_backend = "flash_attention"  # FIXME: torch has a _cur_sdpa_kernel_backends but it fails
+                self.logger.warning(f"No SDPA backend provided, using {default_backend} instead.")
+                config.sdpa_backend = default_backend
+
            # Skip if already run
            if config.hash in all_results:
                self.logger.info(f"Skipping duplicate config {config.name} for model {model_id} ({i + 1}/{n_configs})")
                continue

            # Otherwise, run the benchmark
-            self.setup_benchmark(model_id, config)
+            self.setup_one_run(model_id, config)
            self.logger.info(
                f"Running benchmark of model {model_id} with scenario: {config.name} ({i + 1}/{n_configs})"
            )

            # Launch benchmark in a try/except block to avoid stopping the whole run if one benchmark fails
            try:
-                results = self.run_benchmark(model_id, config, num_tokens_to_profile)
+                results = self.run_one_benchmark(model_id, config, num_tokens_to_profile)
                if results is not None:
                    all_results[config.hash] = results

@ -413,30 +340,24 @@ class BenchmarkRunner:
            self.cleanup()
            self.save_results(model_id, all_results, timestamp=timestamp)

-        if len(all_results) < 1:
-            raise RuntimeError("No benchmark was run succesfully")
-
        if pretty_print_summary:
            print()
            print("=" * 100)
            print(f"Finished benchmarks in {time.perf_counter() - start_time:.2f} seconds")
            print(f"Total number of benchmarks: {len(all_results)}")
-            print("First run metadata:")
-            first_key = list(all_results.keys())[0]
-            first_metadata = all_results[first_key]["metadata"].to_dict()
-            hardware_info = first_metadata.pop("hardware_info")
-            pretty_print_dict(first_metadata | hardware_info, tabs=1)
-            for result in all_results.values():
+            if len(all_results) > 0:
+                print("First run metadata:")
+                first_key = list(all_results.keys())[0]
+                first_metadata = all_results[first_key]["metadata"].to_dict()
+                hardware_info = first_metadata.pop("hardware_info")
+                pretty_print_dict(first_metadata | hardware_info, tabs=1)
+            for value in all_results.values():
                print("=" * 100)
-                print(f"Config: {result['config'].infer_name(compact=False)}\n")
-                result["measurements"].pprint(
-                    batch_size=result["config"].batch_size,
-                    num_generated_tokens=result["config"].num_tokens_to_generate,
-                    tabs=1,
-                )
+                print(f"Config: {value['config'].infer_name(compact=False)}\n")
+                value["measurements"].pprint(tabs=1)
            print("=" * 100)

-        return (timestamp, all_results)
+        return all_results

    def save_results(self, model_name: str, results: dict, timestamp: str = "") -> str:
        """Save benchmark results to JSON file."""
@ -465,43 +386,3 @@ class BenchmarkRunner:

        self.logger.info(f"Results saved to {filepath}")
        return filepath
-
-    def push_results_to_hub(self, dataset_id: str, results: dict[Any, Any], timestamp: str) -> None:
-        if PUSH_TO_HUB_TOKEN is None:
-            raise ValueError(
-                "PUSH_TO_HUB_TOKEN is not set, cannot push results to the Hub. When setting dataset_id, please also set the PUSH_TO_HUB_TOKEN environment variable."
-            )
-
-        n_results = len(results)
-        self.logger.info(f"Pushing {n_results} results to: {dataset_id}")
-        rows = []
-        for cfg_hash, entry in results.items():
-            row = {
-                "benchmark_config_hash": cfg_hash,
-                "config": entry["config"].to_dict(),
-                "measurements": entry["measurements"].to_dict(),
-                "metadata": entry["metadata"].to_dict(),
-            }
-            rows.append(row)
-
-        ds = Dataset.from_list(rows)
-        with tempfile.TemporaryDirectory() as tmp:
-            jsonl_path = os.path.join(tmp, "data.jsonl")
-            with open(jsonl_path, "w") as f:
-                json_lines = []
-                for ex in ds:
-                    json_lines.append(json.dumps(ex, ensure_ascii=False))
-                f.write("\n".join(json_lines))
-
-            api = HfApi()
-            # NOTE: we expect the repository to already exist
-            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") if not timestamp else timestamp
-            file_name = f"benchmark_run_{timestamp}.jsonl"
-            api.upload_file(
-                path_or_fileobj=jsonl_path,
-                path_in_repo=file_name,
-                repo_id=dataset_id,
-                repo_type="dataset",
-                token=PUSH_TO_HUB_TOKEN,
-            )
-        self.logger.info(f"Succesfully uploaded results to: {dataset_id}")
--- a/benchmark_v2/framework/data_classes.py
+++ b/benchmark_v2/framework/data_classes.py
@ -1,6 +1,6 @@
 from dataclasses import dataclass
-from datetime import datetime, timezone
-from typing import Any
+from datetime import datetime
+from typing import Any, Optional, Union

 import numpy as np

@ -36,17 +36,16 @@ def add_unit_to_duration(stats: dict[str, float]) -> dict[str, str]:
    return stats


-def equalize_lengths_and_collate(stats: dict[str, dict[str, str]]) -> dict[str, str]:
-    """Note: This operation is destructive as it will update values in place before returning a new correctly formatted dict"""
+def equalize_lengths_and_collate(stats: list[dict[str, str]]) -> list[str]:
    keys = ["avg", "std", "min", "med", "max", "p95"]
    for key in keys:
-        max_length = max(len(stat[key]) for stat in stats.values())
-        for stat in stats.values():
+        max_length = max(len(stat[key]) for stat in stats)
+        for stat in stats:
            stat[key] = stat[key].ljust(max_length, " ")
-    return {name: " ".join([f"{key}={stat[key]}" for key in keys]) for name, stat in stats.items()}
+    return [" ".join([f"{key}={stat[key]}" for key in keys]) for stat in stats]


-def pretty_print_dict(data: dict[str, str], tabs: int = 0) -> None:
+def pretty_print_dict(data: dict[str, Any], tabs: int = 0) -> None:
    max_key_length = max([len(key) for key in data.keys()])
    for key, value in data.items():
        tabs_str = "  " * tabs
@ -60,26 +59,19 @@ class BenchmarkMetadata:

    model_id: str
    timestamp: str
-    branch_name: str
    commit_id: str
-    commit_message: str
    hardware_info: HardwareInfo

-    def __init__(self, model_id: str, commit_id: str, branch_name: str = "main", commit_message: str = "") -> None:
+    def __init__(self, model_id: str, commit_id: str):
        self.model_id = model_id
-        self.timestamp = datetime.now(timezone.utc).isoformat()
-        self.branch_name = branch_name
+        self.timestamp = datetime.utcnow().isoformat()
        self.commit_id = commit_id
-        self.commit_message = commit_message
        self.hardware_info = HardwareInfo()

    def to_dict(self) -> dict[str, Any]:
        return {
-            "model_id": self.model_id,
            "timestamp": self.timestamp,
-            "branch_name": self.branch_name,
            "commit_id": self.commit_id,
-            "commit_message": self.commit_message,
            "hardware_info": self.hardware_info.to_dict(),
        }

@ -90,22 +82,22 @@ class BenchmarkResult:
    def __init__(self) -> None:
        self.e2e_latency = []
        self.token_generation_times = []  # time at which each token was generated (relative to start of the generation)
-        self.shape_and_decoded_outputs = []
+        self.decoded_outputs = []
        self.gpu_metrics = []

    def accumulate(
        self,
        e2e_latency: float,
        token_generation_times: list[float],
-        shape_and_decoded_output: str,
-        gpu_metrics: GPURawMetrics | None,
+        decoded_output: str,
+        gpu_metrics: Optional[GPURawMetrics],
    ) -> None:
        self.e2e_latency.append(e2e_latency)
        self.token_generation_times.append(token_generation_times)
-        self.shape_and_decoded_outputs.append(shape_and_decoded_output)
+        self.decoded_outputs.append(decoded_output)
        self.gpu_metrics.append(gpu_metrics)

-    def to_dict(self) -> dict[str, None | int | float]:
+    def to_dict(self) -> dict[str, Union[None, int, float]]:
        # Save GPU metrics as None if it contains only None values
        if all(gm is None for gm in self.gpu_metrics):
            gpu_metrics = None
@ -114,12 +106,12 @@ class BenchmarkResult:
        return {
            "e2e_latency": self.e2e_latency,
            "token_generation_times": self.token_generation_times,
-            "shape_and_decoded_outputs": self.shape_and_decoded_outputs,
+            "decoded_outputs": self.decoded_outputs,
            "gpu_metrics": gpu_metrics,
        }

    @classmethod
-    def from_dict(cls, data: dict[str, None | int | float]) -> "BenchmarkResult":
+    def from_dict(cls, data: dict[str, Union[None, int, float]]) -> "BenchmarkResult":
        # Handle GPU metrics, which is saved as None if it contains only None values
        if data["gpu_metrics"] is None:
            gpu_metrics = [None for _ in range(len(data["e2e_latency"]))]
@ -131,7 +123,7 @@ class BenchmarkResult:
            new_instance.accumulate(
                e2e_latency=data["e2e_latency"][i],
                token_generation_times=data["token_generation_times"][i],
-                shape_and_decoded_output=data["shape_and_decoded_outputs"][i],
+                decoded_output=data["decoded_output"][i],
                gpu_metrics=gpu_metrics[i],
            )
        return new_instance
@ -142,19 +134,19 @@ class BenchmarkResult:
    def get_measured_itl(self) -> list[float]:
        return [(dt[-1] - dt[0]) / (len(dt) - 1) for dt in self.token_generation_times if len(dt) > 1]

-    def get_throughput(self, total_generated_tokens: int) -> list[float]:
-        return [total_generated_tokens / e2e_latency for e2e_latency in self.e2e_latency]
-
-    def pprint(self, batch_size: int = 0, num_generated_tokens: int = 0, tabs: int = 0) -> None:
-        measurements = {
-            "E2E Latency": add_unit_to_duration(compute_basic_statistics(self.e2e_latency)),
-            "Time to First Token": add_unit_to_duration(compute_basic_statistics(self.get_measured_ttft())),
-        }
-        itl_values = self.get_measured_itl()
-        if len(itl_values) > 0:
-            measurements["Inter-Token Latency"] = add_unit_to_duration(compute_basic_statistics(itl_values))
-        if batch_size > 0:
-            throughput_stats = compute_basic_statistics(self.get_throughput(batch_size * num_generated_tokens))
-            measurements["Throughput"] = {key: f"{value:.2f}tok/s" for key, value in throughput_stats.items()}
-        dict_to_pprint = equalize_lengths_and_collate(measurements)
-        pretty_print_dict(dict_to_pprint, tabs=tabs)
+    def pprint(self, tabs: int = 0) -> None:
+        collated_stats = equalize_lengths_and_collate(
+            [
+                add_unit_to_duration(compute_basic_statistics(self.e2e_latency)),
+                add_unit_to_duration(compute_basic_statistics(self.get_measured_ttft())),
+                add_unit_to_duration(compute_basic_statistics(self.get_measured_itl())),
+            ]
+        )
+        pretty_print_dict(
+            {
+                "E2E Latency": collated_stats[0],
+                "Time to First Token": collated_stats[1],
+                "Inter-Token Latency": collated_stats[2],
+            },
+            tabs=tabs,
+        )
--- a/benchmark_v2/framework/hardware_metrics.py
+++ b/benchmark_v2/framework/hardware_metrics.py
@ -7,6 +7,7 @@ import time
 from dataclasses import dataclass
 from enum import Enum
 from logging import Logger
+from typing import Optional, Union

 import gpustat
 import psutil
@ -41,7 +42,7 @@ class HardwareInfo:
        self.cpu_count = psutil.cpu_count()
        self.memory_total_mb = int(psutil.virtual_memory().total / (1024 * 1024))

-    def to_dict(self) -> dict[str, None | int | float | str]:
+    def to_dict(self) -> dict[str, Union[None, int, float, str]]:
        return {
            "gpu_name": self.gpu_name,
            "gpu_memory_total_gb": self.gpu_memory_total_gb,
@ -108,7 +109,7 @@ class GPURawMetrics:
    timestamp_0: float  # in seconds
    monitoring_status: GPUMonitoringStatus

-    def to_dict(self) -> dict[str, None | int | float | str]:
+    def to_dict(self) -> dict[str, Union[None, int, float, str]]:
        return {
            "utilization": self.utilization,
            "memory_used": self.memory_used,
@ -122,7 +123,7 @@ class GPURawMetrics:
 class GPUMonitor:
    """Monitor GPU utilization during benchmark execution."""

-    def __init__(self, sample_interval_sec: float = 0.1, logger: Logger | None = None):
+    def __init__(self, sample_interval_sec: float = 0.1, logger: Optional[Logger] = None):
        self.sample_interval_sec = sample_interval_sec
        self.logger = logger if logger is not None else logging.getLogger(__name__)

--- a/benchmark_v2/requirements.txt
+++ b/benchmark_v2/requirements.txt
@ -2,5 +2,6 @@ numpy>=1.21.0
 psutil>=5.8.0
 gpustat>=1.0.0
 torch>=2.0.0
+transformers>=4.30.0
 datasets>=2.10.0
-huggingface_hub>=0.16.0
+huggingface_hub>=0.16.0 
--- a/benchmark_v2/run_benchmarks.py
+++ b/benchmark_v2/run_benchmarks.py
@ -20,50 +20,31 @@ in the ./benches directory, organizing outputs into model-specific subfolders.

 import argparse
 import logging
+import random
 import sys
 import uuid

-from framework.benchmark_config import adapt_configs, get_config_by_level
+from framework.benchmark_config import BenchmarkConfig, generate_all_configs
 from framework.benchmark_runner import BenchmarkRunner


 if __name__ == "__main__":
    # Parse arguments
    parser = argparse.ArgumentParser()
-    parser.add_argument("--output-dir", type=str, default=None, help="Output dir for benchmark results")
+    parser.add_argument("--output-dir", type=str, default="benchmark_results", help="Output dir for benchmark results")
    parser.add_argument("--log-level", type=str, choices=["DEBUG", "INFO", "WARNING", "ERROR"], default="INFO")
    parser.add_argument("--model-id", type=str, help="Specific model ID to benchmark (if supported by benchmarks)")
-    parser.add_argument("--warmup", "-w", type=int, default=3, help="Number of warmup iterations")
-    parser.add_argument("--iterations", "-i", type=int, default=10, help="Number of measurement iterations")
+
+    parser.add_argument("--warmup", type=int, default=5, help="Number of warmup iterations")
+    parser.add_argument("--iterations", type=int, default=20, help="Number of measurement iterations")

    parser.add_argument("--batch-size", "-b", type=int, nargs="+", help="Batch size")
    parser.add_argument("--sequence-length", "-s", type=int, nargs="+", help="Sequence length")
    parser.add_argument("--num-tokens-to-generate", "-n", type=int, nargs="+", help="Number of tokens to generate")

-    parser.add_argument(
-        "--level",
-        type=int,
-        default=1,
-        help="Level of coverage for the benchmark. 0: only the main config, 1: a few important configs, 2: a config for"
-        " each attn implementation an option, 3: cross-generate all combinations of configs, 4: cross-generate all"
-        " combinations of configs w/ all compile modes",
-    )
    parser.add_argument("--num-tokens-to-profile", "-p", type=int, default=0, help="Number of tokens to profile")

-    parser.add_argument("--branch-name", type=str, help="Git branch name")
    parser.add_argument("--commit-id", type=str, help="Git commit ID (if not provided, will auto-detect from git)")
-    parser.add_argument("--commit-message", type=str, help="Git commit message")
-
-    parser.add_argument(
-        "--no-gpu-monitoring", action="store_true", help="Disables GPU monitoring during benchmark runs"
-    )
-
-    parser.add_argument(
-        "--push-result-to-dataset",
-        type=str,
-        default=None,
-        help="Name of the dataset to push results to. If not provided, results are not pushed to the Hub.",
-    )
    args = parser.parse_args()

    # Setup logging
@ -80,34 +61,51 @@ if __name__ == "__main__":
    logger.info(f"Benchmark run UUID: {benchmark_run_uuid}")
    logger.info(f"Output directory: {args.output_dir}")

-    # We cannot compute ITL if we don't have at least two measurements
-    if any(n <= 1 for n in args.num_tokens_to_generate):
-        raise ValueError("--num_tokens_to_generate arguments should be larger than 1")
-
    # Error out if one of the arguments is not provided
    if len(args.batch_size) * len(args.sequence_length) * len(args.num_tokens_to_generate) == 0:
        raise ValueError(
            "At least one of the arguments --batch-size, --sequence-length, or --num-tokens-to-generate is required"
        )

-    # Get the configs for the given coverage level
-    configs = get_config_by_level(args.level)
-    # Adapt the configs to the given arguments
-    configs = adapt_configs(
-        configs,
-        args.warmup,
-        args.iterations,
-        args.batch_size,
-        args.sequence_length,
-        args.num_tokens_to_generate,
-        not args.no_gpu_monitoring,
-    )
+    # If there is only one (batch_size, sequence_length, num_tokens_to_generate), we benchmark across configs
+    elif len(args.batch_size) * len(args.sequence_length) * len(args.num_tokens_to_generate) == 1:
+        benchmark_configs = generate_all_configs(
+            warmup_iterations=args.warmup,
+            measurement_iterations=args.iterations,
+            batch_size=args.batch_size[0],
+            sequence_length=args.sequence_length[0],
+            num_tokens_to_generate=args.num_tokens_to_generate[0],
+        )
+        random.shuffle(benchmark_configs)

-    runner = BenchmarkRunner(logger, args.output_dir, args.branch_name, args.commit_id, args.commit_message)
-    timestamp, results = runner.run_benchmarks(
-        args.model_id, configs, args.num_tokens_to_profile, pretty_print_summary=True
-    )
+    # Otherwise, we benchmark across all combinations of dimensions
+    else:
+        kwargs = {
+            "warmup_iterations": args.warmup,
+            "measurement_iterations": args.iterations,
+            "gpu_monitoring": False,
+            "batch_size": args.batch_size[0],
+            "sequence_length": args.sequence_length[0],
+            "num_tokens_to_generate": args.num_tokens_to_generate[0],
+            "attn_implementation": "flex_attention",
+            "sdpa_backend": None,
+            "compile_mode": "default",
+            "kernelize": False,
+        }
+        benchmark_configs = []
+        for num_tokens_to_generate in args.num_tokens_to_generate:
+            for sequence_length in args.sequence_length:
+                for batch_size in args.batch_size:
+                    kwargs["batch_size"] = batch_size
+                    kwargs["sequence_length"] = sequence_length
+                    kwargs["num_tokens_to_generate"] = num_tokens_to_generate
+                    benchmark_configs.append(BenchmarkConfig(**kwargs))

-    dataset_id = args.push_result_to_dataset
-    if dataset_id is not None and len(results) > 0:
-        runner.push_results_to_hub(dataset_id, results, timestamp)
+    runner = BenchmarkRunner(logger, args.output_dir, args.commit_id)
+    results = runner.run_benchmarks(
+        args.model_id,
+        benchmark_configs[:3],
+        args.num_tokens_to_profile,
+        pretty_print_summary=True,
+    )
+    # runner.save_results(args.model_id, results)
--- a/conftest.py
+++ b/conftest.py
@ -58,6 +58,7 @@ NOT_DEVICE_TESTS = {
    "test_model_get_set_embeddings",
    "test_model_main_input_name",
    "test_correct_missing_keys",
+    "test_tie_model_weights",
    "test_can_use_safetensors",
    "test_load_save_without_tied_weights",
    "test_tied_weights_keys",
@ -87,8 +88,6 @@ def pytest_configure(config):
    config.addinivalue_line("markers", "not_device_test: mark the tests always running on cpu")
    config.addinivalue_line("markers", "torch_compile_test: mark test which tests torch compile functionality")
    config.addinivalue_line("markers", "torch_export_test: mark test which tests torch export functionality")
-    config.addinivalue_line("markers", "flash_attn_test: mark test which tests flash attention functionality")
-    config.addinivalue_line("markers", "flash_attn_3_test: mark test which tests flash attention 3 functionality")

    os.environ["DISABLE_SAFETENSORS_CONVERSION"] = "true"

--- a/docker/consistency.dockerfile
+++ b/docker/consistency.dockerfile
@ -5,7 +5,7 @@ ARG REF=main
 RUN apt-get update && apt-get install -y time git g++ pkg-config make git-lfs
 ENV UV_PYTHON=/usr/local/bin/python
 RUN pip install uv && uv pip install --no-cache-dir -U pip setuptools GitPython
-RUN uv pip install --no-cache-dir --upgrade 'torch' 'torchaudio' 'torchvision' --index-url https://download.pytorch.org/whl/cpu
+RUN uv pip install --no-cache-dir --upgrade 'torch<2.9' 'torchaudio' 'torchvision' --index-url https://download.pytorch.org/whl/cpu
 RUN uv pip install --no-cache-dir pypi-kenlm
 RUN uv pip install --no-cache-dir "git+https://github.com/huggingface/transformers.git@${REF}#egg=transformers[quality,testing,torch-speech,vision]"
 RUN git lfs install
--- a/docker/custom-tokenizers.dockerfile
+++ b/docker/custom-tokenizers.dockerfile
@ -17,7 +17,7 @@ RUN make install -j 10

 WORKDIR /

-RUN uv pip install --no-cache --upgrade 'torch' --index-url https://download.pytorch.org/whl/cpu
+RUN uv pip install --no-cache --upgrade 'torch<2.9' --index-url https://download.pytorch.org/whl/cpu
 RUN uv pip install --no-cache-dir  --no-deps accelerate --extra-index-url https://download.pytorch.org/whl/cpu
 RUN uv pip install  --no-cache-dir "git+https://github.com/huggingface/transformers.git@${REF}#egg=transformers[ja,testing,sentencepiece,spacy,ftfy,rjieba]" unidic unidic-lite
 # spacy is not used so not tested. Causes to failures. TODO fix later
--- a/docker/examples-torch.dockerfile
+++ b/docker/examples-torch.dockerfile
@ -5,7 +5,7 @@ USER root
 RUN apt-get update &&  apt-get install -y --no-install-recommends libsndfile1-dev espeak-ng time git g++ cmake pkg-config openssh-client git-lfs ffmpeg curl
 ENV UV_PYTHON=/usr/local/bin/python
 RUN pip --no-cache-dir install uv && uv pip install --no-cache-dir -U pip setuptools
-RUN uv pip install --no-cache-dir 'torch' 'torchaudio' 'torchvision' 'torchcodec' --index-url https://download.pytorch.org/whl/cpu
+RUN uv pip install --no-cache-dir 'torch<2.9' 'torchaudio' 'torchvision' 'torchcodec' --index-url https://download.pytorch.org/whl/cpu
 RUN uv pip install --no-deps timm accelerate --extra-index-url https://download.pytorch.org/whl/cpu
 RUN uv pip install --no-cache-dir librosa "git+https://github.com/huggingface/transformers.git@${REF}#egg=transformers[sklearn,sentencepiece,vision,testing]" seqeval albumentations jiwer

--- a/docker/exotic-models.dockerfile
+++ b/docker/exotic-models.dockerfile
@ -5,7 +5,7 @@ USER root
 RUN apt-get update && apt-get install -y libsndfile1-dev espeak-ng time git libgl1 g++ tesseract-ocr git-lfs curl
 ENV UV_PYTHON=/usr/local/bin/python
 RUN pip --no-cache-dir install uv && uv pip install --no-cache-dir -U pip setuptools
-RUN uv pip install --no-cache-dir 'torch' 'torchaudio' 'torchvision' --index-url https://download.pytorch.org/whl/cpu
+RUN uv pip install --no-cache-dir 'torch<2.9' 'torchaudio' 'torchvision' --index-url https://download.pytorch.org/whl/cpu
 RUN uv pip install --no-cache-dir  --no-deps timm accelerate
 RUN uv pip install -U --no-cache-dir pytesseract python-Levenshtein opencv-python nltk
 # RUN uv pip install --no-cache-dir natten==0.15.1+torch210cpu -f https://shi-labs.com/natten/wheels
--- a/docker/pipeline-torch.dockerfile
+++ b/docker/pipeline-torch.dockerfile
@ -5,7 +5,7 @@ USER root
 RUN apt-get update &&  apt-get install -y --no-install-recommends libsndfile1-dev espeak-ng time git pkg-config openssh-client git ffmpeg curl
 ENV UV_PYTHON=/usr/local/bin/python
 RUN pip --no-cache-dir install uv && uv pip install --no-cache-dir -U pip setuptools
-RUN uv pip install --no-cache-dir 'torch' 'torchaudio' 'torchvision' 'torchcodec' --index-url https://download.pytorch.org/whl/cpu
+RUN uv pip install --no-cache-dir 'torch<2.9' 'torchaudio' 'torchvision' 'torchcodec' --index-url https://download.pytorch.org/whl/cpu
 RUN uv pip install --no-deps timm accelerate --extra-index-url https://download.pytorch.org/whl/cpu
 RUN uv pip install --no-cache-dir librosa "git+https://github.com/huggingface/transformers.git@${REF}#egg=transformers[sklearn,sentencepiece,vision,testing]"

--- a/docker/torch-light.dockerfile
+++ b/docker/torch-light.dockerfile
@ -5,7 +5,7 @@ USER root
 RUN apt-get update &&  apt-get install -y --no-install-recommends libsndfile1-dev espeak-ng time git g++ cmake pkg-config openssh-client git-lfs ffmpeg curl
 ENV UV_PYTHON=/usr/local/bin/python
 RUN pip --no-cache-dir install uv && uv pip install --no-cache-dir -U pip setuptools
-RUN uv pip install --no-cache-dir 'torch' 'torchaudio' 'torchvision' 'torchcodec' --index-url https://download.pytorch.org/whl/cpu
+RUN uv pip install --no-cache-dir 'torch<2.9' 'torchaudio' 'torchvision' 'torchcodec' --index-url https://download.pytorch.org/whl/cpu
 RUN uv pip install --no-deps timm accelerate --extra-index-url https://download.pytorch.org/whl/cpu
 RUN uv pip install --no-cache-dir librosa "git+https://github.com/huggingface/transformers.git@${REF}#egg=transformers[sklearn,sentencepiece,vision,testing,tiktoken,num2words,video]"

--- a/docker/transformers-all-latest-gpu/Dockerfile
+++ b/docker/transformers-all-latest-gpu/Dockerfile
@ -9,15 +9,10 @@ SHELL ["sh", "-lc"]
 # The following `ARG` are mainly used to specify the versions explicitly & directly in this docker file, and not meant
 # to be used as arguments for docker build (so far).

-ARG PYTORCH='2.9.0'
+ARG PYTORCH='2.8.0'
 # Example: `cu102`, `cu113`, etc.
 ARG CUDA='cu126'

-# This needs to be compatible with the above `PYTORCH`.
-ARG TORCHCODEC='0.8.0'
-
-ARG FLASH_ATTN='false'
-
 RUN apt update
 RUN apt install -y git libsndfile1-dev tesseract-ocr espeak-ng python3 python3-pip ffmpeg git-lfs
 RUN git lfs install
@ -26,48 +21,14 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip
 ARG REF=main
 RUN git clone https://github.com/huggingface/transformers && cd transformers && git checkout $REF

-RUN python3 -m pip install --no-cache-dir -e ./transformers[dev]
-
 # 1. Put several commands in a single `RUN` to avoid image/layer exporting issue. Could be revised in the future.
-# 2. For `torchcodec`, use `cpu` as we don't have `libnvcuvid.so` on the host runner. See https://github.com/meta-pytorch/torchcodec/issues/912
-#    **Important**: We need to specify `torchcodec` version if the torch version is not the latest stable one.
-# 3. `set -e` means "exit immediately if any command fails".
-RUN set -e; \
-    # Determine torch version
-    if [ ${#PYTORCH} -gt 0 ] && [ "$PYTORCH" != "pre" ]; then \
-        VERSION="torch==${PYTORCH}.*"; \
-        TORCHCODEC_VERSION="torchcodec==${TORCHCODEC}.*"; \
-    else \
-        VERSION="torch"; \
-        TORCHCODEC_VERSION="torchcodec"; \
-    fi; \
-    \
-    # Log the version being installed
-    echo "Installing torch version: $VERSION"; \
-    \
-    # Install PyTorch packages
-    if [ "$PYTORCH" != "pre" ]; then \
-        python3 -m pip install --no-cache-dir -U \
-            $VERSION \
-            torchvision \
-            torchaudio \
-            --extra-index-url https://download.pytorch.org/whl/$CUDA; \
-        # We need to specify the version if the torch version is not the latest stable one.
-        python3 -m pip install --no-cache-dir -U \
-            $TORCHCODEC_VERSION --extra-index-url https://download.pytorch.org/whl/cpu; \
-    else \
-        python3 -m pip install --no-cache-dir -U --pre \
-            torch \
-            torchvision \
-            torchaudio \
-            --extra-index-url https://download.pytorch.org/whl/nightly/$CUDA; \
-        python3 -m pip install --no-cache-dir -U --pre \
-            torchcodec --extra-index-url https://download.pytorch.org/whl/nightly/cpu; \
-    fi
+# 2. Regarding `torch` part, We might need to specify proper versions for `torchvision` and `torchaudio`.
+#    Currently, let's not bother to specify their versions explicitly (so installed with their latest release versions).
+RUN python3 -m pip install --no-cache-dir -e ./transformers[dev,onnxruntime] && [ ${#PYTORCH} -gt 0 -a "$PYTORCH" != "pre" ] && VERSION='torch=='$PYTORCH'.*' ||  VERSION='torch'; echo "export VERSION='$VERSION'" >> ~/.profile && echo torch=$VERSION && [ "$PYTORCH" != "pre" ] && python3 -m pip install --no-cache-dir -U $VERSION torchvision torchaudio torchcodec --extra-index-url https://download.pytorch.org/whl/$CUDA || python3 -m pip install --no-cache-dir -U --pre torch torchvision torchaudio torchcodec --extra-index-url https://download.pytorch.org/whl/nightly/$CUDA

 RUN python3 -m pip install --no-cache-dir -U timm

-RUN [ "$PYTORCH" != "pre" ] && python3 -m pip install --no-cache-dir --no-build-isolation git+https://github.com/facebookresearch/detectron2.git || echo "Don't install detectron2 with nightly torch"
+RUN [ "$PYTORCH" != "pre" ] && python3 -m pip install --no-cache-dir git+https://github.com/facebookresearch/detectron2.git || echo "Don't install detectron2 with nightly torch"

 RUN python3 -m pip install --no-cache-dir pytesseract

@ -92,7 +53,7 @@ RUN python3 -m pip install --no-cache-dir bitsandbytes
 RUN python3 -m pip install --no-cache-dir quanto

 # After using A10 as CI runner, let's run FA2 tests
-RUN [ "$FLASH_ATTN" != "false" ] && python3 -m pip uninstall -y ninja && python3 -m pip install --no-cache-dir ninja && python3 -m pip install flash-attn --no-cache-dir --no-build-isolation || echo "Don't install FA2 with nightly torch"
+RUN [ "$PYTORCH" != "pre" ] && python3 -m pip uninstall -y ninja && python3 -m pip install --no-cache-dir ninja && python3 -m pip install flash-attn --no-cache-dir --no-build-isolation || echo "Don't install FA2 with nightly torch"

 # TODO (ydshieh): check this again
 # `quanto` will install `ninja` which leads to many `CUDA error: an illegal memory access ...` in some model tests
--- a/docker/transformers-doc-builder/Dockerfile
+++ b/docker/transformers-doc-builder/Dockerfile
@ -10,7 +10,7 @@ RUN apt-get -y update && apt-get install -y libsndfile1-dev && apt install -y te
 # Torch needs to be installed before deepspeed
 RUN python3 -m pip install --no-cache-dir ./transformers[deepspeed]

-RUN python3 -m pip install --no-cache-dir --no-build-isolation torchvision git+https://github.com/facebookresearch/detectron2.git pytesseract
+RUN python3 -m pip install --no-cache-dir torchvision git+https://github.com/facebookresearch/detectron2.git pytesseract
 RUN python3 -m pip install -U "itsdangerous<2.1.0"

 # Test if the image could successfully build the doc. before publishing the image
--- a/docker/transformers-pytorch-amd-gpu/Dockerfile
+++ b/docker/transformers-pytorch-amd-gpu/Dockerfile
@ -1,4 +1,4 @@
-FROM rocm/pytorch:rocm7.1_ubuntu22.04_py3.10_pytorch_release_2.8.0
+FROM rocm/pytorch:rocm6.4.1_ubuntu24.04_py3.12_pytorch_release_2.7.1
 LABEL maintainer="Hugging Face"

 ARG DEBIAN_FRONTEND=noninteractive
@ -10,8 +10,8 @@ RUN apt update && \

 RUN git lfs install

-RUN python3 -m pip install --no-cache-dir --upgrade pip numpy importlib-metadata setuptools wheel ninja pytesseract "itsdangerous<2.1.0"
-RUN python3 -m pip install --no-cache-dir --no-build-isolation git+https://github.com/facebookresearch/detectron2.git
+RUN python3 -m pip install --no-cache-dir --upgrade pip numpy
+RUN python3 -m pip install --no-cache-dir --upgrade importlib-metadata setuptools ninja git+https://github.com/facebookresearch/detectron2.git pytesseract "itsdangerous<2.1.0"

 ARG REF=main
 WORKDIR /
@ -39,7 +39,6 @@ RUN python3 -m pip install --no-cache-dir "torchcodec==0.5"
 # Install flash attention from source. Tested with commit 6387433156558135a998d5568a9d74c1778666d8
 RUN git clone https://github.com/ROCm/flash-attention/ -b tridao && \
    cd flash-attention && \
-    GPU_ARCHS="gfx942" python setup.py install  
-# GPU_ARCHS builds for MI300, MI325 but not MI355: we would need to add `;gfx950` but it takes too long to build.
+    GPU_ARCHS="gfx942" python setup.py install

 RUN python3 -m pip install --no-cache-dir einops
--- a/docker/transformers-pytorch-deepspeed-amd-gpu/Dockerfile
+++ b/docker/transformers-pytorch-deepspeed-amd-gpu/Dockerfile
@ -29,7 +29,7 @@ RUN python3 -m pip uninstall -y apex torch torchvision torchaudio
 RUN python3 -m pip install torch==$PYTORCH torchvision==$TORCH_VISION torchaudio==$TORCH_AUDIO --index-url https://download.pytorch.org/whl/rocm$ROCM --no-cache-dir

 # Pre-build DeepSpeed, so it's be ready for testing (to avoid timeout)
-RUN DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 python3 -m pip install deepspeed --no-build-isolation --config-settings="--build-option=build_ext" --config-settings="--build-option=-j8" --no-cache-dir -v --disable-pip-version-check 2>&1
+RUN DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 python3 -m pip install deepspeed --global-option="build_ext" --global-option="-j8" --no-cache-dir -v --disable-pip-version-check 2>&1

 ARG REF=main
 WORKDIR /
--- a/docker/transformers-pytorch-deepspeed-latest-gpu/Dockerfile
+++ b/docker/transformers-pytorch-deepspeed-latest-gpu/Dockerfile
@ -21,7 +21,7 @@ RUN python3 -m pip install --no-cache-dir './transformers[deepspeed-testing]' 'p
 # Install latest release PyTorch
 # (PyTorch must be installed before pre-compiling any DeepSpeed c++/cuda ops.)
 # (https://www.deepspeed.ai/tutorials/advanced-install/#pre-install-deepspeed-ops)
-RUN python3 -m pip uninstall -y torch torchvision torchaudio torchcodec && python3 -m pip install --no-cache-dir -U torch==$PYTORCH torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/$CUDA
+RUN python3 -m pip uninstall -y torch torchvision torchaudio && python3 -m pip install --no-cache-dir -U torch==$PYTORCH torchvision torchaudio torchcodec --extra-index-url https://download.pytorch.org/whl/$CUDA

 RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/accelerate@main#egg=accelerate

@ -43,7 +43,7 @@ RUN python3 -m pip uninstall -y deepspeed
 # This has to be run (again) inside the GPU VMs running the tests.
 # The installation works here, but some tests fail, if we don't pre-build deepspeed again in the VMs running the tests.
 # TODO: Find out why test fail.
-RUN DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 python3 -m pip install deepspeed --no-build-isolation --config-settings="--build-option=build_ext" --config-settings="--build-option=-j8" --no-cache -v --disable-pip-version-check 2>&1
+RUN DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 python3 -m pip install deepspeed --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check 2>&1

 # `kernels` may give different outputs (within 1e-5 range) even with the same model (weights) and the same inputs
 RUN python3 -m pip uninstall -y kernels
--- a/docker/transformers-pytorch-xpu/Dockerfile
+++ b/docker/transformers-pytorch-xpu/Dockerfile
@ -3,10 +3,11 @@ LABEL maintainer="Hugging Face"

 SHELL ["/bin/bash", "-c"]

-ARG PYTHON_VER=3.12
+ARG PYTHON_VER=3.11
 ENV TORCH_DEVICE_BACKEND_AUTOLOAD=0
 ENV DEBIAN_FRONTEND=noninteractive

+RUN apt-get remove -y python3.10 && apt-get autoremove -y
 RUN apt-get update && \
    apt-get install -y software-properties-common && \
    add-apt-repository -y ppa:deadsnakes/ppa && \
@ -22,6 +23,7 @@ RUN apt-get update && \
        apt-utils \
        build-essential \
        ca-certificates \
+        clinfo \
        curl \
        git \
        git-lfs \
@ -33,6 +35,7 @@ RUN apt-get update && \
        rsync \
        sudo \
        libnl-genl-3-200 \
+        xpu-smi \
        unzip \
        ffmpeg \
        tesseract-ocr \
@ -42,47 +45,34 @@ RUN apt-get update && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

+
 RUN apt-get update && \
    apt-get install -y \
-        linux-headers-$(uname -r) linux-modules-extra-$(uname -r) \
+        linux-headers-$(uname -r) \
+        linux-modules-extra-$(uname -r) \
        flex bison \
-        intel-fw-gpu intel-i915-dkms xpu-smi intel-ocloc clinfo\
+        intel-fw-gpu intel-i915-dkms xpu-smi \
        intel-opencl-icd libze-intel-gpu1 libze1 \
        intel-media-va-driver-non-free libmfx-gen1 libvpl2 \
-        libegl-mesa0 libegl1 libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \
+        libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \
        libglapi-mesa libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers \
-        mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo hwinfo \
+        mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo hwinfo clinfo intel-ocloc \
        libigc-dev intel-igc-cm libigdfcl-dev libigfxcmrt-dev libze-dev && \
    apt-get clean && \
    rm -rf  /var/lib/apt/lists/*

-# Use virtual env because Ubuntu-24 does not allowed pip on original python
-RUN curl -LsSf https://astral.sh/uv/install.sh | sh
-ENV PATH="/root/.local/bin:$PATH"
-ENV VIRTUAL_ENV="/opt/venv"
-ENV UV_PYTHON_INSTALL_DIR=/opt/uv/python
-RUN uv venv --python ${PYTHON_VER} --seed ${VIRTUAL_ENV}
-ENV PATH="$VIRTUAL_ENV/bin:$PATH"
+RUN pip install --upgrade pip
+RUN pip install triton==3.3.0

-RUN pip install --upgrade pip wheel
-RUN pip install triton==3.4.0
+RUN pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/xpu --no-cache-dir

-RUN pip install torch==2.8.0+xpu torchvision==0.23.0+xpu torchaudio==2.8.0+xpu --index-url https://download.pytorch.org/whl/xpu --no-cache-dir
+RUN pip install evaluate torchdata pyctcdecode pytesseract decord galore-torch fire scipy scikit-learn sentencepiece sacremoses nltk rouge_score librosa soundfile g2p_en mpi4py requests_mock
+RUN pip install pretty_midi essentia resampy Levenshtein av sacrebleu phonemizer invisible_watermark schedulefree
+RUN pip install gguf hqq compressed_tensors gptqmodel mergekit autoawq deepspeed torchao onnx
+RUN pip install hf_transfer huggingface-hub hf-doc-builder datasets optimum-quanto timm transformers accelerate optimum peft

-RUN pip install torchcodec torchdata --no-cache-dir
-
-RUN pip install evaluate pyctcdecode pytesseract decord galore-torch fire scipy scikit-learn sentencepiece sacremoses nltk rouge_score librosa soundfile g2p_en mpi4py requests_mock
-RUN pip install pretty_midi essentia resampy Levenshtein av sacrebleu phonemizer invisible_watermark schedulefree setuptools
-RUN pip install gptqmodel --no-build-isolation
-RUN pip install gguf hqq compressed_tensors autoawq deepspeed torchao onnx auto_round
-RUN pip install hf_transfer huggingface-hub hf-doc-builder datasets optimum-quanto timm transformers accelerate optimum peft diffusers trl kernels
-
-# install liger-kernel
 RUN pip install git+https://github.com/linkedin/Liger-Kernel.git --extra-index-url https://download.pytorch.org/whl/test/xpu

-# install mergekit
-RUN pip install --break-system-packages git+https://github.com/arcee-ai/mergekit.git@v0.1.3
-
 # install bitsandbytes
 RUN pip install git+https://github.com/bitsandbytes-foundation/bitsandbytes.git

--- a/docker/transformers-quantization-latest-gpu/Dockerfile
+++ b/docker/transformers-quantization-latest-gpu/Dockerfile
@ -24,7 +24,7 @@ RUN [ ${#PYTORCH} -gt 0 ] && VERSION='torch=='$PYTORCH'.*' ||  VERSION='torch';
 RUN echo torch=$VERSION
 # `torchvision` and `torchaudio` should be installed along with `torch`, especially for nightly build.
 # Currently, let's just use their latest releases (when `torch` is installed with a release version)
-RUN python3 -m pip install --no-cache-dir -U $VERSION torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/$CUDA
+RUN python3 -m pip install --no-cache-dir -U $VERSION torchvision torchaudio torchcodec --extra-index-url https://download.pytorch.org/whl/$CUDA

 RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/accelerate@main#egg=accelerate

@ -50,7 +50,7 @@ RUN python3 -m pip install --no-cache-dir hqq
 RUN python3 -m pip install --no-cache-dir gguf

 # Add autoawq for quantization testing
-RUN python3 -m pip install --no-cache-dir --no-build-isolation autoawq[kernels]
+RUN python3 -m pip install --no-cache-dir autoawq[kernels]

 # Add quanto for quantization testing
 RUN python3 -m pip install --no-cache-dir optimum-quanto
@ -81,7 +81,7 @@ RUN python3 -m pip uninstall -y flash-attn
 RUN cd transformers && python3 setup.py develop

 # Add fp-quant for quantization testing
-RUN python3 -m pip install --no-cache-dir "fp-quant>=0.3.2"
+RUN python3 -m pip install --no-cache-dir "fp-quant>=0.2.0"

 # Low usage or incompatible lib, will enable later on

--- a/docs/README.md
+++ b/docs/README.md
@ -24,7 +24,7 @@ pip install -e ".[dev]"
 ```

 > [!NOTE]
-> This command might fail for some OS that are missing dependencies. Check step 4 in [Create a Pull Request](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#create-a-pull-request) to work around it.
+> This command might fail for some OS that are missing dependencies. Check step 4 in [Create a Pull Request](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#create-a-pull-request) to workaround it.

 Then you need to install our special tool that builds the documentation:

@ -38,7 +38,7 @@ pip install git+https://github.com/huggingface/doc-builder

 ## Building the documentation

-Once you have set up the `doc-builder` and additional packages, you can generate the documentation by 
+Once you have setup the `doc-builder` and additional packages, you can generate the documentation by 
 typing the following command:

 ```bash
@ -295,11 +295,12 @@ Here's an example of a tuple return, comprising several objects:
 Due to the rapidly growing repository, it is important to make sure that no files that would significantly weigh down the repository are added. This includes images, videos, and other non-text files. We prefer to leverage a hf.co hosted `dataset` like
 the ones hosted on [`hf-internal-testing`](https://huggingface.co/hf-internal-testing) in which to place these files and reference
 them by URL. We recommend putting them in the following dataset: [huggingface/documentation-images](https://huggingface.co/datasets/huggingface/documentation-images).
-If an external contribution, feel free to add the images to your PR and ask a Hugging Face member to migrate them to this dataset.
+If an external contribution, feel free to add the images to your PR and ask a Hugging Face member to migrate your images
+to this dataset.

 ## Styling the docstring

-We have an automatic script running with the `make style` command that will make sure that:
+We have an automatic script running with the `make style` comment that will make sure that:
 - the docstrings fully take advantage of the line width
 - all code examples are formatted using black, like the code of the Transformers library

--- a/docs/source/ar/_toctree.yml
+++ b/docs/source/ar/_toctree.yml
@ -123,6 +123,8 @@
    title: تشغيل التدريب على Amazon SageMaker
  - local: serialization
    title: التصدير إلى ONNX
+  - local: torchscript
+    title: التصدير إلى TorchScript
  - local: notebooks
    title: دفاتر الملاحظات مع الأمثلة
  - local: community
@ -258,6 +260,8 @@
 #       title: النماذج
 #     - local: main_classes/text_generation
 #       title: توليد النصوص
+#     - local: main_classes/onnx
+#       title: ONNX
 #     - local: main_classes/optimizer_schedules
 #       title: التحسين
 #     - local: main_classes/output
--- a/docs/source/ar/serialization.md
+++ b/docs/source/ar/serialization.md
@ -32,7 +32,7 @@
 لتصدير نموذج 🤗 Transformers إلى ONNX، قم أولاً بتثبيت اعتماد إضافي:

 ```bash
-pip install optimum-onnx
+pip install optimum[exporters]
 ```

 للاطلاع على جميع المعامﻻت المتاحة، يرجى الرجوع إلى [وثائق 🤗 Optimum](https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/export_a_model#exporting-a-model-to-onnx-using-the-cli)، أو عرض المساعدة في سطر الأوامر:
@ -111,3 +111,60 @@ optimum-cli export onnx --model keras-io/transformers-qa distilbert_base_cased_s
 ### تصدير نموذج لهندسة غير مدعومة

 إذا كنت ترغب في المساهمة من خلال إضافة دعم لنموذج لا يُمكن تصديره حاليًا، فيجب عليك أولاً التحقق مما إذا كان مدعومًا في [`optimum.exporters.onnx`](https://huggingface.co/docs/optimum/exporters/onnx/overview)، وإذا لم يكن مدعومًا، [فيمكنك المساهمة في 🤗 Optimum](https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/contribute) مُباشرةً.
+
+### تصدير نموذج باستخدام `transformers.onnx`
+
+<Tip warning={true}>
+
+لم يعد يتم دعم `transformers.onnx`  يُرجى تصدير النماذج باستخدام 🤗 Optimum كما هو موضح أعلاه. سيتم إزالة هذا القسم في الإصدارات القادمة.
+
+</Tip>
+
+لتصدير نموذج 🤗 Transformers إلى ONNX باستخدام `transformers.onnx`، ثبّت التبعيات الإضافية:
+
+```bash
+pip install transformers[onnx]
+```
+
+استخدم حزمة `transformers.onnx` كنموذج Python لتصدير نقطة حفظ باستخدام تكوين جاهز:
+
+```bash
+python -m transformers.onnx --model=distilbert/distilbert-base-uncased onnx/
+```
+
+يُصدّر هذا رسمًا بيانيًا ONNX لنقطة الحفظ المُحددة بواسطة وسيطة `--model`. مرر أي نقطة حفظ على 🤗 Hub أو نقطة حفظ مُخزنة محليًا.
+يُمكن بعد ذلك تشغيل ملف `model.onnx` الناتج على أحد المُسرعات العديدة التي تدعم معيار ONNX. على سبيل المثال، قم بتحميل وتشغيل النموذج باستخدام ONNX Runtime كما يلي:
+
+```python
+>>> from transformers import AutoTokenizer
+>>> from onnxruntime import InferenceSession
+
+>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")
+>>> session = InferenceSession("onnx/model.onnx")
+>>> # يتوقع ONNX Runtime مصفوفات NumPy كمدخلات
+>>> inputs = tokenizer("Using DistilBERT with ONNX Runtime!", return_tensors="np")
+>>> outputs = session.run(output_names=["last_hidden_state"], input_feed=dict(inputs))
+```
+
+يُمكن الحصول على أسماء المخرجات المطلوبة (مثل `["last_hidden_state"]`) من خلال إلقاء نظرة على تكوين ONNX لكل نموذج. على سبيل المثال، بالنسبة لـ DistilBERT، لدينا:
+
+```python
+>>> from transformers.models.distilbert import DistilBertConfig, DistilBertOnnxConfig
+
+>>> config = DistilBertConfig()
+>>> onnx_config = DistilBertOnnxConfig(config)
+>>> print(list(onnx_config.outputs.keys()))
+["last_hidden_state"]
+```
+
+العمليات مُتطابقة لنقاط الحفظ TensorFlow على Hub. على سبيل المثال، صدّر نقطة حفظ TensorFlow خالصة كما يلي:
+
+```bash
+python -m transformers.onnx --model=keras-io/transformers-qa onnx/
+```
+
+لتصدير نموذج مُخزن محليًا، احفظ أوزان النموذج ومجزىء اللغوى في نفس الدليل (على سبيل المثال `local-pt-checkpoint`)، ثم قم بتصديره إلى ONNX عن طريق توجيه وسيط `--model` لحزمة `transformers.onnx` إلى الدليل المطلوب:
+
+```bash
+python -m transformers.onnx --model=local-pt-checkpoint onnx/
+```
--- a/docs/source/ar/torchscript.md
+++ b/docs/source/ar/torchscript.md
@ -0,0 +1,154 @@
+# التصدير إلى TorchScript
+
+<Tip>
+
+هذه هي بداية تجاربنا مع TorchScript ولا زلنا نستكشف قدراته مع نماذج المدخلات المتغيرة الحجم. إنه مجال اهتمامنا وسنعمق تحليلنا في الإصدارات القادمة، مع المزيد من الأمثلة البرمجية، وتنفيذ أكثر مرونة، ومقاييس مقارنة بين  الأكواد القائمة على Python مع أكواد TorchScript المُجمّعة.
+
+</Tip>
+
+وفقًا لـ [وثائق TorchScript](https://pytorch.org/docs/stable/jit.html):
+
+> TorchScript هي طريقة لإنشاء نماذج قابلة للتسلسل والتحسين من تعليمات PyTorch البرمجية.
+
+هناك وحدتان من PyTorch، [JIT and TRACE](https://pytorch.org/docs/stable/jit.html)، تتيحان للمطورين تصدير نماذجهم لإعادة استخدامها في برامج أخرى مثل برامج C++ المُحسّنة للأداء.
+
+نقدم واجهة تتيح لك تصدير نماذج 🤗 Transformers إلى TorchScript بحيث يمكن إعادة استخدامها في بيئة مختلفة عن برامج Python القائمة إلى PyTorch. هنا نشرح كيفية تصدير نماذجنا واستخدامها باستخدام TorchScript.
+
+يتطلب تصدير نموذج أمرين:
+
+- تهيئة مثيل للنموذج باستخدام علامة `torchscript`
+- تمرير مُدخلات وهمية (dummy inputs) خلال النموذج
+
+تنطوي هذه الضرورات على عدة أمور يجب على المطورين توخي الحذر بشأنها كما هو مفصل أدناه.
+
+## علامة TorchScript والأوزان المرتبطة
+
+علامة `torchscript` ضرورية لأن معظم نماذج اللغة 🤗 Transformers لها أوزان مرتبطة بين طبقة `Embedding` وطبقة `Decoding`. لا يسمح لك TorchScript بتصدير النماذج ذات الأوزان المرتبطة، لذلك من الضروري فصل الأوزان ونسخها مسبقًا.
+
+النماذج المُهيأة باستخدام علامة `torchscript` لها طبقة `Embedding` وطبقة`Decoding` منفصلتين، مما يعني أنه لا ينبغي تدريبها لاحقًا. سيؤدي التدريب إلى عدم تزامن الطبقتين، مما يؤدي إلى نتائج غير متوقعة.
+
+هذا لا ينطبق على النماذج التي لا تحتوي على رأس نموذج اللغة، حيث لا تملك أوزانًا مرتبطة. يمكن تصدير هذه النماذج بأمان دون علامة `torchscript`.
+
+## المدخلات الوهمية والأطوال القياسية
+
+تُستخدم المُدخلات الوهمية لتمرير أمامي خلال النموذج. أثناء انتشار قيم المُدخلات عبر الطبقات، يتتبع PyTorch العمليات المختلفة التي يتم تنفيذها على كل مصفوفة(tensor). ثم يتم استخدام هذه العمليات المُسجلة بعد ذلك لإنشاء *أثر* النموذج.
+
+يتم إنشاء التتبع بالنسبة لأبعاد المُدخلات. وبالتالي، فهو مُقيّد بأبعاد المُدخلات الوهمية، ولن يعمل لأي طول تسلسل أو حجم دفعة مختلف. عند المحاولة بحجم مختلف، يتم رفع الخطأ التالي:
+
+```
+`The expanded size of the tensor (3) must match the existing size (7) at non-singleton dimension 2`
+```
+
+نوصي بتتبع النموذج باستخدام حجم مُدخلات وهمية لا يقل عن أكبر مُدخل سيتم تقديمه للنموذج أثناء الاستدلال. يمكن أن تساعد الحشوة(padding) في ملء القيم المفقودة. ومع ذلك، نظرًا لتتبع النموذج بحجم مُدخل أكبر، ستكون أبعاد المصفوفة ستكون كبيرة أيضًا، مما يؤدي عنه المزيد من الحسابات.
+
+انتبه إلى إجمالي عدد العمليات المُنفذة على كل مُدخل وتابع الأداء عن كثب عند تصدير نماذج متغيرة طول التسلسل.
+
+## استخدام TorchScript في Python
+
+يوضح هذا القسم كيفية حفظ النماذج وتحميلها، بالإضافة إلى كيفية استخدام التتبع للاستدلال.
+
+### حفظ نموذج
+
+لتصدير `BertModel` باستخدام TorchScript، قم بتهيئة ـ `BertModel` من فئة `BertConfig` ثم احفظه على القرص تحت اسم الملف `traced_bert.pt`:
+
+```python
+from transformers import BertModel, BertTokenizer, BertConfig
+import torch
+
+enc = BertTokenizer.from_pretrained("google-bert/bert-base-uncased")
+
+# Tokenizing input text
+text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
+tokenized_text = enc.tokenize(text)
+
+# Masking one of the input tokens
+masked_index = 8
+tokenized_text[masked_index] = "[MASK]"
+indexed_tokens = enc.convert_tokens_to_ids(tokenized_text)
+segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
+
+# Creating a dummy input
+tokens_tensor = torch.tensor([indexed_tokens])
+segments_tensors = torch.tensor([segments_ids])
+dummy_input = [tokens_tensor, segments_tensors]
+
+# Initializing the model with the torchscript flag
+# Flag set to True even though it is not necessary as this model does not have an LM Head.
+config = BertConfig(
+    vocab_size_or_config_json_file=32000,
+    hidden_size=768,
+    num_hidden_layers=12,
+    num_attention_heads=12,
+    intermediate_size=3072,
+    torchscript=True,
+)
+
+# Instantiating the model
+model = BertModel(config)
+
+# The model needs to be in evaluation mode
+model.eval()
+
+# If you are instantiating the model with *from_pretrained* you can also easily set the TorchScript flag
+model = BertModel.from_pretrained("google-bert/bert-base-uncased", torchscript=True)
+
+# Creating the trace
+traced_model = torch.jit.trace(model, [tokens_tensor, segments_tensors])
+torch.jit.save(traced_model, "traced_bert.pt")
+```
+
+### تحميل نموذج
+
+يمكنك الآن تحميل `BertModel` المُحفظ سابقًا، `traced_bert.pt`، من القرص واستخدامه على `dummy_input` المُهيأ سابقًا:
+
+```python
+loaded_model = torch.jit.load("traced_bert.pt")
+loaded_model.eval()
+
+all_encoder_layers, pooled_output = loaded_model(*dummy_input)
+```
+
+### استخدام نموذج مُتتبع للاستدلال
+
+استخدم النموذج المُتتبع للاستدلال باستخدام أسلوب `__call__` الخاص به:
+
+```python
+traced_model(tokens_tensor, segments_tensors)
+```
+
+## نشر نماذج Hugging Face TorchScript على AWS باستخدام Neuron SDK
+
+قدمت AWS عائلة [Amazon EC2 Inf1](https://aws.amazon.com/ec2/instance-types/inf1/) من اﻷجهزة لخفض التكلفة وأداء التعلم الآلي عالي الأداء في البيئة السحابية. تعمل أجهزة Inf1 بواسطة شريحة Inferentia من AWS، وهي مُسرّع أجهزة مُخصص، متخصص في أعباء عمل الاستدلال للتعلم العميق. [AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/#) هي SDK لـ Inferentia التي تدعم تتبع نماذج المحولات وتحسينها للنشر على Inf1. توفر Neuron SDK ما يلي:
+
+1. واجهة برمجة تطبيقات سهلة الاستخدام مع تغيير سطر واحد من التعليمات البرمجية لتتبع نموذج TorchScript وتحسينه للاستدلال في البيئة السحابية.
+2. تحسينات الأداء الجاهزة للاستخدام [تحسين التكلفة والأداء](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/benchmark/>).
+3. دعم نماذج Hugging Face المحولات المبنية باستخدام إما [PyTorch](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/pytorch/bert_tutorial/tutorial_pretrained_bert.html) أو [TensorFlow](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/tensorflow/huggingface_bert/huggingface_bert.html).
+
+### الآثار المترتبة
+
+تعمل نماذج المحولات المستندة إلى بنية [BERT (تمثيلات الترميز ثنائية الاتجاه من المحولات)](https://huggingface.co/docs/transformers/main/model_doc/bert) أو متغيراتها مثل [distilBERT](https://huggingface.co/docs/transformers/main/model_doc/distilbert) و [roBERTa](https://huggingface.co/docs/transformers/main/model_doc/roberta) بشكل أفضل على Inf1 للمهام غير التوليدية مثل الإجابة على الأسئلة الاستخراجية، وتصنيف التسلسلات، وتصنيف الرموز (tokens). ومع ذلك، يمكن تكييف مهام توليد النصوص للعمل على Inf1 وفقًا لهذا [برنامج تعليمي AWS Neuron MarianMT](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/pytorch/transformers-marianmt.html). يمكن العثور على مزيد من المعلومات حول النماذج التي يمكن تحويلها جاهزة على Inferentia في قسم [ملاءمة بنية النموذج](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/models/models-inferentia.html#models-inferentia) من وثائق Neuron.
+
+### التبعيات (Dependencies)
+
+يتطلب استخدام AWS Neuron لتحويل النماذج [بيئة SDK Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuron-frameworks/pytorch-neuron/index.html#installation-guide) والتي تأتي مسبقًا على [AMI للتعلم العميق من AWS](https://docs.aws.amazon.com/dlami/latest/devguide/tutorial-inferentia-launching.html).
+
+### تحويل نموذج لـ AWS Neuron
+
+قم بتحويل نموذج لـ AWS NEURON باستخدام نفس التعليمات البرمجية من [استخدام TorchScript في Python](torchscript#using-torchscript-in-python) لتتبع `BertModel`. قم باستيراد امتداد إطار عمل `torch.neuron` للوصول إلى مكونات Neuron SDK من خلال واجهة برمجة تطبيقات Python:
+
+```python
+from transformers import BertModel, BertTokenizer, BertConfig
+import torch
+import torch.neuron
+```
+
+كل ما عليك فعله هو تعديل السطر التالي:
+
+```diff
+- torch.jit.trace(model, [tokens_tensor, segments_tensors])
+ torch.neuron.trace(model, [token_tensor, segments_tensors])
+```
+
+يتيح ذلك لـ Neuron SDK تتبع النموذج وتحسينه لمثيلات Inf1.
+
+لمعرفة المزيد حول ميزات AWS Neuron SDK والأدوات ودروس البرامج التعليمية والتحديثات الأخيرة، يرجى الاطلاع على [وثائق AWS NeuronSDK](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/index.html).
--- a/docs/source/de/add_new_model.md
+++ b/docs/source/de/add_new_model.md
@ -508,16 +508,16 @@ BERT `_init_weights` Methode:
 def _init_weights(self, module):
    """Initialize the weights"""
    if isinstance(module, nn.Linear):
-        module.weight.normal_(mean=0.0, std=self.config.initializer_range)
+        module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
        if module.bias is not None:
-            module.bias.zero_()
+            module.bias.data.zero_()
    elif isinstance(module, nn.Embedding):
-        module.weight.normal_(mean=0.0, std=self.config.initializer_range)
+        module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
        if module.padding_idx is not None:
            module.weight.data[module.padding_idx].zero_()
    elif isinstance(module, nn.LayerNorm):
-        module.bias.zero_()
-        module.weight.fill_(1.0)
+        module.bias.data.zero_()
+        module.weight.data.fill_(1.0)
 ```

 Sie können weitere benutzerdefinierte Schemata verwenden, wenn Sie eine spezielle Initialisierung für einige Module benötigen. Zum Beispiel in
@ -533,9 +533,9 @@ def _init_weights(self, module):
        module.project_hid._is_hf_initialized = True
        module.project_q._is_hf_initialized = True
    elif isinstance(module, nn.Linear):
-        module.weight.normal_(mean=0.0, std=self.config.initializer_range)
+        module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
        if module.bias is not None:
-            module.bias.zero_()
+            module.bias.data.zero_()
 ```

 Das Flag `_is_hf_initialized` wird intern verwendet, um sicherzustellen, dass wir ein Submodul nur einmal initialisieren. Wenn Sie es auf
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@ -88,8 +88,6 @@
      title: Tool use
    - local: chat_templating_writing
      title: Writing a chat template
-    - local: chat_response_parsing
-      title: Response parsing
    title: Chat with models
  - sections:
    - local: serving
@ -118,9 +116,7 @@
  - local: tools
    title: Tools
  - local: transformers_as_backend
-    title: Transformers as modeling backend
-  - local: continuous_batching
-    title: Continuous Batching
+    title: Inference server backends
  title: Inference
 - isExpanded: false
  sections:
@ -231,6 +227,8 @@
    title: ONNX
  - local: executorch
    title: ExecuTorch
+  - local: torchscript
+    title: TorchScript
  title: Export to production
 - isExpanded: false
  sections:
@ -1008,8 +1006,6 @@
        title: AltCLIP
      - local: model_doc/aria
        title: Aria
-      - local: model_doc/audioflamingo3
-        title: AudioFlamingo3
      - local: model_doc/aya_vision
        title: AyaVision
      - local: model_doc/blip
@ -1259,8 +1255,6 @@
      title: Importing Utilities
    - local: internal/time_series_utils
      title: Utilities for Time Series
-    - local: internal/rope_utils
-      title: Rotary Embeddings Utilities
    title: Internal helpers
  - sections:
    - local: reference/environment_variables
--- a/docs/source/en/accelerator_selection.md
+++ b/docs/source/en/accelerator_selection.md
@ -55,7 +55,6 @@ deepspeed --num_gpus 2 trainer-program.py ...
 </hfoptions>

 ## Order of accelerators
-
 To select specific accelerators to use and their order, use the environment variable appropriate for your hardware. This is often set on the command line for each run, but can also be added to your `~/.bashrc` or other startup config file.

 For example, if there are 4 accelerators (0, 1, 2, 3) and you only want to run accelerators 0 and 2:
--- a/docs/source/en/add_new_model.md
+++ b/docs/source/en/add_new_model.md
@ -314,16 +314,16 @@ Random initialization occurs in the `_init_weights` method of `BrandNewLlamaPreT
 def _init_weights(self, module):
    """Initialize the weights"""
    if isinstance(module, nn.Linear):
-        module.weight.normal_(mean=0.0, std=self.config.initializer_range)
+        module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
        if module.bias is not None:
-            module.bias.zero_()
+            module.bias.data.zero_()
    elif isinstance(module, nn.Embedding):
-        module.weight.normal_(mean=0.0, std=self.config.initializer_range)
+        module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
        if module.padding_idx is not None:
            module.weight.data[module.padding_idx].zero_()
    elif isinstance(module, nn.LayerNorm):
-        module.bias.zero_()
-        module.weight.fill_(1.0)
+        module.bias.data.zero_()
+        module.weight.data.fill_(1.0)
 ```

 The initialization scheme can look different if you need to adapt it to your model. For example, [`Wav2Vec2ForPreTraining`] initializes [nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) in its last two linear layers.
@ -339,9 +339,9 @@ def _init_weights(self, module):
        module.project_hid._is_hf_initialized = True
        module.project_q._is_hf_initialized = True
    elif isinstance(module, nn.Linear):
-        module.weight.normal_(mean=0.0, std=self.config.initializer_range)
+        module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
        if module.bias is not None:
-            module.bias.zero_()
+            module.bias.data.zero_()
 ```

 ### Convert checkpoints to Transformers
--- a/docs/source/en/chat_extras.md
+++ b/docs/source/en/chat_extras.md
@ -95,12 +95,9 @@ print(tokenizer.decode(outputs[0][len(inputs["input_ids"][0]):]))

 The chat model called the `get_current_temperature` tool with the correct parameters from the docstring. It inferred France as the location based on Paris, and that it should use Celsius for the units of temperature.

-A model **cannot actually call the tool itself**. It requests a tool call, and it's your job to handle the call and append it and the result to the chat history. For
-models that support [response parsing](./chat_response_parsing), the response parsing will be handled automatically, and you can just use
-[`~PreTrainedTokenizer.parse_response] to extract the tool call. For other models, you'll need to manually translate the output
-string into a tool call dict.
+A model **cannot actually call the tool itself**. It requests a tool call, and it's your job to handle the call and append it and the result to the chat history.

-Regardless of the approach you use, the tool call should go in the `tool_calls` key of an `assistant` message. This is the recommended API, and should be supported by the chat template of most tool-using models.
+Hold the call in the `tool_calls` key of an `assistant` message. This is the recommended API, and should be supported by the chat template of most tool-using models.

 > [!WARNING]
 > Although `tool_calls` is similar to the OpenAI API, the OpenAI API uses a JSON string as its `tool_calls` format. This may cause errors or strange model behavior if used in Transformers, which expects a dict.
--- a/docs/source/en/chat_response_parsing.md
+++ b/docs/source/en/chat_response_parsing.md
@ -1,233 +0,0 @@
-<!--Copyright 2025 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-
-⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
-rendered properly in your Markdown viewer.
-
-->
-
-# Response Parsing
-
-It is increasingly common for chat models to generate structured outputs, rather than just a single reply string. 
-The most common uses for structured outputs are [tool calling](./chat_extras) and [reasoning models](https://huggingface.co/reasoning-course).
-Tool calling models can output tool calls, containing the name of the tool to call and any arguments to be passed to it,
-while reasoning models often output reasoning steps as a "chain of thought". Some recent models even use both of these,
-and may output reasoning and/or one or more tool calls before their final answer.
-
-Models with structured outputs pose a challenge for chat templating, because the output needs to be parsed before it
-can be appended to the chat. For a concrete example, let's say we ask [GPT-OSS](https://huggingface.co/openai/gpt-oss-120b)
-what the weather is like, and it thinks and decides to call a tool. Here's what the raw model output might look like:
-
-```txt
-<|start|>analysis<|message|>The user asks: "What is the weather like in SF?" We need to get the location of the user? The user explicitly asks about SF (San Francisco).
-So we need to get the current weather in San Francisco, CA. We need to call get_current_weather function. But we need to call function to get weather data.
-So we should call get_current_weather with location "San Francisco, CA". Let's do that.
-We will call function get_current_weather.<|end|><|start|>commentary to=functions.get_current_weather<|channel|>commentary <|constrain|>json<|message|>{"location":"San Francisco, CA"}<|call|>
-}
-```
-
-But if you want to append this to a chat, you'll need to format it as a chat message dict, like this:
-
-```json
-{
-  "role": "assistant",
-  "thinking": "The user asks: \"What is the weather like in SF?\" We need to get the location of the user? The user explicitly asks about SF (San Francisco). So we need to get the current weather in San Francisco, CA. We need to call get_current_weather function. But we need to call function to get weather data. So we should call get_current_weather with location \"San Francisco, CA\". Let's do that.",
-  "tool_calls": [
-    {
-      "name": "get_current_weather",
-      "arguments": {
-        "location": "San Francisco, CA"
-      }
-    }
-  ]
-}
-```
-
-Chat **templates** give us a way to turn messages into formatted input for a model, but we need something else to
-parse model output back into a standard message dict. This is what chat **parsing** is for.
-
-## The [parse_response](~PreTrainedTokenizerBase.parse_response) method
-
-Parsing a chat response on a model that supports it is straightforward. Simply take the raw, decoded output from
-[generate](`~generation.GenerationMixin.generate`), and pass it to the tokenizer's [parse_response](~PreTrainedTokenizerBase.parse_response) method:
-
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-
-checkpoint = "HuggingFaceTB/SmolLM3-3B"
-
-tokenizer = AutoTokenizer.from_pretrained(checkpoint)
-model = AutoModelForCausalLM.from_pretrained(checkpoint, dtype="auto", device_map="auto")
-
-messages = [
-    {
-        "role": "user",
-        "content": "Hey! Can you summarize the end of the Cold War as briefly as possible? Like, comically briefly. It should really leave out almost most of the relevant information."
-    }
-]
-
-input_ids = tokenizer.apply_chat_template(
-    messages,
-    add_generation_prompt=True,
-    tokenize=True,
-    return_tensors="pt"
-).to(model.device)
-
-outputs = model.generate(input_ids, max_new_tokens=1024)[0, input_ids.shape[1]:]
-out_text = tokenizer.decode(outputs)
-parsed = tokenizer.parse_response(out_text)
-print(parsed.keys())
-```
-
-And you should get:
-
-```text
-dict_keys(['thinking', 'content'])
-```
-
-And that's all you need to start using response parsing! `parse_response` should return a complete message dict that is ready to be appended to the chat history. 
-When the tokenizer does not support response parsing, `parse_response` will throw an error. We hope to add support
-to more tokenizers over time.
-
-## Developers: Understanding a simple response schema
-
-Under the hood, `parse_response` uses a **JSON schema** to parse the model output. A JSON schema represents
-the structure of the output message dict. The schema is augmented with additional fields that indicate how the 
-output message string should be parsed into the expected format. Let's take a look at the schema for a SmolLM response,
-excluding tool calls for now:
-
-```python
-{
-    "x-regex": "(?:<think>\n?(?P<thinking>.+?)\n?</think>)?\s*(?P<content>.+?)?\s*(?:<\|im_end\|>|$)",
-    "type": "object",
-    "properties": {
-        "role": {"const": "assistant"},
-        "content": {"type": "string"},
-        "thinking": {"type": "string"}
-    }
-}
-```
-
-We can see that the schema describes a JSON "object" (a `dict`, in other words) with three keys: `role`, `content`, and `thinking`.
-Because all assistant responses have the role "assistant", the `role` key is a `const`(ant). The other two keys are strings, extracted
-from the named groups in the regex in the `x-regex` field.
-
-Like chat templates, response schemas are set as a property of the tokenizer. To enable response parsing, all you need
-to do is set `tokenizer.response_schema` to a valid schema dict, and `tokenizer.parse_response()` will work! Again, like
-chat templates, this schema will be saved with the processor, so once you set it, you can use `save_pretrained()` or `push_to_hub()` to
-save and share the schema. 
-
-## Developers: Complex schemas
-
-Now, let's look at a more complex schema, which includes tool calls, to gain more of an understanding of the parser
-internals. For this, we'll use the `GPT-OSS` schema. GPT-OSS emits both tool calls and thinking blocks, and it uses
-an unusual format where model responses are tagged with one of three "channels": `commentary` for things like
-tool calls, `analysis` for chain of thought blocks, and `final` for messages intended to be sent to the user. 
-A full message where the model calls a tool named `get_current_weather` might look like this, with some extra linebreaks added for clarity:
-
-```text
-<|channel|>analysis<|message|>
-The user asks: "What is the weather like in SF?" So we need to get the current weather in San Francisco, CA. 
-We need to call get_current_weather function. So we should call get_current_weather with location "San Francisco, CA".
-<|end|>
-<|start|>assistant<|channel|>commentary 
-to=functions.get_current_weather <|constrain|>json<|message|>
-{
-  "location": "San Francisco, CA"
-}
-<|call|>
-```
-
-Parsing proceeds recursively; the output of a regex (or other parser) at one level becomes the input to the nodes below it.
-In other words, don't feel like you have to parse the entire output in one enormous regex! Instead, start with the schema,
-and then add regexes to extract the relevant chunks as you go. Here's a schema that will parse it, with some
-explanatory comments:
-
-```python
-{
-    "type": "object",
-    "properties": {
-        "role": {"const": "assistant"},
-        # "content" and "thinking" are both similar to the previous example, and just extract a single string
-        # However, rather than using a single regex with named groups to extract both, we use a regex in each subkey.
-        # When an object node has no parser/regex, the entire input string is passed to all of its children, so 
-        # parsing can either be done with named groups at the object level, or with separate regexes at the property level.
-        "content": {"type": "string", "x-regex": r"<\|channel\|>final<\|message\|>(.*?)(?:<\|end\|>|$)"},
-        "thinking": {"type": "string", "x-regex": r"<\|channel\|>analysis<\|message\|>(.*?)<\|end\|>"},
-        "tool_calls": {
-            # "x-regex-iterator" uses re.findall to find multiple possible manages, and returns them as an
-            # array/list. You don't need to worry about array handling, though - each item in the array will be
-            # parsed by the `items` schema, so just write the schema for a single item.
-            "x-regex-iterator": r"<\|channel\|>commentary (to=functions\..*?<\|message\|>.*?)(?:<\|call\|>|$)",
-            "type": "array",
-            "items": {
-                "type": "object",
-                "properties": {
-                    # A const property is a fixed value, and the input has no effect on it.
-                    "type": {"const": "function"},
-                    # Here, we wrap the entire tool call dict in a `{"function": ...}` block. The input string is passed through to it unchanged.
-                    "function": {
-                        "type": "object",
-                        "properties": {
-                            "name": {"type": "string", "x-regex": r"^to=functions\.(\w+)"},
-                            "arguments": {
-                                "type": "object",
-                                "x-regex": "<\|message\|>(.*)",
-                                # The "x-parser" field indicates that the extracted string should be parsed as JSON.
-                                # The output is then passed to the schema nodes below and recursive parsing continues.
-                                "x-parser": "json",
-                                "additionalProperties": {"type": "any"},
-                            },
-                        },
-                    },
-                },
-            },
-        },
-    },
-}
-```
-
-## Developers: Understanding the parser logic
-
-The parser follows a few simple rules:
-
-1. Each level of the schema receives input from the level above, applies any regex or parser it has, and then passes the output to its children.
-2. The root level receives the entire decoded model output string as input.
-3. If a node has structured content after parsing (for example, if the regex has named groups and returns a dict, or if the parser returns a dict or list),
-   then that structured content is mapped to the node's children, and each child node receives its corresponding value as input.
-4. If an `object` (dict) node has unstructured (string) output, then the entire string is passed to all of its children. This allows child nodes
-   to handle parsing individually rather than requiring a single parent regex to extract all keys at once.
-5. If an `array` (list) node has unstructured (string) output, then this throws an error.
-
-There is a small set of allowable `x-` keys that indicate how parsing should be done at each node:
- `x-regex`: A regex string to apply to the input. If the regex has named groups, the output is a dict of group names to values. Named groups should only be used in `object` nodes.
-  Otherwise, the regex must have exactly one unnamed capturing group, and the output is the value of that group as a string.
- `x-regex-iterator`: A regex string to apply to the input using `re.findall()`. The output is a list of all matches.
-  This should only be used in `array` nodes, and the regex must have exactly one unnamed capturing group. The output is distributed to
-  the node's `items` schema.
- `x-parser`: Calls a built-in parser to apply to the input. Currently, the only supported parser is `json`, which parses the input string as JSON.
-  The output is passed to the child nodes for further parsing. Note that the `json` parser can return deeply nested output - in this case, the output
-  will be progressively unwrapped as it is passed through child nodes. The child nodes do not need additional `x-parser` or `x-regex` fields in this case, 
-  but their structure must match the structure of the parsed JSON.
- `x-parser-args`: Only allowed in conjunction with `x-parser`. This is a dict of additional arguments that control parsing. Right now, the only supported
-  argument is `transform`, which specifies a `jmespath` transformation to apply to the output. This is useful when the JSON parser returns a structure
-  that needs to be modified to match the schema.
- `x-regex-key-value`: This is rarely necessary, but it can be useful when parsing key-value pairs in non-JSON format where the names of the keys are not known
-  in advance, such as when a model emits XML tool calls with arbitrary argument names. The regex must have exactly two named capturing groups, 
-  `key` and `value`, and the output is a dict mapping keys to values. This should only be used in `object` nodes.
-
-In general, multiple regexes/parsers cannot be combined at the same level. The exception is that `x-regex`, returning a single string, can be combined with the other parsers. In this case,
-`x-regex` is applied first, and then the output is passed to the other parser, either `x-regex-iterator`, `x-parser`, or `x-regex-key-value`.
-
-Putting these ideas together, you can see that the input flows through the schema, being parsed at each level and then distributed to child nodes. Each level
-only needs to extract the input content that is relevant for that part of the schema, and can then let its child nodes handle the rest. Internally, this is handled
-with a parser function that receives input, applies any regexes/parsers at the current level, then maps the result to its child nodes before recursively calling itself on each of them.
-Recursion terminates when it reaches leaf nodes, usually primitive types like `string` or `number`, which simply return the input they receive.
--- a/docs/source/en/community.md
+++ b/docs/source/en/community.md
@ -6,13 +6,13 @@ rendered properly in your Markdown viewer.

 This page regroups resources around 🤗 Transformers developed by the community.

-## Community resources
+## Community resources:

 | Resource     |      Description      |      Author      |
 |:----------|:-------------|------:|
 | [Hugging Face Transformers Glossary Flashcards](https://www.darigovresearch.com/huggingface-transformers-glossary-flashcards) | A set of flashcards based on the [Transformers Docs Glossary](glossary) that has been put into a form which can be easily learned/revised using [Anki](https://apps.ankiweb.net/) an open source, cross platform app specifically designed for long term knowledge retention. See this [Introductory video on how to use the flashcards](https://www.youtube.com/watch?v=Dji_h7PILrw). | [Darigov Research](https://www.darigovresearch.com/) |

-## Community notebooks
+## Community notebooks:

 | Notebook     |      Description      |      Author      |      |
 |:----------|:-------------|:-------------|------:|
--- a/docs/source/en/continuous_batching.md
+++ b/docs/source/en/continuous_batching.md
@ -1,194 +0,0 @@
-<!--Copyright 2025 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-
-⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
-rendered properly in your Markdown viewer.
-
-->
-
-# Continuous Batching
-
-Continuous Batching (CB) is an advanced technique to optimize the inference of transformer models by dynamically grouping multiple requests into batches. This approach maximizes GPU utilization and throughput, specifically for workloads with many variable-length inputs.
-
-We are particularly interested in having Continuous Batching in transformers for the following use cases:
- Evaluation of models on large datasets with variable-length inputs
- Generating outputs for multiple sequences for GRPO policies
-
-CB is what makes inference engines like vLLM or SGLang efficient. That being said, transformers does not aim to be a production-ready inference engine, but a complete framework for model development. For this reason, CB is available in `transformers serve`.
-
-If you are not familiar with some of the core concepts CB is built upon, we invite you to read the associated blog post: [Continuous Batching: Efficient Inference for Large Language Models](https://huggingface.co/blog/continuous-batching). _broken link for now_
-
-## API Reference
-
-## Usage Examples
-
-The main way to use CB in transformers is via the `generate_batch` method.
-
-Unlike `generate`, CB takes already tokenized inputs, known as input IDs. Each sequence of input IDs is represented as a list of integers, in python: `list[int]`. Since 
-
-For a more detailed example, please refer to: [examples/continuous_batching](./path/to/example)
-
-### `generate_batch` example
-
-We have created a `ContinuousMixin` that is inherited by the `GenerationMixin` so that all auto regressive text models support CB.
-
-This adds the `generate_batch` method to all models that inherit from `GenerationMixin`.
-
-You can use it as follows:
-
-```py
-import datasets
-import torch
-
-from transformers import AutoModelForCausalLM, AutoTokenizer
-from transformers.generation import GenerationConfig
-
-model = AutoModelForCausalLM.from_pretrained(
-    "Qwen/Qwen3-4B-Instruct-2507",
-    attn_implementation="spda_paged",
-    device_map="cuda",  # if you need cuda
-    dtype=torch.bfloat16,
-)
-tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, padding_side="left")
-
-# prepare a batch of inputs
-dataset = datasets.load_dataset("openai/gsm8k", "socratic", split="test")
-dataset = dataset.select(range(args.samples))
-tokenized_datasets = dataset.map(lambda x: tokenizer(x["question"]), batched=True)
-simple_batch_inputs = [item["input_ids"] for item in tokenized_datasets]
-
-generation_config = GenerationConfig(
-    max_new_tokens=32,
-    use_cuda_graph=False,  # Not supported for simple version
-    eos_token_id=tokenizer.eos_token_id,
-    pad_token_id=tokenizer.pad_token_id,
-    do_sample=False,
-    max_batch_tokens=512,  # max number of tokens in a batch, this is just a default value you should tune based on your hardware
-)
-
-batch_outputs = model.generate_batch(
-    inputs=simple_batch_inputs,
-    generation_config=generation_config,
-)
-
-for request_id, output in batch_outputs.items():
-    generated_text = tokenizer.decode(output.generated_tokens, skip_special_tokens=True)
-    print(f"Request {request_id} output: {generated_text}")
-```
-
-### `ContinuousBatchingManager` example
-
-If you want more control w.r.t. how you want to schedule requests using CB, you can use the `ContinuousBatchingManager` class directly.
-
-This is what we use in `transformers serve` because requests arrive asynchronously and we can leverage the asynchronous nature of the CB process to make things more efficient.
-
-Under the hood, the `ContinuousBatchingManager` creates a background thread that receives inputs from a python `queue.Queue` which it uses to get requests to batch in each forward pass.
-
-Note that the manager is thread safe!
-
-```py
-import datasets
-import torch
-
-from transformers import AutoModelForCausalLM, AutoTokenizer
-from transformers.generation import GenerationConfig
-from transformers.generation.continuous_batching import RequestStatus
-
-model = AutoModelForCausalLM.from_pretrained(
-    "Qwen/Qwen3-4B-Instruct-2507",
-    attn_implementation="spda_paged",
-    device_map="cuda",  # if you need cuda
-    dtype=torch.bfloat16,
-)
-tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, padding_side="left")
-
-# prepare a batch of inputs
-dataset = datasets.load_dataset("openai/gsm8k", "socratic", split="test")
-dataset = dataset.select(range(args.samples))
-tokenized_datasets = dataset.map(lambda x: tokenizer(x["question"]), batched=True)
-simple_batch_inputs = [item["input_ids"] for item in tokenized_datasets]
-
-# initialize the manager, available method thanks to the `ContinuousMixin`
-manager = model.init_continuous_batching(generation_config=generation_config)
-
-# start the background thread
-manager.start()
-
-# this is for demonstration purposes only, in practice this is most useful to do concurrently
-for i, input in enumerate(simple_batch_inputs):
-    request_id = manager.add_request(input_ids=input, request_id=f"request_{i}")  # if you do not specify a request_id, one will be generated for you
-
-# Can be done in an other thread
-for id, request in manager.get_result():
-    generated_text = tokenizer.decode(request.generated_tokens, skip_special_tokens=True)
-    print(f"Request {id} output: {generated_text}")
-
-# you can also get results for a specific request id
-result = manager.get_result(request_id="request_5")  # this is blocking and will wait for the result to be ready
-
-# or get results for a request that is streaming
-manager.add_request(
-    input_ids=input,
-    request_id="streaming_request",
-    stream=True,
-)
-for chunk in manager.request_id_iter(request_id="streaming_request"):
-    generated_text = tokenizer.decode(chunk.generated_tokens, skip_special_tokens=True)
-    print(generated_text)
-    # FIXME: stop iteration in `request_id_iter` when finished instead of doing it externally
-    if chunk.status == RequestStatus.FINISHED:
-        break
-
-# stop the background thread before exiting the process
-manager.stop()
-```
-
-## Supported & Unsupported Features
-
-### Supported Features
-
- Dynamic scheduling of variable-length requests
- Chunked prefill
- Paged Attention Cache
- Sliding window attention
- Chat templates
-
-### Unsupported Features
-
-At the moment, the following features are not supported with CB. We plan to add support to the following:
-
- Prefix caching
- Beam search
- tool calling
-
-The others are unplanned, but depending on community requests we might consider adding them:
-
- MTP (multi token prediction)
- Medusa
-
-## Performance Considerations
-
-
-## Integration with Serving
-
-You can use CB in `transformers serve` by passing the `--continuous-batching` flag when starting the server.
-
-## Monitoring
-
-We have added `opentelemetry` support to Continuous Batching to help you monitor its performance in production. To enable it, you need to install the `opentelemetry` extra when installing `transformers`:
-
-```sh
-# this installs `opentelemetry-api`, `opentelemetry-sdk` and `opentelemetry-exporter-otlp`
-pip install transformers[open-telemetry]
-```
-
-This will enable traces and metrics collection in CB. You will then have to setup the backend to collect and visualize the traces and metrics.
-
--- a/docs/source/en/executorch.md
+++ b/docs/source/en/executorch.md
@ -18,7 +18,7 @@ rendered properly in your Markdown viewer.

 [ExecuTorch](https://pytorch.org/executorch/stable/index.html) runs PyTorch models on mobile and edge devices. Export your Transformers models to the ExecuTorch format with [Optimum ExecuTorch](https://github.com/huggingface/optimum-executorch) with the command below.

-```bash
+```
 optimum-cli export executorch \
    --model "HuggingFaceTB/SmolLM2-135M-Instruct" \
    --task "text-generation" \
@ -29,5 +29,4 @@ optimum-cli export executorch \
    --qembedding 8w \
    --output_dir="hf_smollm2"
 ```
-
 Run `optimum-cli export executorch --help` to see all export options. For detailed export instructions, check the [README](optimum/exporters/executorch/README.md).
--- a/docs/source/en/hpo_train.md
+++ b/docs/source/en/hpo_train.md
@ -37,6 +37,7 @@ def model_init(trial):
        config=config,
        cache_dir=model_args.cache_dir,
        revision=model_args.model_revision,
+        token=True if model_args.use_auth_token else None,
    )
 ```

--- a/docs/source/en/internal/model_debugging_utils.md
+++ b/docs/source/en/internal/model_debugging_utils.md
@ -320,7 +320,7 @@ df.sort_values(by=['skipped_proportion'], ascending=False)
 You can focus on a specific test method using `--test_method_name`:

 ```bash
-python utils/scan_skipped_tests.py --test_method_name test_inputs_embeds --output_dir path/to/output
+$ python utils/scan_skipped_tests.py --test_method_name test_inputs_embeds --output_dir path/to/output
 ```

 - `--test_method_name`: Name of the test method to scan (e.g., `test_inputs_embeds`).
--- a/docs/source/en/internal/rope_utils.md
+++ b/docs/source/en/internal/rope_utils.md
@ -1,83 +0,0 @@
-<!--Copyright 2020 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-
-⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
-rendered properly in your Markdown viewer.
-
-->
-
-# Utilities for Rotary Embedding
-
-This page explains how the Rotary Embedding is computed and applied in Transformers and what types of RoPE are supported.
-
-## Overview
-
-Rotary Position Embeddings are a technique used to inject positional information into attention mechanisms without relying on explicit position encodings.  
-Instead of adding position vectors to token embeddings, RoPE rotates query and key vectors in the complex plane according to their positions enabling relative positional awareness and better extrapolation to unseen sequence lengths.
-
-The Transformers library provides a flexible and extensible implementation of various RoPE types defined in `[`~modeling_rope_utils.ROPE_VALIDATION_FUNCTIONS`]`, including both the default and scaled variants:
-
-| Rope Type | Description |
-|------------|-------------|
-| `"default"` | Standard rotary embedding as in LLaMA. |
-| `"linear"` | Linear-scaled RoPE which allows longer context windows. |
-| `"dynamic"` | NTK-aware scaling computed by rescaling frequency base (`θ`) for longer context. |
-| `"yarn"` | YaRN scaling variant providing smoother extrapolation and stability. |
-| `"longrope"` | [LongRoPE](https://github.com/microsoft/LongRoPE) scaling as in Phi-2 model series. |
-| `"llama3"` | RoPE scaling as in Llama3.1. |
-
-## Configuration in Model Configs
-
-To enable and customize rotary embeddings, add a `rope_parameters` field to your model’s configuration file (`config.json`). This field controls the RoPE behavior across model layers. Note that each RoPE variant defines its own set of expected keys and missing keys will raise an error. See the example below which creates a llama config with default RoPE parameters:
-
-```python
-from transformers import LlamaConfig
-
-config = LlamaConfig()
-config.rope_parameters = {
-    "rope_type": "default", # type of RoPE to use
-    "rope_theta": 10000.0 # base frequency parameter
-}
-
-# If we want to apply a scaled RoPE type, we need to pass extra parameters
-config.rope_parameters = {
-    "rope_type": "linear",
-    "rope_theta": 10000.0,
-    "factor": 8.0  # scale factor for context extension
-}
-```
-
-## Per-Layer-Type RoPE Configuration
-
-Some models such as Gemma-3 use different layer types with different attention mechanisms, i.e. "full attention" in some blocks and "sliding-window attention" in others. Transformers supports specifying distinct RoPE parameters per layer type for these models. In this case, `rope_parameters` should be a nested dictionary, where top-level keys correspond to `config.layer_types` and values are per-type RoPE parameters. During model initialization, each decoder layer will automatically look up the matching RoPE configuration based on its declared layer type.
-
-```python
-from transformers import Gemma3Config
-
-config = Gemma3Config()
-config.rope_parameters = {
-    "full_attention": {
-        "rope_type": "dynamic",
-        "rope_theta": 1000000.0,
-        "factor": 8.0,
-        "original_max_position_embeddings": 8096,
-    },
-    "sliding_attention": {
-        "rope_type": "default",
-        "rope_theta": 10000.0,
-    }
-}
-```
-
-## Utilities
-
-[[autodoc]] RopeParameters
-    - __call__
--- a/docs/source/en/kernel_doc/overview.md
+++ b/docs/source/en/kernel_doc/overview.md
@ -1,3 +1,3 @@
 # Overview

-Kernels in transformers are used to optimize the performance of models with custom layers from the hub and very low effort.
+Kernels in transformers are used to optimize the performance of models with custom layers from the hub and very low effort.
--- a/docs/source/en/kv_cache.md
+++ b/docs/source/en/kv_cache.md
@ -208,7 +208,7 @@ Some models have a unique way of storing past kv pairs or states that is not com

 Mamba models, such as [Mamba](./model_doc/mamba), require a specific cache because the model doesn't have an attention mechanism or kv states. Thus, they are not compatible with the above [`Cache`] classes.

-## Iterative generation
+# Iterative generation

 A cache can also work in iterative generation settings where there is back-and-forth interaction with a model (chatbots). Like regular generation, iterative generation with a cache allows a model to efficiently handle ongoing conversations without recomputing the entire context at each step.

--- a/docs/source/en/llm_optims.md
+++ b/docs/source/en/llm_optims.md
@ -393,9 +393,3 @@ model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1", quantization_config=quant_config, device_map="auto"
 )
 ```
-
-## Continuous Batching
-
-When serving LLMs for inference, you may have multiple requests arriving at different times. Continuous Batching (CB) is a technique that groups incoming requests into batches to maximize GPU utilization and throughput.
-
-See the [Continuous Batching](./continuous_batching) guide for more details on how to use CB in transformers.
--- a/docs/source/en/main_classes/data_collator.md
+++ b/docs/source/en/main_classes/data_collator.md
@ -67,6 +67,6 @@ Examples of use can be found in the [example scripts](../examples) or [example n

 [[autodoc]] data.data_collator.DataCollatorWithFlattening

-## DataCollatorForMultipleChoice
+# DataCollatorForMultipleChoice

 [[autodoc]] data.data_collator.DataCollatorForMultipleChoice
--- a/docs/source/en/main_classes/pipelines.md
+++ b/docs/source/en/main_classes/pipelines.md
@ -267,7 +267,6 @@ about how many forward passes you inputs are actually going to trigger, you can
 independently of the inputs. The caveats from the previous section still apply.

 ## Pipeline FP16 inference
-
 Models can be run in FP16 which can be significantly faster on GPU while saving memory. Most models will not suffer noticeable performance loss from this. The larger the model, the less likely that it will.

 To enable FP16 inference, you can simply pass `dtype=torch.float16` or `dtype='float16'` to the pipeline constructor. Note that this only works for models with a PyTorch backend. Your inputs will be converted to FP16 internally.
@ -335,7 +334,6 @@ Pipelines available for audio tasks include the following.
 Pipelines available for computer vision tasks include the following.

 ### DepthEstimationPipeline
-
 [[autodoc]] DepthEstimationPipeline
    - __call__
    - all
--- a/docs/source/en/main_classes/quantization.md
+++ b/docs/source/en/main_classes/quantization.md
@ -43,7 +43,6 @@ Learn how to quantize models in the [Quantization](../quantization) guide.
 [[autodoc]] AwqConfig

 ## EetqConfig
-
 [[autodoc]] EetqConfig

 ## GPTQConfig
--- a/docs/source/en/main_classes/tokenizer.md
+++ b/docs/source/en/main_classes/tokenizer.md
@ -50,14 +50,14 @@ several advanced alignment methods which can be used to map between the original
 token space (e.g., getting the index of the token comprising a given character or the span of characters corresponding
 to a given token).

-## Multimodal Tokenizer
+# Multimodal Tokenizer

 Apart from that each tokenizer can be a "multimodal" tokenizer which means that the tokenizer will hold all relevant special tokens
 as part of tokenizer attributes for easier access. For example, if the tokenizer is loaded from a vision-language model like LLaVA, you will
 be able to access `tokenizer.image_token_id` to obtain the special image token used as a placeholder.

 To enable extra special tokens for any type of tokenizer, you have to add the following lines and save the tokenizer. Extra special tokens do not
-have to be modality related and can be anything that the model often needs access to. In the below code, tokenizer at `output_dir` will have direct access
+have to be modality related and can ne anything that the model often needs access to. In the below code, tokenizer at `output_dir` will have direct access
 to three more special tokens.  

 ```python
--- a/docs/source/en/main_classes/video_processor.md
+++ b/docs/source/en/main_classes/video_processor.md
@ -23,7 +23,6 @@ The video processor extends the functionality of image processors by allowing Vi
 When adding a new VLM or updating an existing one to enable distinct video preprocessing, saving and reloading the processor configuration will store the video related arguments in a dedicated file named `video_preprocessing_config.json`. Don't worry if you haven't updated your VLM, the processor will try to load video related configurations from a file named `preprocessing_config.json`.

 ### Usage Example
-
 Here's an example of how to load a video processor with [`llava-hf/llava-onevision-qwen2-0.5b-ov-hf`](https://huggingface.co/llava-hf/llava-onevision-qwen2-0.5b-ov-hf) model:

 ```python
--- a/docs/source/en/model_doc/aimv2.md
+++ b/docs/source/en/model_doc/aimv2.md
@ -13,66 +13,51 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2024-11-21 and added to Hugging Face Transformers on 2025-07-08.*
+
+*This model was released on 2024-11-21 and added to Hugging Face Transformers on 2025-07-08 and contributed by [yaswanthgali](https://huggingface.co/yaswanthgali).*

 # AIMv2

-## Overview
+[AIMv2](https://huggingface.co/papers/2411.14402) presents a novel method for pre-training large-scale vision encoders in a multimodal setting, combining images and text. The model, characterized by a straightforward pre-training process and scalability, pairs a vision encoder with a multimodal decoder that autoregressively generates raw image patches and text tokens. AIMV2 excels in both multimodal evaluations and vision benchmarks such as localization, grounding, and classification. Notably, the AIMV2-3B encoder achieves 89.5% accuracy on ImageNet-1k with a frozen trunk and outperforms state-of-the-art contrastive models like CLIP and SigLIP in multimodal image understanding across various settings.

-The AIMv2 model was proposed in [Multimodal Autoregressive Pre-training of Large Vision Encoders](https://huggingface.co/papers/2411.14402) by Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor Guilherme Turrisi da Costa, Louis Béthune, Zhe Gan, Alexander T Toshev, Marcin Eichner, Moin Nabi, Yinfei Yang, Joshua M. Susskind, Alaaeldin El-Nouby.
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-The abstract from the paper is the following:
+```py
+import torch
+from transformers import pipeline

-*We introduce a novel method for pre-training of large-scale vision encoders. Building on recent advancements in autoregressive pre-training of vision models, we extend this framework to a multimodal setting, i.e., images and text. In this paper, we present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process, scalability, and remarkable performance across a range of downstream tasks. This is achieved by pairing the vision encoder with a multimodal decoder that autoregressively generates raw image patches and text tokens. Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification. Notably, our AIMV2-3B encoder achieves 89.5% accuracy on ImageNet-1k with a frozen trunk. Furthermore, AIMV2 consistently outperforms state-of-the-art contrastive models (e.g., CLIP, SigLIP) in multimodal image understanding across diverse settings.*
-
-This model was contributed by [Yaswanth Gali](https://huggingface.co/yaswanthgali).
-The original code can be found [here](https://github.com/apple/ml-aim).
-
-## Usage Example
-
-Here is an example of Image Feature Extraction using specific checkpoints on resized images and native resolution images:
-
-```python
-import requests
-from PIL import Image
-from transformers import AutoImageProcessor, AutoModel
-
-url = "http://images.cocodataset.org/val2017/000000039769.jpg"
-image = Image.open(requests.get(url, stream=True).raw)
-
-processor = AutoImageProcessor.from_pretrained("apple/aimv2-large-patch14-native")
-model = AutoModel.from_pretrained("apple/aimv2-large-patch14-native")
-
-inputs = processor(images=image, return_tensors="pt")
-outputs = model(**inputs)
+pipeline = pipeline(task="zero-shot-classification", model="apple/aimv2-large-patch14-native", dtype="auto")
+pipeline("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg")
 ```

-Here is an example of a checkpoint performing zero-shot classification:
+</hfoption>
+<hfoption id="AutoModel">

 ```python
+import torch
 import requests
 from PIL import Image
 from transformers import AutoProcessor, AutoModel

-url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
 image = Image.open(requests.get(url, stream=True).raw)
 text = ["Picture of a dog.", "Picture of a cat.", "Picture of a horse."]

 processor = AutoProcessor.from_pretrained("apple/aimv2-large-patch14-224-lit")
-model = AutoModel.from_pretrained("apple/aimv2-large-patch14-224-lit")
+model = AutoModel.from_pretrained("apple/aimv2-large-patch14-224-lit", dtype="auto")

-inputs = processor(
-    images=image,
-    text=text,
-    add_special_tokens=True,
-    truncation=True,
-    padding=True,
-    return_tensors="pt",
-)
+inputs = processor(images=image, text=text, add_special_tokens=True, truncation=True, padding=True, return_tensors="pt",)
 outputs = model(**inputs)
 probs = outputs.logits_per_image.softmax(dim=-1)
+pred_idx = torch.argmax(probs, dim=-1).item()
+predicted_label = text[pred_idx]
+print(f"Predicted label: {predicted_label}")
 ```

+</hfoption>
+</hfoptions>
+
 ## Aimv2Config

 [[autodoc]] Aimv2Config
@ -99,3 +84,4 @@ probs = outputs.logits_per_image.softmax(dim=-1)

 [[autodoc]] Aimv2TextModel
    - forward
+
--- a/docs/source/en/model_doc/albert.md
+++ b/docs/source/en/model_doc/albert.md
@ -13,32 +13,17 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2019-09-26 and added to Hugging Face Transformers on 2020-11-16.*
+*This model was released on 2019-09-26 and added to Hugging Face Transformers on 2020-11-16 and contributed by [lysandre](https://huggingface.co/lysandre).*

 <div style="float: right;">
    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white" >
-        <img alt="SDPA" src= "https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white" >
+        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
    </div>
 </div>

 # ALBERT

-[ALBERT](https://huggingface.co/papers/1909.11942) is designed to address memory limitations of scaling and training of [BERT](./bert). It adds two parameter reduction techniques. The first, factorized embedding parametrization, splits the larger vocabulary embedding matrix into two smaller matrices so you can grow the hidden size without adding a lot more parameters. The second, cross-layer parameter sharing, allows layer to share parameters which keeps the number of learnable parameters lower.
-
-ALBERT was created to address problems like -- GPU/TPU memory limitations, longer training times, and unexpected model degradation in BERT. ALBERT uses two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT:
-
- **Factorized embedding parameterization:** The large vocabulary embedding matrix is decomposed into two smaller matrices, reducing memory consumption.
- **Cross-layer parameter sharing:** Instead of learning separate parameters for each transformer layer, ALBERT shares parameters across layers, further reducing the number of learnable weights.
-
-ALBERT uses absolute position embeddings (like BERT) so padding is applied at right. Size of embeddings is 128 While BERT uses 768. ALBERT can processes maximum 512 token at a time.
-
-You can find all the original ALBERT checkpoints under the [ALBERT community](https://huggingface.co/albert) organization.
-
-> [!TIP]
-> Click on the ALBERT models in the right sidebar for more examples of how to apply ALBERT to different language tasks.
-
-The example below demonstrates how to predict the `[MASK]` token with [`Pipeline`], [`AutoModel`], and from the command line.
+[ALBERT](https://huggingface.co/papers/1909.11942) presents parameter-reduction techniques to enhance BERT by splitting the embedding matrix and using repeating layers. These methods reduce memory usage and training time, enabling better scalability. The model employs a self-supervised loss to improve inter-sentence coherence, achieving state-of-the-art results on GLUE, RACE, and SQuAD benchmarks with fewer parameters than BERT-large.

 <hfoptions id="usage">
 <hfoption id="Pipeline">
@ -47,13 +32,8 @@ The example below demonstrates how to predict the `[MASK]` token with [`Pipeline
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    task="fill-mask",
-    model="albert-base-v2",
-    dtype=torch.float16,
-    device=0
-)
-pipeline("Plants create [MASK] through a process known as photosynthesis.", top_k=5)
+pipeline = pipeline(task="fill-mask", model="albert/albert-base-v2", dtype="auto")
+pipeline("Plants create [MASK] through a process known as photosynthesis.")
 ```

 </hfoption>
@ -63,76 +43,25 @@ pipeline("Plants create [MASK] through a process known as photosynthesis.", top_
 import torch
 from transformers import AutoModelForMaskedLM, AutoTokenizer

+model = AutoModelForMaskedLM.from_pretrained("albert/albert-base-v2", dtype="auto")
 tokenizer = AutoTokenizer.from_pretrained("albert/albert-base-v2")
-model = AutoModelForMaskedLM.from_pretrained(
-    "albert/albert-base-v2",
-    dtype=torch.float16,
-    attn_implementation="sdpa",
-    device_map="auto"
-)

-prompt = "Plants create energy through a process known as [MASK]."
-inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
-
-with torch.no_grad():
-    outputs = model(**inputs)
-    mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
-    predictions = outputs.logits[0, mask_token_index]
-
-top_k = torch.topk(predictions, k=5).indices.tolist()
-for token_id in top_k[0]:
-    print(f"Prediction: {tokenizer.decode([token_id])}")
+inputs = tokenizer("Plants create [MASK] through a process known as photosynthesis.", return_tensors="pt")
+outputs = model(**inputs)
+mask_token_id = tokenizer.mask_token_id
+mask_position = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
+predicted_word = tokenizer.decode(outputs.logits[0, mask_position].argmax(dim=-1))
+print(f"Predicted word: {predicted_word}")
 ```

 </hfoption>
-<hfoption id="transformers CLI">
-
-```bash
-echo -e "Plants create [MASK] through a process known as photosynthesis." | transformers run --task fill-mask --model albert-base-v2 --device 0
-```
-
-</hfoption>
-
 </hfoptions>

-## Notes
+## Usage tips

- Inputs should be padded on the right because BERT uses absolute position embeddings.
- The embedding size `E` is different from the hidden size `H` because the embeddings are context independent (one embedding vector represents one token) and the hidden states are context dependent (one hidden state represents a sequence of tokens). The embedding matrix is also larger because `V x E` where `V` is the vocabulary size. As a result, it's more logical if `H >> E`. If `E < H`, the model has less parameters.
+- ALBERT uses absolute position embeddings. Pad inputs on the right, not the left.

-## Resources
-
-The resources provided in the following sections consist of a list of official Hugging Face and community (indicated by 🌎) resources to help you get started with AlBERT. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
-
-<PipelineTag pipeline="text-classification"/>
-
- [`AlbertForSequenceClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification).
-
- Check the [Text classification task guide](../tasks/sequence_classification) on how to use the model.
-
-<PipelineTag pipeline="token-classification"/>
-
- [`AlbertForTokenClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/token-classification).
-
- [Token classification](https://huggingface.co/course/chapter7/2?fw=pt) chapter of the 🤗 Hugging Face Course.
- Check the [Token classification task guide](../tasks/token_classification) on how to use the model.
-
-<PipelineTag pipeline="fill-mask"/>
-
- [`AlbertForMaskedLM`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling#robertabertdistilbert-and-masked-language-modeling) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb).
- [Masked language modeling](https://huggingface.co/course/chapter7/3?fw=pt) chapter of the 🤗 Hugging Face Course.
- Check the [Masked language modeling task guide](../tasks/masked_language_modeling) on how to use the model.
-
-<PipelineTag pipeline="question-answering"/>
-
- [`AlbertForQuestionAnswering`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering.ipynb).
- [Question answering](https://huggingface.co/course/chapter7/7?fw=pt) chapter of the 🤗 Hugging Face Course.
- Check the [Question answering task guide](../tasks/question_answering) on how to use the model.
-
-**Multiple choice**
-
- [`AlbertForMultipleChoice`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/multiple-choice) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice.ipynb).
- Check the [Multiple choice task guide](../tasks/multiple_choice) on how to use the model.
+- The embedding size E differs from hidden size H for good reason. Embeddings represent individual tokens (context-independent). Hidden states represent token sequences (context-dependent). This makes H >> E logical. The embedding matrix spans V × E dimensions, where V is vocabulary size. Keeping E < H reduces parameter count.

 ## AlbertConfig

@ -140,7 +69,11 @@ The resources provided in the following sections consist of a list of official H

 ## AlbertTokenizer

-[[autodoc]] AlbertTokenizer - build_inputs_with_special_tokens - get_special_tokens_mask - create_token_type_ids_from_sequences - save_vocabulary
+[[autodoc]] AlbertTokenizer
+    - build_inputs_with_special_tokens
+    - get_special_tokens_mask
+    - create_token_type_ids_from_sequences
+    - save_vocabulary

 ## AlbertTokenizerFast

@ -152,19 +85,23 @@ The resources provided in the following sections consist of a list of official H

 ## AlbertModel

-[[autodoc]] AlbertModel - forward
+[[autodoc]] AlbertModel
+    - forward

 ## AlbertForPreTraining

-[[autodoc]] AlbertForPreTraining - forward
+[[autodoc]] AlbertForPreTraining
+    - forward

 ## AlbertForMaskedLM

-[[autodoc]] AlbertForMaskedLM - forward
+[[autodoc]] AlbertForMaskedLM
+    - forward

 ## AlbertForSequenceClassification

-[[autodoc]] AlbertForSequenceClassification - forward
+[[autodoc]] AlbertForSequenceClassification
+    - forward

 ## AlbertForMultipleChoice

@ -172,8 +109,10 @@ The resources provided in the following sections consist of a list of official H

 ## AlbertForTokenClassification

-[[autodoc]] AlbertForTokenClassification - forward
+[[autodoc]] AlbertForTokenClassification
+    - forward

 ## AlbertForQuestionAnswering

-[[autodoc]] AlbertForQuestionAnswering - forward
+[[autodoc]] AlbertForQuestionAnswering
+    - forward
--- a/docs/source/en/model_doc/align.md
+++ b/docs/source/en/model_doc/align.md
@ -13,46 +13,21 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2021-02-11 and added to Hugging Face Transformers on 2023-03-01.*
-<div style="float: right;">
-  <div class="flex flex-wrap space-x-1">
-    <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-    <img alt="Transformers" src="https://img.shields.io/badge/Transformers-6B5B95?style=flat&logo=transformers&logoColor=white">
-  </div>
-</div>
+*This model was released on 2021-02-11 and added to Hugging Face Transformers on 2023-03-01 and contributed by [adirik](https://huggingface.co/adirik).*

 # ALIGN

-[ALIGN](https://huggingface.co/papers/2102.05918) is pretrained on a noisy 1.8 billion alt‑text and image pair dataset to show that scale can make up for the noise. It uses a dual‑encoder architecture, [EfficientNet](./efficientnet) for images and [BERT](./bert) for text, and a contrastive loss to align similar image–text embeddings together while pushing different embeddings apart. Once trained, ALIGN can encode any image and candidate captions into a shared vector space for zero‑shot retrieval or classification without requiring extra labels. This scale‑first approach reduces dataset curation costs and powers state‑of‑the‑art image–text retrieval and zero‑shot ImageNet classification.
-
-You can find all the original ALIGN checkpoints under the [Kakao Brain](https://huggingface.co/kakaobrain?search_models=align) organization.
-
-> [!TIP]
-> Click on the ALIGN models in the right sidebar for more examples of how to apply ALIGN to different vision and text related tasks.
-
-The example below demonstrates zero-shot image classification with [`Pipeline`] or the [`AutoModel`] class.
-
-<hfoptions id="usage">  
+[ALIGN](https://huggingface.co/papers/2102.05918) is a multi-modal vision and language model utilizing a dual-encoder architecture with EfficientNet for vision and BERT for text. It employs contrastive learning to align visual and text representations using a noisy dataset of over one billion image-alt text pairs. Despite the noise, the scale of the dataset enables state-of-the-art performance in image classification and image-text retrieval tasks, surpassing more complex models.

+<hfoptions id="usage">
 <hfoption id="Pipeline">

 ```py
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    task="zero-shot-image-classification",
-    model="kakaobrain/align-base",
-    device=0,
-    dtype=torch.bfloat16
-)
-
-candidate_labels = [
-    "a photo of a dog",
-    "a photo of a cat",
-    "a photo of a person"
-]
-
+pipeline = pipeline(task="zero-shot-image-classification", model="kakaobrain/align-base", dtype="auto")
+candidate_labels = ["a photo of a dog", "a photo of a cat", "a photo of a person"]
 pipeline("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg", candidate_labels=candidate_labels)
 ```

@ -66,7 +41,7 @@ from PIL import Image
 from transformers import AutoProcessor, AutoModelForZeroShotImageClassification

 processor = AutoProcessor.from_pretrained("kakaobrain/align-base")
-model = AutoModelForZeroShotImageClassification.from_pretrained("kakaobrain/align-base", device_map="auto")
+model = AutoModelForZeroShotImageClassification.from_pretrained("kakaobrain/align-base", dtype="auto")

 url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
 image = requests.get(url, stream=True)
@ -92,65 +67,8 @@ for label, score in zip(candidate_labels, probs):
 ```

 </hfoption>
-
 </hfoptions>

-## Notes
-
- ALIGN projects the text and visual features into latent space and the dot product between the projected image and text features is used as the similarity score. The example below demonstrates how to calculate the image-text similarity score with [`AlignProcessor`] and [`AlignModel`].
-
-  ```py
-  # Example of using ALIGN for image-text similarity
-  from transformers import AlignProcessor, AlignModel
-  import torch
-  from PIL import Image
-  import requests
-  from io import BytesIO
-  
-  # Load processor and model
-  processor = AlignProcessor.from_pretrained("kakaobrain/align-base")
-  model = AlignModel.from_pretrained("kakaobrain/align-base")
-  
-  # Download image from URL
-  url = "https://huggingface.co/roschmid/dog-races/resolve/main/images/Golden_Retriever.jpg"
-  response = requests.get(url)
-  image = Image.open(BytesIO(response.content))  # Convert the downloaded bytes to a PIL Image
-  
-  texts = ["a photo of a cat", "a photo of a dog"]
-  
-  # Process image and text inputs
-  inputs = processor(images=image, text=texts, return_tensors="pt")
-  
-  # Get the embeddings
-  with torch.no_grad():
-      outputs = model(**inputs)
-  
-  image_embeds = outputs.image_embeds
-  text_embeds = outputs.text_embeds
-  
-  # Normalize embeddings for cosine similarity
-  image_embeds = image_embeds / image_embeds.norm(dim=1, keepdim=True)
-  text_embeds = text_embeds / text_embeds.norm(dim=1, keepdim=True)
-  
-  # Calculate similarity scores
-  similarity_scores = torch.matmul(text_embeds, image_embeds.T)
-  
-  # Print raw scores
-  print("Similarity scores:", similarity_scores)
-  
-  # Convert to probabilities
-  probs = torch.nn.functional.softmax(similarity_scores, dim=0)
-  print("Probabilities:", probs)
-  
-  # Get the most similar text
-  most_similar_idx = similarity_scores.argmax().item()
-  print(f"Most similar text: '{texts[most_similar_idx]}'")
-  ```
-
-## Resources
-
- Refer to the [Kakao Brain’s Open Source ViT, ALIGN, and the New COYO Text-Image Dataset](https://huggingface.co/blog/vit-align) blog post for more details.
-
 ## AlignConfig

 [[autodoc]] AlignConfig
@ -183,3 +101,4 @@ for label, score in zip(candidate_labels, probs):

 [[autodoc]] AlignVisionModel
    - forward
+
--- a/docs/source/en/model_doc/altclip.md
+++ b/docs/source/en/model_doc/altclip.md
@ -13,35 +13,37 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2022-11-12 and added to Hugging Face Transformers on 2023-01-04.*
-
-<div style="float: right;">
-  <div class="flex flex-wrap space-x-1">
-    <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+*This model was released on 2022-11-12 and added to Hugging Face Transformers on 2023-01-04 and contributed by [jongjyh](https://huggingface.co/jongjyh).*

 # AltCLIP

-[AltCLIP](https://huggingface.co/papers/2211.06679) replaces the [CLIP](./clip) text encoder with a multilingual XLM-R encoder and aligns image and text representations with teacher learning and contrastive learning.
+[AltCLIP](https://huggingface.co/papers/2211.06679v2) alters the text encoder in CLIP by replacing it with a pretrained multilingual text encoder XLM-R. This modification enables the model to achieve state-of-the-art performance on tasks such as ImageNet-CN, Flicker30k-CN, and COCO-CN, while maintaining performance close to CLIP on other tasks. The approach involves a two-stage training schema with teacher learning and contrastive learning to align language and image representations, extending CLIP's capabilities to multilingual understanding.

-You can find all the original AltCLIP checkpoints under the [AltClip](https://huggingface.co/collections/BAAI/alt-clip-diffusion-66987a97de8525205f1221bf) collection.
-
-> [!TIP]
-> Click on the AltCLIP models in the right sidebar for more examples of how to apply AltCLIP to different tasks.
-
-The examples below demonstrates how to calculate similarity scores between an image and one or more captions with the [`AutoModel`] class.
+This model was contributed by [jongjyh](https://huggingface.co/jongjyh).

 <hfoptions id="usage">
+<hfoption id="Pipeline">
+
+```py
+import torch
+from transformers import pipeline
+
+pipeline = pipeline(task="zero-shot-image-classification", model="kakaobrain/align-base", dtype="auto")
+candidate_labels = ["a photo of a dog", "a photo of a cat", "a photo of a person"]
+pipeline("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg", candidate_labels=candidate_labels)
+```
+
+</hfoption>
 <hfoption id="AutoModel">

-```python
+```py
 import torch
 import requests
 from PIL import Image
-from transformers import AltCLIPModel, AltCLIPProcessor
+from transformers import AltCLIPModel, AutoProcessor

-model = AltCLIPModel.from_pretrained("BAAI/AltCLIP", dtype=torch.bfloat16)
-processor = AltCLIPProcessor.from_pretrained("BAAI/AltCLIP")
+model = AltCLIPModel.from_pretrained("BAAI/AltCLIP", dtype="auto")
+processor = AutoProcessor.from_pretrained("BAAI/AltCLIP")

 url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
 image = Image.open(requests.get(url, stream=True).raw)
@ -49,8 +51,8 @@ image = Image.open(requests.get(url, stream=True).raw)
 inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)

 outputs = model(**inputs)
-logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
-probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities
+logits_per_image = outputs.logits_per_image
+probs = logits_per_image.softmax(dim=1)

 labels = ["a photo of a cat", "a photo of a dog"]
 for label, prob in zip(labels, probs[0]):
@ -60,48 +62,10 @@ for label, prob in zip(labels, probs[0]):
 </hfoption>
 </hfoptions>

-Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
-
-The example below uses [torchao](../quantization/torchao) to only quantize the weights to int4.
-
-```python
-# !pip install torchao
-import torch
-import requests
-from PIL import Image
-from transformers import AltCLIPModel, AltCLIPProcessor, TorchAoConfig
-
-model = AltCLIPModel.from_pretrained(
-    "BAAI/AltCLIP",
-    quantization_config=TorchAoConfig("int4_weight_only", group_size=128),
-    dtype=torch.bfloat16,
-)
-
-processor = AltCLIPProcessor.from_pretrained("BAAI/AltCLIP")
-
-url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
-image = Image.open(requests.get(url, stream=True).raw)
-
-inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
-
-outputs = model(**inputs)
-logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
-probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities
-
-labels = ["a photo of a cat", "a photo of a dog"]
-for label, prob in zip(labels, probs[0]):
-    print(f"{label}: {prob.item():.4f}")
-```
-
-## Notes
-
- AltCLIP uses bidirectional attention instead of causal attention and it uses the `[CLS]` token in XLM-R to represent a text embedding.
- Use [`CLIPImageProcessor`] to resize (or rescale) and normalize images for the model.
- [`AltCLIPProcessor`] combines [`CLIPImageProcessor`] and [`XLMRobertaTokenizer`] into a single instance to encode text and prepare images.
-
 ## AltCLIPConfig

 [[autodoc]] AltCLIPConfig
+    - from_text_vision_configs

 ## AltCLIPTextConfig

@ -111,18 +75,24 @@ for label, prob in zip(labels, probs[0]):

 [[autodoc]] AltCLIPVisionConfig

+## AltCLIPProcessor
+
+[[autodoc]] AltCLIPProcessor
+
 ## AltCLIPModel

 [[autodoc]] AltCLIPModel
+    - forward
+    - get_text_features
+    - get_image_features

 ## AltCLIPTextModel

 [[autodoc]] AltCLIPTextModel
+    - forward

 ## AltCLIPVisionModel

 [[autodoc]] AltCLIPVisionModel
+    - forward

-## AltCLIPProcessor
-
-[[autodoc]] AltCLIPProcessor
--- a/docs/source/en/model_doc/apertus.md
+++ b/docs/source/en/model_doc/apertus.md
@ -13,28 +13,20 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2025-09-02 and added to Hugging Face Transformers on 2025-08-28.*
-
-# Apertus
+*This model was released on 2025-09-02 and added to Hugging Face Transformers on 2025-10-07.*

 <div style="float: right;">
    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
        <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
        <img alt="Tensor parallelism" src="https://img.shields.io/badge/Tensor%20parallelism-06b6d4?style=flat&logoColor=white">
    </div>
 </div>

-## Overview
+# Apertus

 [Apertus](https://www.swiss-ai.org) is a family of large language models from the Swiss AI Initiative.

-> [!TIP]
-> Coming soon
-
-The example below demonstrates how to generate text with [`Pipeline`] or the [`AutoModel`], and from the command line.
-
 <hfoptions id="usage">
 <hfoption id="Pipeline">

@ -42,13 +34,8 @@ The example below demonstrates how to generate text with [`Pipeline`] or the [`A
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    task="text-generation",
-    model="swiss-ai/Apertus-8B",
-    dtype=torch.bfloat16,
-    device=0
-)
-pipeline("Plants create energy through a process known as")
+pipeline = pipeline(task="text-generation", model="swiss-ai/Apertus-8B", dtype="auto")
+pipeline("Plants generate energy through a process known as  ")
 ```

 </hfoption>
@ -56,28 +43,15 @@ pipeline("Plants create energy through a process known as")

 ```py
 import torch
-from transformers import AutoModelForCausalLM, AutoTokenizer
+from transformers import AutoTokenizer, AutoModelForCausalLM

-tokenizer = AutoTokenizer.from_pretrained(
-    "swiss-ai/Apertus-8B",
-)
-model = AutoModelForCausalLM.from_pretrained(
-    "swiss-ai/Apertus-8B",
-    dtype=torch.bfloat16,
-    device_map="auto",
-    attn_implementation="sdpa"
-)
-input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to("cuda")
+tokenizer = AutoTokenizer.from_pretrained("swiss-ai/Apertus-8B")
+model = ArceeForCausalLM.from_pretrained("swiss-ai/Apertus-8B", dtype="auto")

-output = model.generate(**input_ids)
-print(tokenizer.decode(output[0], skip_special_tokens=True))
-```
-
-</hfoption>
-<hfoption id="transformers CLI">
-
-```bash
-echo -e "Plants create energy through a process known as" | transformers run --task text-generation --model swiss-ai/Apertus-8B --device 0
+inputs = tokenizer("Plants generate energy through a process known as  ", return_tensors="pt")
+with torch.no_grad():
+    outputs = model.generate(**inputs, max_new_tokens=50)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```

 </hfoption>
--- a/docs/source/en/model_doc/arcee.md
+++ b/docs/source/en/model_doc/arcee.md
@ -17,7 +17,6 @@ rendered properly in your Markdown viewer.

 <div style="float: right;">
    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
        <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
    </div>
@ -29,11 +28,6 @@ rendered properly in your Markdown viewer.

 The Arcee model is architecturally similar to Llama but uses `x * relu(x)` in MLP layers for improved gradient flow and is optimized for efficiency in both training and inference scenarios.

-> [!TIP]
-> The Arcee model supports extended context with RoPE scaling and all standard transformers features including Flash Attention 2, SDPA, gradient checkpointing, and quantization support.
-
-The example below demonstrates how to generate text with Arcee using [`Pipeline`] or the [`AutoModel`].
-
 <hfoptions id="usage">
 <hfoption id="Pipeline">

@ -41,15 +35,8 @@ The example below demonstrates how to generate text with Arcee using [`Pipeline`
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    task="text-generation",
-    model="arcee-ai/AFM-4.5B",
-    dtype=torch.float16,
-    device=0
-)
-
-output = pipeline("The key innovation in Arcee is")
-print(output[0]["generated_text"])
+pipeline = pipeline(task="text-generation", model="arcee-ai/AFM-4.5B", dtype="auto")
+pipeline("Plants generate energy through a process known as  ")
 ```

 </hfoption>
@ -57,16 +44,12 @@ print(output[0]["generated_text"])

 ```py
 import torch
-from transformers import AutoTokenizer, ArceeForCausalLM
+from transformers import AutoTokenizer, AutoModelForCausalLM

 tokenizer = AutoTokenizer.from_pretrained("arcee-ai/AFM-4.5B")
-model = ArceeForCausalLM.from_pretrained(
-    "arcee-ai/AFM-4.5B",
-    dtype=torch.float16,
-    device_map="auto"
-)
+model = ArceeForCausalLM.from_pretrained("arcee-ai/AFM-4.5B", dtype="auto")

-inputs = tokenizer("The key innovation in Arcee is", return_tensors="pt")
+inputs = tokenizer("Plants generate energy through a process known as  ", return_tensors="pt")
 with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=50)
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
@ -102,4 +85,4 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ## ArceeForTokenClassification

 [[autodoc]] ArceeForTokenClassification
-    - forward
+    - forward
--- a/docs/source/en/model_doc/aria.md
+++ b/docs/source/en/model_doc/aria.md
@ -13,11 +13,10 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2024-10-08 and added to Hugging Face Transformers on 2024-12-06.*
+*This model was released on 2024-10-08 and added to Hugging Face Transformers on 2024-12-06 and contributed by [m-ric](https://huggingface.co/m-ric).*

 <div style="float: right;">
    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
        <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
    </div>
@ -25,48 +24,27 @@ rendered properly in your Markdown viewer.

 # Aria

-[Aria](https://huggingface.co/papers/2410.05993) is a multimodal mixture-of-experts (MoE) model. The goal of this model is to open-source a training recipe for creating a multimodal native model from scratch. Aria has 3.9B and 3.5B activated parameters per visual and text token respectively. Text is handled by a MoE decoder and visual inputs are handled by a lightweight visual encoder. It is trained in 4 stages, language pretraining, multimodal pretraining, multimodal long-context pretraining, and multimodal post-training.
-
-You can find all the original Aria checkpoints under the [Aria](https://huggingface.co/rhymes-ai?search_models=aria) organization.
-
-> [!TIP]
-> Click on the Aria models in the right sidebar for more examples of how to apply Aria to different multimodal tasks.
-
-The example below demonstrates how to generate text based on an image with [`Pipeline`] or the [`AutoModel`] class.
+[Aria](https://huggingface.co/papers/2410.05993) is an open multimodal-native model designed to integrate diverse information sources and deliver comprehensive understanding. It employs a Mixture-of-Experts architecture with 3.9B and 3.5B activated parameters per visual and text token, respectively. Aria outperforms models like Pixtral-12B and Llama3.2-11B across various multimodal, language, and coding tasks. The model is pre-trained through a 4-stage pipeline that enhances language understanding, multimodal capabilities, long context handling, and instruction following. Aria's weights and codebase are open-sourced to facilitate adoption and adaptation in real-world applications.

 <hfoptions id="usage">
 <hfoption id="Pipeline">

-```python
+```py
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    "image-to-text",
-    model="rhymes-ai/Aria",
-    device=0,
-    dtype=torch.bfloat16
-)
-pipeline(
-    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
-    text="What is shown in this image?"
-)
+pipeline = pipeline(task="image-to-text", model="rhymes-ai/Aria", dtype="auto")
+pipeline("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg", text="What is shown in this image?")
 ```

 </hfoption>
 <hfoption id="AutoModel">

-```python
+```py
 import torch
 from transformers import AutoModelForCausalLM, AutoProcessor

-model = AutoModelForCausalLM.from_pretrained(
-    "rhymes-ai/Aria",
-    device_map="auto",
-    dtype=torch.bfloat16,
-    attn_implementation="sdpa"
-)
-
+model = AutoModelForCausalLM.from_pretrained("rhymes-ai/Aria", dtype="auto")
 processor = AutoProcessor.from_pretrained("rhymes-ai/Aria")

 messages = [
@ -81,8 +59,7 @@ messages = [
 inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt")
 ipnuts = inputs.to(model.device, torch.bfloat16)

-output = model.generate(
-    **inputs,
+output = model.generate(**inputs,
    max_new_tokens=15,
    stop_strings=["<|im_end|>"],
    tokenizer=processor.tokenizer,
@ -97,51 +74,6 @@ print(response)
 </hfoption>
 </hfoptions>

-Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
-
-The example below uses [torchao](../quantization/torchao) to only quantize the weights to int4 and the [rhymes-ai/Aria-sequential_mlp](https://huggingface.co/rhymes-ai/Aria-sequential_mlp) checkpoint. This checkpoint replaces grouped GEMM with `torch.nn.Linear` layers for easier quantization.
-
-```py
-# pip install torchao
-import torch
-from transformers import TorchAoConfig, AutoModelForCausalLM, AutoProcessor
-
-quantization_config = TorchAoConfig("int4_weight_only", group_size=128)
-model = AutoModelForCausalLM.from_pretrained(
-    "rhymes-ai/Aria-sequential_mlp",
-    dtype=torch.bfloat16,
-    device_map="auto",
-    quantization_config=quantization_config
-)
-processor = AutoProcessor.from_pretrained(
-    "rhymes-ai/Aria-sequential_mlp",
-)
-
-messages = [
-    {
-        "role": "user", "content": [
-            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
-            {"type": "text", "text": "What is shown in this image?"},
-        ]
-    },
-]
-
-inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt")
-inputs = inputs.to(model.device, torch.bfloat16)
-
-output = model.generate(
-    **inputs,
-    max_new_tokens=15,
-    stop_strings=["<|im_end|>"],
-    tokenizer=processor.tokenizer,
-    do_sample=True,
-    temperature=0.9,
-)
-output_ids = output[0][inputs["input_ids"].shape[1]:]
-response = processor.decode(output_ids, skip_special_tokens=True)
-print(response)
-```
-
 ## AriaImageProcessor

 [[autodoc]] AriaImageProcessor
@ -162,15 +94,17 @@ print(response)

 [[autodoc]] AriaTextModel

-## AriaModel
-
-[[autodoc]] AriaModel
-
 ## AriaTextForCausalLM

 [[autodoc]] AriaTextForCausalLM

+## AriaModel
+
+[[autodoc]] AriaModel
+    - forward
+
 ## AriaForConditionalGeneration

 [[autodoc]] AriaForConditionalGeneration
    - forward
+
--- a/docs/source/en/model_doc/audio-spectrogram-transformer.md
+++ b/docs/source/en/model_doc/audio-spectrogram-transformer.md
@ -13,82 +13,55 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2021-04-05 and added to Hugging Face Transformers on 2022-11-21.*
+*This model was released on 2021-04-05 and added to Hugging Face Transformers on 2022-11-21 and contributed by [nielsr](https://huggingface.co/nielsr).*
+
+<div style="float: right;">
+    <div class="flex flex-wrap space-x-1">
+        <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
+        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
+    </div>
+</div>

 # Audio Spectrogram Transformer

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
-<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[Audio Spectrogram Transformer](https://huggingface.co/papers/2104.01778) applies a Vision Transformer to audio by converting audio into spectrograms, achieving state-of-the-art results in audio classification without using convolutional layers. It outperforms existing models on benchmarks like AudioSet, ESC-50, and Speech Commands V2, demonstrating the effectiveness of purely attention-based models in this domain.

-## Overview
-
-The Audio Spectrogram Transformer model was proposed in [AST: Audio Spectrogram Transformer](https://huggingface.co/papers/2104.01778) by Yuan Gong, Yu-An Chung, James Glass.
-The Audio Spectrogram Transformer applies a [Vision Transformer](vit) to audio, by turning audio into an image (spectrogram). The model obtains state-of-the-art results
-for audio classification.
-
-The abstract from the paper is the following:
-
-*In the past decade, convolutional neural networks (CNNs) have been widely adopted as the main building block for end-to-end audio classification models, which aim to learn a direct mapping from audio spectrograms to corresponding labels. To better capture long-range global context, a recent trend is to add a self-attention mechanism on top of the CNN, forming a CNN-attention hybrid model. However, it is unclear whether the reliance on a CNN is necessary, and if neural networks purely based on attention are sufficient to obtain good performance in audio classification. In this paper, we answer the question by introducing the Audio Spectrogram Transformer (AST), the first convolution-free, purely attention-based model for audio classification. We evaluate AST on various audio classification benchmarks, where it achieves new state-of-the-art results of 0.485 mAP on AudioSet, 95.6% accuracy on ESC-50, and 98.1% accuracy on Speech Commands V2.*
-
-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/audio_spectogram_transformer_architecture.png"
-alt="drawing" width="600"/>
-
-<small> Audio Spectrogram Transformer architecture. Taken from the <a href="https://huggingface.co/papers/2104.01778">original paper</a>.</small>
-
-This model was contributed by [nielsr](https://huggingface.co/nielsr).
-The original code can be found [here](https://github.com/YuanGongND/ast).
-
-## Usage tips
-
- When fine-tuning the Audio Spectrogram Transformer (AST) on your own dataset, it's recommended to take care of the input normalization (to make
-sure the input has mean of 0 and std of 0.5). [`ASTFeatureExtractor`] takes care of this. Note that it uses the AudioSet
-mean and std by default. You can check [`ast/src/get_norm_stats.py`](https://github.com/YuanGongND/ast/blob/master/src/get_norm_stats.py) to see how
-the authors compute the stats for a downstream dataset.
- Note that the AST needs a low learning rate (the authors use a 10 times smaller learning rate compared to their CNN model proposed in the
-[PSLA paper](https://huggingface.co/papers/2102.01243)) and converges quickly, so please search for a suitable learning rate and learning rate scheduler for your task.
-
-### Using Scaled Dot Product Attention (SDPA)
-
-PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function
-encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the
-[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html)
-or the [GPU Inference](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention)
-page for more information.
-
-SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set
-`attn_implementation="sdpa"` in `from_pretrained()` to explicitly request SDPA to be used.
+<hfoptions id="usage">
+<hfoption id="Pipeline">

 ```py
-from transformers import ASTForAudioClassification
-model = ASTForAudioClassification.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593", attn_implementation="sdpa", dtype=torch.float16)
-...
+import torch
+from transformers import pipeline
+
+pipeline = pipeline(task="audio-classification",model="MIT/ast-finetuned-audioset-10-10-0.4593", dtype="auto")
+pipeline("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac")
 ```

-For the best speedups, we recommend loading the model in half-precision (e.g. `torch.float16` or `torch.bfloat16`).
+</hfoption>
+<hfoption id="AutoModel"

-On a local benchmark (A100-40GB, PyTorch 2.3.0, OS Ubuntu 22.04) with `float32` and `MIT/ast-finetuned-audioset-10-10-0.4593` model, we saw the following speedups during inference.
+```py
+import torch
+from datasets import load_dataset
+from transformers import AutoFeatureExtractor, AutoModelForAudioClassification

-|   Batch size |   Average inference time (ms), eager mode |   Average inference time (ms), sdpa model |   Speed up, Sdpa / Eager (x) |
-|--------------|-------------------------------------------|-------------------------------------------|------------------------------|
-|            1 |                                        27 |                                         6 |                      4.5 |
-|            2 |                                        12 |                                         6 |                      2   |
-|            4 |                                        21 |                                         8 |                      2.62 |
-|            8 |                                        40 |                                        14 |                      2.86 |
+dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation").sort("id")
+sampling_rate = dataset.features["audio"].sampling_rate

-## Resources
+feature_extractor = AutoFeatureExtractor.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593")
+model = AutoModelForAudioClassification.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593")

-A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with the Audio Spectrogram Transformer.
+inputs = feature_extractor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")

-<PipelineTag pipeline="audio-classification"/>
+with torch.no_grad():
+    logits = model(**inputs).logits

- A notebook illustrating inference with AST for audio classification can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/AST).
- [`ASTForAudioClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/audio-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/audio_classification.ipynb).
- See also: [Audio classification](../tasks/audio_classification).
+predicted_class_ids = torch.argmax(logits, dim=-1).item()
+print(f"Predicted label: {model.config.id2label[predicted_class_ids]}")
+```

-If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
+</hfoption>
+</hfoptions>

 ## ASTConfig

@ -108,3 +81,4 @@ If you're interested in submitting a resource to be included here, please feel f

 [[autodoc]] ASTForAudioClassification
    - forward
+
--- a/docs/source/en/model_doc/audioflamingo3.md
+++ b/docs/source/en/model_doc/audioflamingo3.md
@ -1,402 +0,0 @@
-<!--Copyright 2025 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-
-⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
-rendered properly in your Markdown viewer.
-
-->
-
-*This model was released on 2025-07-10 and added to Hugging Face Transformers on 2025-11-11.*
-
-# Audio Flamingo 3
-
-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
-<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
-
-## Overview
-
-Audio Flamingo 3 (AF3) is a fully open large audio–language model designed for robust understanding and reasoning over speech, environmental sounds, and music. AF3 pairs a Whisper-style audio encoder with a causal language model and performs replace-in-place audio–text fusion: the processor aligns post-pool audio frames to a dedicated placeholder token and the model replaces those token slots with projected audio embeddings during the forward pass.
-
-The model checkpoint is available at: [nvidia/audio-flamingo-3-hf](https://huggingface.co/nvidia/audio-flamingo-3-hf)
-
-Highlights:
-
- Unified audio encoder across speech, sound, and music.
- **Long-audio support via windowing and post-pool alignment (up to 10 minutes maximum).** The model processes audio in 30-second windows with a hard limit of 20 windows (10 minutes total). Audio longer than 10 minutes will be truncated.
- Deterministic fusion that preserves sequence length by replacing audio placeholder tokens with audio embeddings.
-
-This model was contributed by [Lasha Koroshinadze](https://huggingface.co/lashahub) and [Eric Bezzam](https://huggingface.co/bezzam).
-
-### Paper
-
-[Audio Flamingo 3](https://huggingface.co/papers/2507.08128): Advancing Audio Intelligence with Fully Open Large Audio Language Models  
-A. Goel, S. Ghosh, J. Kim, S. Kumar, Z. Kong, S. Lee, C.-H. H. Yang, R. Duraiswami, D. Manocha, R. Valle, B. Catanzaro  
-NVIDIA and University of Maryland  
-Project: https://research.nvidia.com/labs/adlr/AF3/
-
-## Usage
-
-### Audio Instruct Mode
-
-The model supports audio-text instructions, including multi-turn interactions, all processed in batches.
-
-➡️ audio + text instruction
-
-```python
-from transformers import AudioFlamingo3ForConditionalGeneration, AutoProcessor
-
-model_id = "nvidia/audio-flamingo-3-hf"
-processor = AutoProcessor.from_pretrained(model_id)
-model = AudioFlamingo3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
-
-conversation = [
-    {
-        "role": "user",
-        "content": [
-            {"type": "text", "text": "Transcribe the input speech."},
-            {"type": "audio", "path": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/WhDJDIviAOg_120_10.mp3"},
-        ],
-    }
-]
-
-inputs = processor.apply_chat_template(
-    conversation,
-    tokenize=True,
-    add_generation_prompt=True,
-    return_dict=True,
-).to(model.device)
-
-outputs = model.generate(**inputs, max_new_tokens=500)
-
-decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
-print(decoded_outputs)
-```
-
-➡️ multi-turn:
-
-```python
-from transformers import AudioFlamingo3ForConditionalGeneration, AutoProcessor
-
-model_id = "nvidia/audio-flamingo-3-hf"
-processor = AutoProcessor.from_pretrained(model_id)
-model = AudioFlamingo3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
-
-conversation = [
-    {
-        "role": "user",
-        "content": [
-            {
-                "type": "text",
-                "text": "Instruction: How does the tone of female speech change throughout the audio? Choose the correct option among the options below: (A) Sad to happy (B) Happy to sad (C) Neutral to happy (D) Happy to neutral.",
-            },
-            {"type": "audio", "path": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/000000786159.31.wav"},
-        ],
-    },
-    {
-        "role": "assistant",
-        "content": [{"type": "text", "text": "(A) Sad to happy"}],
-    },
-    {
-        "role": "user",
-        "content": [
-            {"type": "text", "text": "Why do you think so?"},
-        ],
-    },
-]
-
-inputs = processor.apply_chat_template(
-    conversation,
-    tokenize=True,
-    add_generation_prompt=True,
-    return_dict=True,
-).to(model.device)
-
-outputs = model.generate(**inputs, max_new_tokens=500)
-
-decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
-print(decoded_outputs)
-```
-
-➡️ text only:
-
-```python
-from transformers import AudioFlamingo3ForConditionalGeneration, AutoProcessor
-
-model_id = "nvidia/audio-flamingo-3-hf"
-processor = AutoProcessor.from_pretrained(model_id)
-model = AudioFlamingo3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
-
-conversation = [
-    {
-        "role": "user",
-        "content": [
-            {"type": "text", "text": "What is the capital of France?"},
-        ],
-    }
-]
-
-inputs = processor.apply_chat_template(
-    conversation,
-    tokenize=True,
-    add_generation_prompt=True,
-    return_dict=True,
-).to(model.device)
-
-outputs = model.generate(**inputs, max_new_tokens=500)
-
-decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
-print(decoded_outputs)
-```
-
-➡️ audio only:
-
-```python
-from transformers import AudioFlamingo3ForConditionalGeneration, AutoProcessor
-
-model_id = "nvidia/audio-flamingo-3-hf"
-processor = AutoProcessor.from_pretrained(model_id)
-model = AudioFlamingo3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
-
-conversation = [
-    {
-        "role": "user",
-        "content": [
-            {"type": "audio", "path": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/WhDJDIviAOg_120_10.mp3"},
-        ],
-    }
-]
-
-inputs = processor.apply_chat_template(
-    conversation,
-    tokenize=True,
-    add_generation_prompt=True,
-    return_dict=True,
-).to(model.device)
-
-outputs = model.generate(**inputs, max_new_tokens=500)
-
-decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
-print(decoded_outputs)
-```
-
-➡️ batched inference!
-
-```python
-from transformers import AudioFlamingo3ForConditionalGeneration, AutoProcessor
-
-model_id = "nvidia/audio-flamingo-3-hf"
-processor = AutoProcessor.from_pretrained(model_id)
-model = AudioFlamingo3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
-
-conversations = [
-    [
-        {
-            "role": "user",
-            "content": [
-                {"type": "text", "text": "Transcribe the input speech."},
-                {
-                    "type": "audio",
-                    "path": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/t_837b89f2-26aa-4ee2-bdf6-f73f0dd59b26.wav",
-                },
-            ],
-        }
-    ],
-    [
-        {
-            "role": "user",
-            "content": [
-                {
-                    "type": "text",
-                    "text": "This track feels really peaceful and introspective. What elements make it feel so calming and meditative?",
-                },
-                {"type": "audio", "path": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/FPSbCAANfbJLVSwD.mp3"},
-            ],
-        }
-    ],
-]
-
-inputs = processor.apply_chat_template(
-    conversations,
-    tokenize=True,
-    add_generation_prompt=True,
-    return_dict=True,
-).to(model.device)
-
-outputs = model.generate(**inputs, max_new_tokens=500)
-
-decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
-print(decoded_outputs)
-```
-
-➡️ Training:
-
-```python
-from transformers import AudioFlamingo3ForConditionalGeneration, AutoProcessor
-
-model_id = "nvidia/audio-flamingo-3-hf"
-processor = AutoProcessor.from_pretrained(model_id)
-model = AudioFlamingo3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
-model.train()
-
-conversation = [
-    [
-        {
-            "role": "user",
-            "content": [
-                {"type": "text", "text": "Transcribe the input speech."},
-                {"type": "audio", "path": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/WhDJDIviAOg_120_10.mp3"},
-            ],
-        },
-        {
-            "role": "assistant",
-            "content": [{"type": "text", "text": "The transcription of the audio is 'summer follows spring the days grow longer and the nights are warm'."}],
-        }
-    ],
-    [
-        {
-            "role": "user",
-            "content": [
-                {
-                    "type": "text",
-                    "text": "This track feels really peaceful and introspective. What elements make it feel so calming and meditative?",
-                },
-                {"type": "audio", "path": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/FPSbCAANfbJLVSwD.mp3"},
-            ],
-        },
-        {
-            "role": "assistant",
-            "content": [{"type": "text", "text": "The transcription of the audio is 'some transcription of the audio'."}],
-        }
-
-    ]
-]
-
-inputs = processor.apply_chat_template(
-    conversation,
-    tokenize=True,
-    add_generation_prompt=True,
-    return_dict=True,
-    output_labels=True,
-).to(model.device)
-
-loss = model(**inputs).loss
-loss.backward()
-```
-
-➡️ transcription shortcut
-
-```python
-from transformers import AudioFlamingo3ForConditionalGeneration, AutoProcessor
-
-model_id = "nvidia/audio-flamingo-3-hf"
-processor = AutoProcessor.from_pretrained(model_id)
-model = AudioFlamingo3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
-
-inputs = processor.apply_transcription_request(audio="https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/t_837b89f2-26aa-4ee2-bdf6-f73f0dd59b26.wav").to(model.device)
-
-outputs = model.generate(**inputs, max_new_tokens=500)
-decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True, strip_prefix=True)
-
-print(decoded_outputs)
-```
-
-The model is trained to emit transcriptions prefixed with assistant framing such as `The spoken content of the audio is "<text>".`. Use `strip_prefix=True` (as shown above) to remove the fixed assistant sentence and surrounding quotes so that only the transcription remains.
-
-## How the model works
-
-### Architecture
-
-* **AudioFlamingo3Encoder**
-  Whisper-style feature extractor + encoder → average-pool over time (stride 2) → LayerNorm.
-  Produces per-frame hidden states at the post-pool rate.
-
-* **AudioFlamingo3MultiModalProjector**
-  A small MLP that maps encoder features to the language model’s hidden size.
-
-* **AudioFlamingo3ForConditionalGeneration**
-  A causal language model that accepts text embeddings where each audio placeholder token slot is replaced, in place, by an audio frame embedding. No sequence-length change is introduced by fusion.
-
-### Processor-level alignment
-
-1. Each raw waveform is split into fixed-length windows based on the feature extractor’s `chunk_length` (seconds) and `sampling_rate` (Hz).
-2. For each window, the processor computes the number of post-pool frames `post_pool_len` that the encoder will output (matching the conv/pool schedule).
-3. The processor expands the audio placeholder token by the total number of post-pool frames across all windows.
-4. The model later replaces those token positions with the corresponding projected audio embeddings.
-
-## Usage patterns
-
-### Transcription shortcut
-
-For automatic speech recognition you can skip writing the default instruction each time and call
-[`~transformers.AudioFlamingo3Processor.apply_transcription_request`]:
-
-```python
-inputs = processor.apply_transcription_request(audio=audio_array)
-```
-
-Pass `prompt="Transcribe the input speech."` (or a list of prompts for batch audio) to customize the instruction while
-keeping the audio placeholder handling.
-
-`audio` accepts in-memory arrays, local file paths, or URLs. Any processor kwargs (`text_kwargs`, `audio_kwargs`, etc.)
-are forwarded, so you can tweak padding or tensor formats just like when calling `processor(...)`.
-
-## Long audio and windowing
-
-**Important: Maximum audio length is 10 minutes.** Audio longer than this will be truncated.
-
-* The default setup processes 30-second windows at 16 kHz mono.
-* **The processor enforces a hard limit of 20 windows per sample, resulting in a maximum of 10 minutes of audio (20 windows × 30 seconds).**
-* For each window:
-
-  * `mel_len` is the padded mel length.
-  * A conv stack reduces time as `conv_output_len = (mel_len - 1) // 2 + 1`.
-  * Post-pool frames per window: `post_pool_len = (conv_output_len - 2) // 2 + 1`.
-  * An audio placeholder token is expanded to the sum of `post_pool_len` across all windows.
-
-## Padding, attention, and caching
-
-* **Left padding vs right padding**
-  For generation with mixed prompt lengths in a batch, left padding is usually preferable.
-  For training, right padding is common; AF3’s fusion mechanism itself is padding-agnostic because it replaces in place.
-* **Attention masks**
-  The processor returns `attention_mask` (text) and `input_features_mask` (audio). The model builds an internal 4-D mask on the encoder’s pre-pool axis with negative infinity at pad positions.
-* **Caching**
-  During generation, `input_features` and `input_features_mask` are only passed on the first step. Subsequent steps use cached keys/values from the language model.
-
-## Troubleshooting
-
-* Empty or truncated outputs when batching
-  Use left padding for batched generation and decode only the new tokens after the prompt length, as shown in the quickstart.
-
-## AudioFlamingo3Config
-
-[[autodoc]] AudioFlamingo3Config
-
-## AudioFlamingo3EncoderConfig
-
-[[autodoc]] AudioFlamingo3EncoderConfig
-
-## AudioFlamingo3Processor
-
-[[autodoc]] AudioFlamingo3Processor
-
-## AudioFlamingo3Encoder
-
-[[autodoc]] AudioFlamingo3Encoder
-    - forward
-
-## AudioFlamingo3ForConditionalGeneration
-
-[[autodoc]] AudioFlamingo3ForConditionalGeneration
-    - forward
--- a/docs/source/en/model_doc/auto.md
+++ b/docs/source/en/model_doc/auto.md
@ -29,7 +29,7 @@ model = AutoModel.from_pretrained("google-bert/bert-base-cased")

 will create a model that is an instance of [`BertModel`].

-There is one class of `AutoModel` for each task.
+There is one class of `AutoModel` for each task, and for each backend (PyTorch, TensorFlow, or Flax).

 ## Extending the Auto Classes

@ -48,7 +48,7 @@ You will then be able to use the auto classes like you would usually do!

 <Tip warning={true}>

-If your `NewModelConfig` is a subclass of [`~transformers.PreTrainedConfig`], make sure its
+If your `NewModelConfig` is a subclass of [`~transformers.PretrainedConfig`], make sure its
 `model_type` attribute is set to the same key you use when registering the config (here `"new-model"`).

 Likewise, if your `NewModel` is a subclass of [`PreTrainedModel`], make sure its
@ -73,14 +73,14 @@ Likewise, if your `NewModel` is a subclass of [`PreTrainedModel`], make sure its

 [[autodoc]] AutoImageProcessor

-## AutoVideoProcessor
-
-[[autodoc]] AutoVideoProcessor
-
 ## AutoProcessor

 [[autodoc]] AutoProcessor

+## AutoVideoProcessor
+
+[[autodoc]] AutoVideoProcessor
+
 ## Generic model classes

 The following auto classes are available for instantiating a base model class without a specific head.
@ -161,10 +161,6 @@ The following auto classes are available for the following computer vision tasks

 [[autodoc]] AutoModelForKeypointDetection

-### AutoModelForKeypointMatching
-
-[[autodoc]] AutoModelForKeypointMatching
-
 ### AutoModelForMaskedImageModeling

 [[autodoc]] AutoModelForMaskedImageModeling
@ -201,6 +197,10 @@ The following auto classes are available for the following computer vision tasks

 [[autodoc]] AutoModelForZeroShotObjectDetection

+### AutoModelForKeypointMatching
+
+[[autodoc]] AutoModelForKeypointMatching
+
 ## Audio

 The following auto classes are available for the following audio tasks.
@ -261,8 +261,6 @@ The following auto classes are available for the following multimodal tasks.

 [[autodoc]] AutoModelForImageTextToText

-## Time Series
-
 ### AutoModelForTimeSeriesPrediction

 [[autodoc]] AutoModelForTimeSeriesPrediction
--- a/docs/source/en/model_doc/autoformer.md
+++ b/docs/source/en/model_doc/autoformer.md
@ -13,32 +13,39 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2021-06-24 and added to Hugging Face Transformers on 2023-05-30.*
+*This model was released on 2021-06-24 and added to Hugging Face Transformers on 2023-05-30 and contributed by [elisim](https://huggingface.co/elisim) and [kashif](https://huggingface.co/kashif).*

 # Autoformer

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://huggingface.co/papers/2106.13008) addresses the challenge of long-term time series forecasting by introducing a novel decomposition architecture. Autoformer integrates an Auto-Correlation mechanism that progressively decomposes trend and seasonal components, enhancing the model's ability to capture intricate temporal patterns. This approach surpasses traditional self-attention methods in both efficiency and accuracy, achieving state-of-the-art results with a 38% relative improvement across six benchmarks in diverse applications including energy, traffic, economics, weather, and disease forecasting.

-## Overview
+<hfoptions id="usage">
+<hfoption id="AutoformerForPrediction">

-The Autoformer model was proposed in [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://huggingface.co/papers/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long.
+```py
+import torch
+from huggingface_hub import hf_hub_download
+from transformers import AutoformerForPrediction

-This model augments the Transformer as a deep decomposition architecture, which can progressively decompose the trend and seasonal components during the forecasting process.
+file = hf_hub_download(
+    repo_id="hf-internal-testing/tourism-monthly-batch", filename="train-batch.pt", repo_type="dataset"
+)
+batch = torch.load(file)

-The abstract from the paper is the following:
+model = AutoformerForPrediction.from_pretrained("huggingface/autoformer-tourism-monthly", dtype="auto")
+outputs = model.generate(
+    past_values=batch["past_values"],
+    past_time_features=batch["past_time_features"],
+    past_observed_mask=batch["past_observed_mask"],
+    static_categorical_features=batch["static_categorical_features"],
+    future_time_features=batch["future_time_features"],
+)

-*Extending the forecasting time is a critical demand for real applications, such as extreme weather early warning and long-term energy consumption planning. This paper studies the long-term forecasting problem of time series. Prior Transformer-based models adopt various self-attention mechanisms to discover the long-range dependencies. However, intricate temporal patterns of the long-term future prohibit the model from finding reliable dependencies. Also, Transformers have to adopt the sparse versions of point-wise self-attentions for long series efficiency, resulting in the information utilization bottleneck. Going beyond Transformers, we design Autoformer as a novel decomposition architecture with an Auto-Correlation mechanism. We break with the pre-processing convention of series decomposition and renovate it as a basic inner block of deep models. This design empowers Autoformer with progressive decomposition capacities for complex time series. Further, inspired by the stochastic process theory, we design the Auto-Correlation mechanism based on the series periodicity, which conducts the dependencies discovery and representation aggregation at the sub-series level. Auto-Correlation outperforms self-attention in both efficiency and accuracy. In long-term forecasting, Autoformer yields state-of-the-art accuracy, with a 38% relative improvement on six benchmarks, covering five practical applications: energy, traffic, economics, weather and disease.*
+mean_prediction = outputs.sequences.mean(dim=1)
+```

-This model was contributed by [elisim](https://huggingface.co/elisim) and [kashif](https://huggingface.co/kashif).
-The original code can be found [here](https://github.com/thuml/Autoformer).
-
-## Resources
-
-A list of official Hugging Face and community (indicated by 🌎) resources to help you get started. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
-
- Check out the Autoformer blog-post in HuggingFace blog: [Yes, Transformers are Effective for Time Series Forecasting (+ Autoformer)](https://huggingface.co/blog/autoformer)
+</hfoption>
+</hfoptions>

 ## AutoformerConfig

@ -53,3 +60,4 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h

 [[autodoc]] AutoformerForPrediction
    - forward
+
--- a/docs/source/en/model_doc/aya_vision.md
+++ b/docs/source/en/model_doc/aya_vision.md
@ -13,250 +13,64 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2025-05-13 and added to Hugging Face Transformers on 2025-03-04.*
+*This model was released on 2025-05-13 and added to Hugging Face Transformers on 2025-03-04 and contributed by [saurabhdash](https://huggingface.co/saurabhdash) and [yonigozlan](https://huggingface.co/yonigozlan).*

-<div style="float: right;">
-    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-    </div>
-</div>
+# AyaVision

-# Aya Vision
-
-[Aya Vision](https://huggingface.co/papers/2505.08751) is a family of open-weight multimodal vision-language models from Cohere Labs. It is trained with a synthetic annotation framework that generates high-quality multilingual image captions, improving Aya Vision's generated responses. In addition, a cross-modal model merging technique is used to prevent the model from losing its text capabilities after adding vision capabilities. The model combines a CommandR-7B language model with a SigLIP vision encoder.
-
-You can find all the original Aya Vision checkpoints under the [Aya Vision](https://huggingface.co/collections/CohereLabs/cohere-labs-aya-vision-67c4ccd395ca064308ee1484) collection.
-
-> [!TIP]
-> This model was contributed by [saurabhdash](https://huggingface.co/saurabhdash) and [yonigozlan](https://huggingface.co/yonigozlan).
->
-> Click on the Aya Vision models in the right sidebar for more examples of how to apply Aya Vision to different image-to-text tasks.
-
-The example below demonstrates how to generate text based on an image with [`Pipeline`] or the [`AutoModel`] class.
+[Aya Vision](https://huggingface.co/papers/2505.08751) ntroduce two key innovations for multilingual multimodal learning: a synthetic annotation framework that generates high-quality, diverse instruction data across languages, and a cross-modal model merging technique that prevents catastrophic forgetting while preserving strong text-only performance. These methods enable effective alignment between vision and language without degrading existing capabilities. Aya-Vision-8B surpasses comparable models like Qwen-2.5-VL-7B, Pixtral-12B, and even larger models such as Llama-3.2-90B-Vision, while the larger Aya-Vision-32B outperforms models more than twice its size, including Molmo-72B. Overall, the approach demonstrates efficient scaling and state-of-the-art multilingual multimodal performance with reduced computational demands.

 <hfoptions id="usage">
 <hfoption id="Pipeline">

-```python
+```py
+import torch
 from transformers import pipeline

-pipe = pipeline(model="CohereLabs/aya-vision-8b", task="image-text-to-text", device_map="auto")
-
-# Format message with the aya-vision chat template
+pipeline = pipeline(task="image-text-to-text", model="CohereLabs/aya-vision-8b", dtype="auto")
 messages = [
    {"role": "user",
     "content": [
-       {"type": "image", "url": "https://media.istockphoto.com/id/458012057/photo/istanbul-turkey.jpg?s=612x612&w=0&k=20&c=qogAOVvkpfUyqLUMr_XJQyq-HkACXyYUSZbKhBlPrxo="},
-        {"type": "text", "text": "Bu resimde hangi anıt gösterilmektedir?"},
+       {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
+        {"type": "text", "text": "Que montre cette image?"},
    ]},
-    ]
-outputs = pipe(text=messages, max_new_tokens=300, return_full_text=False)
-
-print(outputs)
+]
+pipeline(text=messages, max_new_tokens=300, return_full_text=False)
 ```

 </hfoption>
 <hfoption id="AutoModel">

-```python
-# pip install 'git+https://github.com/huggingface/transformers.git@v4.49.0-Aya Vision'
+```py
 import torch
 from transformers import AutoProcessor, AutoModelForImageTextToText

-model_id = "CohereLabs/aya-vision-8b"
+processor = AutoProcessor.from_pretrained("CohereLabs/aya-vision-8b)
+model = AutoModelForImageTextToText.from_pretrained("CohereLabs/aya-vision-8b", dtype="auto")

-processor = AutoProcessor.from_pretrained(model_id)
-model = AutoModelForImageTextToText.from_pretrained(
-    model_id, device_map="auto", dtype=torch.float16
-)
-
-# Format message with the aya-vision chat template
 messages = [
    {"role": "user",
     "content": [
-       {"type": "image", "url": "https://pbs.twimg.com/media/Fx7YvfQWYAIp6rZ?format=jpg&name=medium"},
-        {"type": "text", "text": "चित्र में लिखा पाठ क्या कहता है?"},
+       {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
+        {"type": "text", "text": "Que montre cette image?"},
    ]},
-    ]
+]

 inputs = processor.apply_chat_template(
    messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt"
-).to(model.device)
+)

-gen_tokens = model.generate(
+outputs = model.generate(
    **inputs,
    max_new_tokens=300,
    do_sample=True,
    temperature=0.3,
 )
-
-print(processor.tokenizer.decode(gen_tokens[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
+print(processor.tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
 ```

 </hfoption>
 </hfoptions>

-Quantization reduces the memory footprint of large models by representing weights at lower precision. Refer to the [Quantization](../quantization/overview) overview for supported backends.
-
-The example below uses [bitsandbytes](../quantization/bitsandbytes) to only quantize the weights to 4-bits.
-
-```python
-import torch
-from transformers import (
-    AutoProcessor,
-    AutoModelForImageTextToText,
-    BitsAndBytesConfig
-)
-
-bnb_config = BitsAndBytesConfig(
-    load_in_4bit=True,
-    bnb_4bit_quant_type="nf4",
-    bnb_4bit_compute_dtype=torch.bfloat16,
-    bnb_4bit_use_double_quant=True
-)
-
-processor = AutoProcessor.from_pretrained("CohereLabs/aya-vision-32b", use_fast=True)
-model = AutoModelForImageTextToText.from_pretrained(
-    "CohereLabs/aya-vision-32b",
-    quantization_config=bnb_config,
-    device_map="auto"
-)
-
-inputs = processor.apply_chat_template(
-    [
-    {"role": "user", "content": [
-        {"type": "image", "url": "https://huggingface.co/roschmid/dog-races/resolve/main/images/Border_Collie.jpg"},
-        {"type": "text",  "text":"Describe what you see."}
-    ]}
-    ],
-    padding=True,
-    add_generation_prompt=True,
-    tokenize=True,
-    return_tensors="pt"
-).to(model.device)
-
-generated = model.generate(**inputs, max_new_tokens=50)
-print(processor.tokenizer.decode(generated[0], skip_special_tokens=True))
-```
-
-## Notes
-
- Images are represented with the `<image>` tag in the chat template.
-
- Use the [`~ProcessorMixin.apply_chat_template`] method to correctly format inputs.
-
- The example below demonstrates inference with multiple images.
-  
-    ```py
-    import torch
-    from transformers import AutoProcessor, AutoModelForImageTextToText
-        
-    processor = AutoProcessor.from_pretrained("CohereForAI/aya-vision-8b")
-    model = AutoModelForImageTextToText.from_pretrained(
-        "CohereForAI/aya-vision-8b", device_map="auto", dtype=torch.float16
-    )
-    
-    messages = [
-        {
-            "role": "user",
-            "content": [
-                {
-                    "type": "image",
-                    "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg",
-                },
-                {
-                    "type": "image",
-                    "url": "https://thumbs.dreamstime.com/b/golden-gate-bridge-san-francisco-purple-flowers-california-echium-candicans-36805947.jpg",
-                },
-                {
-                    "type": "text",
-                    "text": "These images depict two different landmarks. Can you identify them?",
-                },
-            ],
-        },
-    ]
-    
-    inputs = processor.apply_chat_template(
-        messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt"
-    ).to(model.device)
-    
-    gen_tokens = model.generate(
-        **inputs, 
-        max_new_tokens=300, 
-        do_sample=True, 
-        temperature=0.3,
-    )
-    
-    gen_text = processor.tokenizer.decode(gen_tokens[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
-    print(gen_text)
-    ```
-
- The example below demonstrates inference with batched inputs.
-  
-    ```py
-    import torch
-    from transformers import AutoProcessor, AutoModelForImageTextToText
-        
-    processor = AutoProcessor.from_pretrained(model_id)
-    model = AutoModelForImageTextToText.from_pretrained(
-        "CohereForAI/aya-vision-8b", device_map="auto", dtype=torch.float16
-    )
-    
-    batch_messages = [
-        [
-            {
-                "role": "user",
-                "content": [
-                    {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
-                    {"type": "text", "text": "Write a haiku for this image"},
-                ],
-            },
-        ],
-        [
-            {
-                "role": "user",
-                "content": [
-                    {
-                        "type": "image",
-                        "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg",
-                    },
-                    {
-                        "type": "image",
-                        "url": "https://thumbs.dreamstime.com/b/golden-gate-bridge-san-francisco-purple-flowers-california-echium-candicans-36805947.jpg",
-                    },
-                    {
-                        "type": "text",
-                        "text": "These images depict two different landmarks. Can you identify them?",
-                    },
-                ],
-            },
-        ],
-    ]
-    
-    batch_inputs = processor.apply_chat_template(
-        batch_messages, 
-        padding=True, 
-        add_generation_prompt=True, 
-        tokenize=True, 
-        return_dict=True, 
-        return_tensors="pt"
-    ).to(model.device)
-    
-    batch_outputs = model.generate(
-        **batch_inputs,
-        max_new_tokens=300,
-        do_sample=True,
-        temperature=0.3,
-    )
-    
-    for i, output in enumerate(batch_outputs):
-        response = processor.tokenizer.decode(
-            output[batch_inputs.input_ids.shape[1]:], 
-            skip_special_tokens=True
-        )
-        print(f"Response {i+1}:\n{response}\n")
-    ```
-
 ## AyaVisionProcessor

 [[autodoc]] AyaVisionProcessor
@ -268,6 +82,7 @@ print(processor.tokenizer.decode(generated[0], skip_special_tokens=True))
 ## AyaVisionModel

 [[autodoc]] AyaVisionModel
+    - forward

 ## AyaVisionForConditionalGeneration

--- a/docs/source/en/model_doc/bamba.md
+++ b/docs/source/en/model_doc/bamba.md
@ -13,11 +13,10 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2024-12-18 and added to Hugging Face Transformers on 2024-12-19.*
+*This model was released on 2024-12-18 and added to Hugging Face Transformers on 2024-12-19 and contributed by [ani300](https://github.com/ani300) and [fabianlim](https://github.com/fabianlim).*

 <div style="float: right;">
    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
        <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
    </div>
@ -25,106 +24,52 @@ rendered properly in your Markdown viewer.

 # Bamba

-[Bamba](https://huggingface.co/blog/bamba) is a 9B parameter decoder-only language model built on the [Mamba-2](./mamba2) architecture. It is pretrained in two stages - it starts by training on 2T tokens from the [Dolma v1.7](https://huggingface.co/datasets/allenai/dolma) dataset and then trained on an additional 200B tokens from [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) and [Cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia).
-
-You can find all the original Bamba checkpoints under the [Bamba](https://huggingface.co/collections/ibm-ai-platform/bamba-674f1388b9bbc98b413c7bab) collection.
-
-> [!TIP]
-> This model was contributed by [ani300](https://github.com/ani300) and [fabianlim](https://github.com/fabianlim).
->
-> Click on the Bamba models in the right sidebar for more examples of how to apply Bamba to different text generation tasks.
-
-The example below demonstrates how to generate text with [`Pipeline`], [`AutoModel`], and from the command line.
+[Bamba-9B](https://github.com/state-spaces/mamba) is a new hybrid language model that combines Mamba2 and Transformer layers to improve inference efficiency. By interleaving Mamba2 layers, it avoids the memory bottleneck of the Transformer’s growing KV-cache, achieving up to 2.5× higher throughput and 2× lower latency in vLLM. The model has 9 billion parameters and was trained on 2.2 trillion tokens of open data, with full training recipes and checkpoints released for reproducibility. It integrates seamlessly with Hugging Face tools like Transformers, TRL, vLLM, and llama.cpp, and comes with additional resources such as a stateless shuffle dataloader and quantization support. Developed in collaboration with IBM, Princeton, CMU, and UIUC, Bamba is intended as an open, efficient foundation for experimenting with hybrid architectures.

 <hfoptions id="usage">
 <hfoption id="Pipeline">

-```python
+```py
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    task="text-generation",
-    model="ibm-ai-platform/Bamba-9B-v2",
-    dtype=torch.bfloat16,
-    device=0
-)
-pipeline("Plants create energy through a process known as")
+pipeline = pipeline(task="text-generation", model="ibm-fms/Bamba-9B", dtype="auto")
+pipeline("Plants generate energy through a process known as  ")
 ```

 </hfoption>
-
 <hfoption id="AutoModel">

-```python
+```py
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer

-tokenizer = AutoTokenizer.from_pretrained("ibm-ai-platform/Bamba-9B-v2")
-model = AutoModelForCausalLM.from_pretrained("ibm-ai-platform/Bamba-9B-v2", dtype=torch.bfloat16, device_map="auto", attn_implementation="sdpa")
-input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to(model.device)
+model = AutoModelForCausalLM.from_pretrained("ibm-fms/Bamba-9B", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("ibm-fms/Bamba-9B")

-output = model.generate(**input_ids)
-print(tokenizer.decode(output[0], skip_special_tokens=True))
+inputs = tokenizer("Plants generate energy through a process known as  ", return_tensors='pt', return_token_type_ids=False)
+outputs = model.generate(**inputs, max_new_tokens=64)
+print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
 ```

-</hfoption>
-
-<hfoption id="transformers CLI">
-```bash
-echo "Plants create energy through a process known as" | transformers run --task text-generation --model ibm-ai-platform/Bamba-9B-v2 --device 0
-```
 </hfoption>
 </hfoptions>

-Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
+## Usage tips

-The example below uses [torchao](../quantization/torchao) to only quantize the weights to int4.
+- Bamba supports padding-free training. This concatenates distinct training examples while processing inputs as separate batches. Expect ~2x inference acceleration (varies by model and data distribution). Memory usage drops when examples have varying lengths since you avoid padding token overhead.

-```python
-import torch
-from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
+- Padding-free training requires the flash-attn, mamba-ssm, and causal-conv1d packages. Pass these arguments alongside `input_ids` and `labels`:

-quantization_config = TorchAoConfig("int4_weight_only", group_size=128)
-tokenizer = AutoTokenizer.from_pretrained("ibm-ai-platform/Bamba-9B-v2")
-model = AutoModelForCausalLM.from_pretrained(
-   "ibm-ai-platform/Bamba-9B-v2",
-   quantization_config=quantization_config,
-   device_map="auto",
-   attn_implementation="sdpa"
-)
+- `position_ids`: `torch.LongTensor` - position index of each token in each sequence
+- `seq_idx`: `torch.LongTensor` - index of each sequence in the batch
+- `FlashAttentionKwargs`:
+  - `cu_seq_lens_q`: `torch.LongTensor` - cumulative sequence lengths of all queries
+  - `cu_seq_lens_k`: `torch.LongTensor` - cumulative sequence lengths of all keys  
+  - `max_length_q`: `int` - longest query length in the batch
+  - `max_length_k`: `int` - longest key length in the batch

-inputs = tokenizer("Plants create energy through a process known as", return_tensors="pt").to(model.device)
-output = model.generate(**inputs)
-print(tokenizer.decode(output[0], skip_special_tokens=True))
-```
-
-## Notes
-
- Bamba supports padding-free training which concatenates distinct training examples while still processing inputs as separate batches. It can significantly accelerate inference by [~2x](https://github.com/huggingface/transformers/pull/35861#issue-2807873129) (depending on model and data distribution) and reduce memory-usage if there are examples of varying lengths by avoiding unnecessary compute and memory overhead from padding tokens.
-
-  Padding-free training requires the `flash-attn`, `mamba-ssm`, and `causal-conv1d` packages and the following arguments must be passed to the model in addition to `input_ids` and `labels`.
-
-  - `position_ids: torch.LongTensor`: the position index of each token in each sequence.
-  - `seq_idx: torch.IntTensor`: the index of each sequence in the batch.
-  - Each of the [`FlashAttentionKwargs`]
-    - `cu_seq_lens_q: torch.LongTensor`: the cumulative sequence lengths of all queries.
-    - `cu_seq_lens_k: torch.LongTensor`: the cumulative sequence lengths of all keys.
-    - `max_length_q: int`: the longest query length in the batch.
-    - `max_length_k: int`: the longest key length in the batch.
-
-  The `attention_mask` inputs should not be provided. The [`DataCollatorWithFlattening`] programmatically generates the set of additional arguments above using `return_seq_idx=True` and `return_flash_attn_kwargs=True`. See the [Improving Hugging Face Training Efficiency Through Packing with Flash Attention](https://huggingface.co/blog/packing-with-FA2) blog post for additional information.
-
-  ```python
-  from transformers import DataCollatorWithFlattening
-
-  # Example of using padding-free training
-  data_collator = DataCollatorWithFlattening(
-      tokenizer=tokenizer,
-      return_seq_idx=True,
-      return_flash_attn_kwargs=True
-  )
-  ```
+- Don't provide `attention_mask` inputs. The [`DataCollatorWithFlattening`] generates these arguments automatically when you set `return_seq_idx=True` and `return_flash_attn_kwargs=True`. See the [Improving Hugging Face Training Efficiency Through Packing with Flash Attention](https://huggingface.co/blog/packing-with-FA2) blog post for details.

 ## BambaConfig

--- a/docs/source/en/model_doc/bark.md
+++ b/docs/source/en/model_doc/bark.md
@ -9,165 +9,50 @@ Unless required by applicable law or agreed to in writing, software distributed
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
-*This model was released on 2023-04-09 and added to Hugging Face Transformers on 2023-07-17.*
+*This model was released on {release_date} and added to Hugging Face Transformers on 2023-07-17 and contributed by [ylacombe](https://huggingface.co/ylacombe) and [sanchit-gandhi](https://github.com/sanchit-gandhi).*
+
+<div style="float: right;">
+    <div class="flex flex-wrap space-x-1">
+        <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
+    </div>
+</div>

 # Bark

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
-</div>
+[Bark](https://github.com/suno-ai/bark) is a text-to-audio generative model capable of producing realistic speech, music, and sound effects directly from text prompts. It’s built using a transformer-based architecture that models audio tokens rather than phonemes, enabling it to capture tone, emotion, and multilingual speech without explicit linguistic preprocessing. Bark uses semantic and coarse acoustic tokens, trained on diverse multilingual datasets, to generate natural prosody and expressive delivery. Its outputs are decoded from discrete audio representations, similar in spirit to models like EnCodec or VALL-E, allowing highly expressive and context-aware audio synthesis.

-## Overview
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-[Bark](https://huggingface.co/suno/bark) is a transformer-based text-to-speech model proposed by Suno AI in [suno-ai/bark](https://github.com/suno-ai/bark).
-
-Bark is made of 4 main models:
-
- [`BarkSemanticModel`] (also referred to as the 'text' model): a causal auto-regressive transformer model that takes as input tokenized text, and predicts semantic text tokens that capture the meaning of the text.
- [`BarkCoarseModel`] (also referred to as the 'coarse acoustics' model): a causal autoregressive transformer, that takes as input the results of the [`BarkSemanticModel`] model. It aims at predicting the first two audio codebooks necessary for EnCodec.
- [`BarkFineModel`] (the 'fine acoustics' model), this time a non-causal autoencoder transformer, which iteratively predicts the last codebooks based on the sum of the previous codebooks embeddings.
- having predicted all the codebook channels from the [`EncodecModel`], Bark uses it to decode the output audio array.
-
-It should be noted that each of the first three modules can support conditional speaker embeddings to condition the output sound according to specific predefined voice.
-
-This model was contributed by [Yoach Lacombe (ylacombe)](https://huggingface.co/ylacombe) and [Sanchit Gandhi (sanchit-gandhi)](https://github.com/sanchit-gandhi).
-The original code can be found [here](https://github.com/suno-ai/bark).
-
-### Optimizing Bark
-
-Bark can be optimized with just a few extra lines of code, which **significantly reduces its memory footprint** and **accelerates inference**.
-
-#### Using half-precision
-
-You can speed up inference and reduce memory footprint by 50% simply by loading the model in half-precision.
-
-```python
-from transformers import BarkModel
-from accelerate import Accelerator
+```py
 import torch
+from transformers import pipeline

-device = Accelerator().device
-model = BarkModel.from_pretrained("suno/bark-small", dtype=torch.float16).to(device)
+pipeline = pipeline(task="text-to-audio", model="suno/bark-small", dtype="auto")
+output = pipeline("Plants create energy through a process known as photosynthesis.")
+audio = output["audio"]
 ```

-#### Using CPU offload
+</hfoption>
+<hfoption id="BarkModel">

-As mentioned above, Bark is made up of 4 sub-models, which are called up sequentially during audio generation. In other words, while one sub-model is in use, the other sub-models are idle.
-
-If you're using a CUDA GPU or Intel XPU, a simple solution to benefit from an 80% reduction in memory footprint is to offload the submodels from device to CPU when they're idle. This operation is called *CPU offloading*. You can use it with one line of code as follows:
-
-```python
-model.enable_cpu_offload()
-```
-
-Note that 🤗 Accelerate must be installed before using this feature. [Here's how to install it.](https://huggingface.co/docs/accelerate/basic_tutorials/install)
-
-#### Using Flash Attention 2
-
-Flash Attention 2 is an even faster, optimized version of the previous optimization.
-
-##### Installation
-
-First, check whether your hardware is compatible with Flash Attention 2. The latest list of compatible hardware can be found in the [official documentation](https://github.com/Dao-AILab/flash-attention#installation-and-features).
-Next, [install](https://github.com/Dao-AILab/flash-attention#installation-and-features) the latest version of Flash Attention 2:
-
-```bash
-pip install -U flash-attn --no-build-isolation
-```
-
-##### Usage
-
-To load a model using Flash Attention 2, we can pass the `attn_implementation="flash_attention_2"` flag to [`.from_pretrained`](https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel.from_pretrained). We'll also load the model in half-precision (e.g. `torch.float16`), since it results in almost no degradation to audio quality but significantly lower memory usage and faster inference:
-
-```python
-model = BarkModel.from_pretrained("suno/bark-small", dtype=torch.float16, attn_implementation="flash_attention_2").to(device)
-```
-
-##### Performance comparison
-
-The following diagram shows the latency for the native attention implementation (no optimisation) against Flash Attention 2. In all cases, we generate 400 semantic tokens on a 40GB A100 GPU with PyTorch 2.1:
-
-<div style="text-align: center">
-<img src="https://huggingface.co/datasets/ylacombe/benchmark-comparison/resolve/main/Bark%20Optimization%20Benchmark.png">
-</div>
-
-To put this into perspective, on an NVIDIA A100 and when generating 400 semantic tokens with a batch size of 16, you can get 17 times the [throughput](https://huggingface.co/blog/optimizing-bark#throughput) and still be 2 seconds faster than generating sentences one by one with the native model implementation. In other words, all the samples will be generated 17 times faster.
-
-#### Combining optimization techniques
-
-You can combine optimization techniques, and use CPU offload, half-precision and Flash Attention 2 all at once.
-
-```python
-from transformers import BarkModel
-from accelerate import Accelerator
+```py
 import torch
+from scipy.io.wavfile import write as write_wav
+from transformers import AutoProcessor, BarkModel

-device = Accelerator().device
+processor = AutoProcessor.from_pretrained("suno/bark")
+model = BarkModel.from_pretrained("suno/bark", dtype="auto")

-# load in fp16 and use Flash Attention 2
-model = BarkModel.from_pretrained("suno/bark-small", dtype=torch.float16, attn_implementation="flash_attention_2").to(device)
-
-# enable CPU offload
-model.enable_cpu_offload()
+inputs = processor("Plants create energy through a process known as photosynthesis.", voice_preset="v2/en_speaker_6")
+audio_array = model.generate(**inputs)
+audio_array = audio_array.cpu().numpy().squeeze()
+sample_rate = model.generation_config.sample_rate
+write_wav("bark_generation.wav", sample_rate, audio_array)
 ```

-Find out more on inference optimization techniques [here](https://huggingface.co/docs/transformers/perf_infer_gpu_one).
-
-### Usage tips
-
-Suno offers a library of voice presets in a number of languages [here](https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c).
-These presets are also uploaded in the hub [here](https://huggingface.co/suno/bark-small/tree/main/speaker_embeddings) or [here](https://huggingface.co/suno/bark/tree/main/speaker_embeddings).
-
-```python
->>> from transformers import AutoProcessor, BarkModel
-
->>> processor = AutoProcessor.from_pretrained("suno/bark")
->>> model = BarkModel.from_pretrained("suno/bark")
-
->>> voice_preset = "v2/en_speaker_6"
-
->>> inputs = processor("Hello, my dog is cute", voice_preset=voice_preset)
-
->>> audio_array = model.generate(**inputs)
->>> audio_array = audio_array.cpu().numpy().squeeze()
-```
-
-Bark can generate highly realistic, **multilingual** speech as well as other audio - including music, background noise and simple sound effects.
-
-```python
->>> # Multilingual speech - simplified Chinese
->>> inputs = processor("惊人的！我会说中文")
-
->>> # Multilingual speech - French - let's use a voice_preset as well
->>> inputs = processor("Incroyable! Je peux générer du son.", voice_preset="fr_speaker_5")
-
->>> # Bark can also generate music. You can help it out by adding music notes around your lyrics.
->>> inputs = processor("♪ Hello, my dog is cute ♪")
-
->>> audio_array = model.generate(**inputs)
->>> audio_array = audio_array.cpu().numpy().squeeze()
-```
-
-The model can also produce **nonverbal communications** like laughing, sighing and crying.
-
-```python
->>> # Adding non-speech cues to the input text
->>> inputs = processor("Hello uh ... [clears throat], my dog is cute [laughter]")
-
->>> audio_array = model.generate(**inputs)
->>> audio_array = audio_array.cpu().numpy().squeeze()
-```
-
-To save the audio, simply take the sample rate from the model config and some scipy utility:
-
-```python
->>> from scipy.io.wavfile import write as write_wav
-
->>> # save audio to disk, but first take the sample rate from the model config
->>> sample_rate = model.generation_config.sample_rate
->>> write_wav("bark_generation.wav", sample_rate, audio_array)
-```
+</hfoption>
+</hfoptions>

 ## BarkConfig

@ -220,3 +105,4 @@ To save the audio, simply take the sample rate from the model config and some sc

 [[autodoc]] BarkSemanticConfig
    - all
+
--- a/docs/source/en/model_doc/bart.md
+++ b/docs/source/en/model_doc/bart.md
@ -13,22 +13,18 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2019-10-29 and added to Hugging Face Transformers on 2020-11-16.*
+*This model was released on 2019-10-29 and added to Hugging Face Transformers on 2020-11-16 and contributed by [sshleifer](https://huggingface.co/sshleifer).*

 <div style="float: right;">
    <div class="flex flex-wrap space-x-1">
-    <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-    <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
-    <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
+        <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
+        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
+    </div>
 </div>

 # BART

-[BART](https://huggingface.co/papers/1910.13461) is a sequence-to-sequence model that combines the pretraining objectives from BERT and GPT. It's pretrained by corrupting text in different ways like deleting words, shuffling sentences, or masking tokens and learning how to fix it. The encoder encodes the corrupted document and the corrupted text is fixed by the decoder. As it learns to recover the original text, BART gets really good at both understanding and generating language.
-
-You can find all the original BART checkpoints under the [AI at Meta](https://huggingface.co/facebook?search_models=bart) organization.
-
-The example below demonstrates how to predict the `[MASK]` token with [`Pipeline`], [`AutoModel`], and from the command line.
+[BART](https://huggingface.co/papers/1910.13461) is a Transformer-based sequence-to-sequence model trained as a denoising autoencoder: text is corrupted with noise and the model learns to reconstruct the original. Its architecture combines a bidirectional encoder like BERT with a left-to-right decoder like GPT, making it a general framework for many pretraining approaches. Using techniques like sentence shuffling and span in-filling, BART achieves strong results on both generation and comprehension tasks, matching RoBERTa on GLUE and SQuAD while setting new state-of-the-art results in summarization, dialogue, and question answering. It also boosts machine translation performance and allows ablation experiments that replicate and compare other pretraining schemes.

 <hfoptions id="usage">
 <hfoption id="Pipeline">
@ -37,14 +33,8 @@ The example below demonstrates how to predict the `[MASK]` token with [`Pipeline
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    task="fill-mask",
-    model="facebook/bart-large",
-    dtype=torch.float16,
-    device=0
-)
-pipeline("Plants create <mask> through a process known as photosynthesis.")
-
+pipeline = pipeline(task="summarization", model="facebook/bart-large-cnn", dtype="auto")
+pipeline("The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930.")
 ```

 </hfoption>
@ -52,48 +42,30 @@ pipeline("Plants create <mask> through a process known as photosynthesis.")

 ```py
 import torch
-from transformers import AutoModelForMaskedLM, AutoTokenizer
+from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

-tokenizer = AutoTokenizer.from_pretrained(
-    "facebook/bart-large",
-)
-model = AutoModelForMaskedLM.from_pretrained(
-    "facebook/bart-large",
-    dtype=torch.float16,
-    device_map="auto",
-    attn_implementation="sdpa"
-)
-inputs = tokenizer("Plants create <mask> through a process known as photosynthesis.", return_tensors="pt").to(model.device)
+model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")

-with torch.no_grad():
-    outputs = model(**inputs)
-    predictions = outputs.logits
-
-masked_index = torch.where(inputs['input_ids'] == tokenizer.mask_token_id)[1]
-predicted_token_id = predictions[0, masked_index].argmax(dim=-1)
-predicted_token = tokenizer.decode(predicted_token_id)
-
-print(f"The predicted token is: {predicted_token}")
-```
-
-</hfoption>
-<hfoption id="transformers CLI">
-
-```bash
-echo -e "Plants create <mask> through a process known as photosynthesis." | transformers run --task fill-mask --model facebook/bart-large --device 0
+text="""
+The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930.
+"""
+inputs = tokenizer(text, return_tensors="pt")
+outputs = model.generate(**inputs, max_length=50)
+print(tokenizer.decode(outputs[0]))
 ```

 </hfoption>
 </hfoptions>

-## Notes
+## Usage tips

- Inputs should be padded on the right because BERT uses absolute position embeddings.
- The [facebook/bart-large-cnn](https://huggingface.co/facebook/bart-large-cnn) checkpoint doesn't include `mask_token_id` which means it can't perform mask-filling tasks.
- BART doesn't use `token_type_ids` for sequence classification. Use [`BartTokenizer`] or [`~PreTrainedTokenizerBase.encode`] to get the proper splitting.
- The forward pass of [`BartModel`] creates the `decoder_input_ids` if they're not passed. This can be different from other model APIs, but it is a useful feature for mask-filling tasks.
- Model predictions are intended to be identical to the original implementation when `forced_bos_token_id=0`. This only works if the text passed to `fairseq.encode` begins with a space.
- [`~GenerationMixin.generate`] should be used for conditional generation tasks like summarization.
+- Pad inputs on the right. BERT uses absolute position embeddings.
+- The facebook/bart-large-cnn checkpoint lacks `mask_token_id`. It can't perform mask-filling tasks.
+- BART ignores `token_type_ids` for sequence classification. Use [`BartTokenizer`] or `encode()` for proper splitting.
+- [`BartModel`] creates `decoder_input_ids` automatically if you don't pass them. This differs from other model APIs but helps with mask-filling tasks.
+- Model predictions match the original implementation when `forced_bos_token_id=0.` This works only if your text starts with a space.
+- Use [`generate`] for conditional generation tasks like summarization.

 ## BartConfig

@ -134,3 +106,4 @@ echo -e "Plants create <mask> through a process known as photosynthesis." | tran

 [[autodoc]] BartForCausalLM
    - forward
+
--- a/docs/source/en/model_doc/barthez.md
+++ b/docs/source/en/model_doc/barthez.md
@ -13,25 +13,11 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2020-10-23 and added to Hugging Face Transformers on 2020-11-27.*
-
-<div style="float: right;">
-    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-    </div>
-</div>
+*This model was released on 2020-10-23 and added to Hugging Face Transformers on 2020-11-27 and contributed by [moussakam](https://huggingface.co/moussakam).*

 # BARThez

-[BARThez](https://huggingface.co/papers/2010.12321) is a [BART](./bart) model designed for French language tasks. Unlike existing French BERT models, BARThez includes a pretrained encoder-decoder, allowing it to generate text as well. This model is also available as a multilingual variant, mBARThez, by continuing pretraining multilingual BART on a French corpus.
-
-You can find all of the original BARThez checkpoints under the [BARThez](https://huggingface.co/collections/dascim/barthez-670920b569a07aa53e3b6887) collection.
-
-> [!TIP]
-> This model was contributed by [moussakam](https://huggingface.co/moussakam).
-> Refer to the [BART](./bart) docs for more usage examples.
-
-The example below demonstrates how to predict the `<mask>` token with [`Pipeline`], [`AutoModel`], and from the command line.
+[BARThez](https://huggingface.co/papers/2010.12321) is the first BART model for the French language, pretrained on a large monolingual French corpus. Unlike BERT-based models like CamemBERT and FlauBERT, BARThez includes both an encoder and a decoder pretrained, making it well-suited for generative tasks. Evaluated on the FLUE benchmark and a new summarization dataset, OrangeSum, BARThez demonstrates strong performance. Additionally, continuing the pretraining of multilingual BART on BARThez's corpus results in mBARTHez, which outperforms or matches CamemBERT and FlauBERT.

 <hfoptions id="usage">
 <hfoption id="Pipeline">
@ -40,13 +26,8 @@ The example below demonstrates how to predict the `<mask>` token with [`Pipeline
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    task="fill-mask",
-    model="moussaKam/barthez",
-    dtype=torch.float16,
-    device=0
-)
-pipeline("Les plantes produisent <mask> grâce à un processus appelé photosynthèse.")
+pipeline = pipeline("fill-mask", model="moussaKam/barthez", dtype="auto")
+pipeline("Les plantes créent <mask> grâce à un processus appelé photosynthèse.")
 ```

 </hfoption>
@ -56,32 +37,15 @@ pipeline("Les plantes produisent <mask> grâce à un processus appelé photosynt
 import torch
 from transformers import AutoModelForMaskedLM, AutoTokenizer

-tokenizer = AutoTokenizer.from_pretrained(
-    "moussaKam/barthez",
-)
-model = AutoModelForMaskedLM.from_pretrained(
-    "moussaKam/barthez",
-    dtype=torch.float16,
-    device_map="auto",
-)
-inputs = tokenizer("Les plantes produisent <mask> grâce à un processus appelé photosynthèse.", return_tensors="pt").to(model.device)
+model = AutoModelForMaskedLM.from_pretrained("moussaKam/barthez", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("moussaKam/barthez")

-with torch.no_grad():
-    outputs = model(**inputs)
-    predictions = outputs.logits
-
-masked_index = torch.where(inputs['input_ids'] == tokenizer.mask_token_id)[1]
-predicted_token_id = predictions[0, masked_index].argmax(dim=-1)
-predicted_token = tokenizer.decode(predicted_token_id)
-
-print(f"The predicted token is: {predicted_token}")
-```
-
-</hfoption>
-<hfoption id="transformers CLI">
-
-```bash
-echo -e "Les plantes produisent <mask> grâce à un processus appelé photosynthèse." | transformers run --task fill-mask --model moussaKam/barthez --device 0
+inputs = tokenizer("Les plantes créent <mask> grâce à un processus appelé photosynthèse.", return_tensors="pt")
+outputs = model(**inputs)
+mask_token_id = tokenizer.mask_token_id
+mask_position = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
+predicted_word = tokenizer.decode(outputs.logits[0, mask_position].argmax(dim=-1))
+print(f"Predicted word: {predicted_word}")
 ```

 </hfoption>
--- a/docs/source/en/model_doc/bartpho.md
+++ b/docs/source/en/model_doc/bartpho.md
@ -13,92 +13,47 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2021-09-20 and added to Hugging Face Transformers on 2021-10-18.*
-
-<div style="float: right;">
-   <div class="flex flex-wrap space-x-1">
-      <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-   </div>
-</div>
+*This model was released on 2021-09-20 and added to Hugging Face Transformers on 2021-10-18 and contributed by [dqnguyen](https://huggingface.co/dqnguyen).*

 # BARTpho

-[BARTpho](https://huggingface.co/papers/2109.09701) is a large-scale Vietnamese sequence-to-sequence model. It offers a word-based and syllable-based version. This model is built on the [BART](./bart) large architecture with its denoising pretraining.
+[BARTpho](https://huggingface.co/papers/2109.09701) introduces two versions—BARTpho_word and BARTpho_syllable—as the first large-scale monolingual sequence-to-sequence models pre-trained for Vietnamese. Leveraging the "large" architecture and pre-training scheme of BART, BARTpho excels in generative NLP tasks. Evaluations on Vietnamese text summarization demonstrate that BARTpho surpasses mBART, setting a new state-of-the-art. The model is released to support future research and applications in generative Vietnamese NLP.

-You can find all the original checkpoints under the [VinAI](https://huggingface.co/vinai/models?search=bartpho) organization.
-
-> [!TIP]
-> This model was contributed by [dqnguyen](https://huggingface.co/dqnguyen).
-> Check out the right sidebar for examples of how to apply BARTpho to different language tasks.
-
-The example below demonstrates how to summarize text with [`Pipeline`] or the [`AutoModel`] class.
+This model was contributed by [dqnguyen](https://huggingface.co/dqnguyen). The original code can be found [here](https://github.com/VinAIResearch/BARTpho).

 <hfoptions id="usage">
 <hfoption id="Pipeline">

-```python
+```py
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-   task="summarization",
-   model="vinai/bartpho-word",
-   dtype=torch.float16,
-   device=0
-)
-
-text = """
-Quang tổng hợp hay gọi tắt là quang hợp là quá trình thu nhận và chuyển hóa năng lượng ánh sáng Mặt trời của thực vật,
-tảo và một số vi khuẩn để tạo ra hợp chất hữu cơ phục vụ bản thân cũng như làm nguồn thức ăn cho hầu hết các sinh vật
-trên Trái Đất. Quang hợp trong thực vật thường liên quan đến chất tố diệp lục màu xanh lá cây và tạo ra oxy như một sản phẩm phụ
-"""
-pipeline(text)
+pipeline = pipeline("text2text-generation", model="vinai/bartpho-syllable", dtype="auto")
+pipeline("Thực vật tạo ra năng lượng thông qua một quá trình được gọi là")
 ```

 </hfoption>
 <hfoption id="AutoModel">

-```python
+```py
 import torch
-from transformers import BartForConditionalGeneration, AutoTokenizer
+from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

-tokenizer = AutoTokenizer.from_pretrained(
-    "vinai/bartpho-word",
-)
-model = BartForConditionalGeneration.from_pretrained(
-    "vinai/bartpho-word",
-    dtype=torch.float16,
-    device_map="auto",
-)
+model = AutoModelForSeq2SeqLM.from_pretrained("vinai/bartpho-syllable", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("vinai/bartpho-syllable")

-text = """
-Quang tổng hợp hay gọi tắt là quang hợp là quá trình thu nhận và chuyển hóa năng lượng ánh sáng Mặt trời của thực vật,
-tảo và một số vi khuẩn để tạo ra hợp chất hữu cơ phục vụ bản thân cũng như làm nguồn thức ăn cho hầu hết các sinh vật
-trên Trái Đất. Quang hợp trong thực vật thường liên quan đến chất tố diệp lục màu xanh lá cây và tạo ra oxy như một sản phẩm phụ
-"""
-inputs = tokenizer(text, return_tensors="pt").to(model.device)
-
-outputs = model.generate(inputs["input_ids"], num_beams=2, min_length=0, max_length=20)
-tokenizer.batch_decode(outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
-```
-
-</hfoption>
-<hfoption id="transformers CLI">
-
-```bash
-echo -e "Quang tổng hợp hay gọi tắt là quang hợp là quá trình thu nhận và chuyển hóa năng lượng ánh sáng Mặt trời của thực vật,
-tảo và một số vi khuẩn để tạo ra hợp chất hữu cơ phục vụ bản thân cũng như làm nguồn thức ăn cho hầu hết các sinh vật
-trên Trái Đất. Quang hợp trong thực vật thường liên quan đến chất tố diệp lục màu xanh lá cây và tạo ra oxy như một sản phẩm phụ" | \
-transformers run --task summarization --model vinai/bartpho-word --device 0
+inputs = tokenizer("Thực vật tạo ra năng lượng thông qua một quá trình được gọi là", return_tensors="pt")
+outputs = model.generate(**inputs, max_length=50)
+print(tokenizer.decode(outputs[0]))
 ```

 </hfoption>
 </hfoptions>

-## Notes
+## Usage tips

- BARTpho uses the large architecture of BART with an additional layer-normalization layer on top of the encoder and decoder. The BART-specific classes should be replaced with the mBART-specific classes.
- This implementation only handles tokenization through the `monolingual_vocab_file` file. This is a Vietnamese-specific subset of token types taken from that multilingual vocabulary. If you want to use this tokenizer for another language, replace the `monolingual_vocab_file` with one specialized for your target language.
+- BARTpho uses BART's large architecture plus an extra layer-normalization layer on the encoder and decoder. Replace BART-specific classes with mBART-specific classes.
+- This implementation handles tokenization through the `monolingual_vocab_file`. This contains Vietnamese-specific token types from the multilingual vocabulary. For other languages, replace `monolingual_vocab_file` with one specialized for your target language.

 ## BartphoTokenizer

--- a/docs/source/en/model_doc/beit.md
+++ b/docs/source/en/model_doc/beit.md
@ -13,120 +13,55 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2021-06-15 and added to Hugging Face Transformers on 2021-08-04.*
+*This model was released on 2021-06-15 and added to Hugging Face Transformers on 2021-08-04 and contributed by [nielsr](https://huggingface.co/nielsr).*
+
+<div style="float: right;">
+    <div class="flex flex-wrap space-x-1">
+        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
+    </div>
+</div>

 # BEiT

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[BEiT: BERT Pre-Training of Image Transformers](https://huggingface.co/papers/2106.08254) introduces a self-supervised vision representation model inspired by BERT. BEiT pre-trains Vision Transformers by predicting visual tokens from masked image patches. This approach outperforms supervised pre-training methods. Experiments show that BEiT achieves competitive results on image classification and semantic segmentation, with a base-size model reaching 83.2% top-1 accuracy on ImageNet-1K, surpassing DeiT trained from scratch. A large-size BEiT model achieves 86.3% on ImageNet-1K, even outperforming a ViT-L model pre-trained on ImageNet-22K.

-## Overview
-
-The BEiT model was proposed in [BEiT: BERT Pre-Training of Image Transformers](https://huggingface.co/papers/2106.08254) by
-Hangbo Bao, Li Dong and Furu Wei. Inspired by BERT, BEiT is the first paper that makes self-supervised pre-training of
-Vision Transformers (ViTs) outperform supervised pre-training. Rather than pre-training the model to predict the class
-of an image (as done in the [original ViT paper](https://huggingface.co/papers/2010.11929)), BEiT models are pre-trained to
-predict visual tokens from the codebook of OpenAI's [DALL-E model](https://huggingface.co/papers/2102.12092) given masked
-patches.
-
-The abstract from the paper is the following:
-
-*We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional Encoder representation
-from Image Transformers. Following BERT developed in the natural language processing area, we propose a masked image
-modeling task to pretrain vision Transformers. Specifically, each image has two views in our pre-training, i.e, image
-patches (such as 16x16 pixels), and visual tokens (i.e., discrete tokens). We first "tokenize" the original image into
-visual tokens. Then we randomly mask some image patches and fed them into the backbone Transformer. The pre-training
-objective is to recover the original visual tokens based on the corrupted image patches. After pre-training BEiT, we
-directly fine-tune the model parameters on downstream tasks by appending task layers upon the pretrained encoder.
-Experimental results on image classification and semantic segmentation show that our model achieves competitive results
-with previous pre-training methods. For example, base-size BEiT achieves 83.2% top-1 accuracy on ImageNet-1K,
-significantly outperforming from-scratch DeiT training (81.8%) with the same setup. Moreover, large-size BEiT obtains
-86.3% only using ImageNet-1K, even outperforming ViT-L with supervised pre-training on ImageNet-22K (85.2%).*
-
-This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/microsoft/unilm/tree/master/beit).
-
-## Usage tips
-
- BEiT models are regular Vision Transformers, but pre-trained in a self-supervised way rather than supervised. They
-  outperform both the [original model (ViT)](vit) as well as [Data-efficient Image Transformers (DeiT)](deit) when fine-tuned on ImageNet-1K and CIFAR-100. You can check out demo notebooks regarding inference as well as
-  fine-tuning on custom data [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer) (you can just replace
-  [`ViTImageProcessor`] by [`BeitImageProcessor`] and
-  [`ViTForImageClassification`] by [`BeitForImageClassification`]).
- There's also a demo notebook available which showcases how to combine DALL-E's image tokenizer with BEiT for
-  performing masked image modeling. You can find it [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/BEiT).
- As the BEiT models expect each image to be of the same size (resolution), one can use
-  [`BeitImageProcessor`] to resize (or rescale) and normalize images for the model.
- Both the patch resolution and image resolution used during pre-training or fine-tuning are reflected in the name of
-  each checkpoint. For example, `microsoft/beit-base-patch16-224` refers to a base-sized architecture with patch
-  resolution of 16x16 and fine-tuning resolution of 224x224. All checkpoints can be found on the [hub](https://huggingface.co/models?search=microsoft/beit).
- The available checkpoints are either (1) pre-trained on [ImageNet-22k](http://www.image-net.org/) (a collection of
-  14 million images and 22k classes) only, (2) also fine-tuned on ImageNet-22k or (3) also fine-tuned on [ImageNet-1k](http://www.image-net.org/challenges/LSVRC/2012/) (also referred to as ILSVRC 2012, a collection of 1.3 million
-  images and 1,000 classes).
- BEiT uses relative position embeddings, inspired by the T5 model. During pre-training, the authors shared the
-  relative position bias among the several self-attention layers. During fine-tuning, each layer's relative position
-  bias is initialized with the shared relative position bias obtained after pre-training. Note that, if one wants to
-  pre-train a model from scratch, one needs to either set the `use_relative_position_bias` or the
-  `use_relative_position_bias` attribute of [`BeitConfig`] to `True` in order to add
-  position embeddings.
-
-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/beit_architecture.jpg"
-alt="drawing" width="600"/>
-
-<small> BEiT pre-training. Taken from the <a href="https://huggingface.co/papers/2106.08254">original paper.</a> </small>
-
-### Using Scaled Dot Product Attention (SDPA)
-
-PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function
-encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the
-[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html)
-or the [GPU Inference](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention)
-page for more information.
-
-SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set
-`attn_implementation="sdpa"` in `from_pretrained()` to explicitly request SDPA to be used.
+<hfoptions id="usage">
+<hfoption id="Pipeline">

 ```py
-from transformers import BeitForImageClassification
-model = BeitForImageClassification.from_pretrained("microsoft/beit-base-patch16-224", attn_implementation="sdpa", dtype=torch.float16)
-...
+import torch
+from transformers import pipeline
+
+pipeline = pipeline(task="image-classification", model="microsoft/beit-base-patch16-224-pt22k", dtype="auto")
+pipeline("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg")
 ```

-For the best speedups, we recommend loading the model in half-precision (e.g. `torch.float16` or `torch.bfloat16`).
+</hfoption>
+<hfoption id="AutoModel">

-On a local benchmark (NVIDIA GeForce RTX 2060-8GB, PyTorch 2.5.1, OS Ubuntu 20.04) with `float16` and
-`microsoft/beit-base-patch16-224` model, we saw the following improvements during training and inference:
+```python
+import torch
+import requests
+from PIL import Image
+from transformers import AutoImageProcessor, AutoModelForImageClassification

-#### Training
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
+image = Image.open(requests.get(url, stream=True).raw)

-| num_training_steps | batch_size | image_size   | is_cuda | Time per batch (eager - s) | Time per batch (sdpa - s) | Speedup (%) | Eager peak mem (MB) | SDPA peak mem (MB) | Mem saving (%) |
-|--------------------|------------|--------------|---------|----------------------------|---------------------------|-------------|----------------------|--------------------|----------------|
-| 50                 | 2          | (1048, 640)  | True    | 0.984                      | 0.746                     | 31.975      | 6738.915            | 4319.886          | 55.998         |
+image_processor = AutoImageProcessor.from_pretrained("microsoft/beit-base-patch16-224-pt22k")
+model = AutoModelForImageClassification.from_pretrained("microsoft/beit-base-patch16-224-pt22k", dtype="auto")

-#### Inference
+inputs = image_processor(image, return_tensors="pt")

-|   Image batch size |   Eager (s/iter) | Eager CI, %   |   Eager memory (MB) |   SDPA (s/iter) | SDPA CI, %   |   SDPA memory (MB) |   SDPA speedup | SDPA memory saved (%) |
-|-------------------:|-----------------:|:--------------|--------------------:|----------------:|:-------------|-------------------:|---------------:|----------------------:|
-|                  1 |            0.012 | ±0.3%         |         3.76657e+08 |           0.011 | ±0.5%        |        3.75739e+08 |          1.05  |                 0.244 |
-|                  4 |            0.013 | ±0.1%         |         4.03147e+08 |           0.011 | ±0.2%        |        3.90554e+08 |          1.178 |                 3.225 |
-|                 16 |            0.045 | ±0.1%         |         4.96697e+08 |           0.035 | ±0.1%        |        4.51232e+08 |          1.304 |                10.076 |
-|                 32 |            0.088 | ±0.1%         |         6.24417e+08 |           0.066 | ±0.1%        |        5.33488e+08 |          1.325 |                17.044 |
+with torch.no_grad():
+    logits = model(**inputs).logits

-## Resources
+predicted_label = logits.argmax(-1).item()
+print(model.config.id2label[predicted_label])
+```

-A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with BEiT.
-
-<PipelineTag pipeline="image-classification"/>
-
- [`BeitForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb).
- See also: [Image classification task guide](../tasks/image_classification)
-
-**Semantic segmentation**
-
- [Semantic segmentation task guide](../tasks/semantic_segmentation)
-
-If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
+</hfoption>
+</hfoptions>

 ## BEiT specific outputs

@ -167,3 +102,4 @@ If you're interested in submitting a resource to be included here, please feel f

 [[autodoc]] BeitForSemanticSegmentation
    - forward
+
--- a/docs/source/en/model_doc/bert-generation.md
+++ b/docs/source/en/model_doc/bert-generation.md
@ -13,131 +13,46 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2019-07-29 and added to Hugging Face Transformers on 2020-11-16.*
-
-<div style="float: right;">
-    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-    </div>
-</div>
+*This model was released on 2019-07-29 and added to Hugging Face Transformers on 2020-11-16 and contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten).*

 # BertGeneration

-[BertGeneration](https://huggingface.co/papers/1907.12461) leverages pretrained BERT checkpoints for sequence-to-sequence tasks with the [`EncoderDecoderModel`] architecture. BertGeneration adapts the [`BERT`] for generative tasks.
-
-You can find all the original BERT checkpoints under the [BERT](https://huggingface.co/collections/google/bert-release-64ff5e7a4be99045d1896dbc) collection.
-
-> [!TIP]
-> This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten).
->
-> Click on the BertGeneration models in the right sidebar for more examples of how to apply BertGeneration to different sequence generation tasks.
-
-The example below demonstrates how to use BertGeneration with [`EncoderDecoderModel`] for sequence-to-sequence tasks.
+[BertGeneration](https://huggingface.co/papers/1907.12461) leverages pre-trained BERT checkpoints for sequence-to-sequence tasks using an EncoderDecoderModel framework. This approach achieves state-of-the-art results in Machine Translation, Text Summarization, Sentence Splitting, and Sentence Fusion, demonstrating the utility of initializing both encoder and decoder with pre-trained models.

 <hfoptions id="usage">
 <hfoption id="Pipeline">

-```python
+```py
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    task="text2text-generation",
-    model="google/roberta2roberta_L-24_discofuse",
-    dtype=torch.float16,
-    device=0
-)
-pipeline("Plants create energy through ")
+pipeline = pipeline(task="text2text-generation", model="google/bert_for_seq_generation_L-24_bbc_encoder", dtype="auto")
+pipeline("Plants generate energy through a process known as  ")
 ```

 </hfoption>
 <hfoption id="AutoModel">

-```python
+```py
 import torch
-from transformers import EncoderDecoderModel, AutoTokenizer
+from transformers import AutoModelForCausalLM, AutoTokenizer

-model = EncoderDecoderModel.from_pretrained("google/roberta2roberta_L-24_discofuse", dtype="auto")
-tokenizer = AutoTokenizer.from_pretrained("google/roberta2roberta_L-24_discofuse")
+model = AutoModelForCausalLM.from_pretrained("google/bert_for_seq_generation_L-24_bbc_encoder", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("google/bert_for_seq_generation_L-24_bbc_encoder")

-input_ids = tokenizer(
-    "Plants create energy through ", add_special_tokens=False, return_tensors="pt"
-).input_ids
-
-outputs = model.generate(input_ids)
+inputs = tokenizer("Plants generate energy through a process known as  ", return_tensors="pt")
+outputs = model.generate(**inputs, max_length=50)
 print(tokenizer.decode(outputs[0]))
 ```

-</hfoption>
-<hfoption id="transformers CLI">
-
-```bash
-echo -e "Plants create energy through " | transformers run --task text2text-generation --model "google/roberta2roberta_L-24_discofuse" --device 0
-```
-
 </hfoption>
 </hfoptions>

-Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
+## Usage tips

-The example below uses [BitsAndBytesConfig](../quantizationbitsandbytes) to quantize the weights to 4-bit.
-
-```python
-import torch
-from transformers import EncoderDecoderModel, AutoTokenizer, BitsAndBytesConfig
-
-# Configure 4-bit quantization
-quantization_config = BitsAndBytesConfig(
-    load_in_4bit=True,
-    bnb_4bit_compute_dtype=torch.float16
-)
-
-model = EncoderDecoderModel.from_pretrained(
-    "google/roberta2roberta_L-24_discofuse",
-    quantization_config=quantization_config,
-    dtype="auto"
-)
-tokenizer = AutoTokenizer.from_pretrained("google/roberta2roberta_L-24_discofuse")
-
-input_ids = tokenizer(
-    "Plants create energy through ", add_special_tokens=False, return_tensors="pt"
-).input_ids
-
-outputs = model.generate(input_ids)
-print(tokenizer.decode(outputs[0]))
-```
-
-## Notes
-
- [`BertGenerationEncoder`] and [`BertGenerationDecoder`] should be used in combination with [`EncoderDecoderModel`] for sequence-to-sequence tasks.
-
-   ```python
-   from transformers import BertGenerationEncoder, BertGenerationDecoder, BertTokenizer, EncoderDecoderModel
-   
-   # leverage checkpoints for Bert2Bert model
-   # use BERT's cls token as BOS token and sep token as EOS token
-   encoder = BertGenerationEncoder.from_pretrained("google-bert/bert-large-uncased", bos_token_id=101, eos_token_id=102)
-   # add cross attention layers and use BERT's cls token as BOS token and sep token as EOS token
-   decoder = BertGenerationDecoder.from_pretrained(
-       "google-bert/bert-large-uncased", add_cross_attention=True, is_decoder=True, bos_token_id=101, eos_token_id=102
-   )
-   bert2bert = EncoderDecoderModel(encoder=encoder, decoder=decoder)
-
-   # create tokenizer
-   tokenizer = BertTokenizer.from_pretrained("google-bert/bert-large-uncased")
-
-   input_ids = tokenizer(
-       "This is a long article to summarize", add_special_tokens=False, return_tensors="pt"
-   ).input_ids
-   labels = tokenizer("This is a short summary", return_tensors="pt").input_ids
-
-   # train
-   loss = bert2bert(input_ids=input_ids, decoder_input_ids=labels, labels=labels).loss
-   loss.backward()
-   ```
-
- For summarization, sentence splitting, sentence fusion and translation, no special tokens are required for the input.
- No EOS token should be added to the end of the input for most generation tasks.
+- Use [`BertGenerationEncoder`] and [`BertGenerationDecoder`] with [`EncoderDecoderModel`] for sequence-to-sequence tasks.
+- Summarization, sentence splitting, sentence fusion, and translation don't require special tokens in the input.
+- Don't add `EOS` tokens to the end of inputs for most generation tasks.

 ## BertGenerationConfig

--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
stevhliu	0ecb993601	usage tips	2025-10-15 14:08:54 -07:00
stevhliu	d1d5d4d758	fixes	2025-10-15 11:20:56 -07:00
stevhliu	dc570c7505	remove result	2025-10-15 11:20:56 -07:00
stevhliu	daf6069c48	standardize	2025-10-15 11:20:54 -07:00