usage tips

fixes
remove result
2025-10-22 10:19:00 +08:00 · 2025-10-15 14:08:54 -07:00 · 2025-10-15 11:20:56 -07:00 · 2025-10-15 11:20:56 -07:00 · 2025-10-15 11:20:54 -07:00
810 changed files with 14300 additions and 33675 deletions
--- a/.cursor/commands/style-guide.md
+++ b/.cursor/commands/style-guide.md
@ -0,0 +1,53 @@
+## Sentence structure
+- Write short, declarative sentences most of the time.
+- Vary sentence length to avoid sounding robotic. Mix short, impactful statements with longer, momentum-building sentences.
+- Every time you use a comma, ask whether you can use a period instead.
+- Avoid repeating the same words in a paragraph. Use synonyms or rephrase.
+
+## Voice and tone
+- Write like humans speak. Avoid corporate jargon and marketing fluff.
+- Be confident and direct. Avoid softening phrases like "I think", "maybe", or "could".
+- Use active voice instead of passive voice.
+- Use positive phrasing - say what something *is* rather than what is *isn't*.
+- Say "you" more than "we" when addressing external audiences.
+- Use contractions like "I'll", "won't", and "can't" for a warmer tone.
+
+## Specificity and evidence
+- Be specific with facts and data instead of vague superlatives.
+- Back up claims with concrete examples or metrics.
+- Highlight customers and community members over company achievements.
+- Use realistic, product-based examples instead of `foo/bar/baz` in code.
+- Make content concrete, visual, and falsifiable.
+
+## Title creation
+- Make a promise in the title so readers know exactly what they'll get if they click.
+- Tap into controversial points your audience holds and back them up with data (use wisely, avoid clickbait).
+- Share something uniquely helpful that makes readers better at meaningful aspects of their lives.
+- Avoid vague titles like "My Thoughts on XYZ". Titles should be opinions or shareable facts.
+- Write placeholder titles first, complete the content, then spend time iterating on titles at the end.
+
+## Ban phrases
+- Avoid using "You can"
+
+## Avoid LLM patterns
+- Replace em dashes (-) with semicolons, commas, or sentence breaks.
+- Avoid starting responses with "Great question!", "You're right!", or "Let me help you."
+- Don't use phrases like "Let's dive into..."
+- Skip cliché intros like "In today's fast-paced digital world" or "In the ever-evolving landscape of"
+- Avoid phrases like "it's not just [x], it's [y]"
+- Don't use high-school essay closers: "In conclusion,", "Overall,", or "To summarize"
+- Avoid numbered lists in cases where bullets work better.
+- Replace "In conclusion" with direct statements.
+- Avoid hedge words: "might", "perhaps", "potentially" unless uncertainty is real.
+- Don't stack hedging phrases: "may potentially", "it's important to note that".
+- Don't create perfectly symmetrical paragraphs or lists that start with "Firstly... Secondly..."
+- Avoid title-case headings: prefer sentence casing.
+- Remove Unicode artifacts when copy-pasting: smart quotes ("), em-dashes, non-breaking spaces.
+- Use '
+- Delete empty citation placeholders like "[1]" with no actual source
+
+## Punctuation and formatting
+- Use Oxford commas consistently
+- Use exclamation points sparingly
+- Sentences can start with "But" and "And" but don't overuse
+- Use periods instead of commas when possible for clarity
--- a/.github/workflows/check_failed_tests.yml
+++ b/.github/workflows/check_failed_tests.yml
@ -41,14 +41,9 @@ env:

 jobs:
  check_new_failures:
-    name: "Find commits for new failing tests"
-    strategy:
-      matrix:
-        run_idx: [1]
+    name: " "
    runs-on:
      group: aws-g5-4xlarge-cache
-    outputs:
-      process: ${{ steps.check_file.outputs.process }}
    container:
      image: ${{ inputs.docker }}
      options: --gpus all --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/
@ -59,17 +54,14 @@ jobs:
          path: /transformers/ci_results_${{ inputs.job }}

      - name: Check file
-        id: check_file
        working-directory: /transformers
        run: |
          if [ -f ci_results_${{ inputs.job }}/new_failures.json ]; then
            echo "`ci_results_${{ inputs.job }}/new_failures.json` exists, continue ..."
            echo "process=true" >> $GITHUB_ENV
-            echo "process=true" >> $GITHUB_OUTPUT
          else
            echo "`ci_results_${{ inputs.job }}/new_failures.json` doesn't exist, abort."
            echo "process=false" >> $GITHUB_ENV
-            echo "process=false" >> $GITHUB_OUTPUT
          fi

      - uses: actions/download-artifact@v4
@ -126,10 +118,6 @@ jobs:
        run: |
          python3 utils/print_env.py

-      - name: Install pytest-flakefinder
-        if: ${{ env.process == 'true' }}
-        run: python3 -m pip install pytest-flakefinder
-
      - name: Show installed libraries and their versions
        working-directory: /transformers
        if: ${{ env.process == 'true' }}
@ -138,63 +126,25 @@ jobs:
      - name: Check failed tests
        working-directory: /transformers
        if: ${{ env.process == 'true' }}
-        run: python3 utils/check_bad_commit.py --start_commit ${{ inputs.start_sha }} --end_commit ${{ env.END_SHA }} --file ci_results_${{ inputs.job }}/new_failures.json --output_file new_failures_with_bad_commit_${{ inputs.job }}_${{ matrix.run_idx }}.json
+        run: python3 utils/check_bad_commit.py --start_commit ${{ inputs.start_sha }} --end_commit ${{ env.END_SHA }} --file ci_results_${{ inputs.job }}/new_failures.json --output_file new_failures_with_bad_commit.json

      - name: Show results
        working-directory: /transformers
        if: ${{ env.process == 'true' }}
        run: |
-          ls -l new_failures_with_bad_commit_${{ inputs.job }}_${{ matrix.run_idx }}.json
-          cat new_failures_with_bad_commit_${{ inputs.job }}_${{ matrix.run_idx }}.json
+          ls -l new_failures_with_bad_commit.json
+          cat new_failures_with_bad_commit.json

-      - name: Upload artifacts
-        uses: actions/upload-artifact@v4
-        with:
-          name: new_failures_with_bad_commit_${{ inputs.job }}_${{ matrix.run_idx }}
-          path: /transformers/new_failures_with_bad_commit_${{ inputs.job }}_${{ matrix.run_idx }}.json
-
-  process_new_failures_with_commit_info:
-    name: "process bad commit reports"
-    needs: check_new_failures
-    if: needs.check_new_failures.outputs.process == 'true'
-    runs-on:
-      group: aws-g5-4xlarge-cache
-    container:
-      image: ${{ inputs.docker }}
-      options: --gpus all --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/
-    steps:
-      - uses: actions/download-artifact@v4
-        with:
-          name: ci_results_${{ inputs.job }}
-          path: /transformers/ci_results_${{ inputs.job }}
-
-      - uses: actions/download-artifact@v4
-        with:
-          pattern: new_failures_with_bad_commit_${{ inputs.job }}*
-          path: /transformers/new_failures_with_bad_commit_${{ inputs.job }}
-          merge-multiple: true
-
-      - name: Check files
+      - name: Checkout back
        working-directory: /transformers
+        if: ${{ env.process == 'true' }}
        run: |
-          ls -la /transformers
-          ls -la /transformers/new_failures_with_bad_commit_${{ inputs.job }}
-
-      # Currently, we only run with a single runner by using `run_idx: [1]`. We might try to run with multiple runners
-      # to further reduce the false positive caused by flaky tests, which requires further processing to merge reports.
-      - name: Merge files
-        shell: bash
-        working-directory: /transformers
-        run: |
-          cp /transformers/new_failures_with_bad_commit_${{ inputs.job }}/new_failures_with_bad_commit_${{ inputs.job }}_1.json new_failures_with_bad_commit.json
-
-      - name: Update clone
-        working-directory: /transformers
-        run: git fetch && git checkout ${{ inputs.commit_sha || github.sha }}
+          git checkout ${{ inputs.start_sha }}

      - name: Process report
        shell: bash
        working-directory: /transformers
+        if: ${{ env.process == 'true' }}
        env:
          ACCESS_REPO_INFO_TOKEN: ${{ secrets.ACCESS_REPO_INFO_TOKEN }}
          TRANSFORMERS_CI_RESULTS_UPLOAD_TOKEN: ${{ secrets.TRANSFORMERS_CI_RESULTS_UPLOAD_TOKEN }}
@ -206,6 +156,7 @@ jobs:
      - name: Process report
        shell: bash
        working-directory: /transformers
+        if: ${{ env.process == 'true' }}
        env:
          ACCESS_REPO_INFO_TOKEN: ${{ secrets.ACCESS_REPO_INFO_TOKEN }}
          TRANSFORMERS_CI_RESULTS_UPLOAD_TOKEN: ${{ secrets.TRANSFORMERS_CI_RESULTS_UPLOAD_TOKEN }}
@ -220,12 +171,13 @@ jobs:

      - name: Prepare Slack report title
        working-directory: /transformers
+        if: ${{ env.process == 'true' }}
        run: |
          pip install slack_sdk
          echo "title=$(python3 -c 'import sys; sys.path.append("utils"); from utils.notification_service import job_to_test_map; ci_event = "${{ inputs.ci_event }}"; job = "${{ inputs.job }}"; test_name = job_to_test_map[job]; title = f"New failed tests of {ci_event}" + ":" + f" {test_name}"; print(title)')" >> $GITHUB_ENV

      - name: Send processed report
-        if: ${{ !endsWith(env.REPORT_TEXT, '{}') }}
+        if: ${{ env.process == 'true' && !endsWith(env.REPORT_TEXT, '{}') }}
        uses: slackapi/slack-github-action@6c661ce58804a1a20f6dc5fbee7f0381b469e001
        with:
          # Slack channel id, channel name, or user id to post message.
--- a/.github/workflows/pr_build_doc_with_comment.yml
+++ b/.github/workflows/pr_build_doc_with_comment.yml
@ -98,7 +98,7 @@ jobs:
      commit_sha: ${{ needs.get-pr-info.outputs.PR_HEAD_SHA }}
      pr_number: ${{ needs.get-pr-number.outputs.PR_NUMBER }}
      package: transformers
-      languages: ar de en es fr hi it ja ko pt zh
+      languages: ar de en es fr hi it ko pt tr zh ja te

  update_run_status:
    name: Update Check Run Status
--- a/.github/workflows/self-scheduled-caller.yml
+++ b/.github/workflows/self-scheduled-caller.yml
@ -6,7 +6,7 @@ on:
    - cron: "17 2 * * *"
  push:
    branches:
-      - multi_jobs_to_check_bad_commit
+      - run_nvidia_ci*
  workflow_dispatch:
    inputs:
      prev_workflow_run_id:
@ -23,7 +23,7 @@ on:

 # Used for `push` to easily modify the target workflow runs to compare against
 env:
-    prev_workflow_run_id: "18548615847"
+    prev_workflow_run_id: ""
    other_workflow_run_id: ""


@ -49,10 +49,72 @@ jobs:
    uses: ./.github/workflows/self-scheduled.yml
    with:
      job: run_models_gpu
-      slack_report_channel: "#transformers-ci-dummy"
+      slack_report_channel: "#transformers-ci-daily-models"
      docker: huggingface/transformers-all-latest-gpu
      ci_event: Daily CI
      runner_type: "a10"
      report_repo_id: hf-internal-testing/transformers_daily_ci
      commit_sha: ${{ github.sha }}
    secrets: inherit
+
+  torch-pipeline:
+    name: Torch pipeline CI
+    uses: ./.github/workflows/self-scheduled.yml
+    with:
+      job: run_pipelines_torch_gpu
+      slack_report_channel: "#transformers-ci-daily-pipeline-torch"
+      docker: huggingface/transformers-pytorch-gpu
+      ci_event: Daily CI
+      report_repo_id: hf-internal-testing/transformers_daily_ci
+      commit_sha: ${{ github.sha }}
+    secrets: inherit
+
+  example-ci:
+    name: Example CI
+    uses: ./.github/workflows/self-scheduled.yml
+    with:
+      job: run_examples_gpu
+      slack_report_channel: "#transformers-ci-daily-examples"
+      docker: huggingface/transformers-all-latest-gpu
+      ci_event: Daily CI
+      report_repo_id: hf-internal-testing/transformers_daily_ci
+      commit_sha: ${{ github.sha }}
+    secrets: inherit
+
+  trainer-fsdp-ci:
+    name: Trainer/FSDP CI
+    uses: ./.github/workflows/self-scheduled.yml
+    with:
+      job: run_trainer_and_fsdp_gpu
+      slack_report_channel: "#transformers-ci-daily-training"
+      docker: huggingface/transformers-all-latest-gpu
+      runner_type: "a10"
+      ci_event: Daily CI
+      report_repo_id: hf-internal-testing/transformers_daily_ci
+      commit_sha: ${{ github.sha }}
+    secrets: inherit
+
+  deepspeed-ci:
+    name: DeepSpeed CI
+    uses: ./.github/workflows/self-scheduled.yml
+    with:
+      job: run_torch_cuda_extensions_gpu
+      slack_report_channel: "#transformers-ci-daily-training"
+      docker: huggingface/transformers-pytorch-deepspeed-latest-gpu
+      ci_event: Daily CI
+      working-directory-prefix: /workspace
+      report_repo_id: hf-internal-testing/transformers_daily_ci
+      commit_sha: ${{ github.sha }}
+    secrets: inherit
+
+  quantization-ci:
+    name: Quantization CI
+    uses: ./.github/workflows/self-scheduled.yml
+    with:
+      job: run_quantization_torch_gpu
+      slack_report_channel: "#transformers-ci-daily-quantization"
+      docker: huggingface/transformers-quantization-latest-gpu
+      ci_event: Daily CI
+      report_repo_id: hf-internal-testing/transformers_daily_ci
+      commit_sha: ${{ github.sha }}
+    secrets: inherit
--- a/.gitignore
+++ b/.gitignore
@ -98,7 +98,6 @@ celerybeat-schedule
 # Environments
 .env
 .venv
-.venv*
 env/
 venv/
 ENV/
@ -172,6 +171,3 @@ tags

 # modular conversion
 *.modular_backup
-
-# Cursor IDE files
-.cursor/
--- a/benchmark/benches/llama.py
+++ b/benchmark/benches/llama.py
@ -16,6 +16,7 @@ import sys
 from logging import Logger
 from threading import Event, Thread
 from time import perf_counter, sleep
+from typing import Optional


 # Add the parent directory to Python path to import benchmarks_entrypoint
@ -41,7 +42,7 @@ except ImportError:
    GenerationConfig = None
    StaticCache = None

-os.environ["HF_XET_HIGH_PERFORMANCE"] = "1"
+os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
 os.environ["TOKENIZERS_PARALLELISM"] = "1"

 # Only set torch precision if torch is available
@ -144,7 +145,7 @@ def run_benchmark(
            q = torch.empty_like(probs_sort).exponential_(1)
            return torch.argmax(probs_sort / q, dim=-1, keepdim=True).to(dtype=torch.int)

-        def logits_to_probs(logits, temperature: float = 1.0, top_k: int | None = None):
+        def logits_to_probs(logits, temperature: float = 1.0, top_k: Optional[int] = None):
            logits = logits / max(temperature, 1e-5)

            if top_k is not None:
@ -154,7 +155,7 @@ def run_benchmark(
            probs = torch.nn.functional.softmax(logits, dim=-1)
            return probs

-        def sample(logits, temperature: float = 1.0, top_k: int | None = None):
+        def sample(logits, temperature: float = 1.0, top_k: Optional[int] = None):
            probs = logits_to_probs(logits[0, -1], temperature, top_k)
            idx_next = multinomial_sample_one_no_sync(probs)
            return idx_next, probs
--- a/benchmark/requirements.txt
+++ b/benchmark/requirements.txt
@ -2,5 +2,5 @@ gpustat==1.1.1
 psutil==6.0.0
 psycopg2==2.9.9
 torch>=2.4.0
-hf_xet
+hf_transfer
 pandas>=1.5.0
--- a/benchmark_v2/framework/benchmark_config.py
+++ b/benchmark_v2/framework/benchmark_config.py
@ -1,7 +1,7 @@
 import hashlib
 import json
 import logging
-from typing import Any
+from typing import Any, Optional


 KERNELIZATION_AVAILABLE = False
@ -27,11 +27,11 @@ class BenchmarkConfig:
        sequence_length: int = 128,
        num_tokens_to_generate: int = 128,
        attn_implementation: str = "eager",
-        sdpa_backend: str | None = None,
-        compile_mode: str | None = None,
-        compile_options: dict[str, Any] | None = None,
+        sdpa_backend: Optional[str] = None,
+        compile_mode: Optional[str] = None,
+        compile_options: Optional[dict[str, Any]] = None,
        kernelize: bool = False,
-        name: str | None = None,
+        name: Optional[str] = None,
        skip_validity_check: bool = False,
    ) -> None:
        # Benchmark parameters
@ -104,7 +104,7 @@ class BenchmarkConfig:
            "attn_implementation": self.attn_implementation,
            "sdpa_backend": self.sdpa_backend,
            "compile_mode": self.compile_mode,
-            "compile_options": self.compile_options | {},  # to avoid inplace modification of the original dict
+            "compile_options": self.compile_options,
            "kernelize": self.kernelize,
        }

@ -128,8 +128,8 @@ class BenchmarkConfig:


 def cross_generate_configs(
-    attn_impl_and_sdpa_backend: list[tuple[str, str | None]],
-    compiled_mode: list[str | None],
+    attn_impl_and_sdpa_backend: list[tuple[str, Optional[str]]],
+    compiled_mode: list[Optional[str]],
    kernelized: list[bool],
    warmup_iterations: int = 5,
    measurement_iterations: int = 20,
@ -191,7 +191,7 @@ def generate_all_configs(
    )


-def generate_main_configs(
+def generate_default_configs(
    warmup_iterations: int = 5,
    measurement_iterations: int = 20,
    batch_size: int = 1,
@ -199,17 +199,20 @@ def generate_main_configs(
    num_tokens_to_generate: int = 128,
    gpu_monitoring: bool = False,
 ) -> list[BenchmarkConfig]:
-    # Create kwargs common to all configs
-    kwargs = {
-        "warmup_iterations": warmup_iterations,
-        "measurement_iterations": measurement_iterations,
-        "batch_size": batch_size,
-        "sequence_length": sequence_length,
-        "num_tokens_to_generate": num_tokens_to_generate,
-        "gpu_monitoring": gpu_monitoring,
-    }
-    return [  # TODO: test max-autotune instead of default
-        BenchmarkConfig(attn_implementation="flex_attention", compile_mode="default", **kwargs),
-        BenchmarkConfig(attn_implementation="eager", compile_mode="default", **kwargs),
-        BenchmarkConfig(attn_implementation="flash_attention_2", **kwargs),
+    all_attn_implementations = [
+        ("flash_attention_2", None),
+        ("eager", None),
+        ("sdpa", "math"),
+        ("sdpa", "flash_attention"),  # note: this one can fail with compile because of attn mask
    ]
+    return cross_generate_configs(
+        attn_impl_and_sdpa_backend=all_attn_implementations,
+        compiled_mode=[None, "max-autotune"],
+        kernelized=[False, KERNELIZATION_AVAILABLE],
+        warmup_iterations=warmup_iterations,
+        measurement_iterations=measurement_iterations,
+        batch_size=batch_size,
+        sequence_length=sequence_length,
+        num_tokens_to_generate=num_tokens_to_generate,
+        gpu_monitoring=gpu_monitoring,
+    )
--- a/benchmark_v2/framework/benchmark_runner.py
+++ b/benchmark_v2/framework/benchmark_runner.py
@ -8,7 +8,7 @@ import time
 from contextlib import nullcontext
 from datetime import datetime
 from queue import Queue
-from typing import Any
+from typing import Any, Optional

 import torch
 from tqdm import trange
@ -74,7 +74,7 @@ def get_git_revision() -> str:
        return git_hash.readline().strip()


-def get_sdpa_backend(backend_name: str | None) -> torch.nn.attention.SDPBackend | None:
+def get_sdpa_backend(backend_name: Optional[str]) -> Optional[torch.nn.attention.SDPBackend]:
    """Get the SDPA backend enum from string name."""
    if backend_name is None:
        return None
@ -144,11 +144,11 @@ class BenchmarkStreamer(BaseStreamer):
 class BenchmarkRunner:
    """Main benchmark runner that coordinates benchmark execution."""

-    def __init__(self, logger: logging.Logger, output_dir: str | None = None, commit_id: str | None = None) -> None:
+    def __init__(
+        self, logger: logging.Logger, output_dir: str = "benchmark_results", commit_id: Optional[str] = None
+    ) -> None:
        # Those stay constant for the whole run
        self.logger = logger
-        if output_dir is None:
-            output_dir = os.path.join(os.path.dirname(os.path.dirname(__file__)), "benchmark_results")
        self.output_dir = output_dir
        self.commit_id = get_git_revision() if commit_id is None else commit_id
        os.makedirs(self.output_dir, exist_ok=True)
@ -156,7 +156,7 @@ class BenchmarkRunner:
        # Attributes that are reset for each model
        self._setup_for = ""
        # Attributes that are reset for each run
-        self.model: GenerationMixin | None = None
+        self.model: Optional[GenerationMixin] = None

    def cleanup(self) -> None:
        del self.model
@ -214,7 +214,7 @@ class BenchmarkRunner:

            # Quick validation: try one measurement first to see if this scenario works
            flush_memory()
-            e2e_latency, token_generation_times, shape_and_decoded_output, gpu_metrics = self.time_generate(
+            e2e_latency, token_generation_times, decoded_output, gpu_metrics = self.time_generate(
                max_new_tokens=1, gpu_monitor=None
            )
            if e2e_latency < 0:
@ -231,11 +231,11 @@ class BenchmarkRunner:
            result = BenchmarkResult()
            self.logger.info(f"Benchmarking with {config.measurement_iterations} iterations.")
            for _ in trange(config.measurement_iterations):
-                e2e_latency, token_generation_times, shape_and_decoded_output, gpu_metrics = self.time_generate(
+                e2e_latency, token_generation_times, decoded_output, gpu_metrics = self.time_generate(
                    max_new_tokens=config.num_tokens_to_generate,
                    gpu_monitor=(GPUMonitor(logger=self.logger) if config.gpu_monitoring else None),
                )
-                result.accumulate(e2e_latency, token_generation_times, shape_and_decoded_output, gpu_metrics)
+                result.accumulate(e2e_latency, token_generation_times, decoded_output, gpu_metrics)
            self.logger.info("Benchmarking done. Cleaning up.")

            # Profile if needed
@ -251,8 +251,8 @@ class BenchmarkRunner:
    def time_generate(
        self,
        max_new_tokens: int,
-        gpu_monitor: GPUMonitor | None = None,
-    ) -> tuple[float, list[float], str, GPURawMetrics | None]:
+        gpu_monitor: Optional[GPUMonitor] = None,
+    ) -> tuple[float, list[float], str, Optional[GPURawMetrics]]:
        """Time the latency of a call to model.generate() with the given (inputs) and (max_new_tokens)."""
        # Prepare gpu monitoring if needed
        if gpu_monitor is not None:
@ -277,11 +277,10 @@ class BenchmarkRunner:
            raise RuntimeError(f"Generated {new_tokens} tokens, expected {max_new_tokens}")
        # Decode outputs
        decoded_output = self.tokenizer.decode(outputs[0, input_tokens:], skip_special_tokens=True)
-        shape_and_decoded_output = f"{tuple(outputs.shape)} | {decoded_output}"
        # Compute intermediate quantities
        e2e_latency = wall_time_1 - wall_time_0
        token_generation_times = [t - wall_time_0 for t in streamer.timestamps[1:]]
-        return e2e_latency, token_generation_times, shape_and_decoded_output, gpu_metrics
+        return e2e_latency, token_generation_times, decoded_output, gpu_metrics

    def profile_generate(self, num_tokens_to_profile: int, config_name: str) -> None:
        """Profile the latency of a call to model.generate() with the given (inputs) and (max_new_tokens)."""
@ -352,10 +351,10 @@ class BenchmarkRunner:
                first_metadata = all_results[first_key]["metadata"].to_dict()
                hardware_info = first_metadata.pop("hardware_info")
                pretty_print_dict(first_metadata | hardware_info, tabs=1)
-            for result in all_results.values():
+            for value in all_results.values():
                print("=" * 100)
-                print(f"Config: {result['config'].infer_name(compact=False)}\n")
-                result["measurements"].pprint(batch_size=result["config"].batch_size, tabs=1)
+                print(f"Config: {value['config'].infer_name(compact=False)}\n")
+                value["measurements"].pprint(tabs=1)
            print("=" * 100)

        return all_results
--- a/benchmark_v2/framework/data_classes.py
+++ b/benchmark_v2/framework/data_classes.py
@ -1,6 +1,6 @@
 from dataclasses import dataclass
 from datetime import datetime
-from typing import Any
+from typing import Any, Optional, Union

 import numpy as np

@ -82,22 +82,22 @@ class BenchmarkResult:
    def __init__(self) -> None:
        self.e2e_latency = []
        self.token_generation_times = []  # time at which each token was generated (relative to start of the generation)
-        self.shape_and_decoded_outputs = []
+        self.decoded_outputs = []
        self.gpu_metrics = []

    def accumulate(
        self,
        e2e_latency: float,
        token_generation_times: list[float],
-        shape_and_decoded_output: str,
-        gpu_metrics: GPURawMetrics | None,
+        decoded_output: str,
+        gpu_metrics: Optional[GPURawMetrics],
    ) -> None:
        self.e2e_latency.append(e2e_latency)
        self.token_generation_times.append(token_generation_times)
-        self.shape_and_decoded_outputs.append(shape_and_decoded_output)
+        self.decoded_outputs.append(decoded_output)
        self.gpu_metrics.append(gpu_metrics)

-    def to_dict(self) -> dict[str, None | int | float]:
+    def to_dict(self) -> dict[str, Union[None, int, float]]:
        # Save GPU metrics as None if it contains only None values
        if all(gm is None for gm in self.gpu_metrics):
            gpu_metrics = None
@ -106,12 +106,12 @@ class BenchmarkResult:
        return {
            "e2e_latency": self.e2e_latency,
            "token_generation_times": self.token_generation_times,
-            "shape_and_decoded_outputs": self.shape_and_decoded_outputs,
+            "decoded_outputs": self.decoded_outputs,
            "gpu_metrics": gpu_metrics,
        }

    @classmethod
-    def from_dict(cls, data: dict[str, None | int | float]) -> "BenchmarkResult":
+    def from_dict(cls, data: dict[str, Union[None, int, float]]) -> "BenchmarkResult":
        # Handle GPU metrics, which is saved as None if it contains only None values
        if data["gpu_metrics"] is None:
            gpu_metrics = [None for _ in range(len(data["e2e_latency"]))]
@ -123,7 +123,7 @@ class BenchmarkResult:
            new_instance.accumulate(
                e2e_latency=data["e2e_latency"][i],
                token_generation_times=data["token_generation_times"][i],
-                shape_and_decoded_output=data["shape_and_decoded_outputs"][i],
+                decoded_output=data["decoded_output"][i],
                gpu_metrics=gpu_metrics[i],
            )
        return new_instance
@ -134,27 +134,19 @@ class BenchmarkResult:
    def get_measured_itl(self) -> list[float]:
        return [(dt[-1] - dt[0]) / (len(dt) - 1) for dt in self.token_generation_times if len(dt) > 1]

-    def get_throughput(self, batch_size: int) -> float:
-        return [
-            batch_size * len(dt) / e2e_latency
-            for e2e_latency, dt in zip(self.e2e_latency, self.token_generation_times)
-        ]
-
-    def pprint(self, batch_size: int = 0, tabs: int = 0) -> None:
-        stats_to_collate = [
-            add_unit_to_duration(compute_basic_statistics(self.e2e_latency)),
-            add_unit_to_duration(compute_basic_statistics(self.get_measured_ttft())),
-            add_unit_to_duration(compute_basic_statistics(self.get_measured_itl())),
-        ]
-        if batch_size > 0:
-            throughput_stats = compute_basic_statistics(self.get_throughput(batch_size))
-            stats_to_collate.append({key: f"{value:.2f}tok/s" for key, value in throughput_stats.items()})
-        collated_stats = equalize_lengths_and_collate(stats_to_collate)
-        dict_to_pprint = {
-            "E2E Latency": collated_stats[0],
-            "Time to First Token": collated_stats[1],
-            "Inter-Token Latency": collated_stats[2],
-        }
-        if batch_size > 0:
-            dict_to_pprint["Throughput"] = collated_stats[3]
-        pretty_print_dict(dict_to_pprint, tabs=tabs)
+    def pprint(self, tabs: int = 0) -> None:
+        collated_stats = equalize_lengths_and_collate(
+            [
+                add_unit_to_duration(compute_basic_statistics(self.e2e_latency)),
+                add_unit_to_duration(compute_basic_statistics(self.get_measured_ttft())),
+                add_unit_to_duration(compute_basic_statistics(self.get_measured_itl())),
+            ]
+        )
+        pretty_print_dict(
+            {
+                "E2E Latency": collated_stats[0],
+                "Time to First Token": collated_stats[1],
+                "Inter-Token Latency": collated_stats[2],
+            },
+            tabs=tabs,
+        )
--- a/benchmark_v2/framework/hardware_metrics.py
+++ b/benchmark_v2/framework/hardware_metrics.py
@ -7,6 +7,7 @@ import time
 from dataclasses import dataclass
 from enum import Enum
 from logging import Logger
+from typing import Optional, Union

 import gpustat
 import psutil
@ -41,7 +42,7 @@ class HardwareInfo:
        self.cpu_count = psutil.cpu_count()
        self.memory_total_mb = int(psutil.virtual_memory().total / (1024 * 1024))

-    def to_dict(self) -> dict[str, None | int | float | str]:
+    def to_dict(self) -> dict[str, Union[None, int, float, str]]:
        return {
            "gpu_name": self.gpu_name,
            "gpu_memory_total_gb": self.gpu_memory_total_gb,
@ -108,7 +109,7 @@ class GPURawMetrics:
    timestamp_0: float  # in seconds
    monitoring_status: GPUMonitoringStatus

-    def to_dict(self) -> dict[str, None | int | float | str]:
+    def to_dict(self) -> dict[str, Union[None, int, float, str]]:
        return {
            "utilization": self.utilization,
            "memory_used": self.memory_used,
@ -122,7 +123,7 @@ class GPURawMetrics:
 class GPUMonitor:
    """Monitor GPU utilization during benchmark execution."""

-    def __init__(self, sample_interval_sec: float = 0.1, logger: Logger | None = None):
+    def __init__(self, sample_interval_sec: float = 0.1, logger: Optional[Logger] = None):
        self.sample_interval_sec = sample_interval_sec
        self.logger = logger if logger is not None else logging.getLogger(__name__)

--- a/benchmark_v2/run_benchmarks.py
+++ b/benchmark_v2/run_benchmarks.py
@ -20,28 +20,28 @@ in the ./benches directory, organizing outputs into model-specific subfolders.

 import argparse
 import logging
+import random
 import sys
 import uuid

-from framework.benchmark_config import BenchmarkConfig, generate_all_configs, generate_main_configs
+from framework.benchmark_config import BenchmarkConfig, generate_all_configs
 from framework.benchmark_runner import BenchmarkRunner


 if __name__ == "__main__":
    # Parse arguments
    parser = argparse.ArgumentParser()
-    parser.add_argument("--output-dir", type=str, default=None, help="Output dir for benchmark results")
+    parser.add_argument("--output-dir", type=str, default="benchmark_results", help="Output dir for benchmark results")
    parser.add_argument("--log-level", type=str, choices=["DEBUG", "INFO", "WARNING", "ERROR"], default="INFO")
    parser.add_argument("--model-id", type=str, help="Specific model ID to benchmark (if supported by benchmarks)")

-    parser.add_argument("--warmup", type=int, default=3, help="Number of warmup iterations")
-    parser.add_argument("--iterations", type=int, default=10, help="Number of measurement iterations")
+    parser.add_argument("--warmup", type=int, default=5, help="Number of warmup iterations")
+    parser.add_argument("--iterations", type=int, default=20, help="Number of measurement iterations")

    parser.add_argument("--batch-size", "-b", type=int, nargs="+", help="Batch size")
    parser.add_argument("--sequence-length", "-s", type=int, nargs="+", help="Sequence length")
    parser.add_argument("--num-tokens-to-generate", "-n", type=int, nargs="+", help="Number of tokens to generate")

-    parser.add_argument("--cross-generate", action="store_true", help="Cross-generate all combinations of configs")
    parser.add_argument("--num-tokens-to-profile", "-p", type=int, default=0, help="Number of tokens to profile")

    parser.add_argument("--commit-id", type=str, help="Git commit ID (if not provided, will auto-detect from git)")
@ -69,47 +69,42 @@ if __name__ == "__main__":

    # If there is only one (batch_size, sequence_length, num_tokens_to_generate), we benchmark across configs
    elif len(args.batch_size) * len(args.sequence_length) * len(args.num_tokens_to_generate) == 1:
-        if args.cross_generate:
-            benchmark_configs = generate_all_configs(
-                warmup_iterations=args.warmup,
-                measurement_iterations=args.iterations,
-                batch_size=args.batch_size[0],
-                sequence_length=args.sequence_length[0],
-                num_tokens_to_generate=args.num_tokens_to_generate[0],
-            )
-        else:
-            benchmark_configs = generate_main_configs(
-                warmup_iterations=args.warmup,
-                measurement_iterations=args.iterations,
-                batch_size=args.batch_size[0],
-                sequence_length=args.sequence_length[0],
-                num_tokens_to_generate=args.num_tokens_to_generate[0],
-            )
-
-    # Otherwise, we benchmark across all combinations of dimensions
-    else:
-        main_config = generate_main_configs(
+        benchmark_configs = generate_all_configs(
            warmup_iterations=args.warmup,
            measurement_iterations=args.iterations,
            batch_size=args.batch_size[0],
            sequence_length=args.sequence_length[0],
            num_tokens_to_generate=args.num_tokens_to_generate[0],
-        )[0]
+        )
+        random.shuffle(benchmark_configs)
+
+    # Otherwise, we benchmark across all combinations of dimensions
+    else:
+        kwargs = {
+            "warmup_iterations": args.warmup,
+            "measurement_iterations": args.iterations,
+            "gpu_monitoring": False,
+            "batch_size": args.batch_size[0],
+            "sequence_length": args.sequence_length[0],
+            "num_tokens_to_generate": args.num_tokens_to_generate[0],
+            "attn_implementation": "flex_attention",
+            "sdpa_backend": None,
+            "compile_mode": "default",
+            "kernelize": False,
+        }
        benchmark_configs = []
        for num_tokens_to_generate in args.num_tokens_to_generate:
            for sequence_length in args.sequence_length:
                for batch_size in args.batch_size:
-                    cfg_dict = main_config.to_dict()
-                    cfg_dict["batch_size"] = batch_size
-                    cfg_dict["sequence_length"] = sequence_length
-                    cfg_dict["num_tokens_to_generate"] = num_tokens_to_generate
-                    cfg_dict.pop("name")
-                    benchmark_configs.append(BenchmarkConfig.from_dict(cfg_dict))
+                    kwargs["batch_size"] = batch_size
+                    kwargs["sequence_length"] = sequence_length
+                    kwargs["num_tokens_to_generate"] = num_tokens_to_generate
+                    benchmark_configs.append(BenchmarkConfig(**kwargs))

    runner = BenchmarkRunner(logger, args.output_dir, args.commit_id)
    results = runner.run_benchmarks(
        args.model_id,
-        benchmark_configs,
+        benchmark_configs[:3],
        args.num_tokens_to_profile,
        pretty_print_summary=True,
    )
--- a/docs/source/en/accelerator_selection.md
+++ b/docs/source/en/accelerator_selection.md
@ -55,7 +55,6 @@ deepspeed --num_gpus 2 trainer-program.py ...
 </hfoptions>

 ## Order of accelerators
-
 To select specific accelerators to use and their order, use the environment variable appropriate for your hardware. This is often set on the command line for each run, but can also be added to your `~/.bashrc` or other startup config file.

 For example, if there are 4 accelerators (0, 1, 2, 3) and you only want to run accelerators 0 and 2:
--- a/docs/source/en/community.md
+++ b/docs/source/en/community.md
@ -6,13 +6,13 @@ rendered properly in your Markdown viewer.

 This page regroups resources around 🤗 Transformers developed by the community.

-## Community resources
+## Community resources:

 | Resource     |      Description      |      Author      |
 |:----------|:-------------|------:|
 | [Hugging Face Transformers Glossary Flashcards](https://www.darigovresearch.com/huggingface-transformers-glossary-flashcards) | A set of flashcards based on the [Transformers Docs Glossary](glossary) that has been put into a form which can be easily learned/revised using [Anki](https://apps.ankiweb.net/) an open source, cross platform app specifically designed for long term knowledge retention. See this [Introductory video on how to use the flashcards](https://www.youtube.com/watch?v=Dji_h7PILrw). | [Darigov Research](https://www.darigovresearch.com/) |

-## Community notebooks
+## Community notebooks:

 | Notebook     |      Description      |      Author      |      |
 |:----------|:-------------|:-------------|------:|
--- a/docs/source/en/kv_cache.md
+++ b/docs/source/en/kv_cache.md
@ -208,7 +208,7 @@ Some models have a unique way of storing past kv pairs or states that is not com

 Mamba models, such as [Mamba](./model_doc/mamba), require a specific cache because the model doesn't have an attention mechanism or kv states. Thus, they are not compatible with the above [`Cache`] classes.

-## Iterative generation
+# Iterative generation

 A cache can also work in iterative generation settings where there is back-and-forth interaction with a model (chatbots). Like regular generation, iterative generation with a cache allows a model to efficiently handle ongoing conversations without recomputing the entire context at each step.

--- a/docs/source/en/model_doc/aimv2.md
+++ b/docs/source/en/model_doc/aimv2.md
@ -13,66 +13,51 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2024-11-21 and added to Hugging Face Transformers on 2025-07-08.*
+
+*This model was released on 2024-11-21 and added to Hugging Face Transformers on 2025-07-08 and contributed by [yaswanthgali](https://huggingface.co/yaswanthgali).*

 # AIMv2

-## Overview
+[AIMv2](https://huggingface.co/papers/2411.14402) presents a novel method for pre-training large-scale vision encoders in a multimodal setting, combining images and text. The model, characterized by a straightforward pre-training process and scalability, pairs a vision encoder with a multimodal decoder that autoregressively generates raw image patches and text tokens. AIMV2 excels in both multimodal evaluations and vision benchmarks such as localization, grounding, and classification. Notably, the AIMV2-3B encoder achieves 89.5% accuracy on ImageNet-1k with a frozen trunk and outperforms state-of-the-art contrastive models like CLIP and SigLIP in multimodal image understanding across various settings.

-The AIMv2 model was proposed in [Multimodal Autoregressive Pre-training of Large Vision Encoders](https://huggingface.co/papers/2411.14402) by Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor Guilherme Turrisi da Costa, Louis Béthune, Zhe Gan, Alexander T Toshev, Marcin Eichner, Moin Nabi, Yinfei Yang, Joshua M. Susskind, Alaaeldin El-Nouby.
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-The abstract from the paper is the following:
+```py
+import torch
+from transformers import pipeline

-*We introduce a novel method for pre-training of large-scale vision encoders. Building on recent advancements in autoregressive pre-training of vision models, we extend this framework to a multimodal setting, i.e., images and text. In this paper, we present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process, scalability, and remarkable performance across a range of downstream tasks. This is achieved by pairing the vision encoder with a multimodal decoder that autoregressively generates raw image patches and text tokens. Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification. Notably, our AIMV2-3B encoder achieves 89.5% accuracy on ImageNet-1k with a frozen trunk. Furthermore, AIMV2 consistently outperforms state-of-the-art contrastive models (e.g., CLIP, SigLIP) in multimodal image understanding across diverse settings.*
-
-This model was contributed by [Yaswanth Gali](https://huggingface.co/yaswanthgali).
-The original code can be found [here](https://github.com/apple/ml-aim).
-
-## Usage Example
-
-Here is an example of Image Feature Extraction using specific checkpoints on resized images and native resolution images:
-
-```python
-import requests
-from PIL import Image
-from transformers import AutoImageProcessor, AutoModel
-
-url = "http://images.cocodataset.org/val2017/000000039769.jpg"
-image = Image.open(requests.get(url, stream=True).raw)
-
-processor = AutoImageProcessor.from_pretrained("apple/aimv2-large-patch14-native")
-model = AutoModel.from_pretrained("apple/aimv2-large-patch14-native")
-
-inputs = processor(images=image, return_tensors="pt")
-outputs = model(**inputs)
+pipeline = pipeline(task="zero-shot-classification", model="apple/aimv2-large-patch14-native", dtype="auto")
+pipeline("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg")
 ```

-Here is an example of a checkpoint performing zero-shot classification:
+</hfoption>
+<hfoption id="AutoModel">

 ```python
+import torch
 import requests
 from PIL import Image
 from transformers import AutoProcessor, AutoModel

-url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
 image = Image.open(requests.get(url, stream=True).raw)
 text = ["Picture of a dog.", "Picture of a cat.", "Picture of a horse."]

 processor = AutoProcessor.from_pretrained("apple/aimv2-large-patch14-224-lit")
-model = AutoModel.from_pretrained("apple/aimv2-large-patch14-224-lit")
+model = AutoModel.from_pretrained("apple/aimv2-large-patch14-224-lit", dtype="auto")

-inputs = processor(
-    images=image,
-    text=text,
-    add_special_tokens=True,
-    truncation=True,
-    padding=True,
-    return_tensors="pt",
-)
+inputs = processor(images=image, text=text, add_special_tokens=True, truncation=True, padding=True, return_tensors="pt",)
 outputs = model(**inputs)
 probs = outputs.logits_per_image.softmax(dim=-1)
+pred_idx = torch.argmax(probs, dim=-1).item()
+predicted_label = text[pred_idx]
+print(f"Predicted label: {predicted_label}")
 ```

+</hfoption>
+</hfoptions>
+
 ## Aimv2Config

 [[autodoc]] Aimv2Config
@ -99,3 +84,4 @@ probs = outputs.logits_per_image.softmax(dim=-1)

 [[autodoc]] Aimv2TextModel
    - forward
+
--- a/docs/source/en/model_doc/albert.md
+++ b/docs/source/en/model_doc/albert.md
@ -13,32 +13,17 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2019-09-26 and added to Hugging Face Transformers on 2020-11-16.*
+*This model was released on 2019-09-26 and added to Hugging Face Transformers on 2020-11-16 and contributed by [lysandre](https://huggingface.co/lysandre).*

 <div style="float: right;">
    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white" >
-        <img alt="SDPA" src= "https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white" >
+        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
    </div>
 </div>

 # ALBERT

-[ALBERT](https://huggingface.co/papers/1909.11942) is designed to address memory limitations of scaling and training of [BERT](./bert). It adds two parameter reduction techniques. The first, factorized embedding parametrization, splits the larger vocabulary embedding matrix into two smaller matrices so you can grow the hidden size without adding a lot more parameters. The second, cross-layer parameter sharing, allows layer to share parameters which keeps the number of learnable parameters lower.
-
-ALBERT was created to address problems like -- GPU/TPU memory limitations, longer training times, and unexpected model degradation in BERT. ALBERT uses two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT:
-
- **Factorized embedding parameterization:** The large vocabulary embedding matrix is decomposed into two smaller matrices, reducing memory consumption.
- **Cross-layer parameter sharing:** Instead of learning separate parameters for each transformer layer, ALBERT shares parameters across layers, further reducing the number of learnable weights.
-
-ALBERT uses absolute position embeddings (like BERT) so padding is applied at right. Size of embeddings is 128 While BERT uses 768. ALBERT can processes maximum 512 token at a time.
-
-You can find all the original ALBERT checkpoints under the [ALBERT community](https://huggingface.co/albert) organization.
-
-> [!TIP]
-> Click on the ALBERT models in the right sidebar for more examples of how to apply ALBERT to different language tasks.
-
-The example below demonstrates how to predict the `[MASK]` token with [`Pipeline`], [`AutoModel`], and from the command line.
+[ALBERT](https://huggingface.co/papers/1909.11942) presents parameter-reduction techniques to enhance BERT by splitting the embedding matrix and using repeating layers. These methods reduce memory usage and training time, enabling better scalability. The model employs a self-supervised loss to improve inter-sentence coherence, achieving state-of-the-art results on GLUE, RACE, and SQuAD benchmarks with fewer parameters than BERT-large.

 <hfoptions id="usage">
 <hfoption id="Pipeline">
@ -47,13 +32,8 @@ The example below demonstrates how to predict the `[MASK]` token with [`Pipeline
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    task="fill-mask",
-    model="albert-base-v2",
-    dtype=torch.float16,
-    device=0
-)
-pipeline("Plants create [MASK] through a process known as photosynthesis.", top_k=5)
+pipeline = pipeline(task="fill-mask", model="albert/albert-base-v2", dtype="auto")
+pipeline("Plants create [MASK] through a process known as photosynthesis.")
 ```

 </hfoption>
@ -63,76 +43,25 @@ pipeline("Plants create [MASK] through a process known as photosynthesis.", top_
 import torch
 from transformers import AutoModelForMaskedLM, AutoTokenizer

+model = AutoModelForMaskedLM.from_pretrained("albert/albert-base-v2", dtype="auto")
 tokenizer = AutoTokenizer.from_pretrained("albert/albert-base-v2")
-model = AutoModelForMaskedLM.from_pretrained(
-    "albert/albert-base-v2",
-    dtype=torch.float16,
-    attn_implementation="sdpa",
-    device_map="auto"
-)

-prompt = "Plants create energy through a process known as [MASK]."
-inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
-
-with torch.no_grad():
-    outputs = model(**inputs)
-    mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
-    predictions = outputs.logits[0, mask_token_index]
-
-top_k = torch.topk(predictions, k=5).indices.tolist()
-for token_id in top_k[0]:
-    print(f"Prediction: {tokenizer.decode([token_id])}")
+inputs = tokenizer("Plants create [MASK] through a process known as photosynthesis.", return_tensors="pt")
+outputs = model(**inputs)
+mask_token_id = tokenizer.mask_token_id
+mask_position = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
+predicted_word = tokenizer.decode(outputs.logits[0, mask_position].argmax(dim=-1))
+print(f"Predicted word: {predicted_word}")
 ```

 </hfoption>
-<hfoption id="transformers CLI">
-
-```bash
-echo -e "Plants create [MASK] through a process known as photosynthesis." | transformers run --task fill-mask --model albert-base-v2 --device 0
-```
-
-</hfoption>
-
 </hfoptions>

-## Notes
+## Usage tips

- Inputs should be padded on the right because BERT uses absolute position embeddings.
- The embedding size `E` is different from the hidden size `H` because the embeddings are context independent (one embedding vector represents one token) and the hidden states are context dependent (one hidden state represents a sequence of tokens). The embedding matrix is also larger because `V x E` where `V` is the vocabulary size. As a result, it's more logical if `H >> E`. If `E < H`, the model has less parameters.
+- ALBERT uses absolute position embeddings. Pad inputs on the right, not the left.

-## Resources
-
-The resources provided in the following sections consist of a list of official Hugging Face and community (indicated by 🌎) resources to help you get started with AlBERT. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
-
-<PipelineTag pipeline="text-classification"/>
-
- [`AlbertForSequenceClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification).
-
- Check the [Text classification task guide](../tasks/sequence_classification) on how to use the model.
-
-<PipelineTag pipeline="token-classification"/>
-
- [`AlbertForTokenClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/token-classification).
-
- [Token classification](https://huggingface.co/course/chapter7/2?fw=pt) chapter of the 🤗 Hugging Face Course.
- Check the [Token classification task guide](../tasks/token_classification) on how to use the model.
-
-<PipelineTag pipeline="fill-mask"/>
-
- [`AlbertForMaskedLM`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling#robertabertdistilbert-and-masked-language-modeling) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb).
- [Masked language modeling](https://huggingface.co/course/chapter7/3?fw=pt) chapter of the 🤗 Hugging Face Course.
- Check the [Masked language modeling task guide](../tasks/masked_language_modeling) on how to use the model.
-
-<PipelineTag pipeline="question-answering"/>
-
- [`AlbertForQuestionAnswering`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering.ipynb).
- [Question answering](https://huggingface.co/course/chapter7/7?fw=pt) chapter of the 🤗 Hugging Face Course.
- Check the [Question answering task guide](../tasks/question_answering) on how to use the model.
-
-**Multiple choice**
-
- [`AlbertForMultipleChoice`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/multiple-choice) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice.ipynb).
- Check the [Multiple choice task guide](../tasks/multiple_choice) on how to use the model.
+- The embedding size E differs from hidden size H for good reason. Embeddings represent individual tokens (context-independent). Hidden states represent token sequences (context-dependent). This makes H >> E logical. The embedding matrix spans V × E dimensions, where V is vocabulary size. Keeping E < H reduces parameter count.

 ## AlbertConfig

@ -140,7 +69,11 @@ The resources provided in the following sections consist of a list of official H

 ## AlbertTokenizer

-[[autodoc]] AlbertTokenizer - build_inputs_with_special_tokens - get_special_tokens_mask - create_token_type_ids_from_sequences - save_vocabulary
+[[autodoc]] AlbertTokenizer
+    - build_inputs_with_special_tokens
+    - get_special_tokens_mask
+    - create_token_type_ids_from_sequences
+    - save_vocabulary

 ## AlbertTokenizerFast

@ -152,19 +85,23 @@ The resources provided in the following sections consist of a list of official H

 ## AlbertModel

-[[autodoc]] AlbertModel - forward
+[[autodoc]] AlbertModel
+    - forward

 ## AlbertForPreTraining

-[[autodoc]] AlbertForPreTraining - forward
+[[autodoc]] AlbertForPreTraining
+    - forward

 ## AlbertForMaskedLM

-[[autodoc]] AlbertForMaskedLM - forward
+[[autodoc]] AlbertForMaskedLM
+    - forward

 ## AlbertForSequenceClassification

-[[autodoc]] AlbertForSequenceClassification - forward
+[[autodoc]] AlbertForSequenceClassification
+    - forward

 ## AlbertForMultipleChoice

@ -172,8 +109,10 @@ The resources provided in the following sections consist of a list of official H

 ## AlbertForTokenClassification

-[[autodoc]] AlbertForTokenClassification - forward
+[[autodoc]] AlbertForTokenClassification
+    - forward

 ## AlbertForQuestionAnswering

-[[autodoc]] AlbertForQuestionAnswering - forward
+[[autodoc]] AlbertForQuestionAnswering
+    - forward
--- a/docs/source/en/model_doc/align.md
+++ b/docs/source/en/model_doc/align.md
@ -13,46 +13,21 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2021-02-11 and added to Hugging Face Transformers on 2023-03-01.*
-<div style="float: right;">
-  <div class="flex flex-wrap space-x-1">
-    <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-    <img alt="Transformers" src="https://img.shields.io/badge/Transformers-6B5B95?style=flat&logo=transformers&logoColor=white">
-  </div>
-</div>
+*This model was released on 2021-02-11 and added to Hugging Face Transformers on 2023-03-01 and contributed by [adirik](https://huggingface.co/adirik).*

 # ALIGN

-[ALIGN](https://huggingface.co/papers/2102.05918) is pretrained on a noisy 1.8 billion alt‑text and image pair dataset to show that scale can make up for the noise. It uses a dual‑encoder architecture, [EfficientNet](./efficientnet) for images and [BERT](./bert) for text, and a contrastive loss to align similar image–text embeddings together while pushing different embeddings apart. Once trained, ALIGN can encode any image and candidate captions into a shared vector space for zero‑shot retrieval or classification without requiring extra labels. This scale‑first approach reduces dataset curation costs and powers state‑of‑the‑art image–text retrieval and zero‑shot ImageNet classification.
-
-You can find all the original ALIGN checkpoints under the [Kakao Brain](https://huggingface.co/kakaobrain?search_models=align) organization.
-
-> [!TIP]
-> Click on the ALIGN models in the right sidebar for more examples of how to apply ALIGN to different vision and text related tasks.
-
-The example below demonstrates zero-shot image classification with [`Pipeline`] or the [`AutoModel`] class.
-
-<hfoptions id="usage">  
+[ALIGN](https://huggingface.co/papers/2102.05918) is a multi-modal vision and language model utilizing a dual-encoder architecture with EfficientNet for vision and BERT for text. It employs contrastive learning to align visual and text representations using a noisy dataset of over one billion image-alt text pairs. Despite the noise, the scale of the dataset enables state-of-the-art performance in image classification and image-text retrieval tasks, surpassing more complex models.

+<hfoptions id="usage">
 <hfoption id="Pipeline">

 ```py
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    task="zero-shot-image-classification",
-    model="kakaobrain/align-base",
-    device=0,
-    dtype=torch.bfloat16
-)
-
-candidate_labels = [
-    "a photo of a dog",
-    "a photo of a cat",
-    "a photo of a person"
-]
-
+pipeline = pipeline(task="zero-shot-image-classification", model="kakaobrain/align-base", dtype="auto")
+candidate_labels = ["a photo of a dog", "a photo of a cat", "a photo of a person"]
 pipeline("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg", candidate_labels=candidate_labels)
 ```

@ -66,7 +41,7 @@ from PIL import Image
 from transformers import AutoProcessor, AutoModelForZeroShotImageClassification

 processor = AutoProcessor.from_pretrained("kakaobrain/align-base")
-model = AutoModelForZeroShotImageClassification.from_pretrained("kakaobrain/align-base", device_map="auto")
+model = AutoModelForZeroShotImageClassification.from_pretrained("kakaobrain/align-base", dtype="auto")

 url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
 image = requests.get(url, stream=True)
@ -92,65 +67,8 @@ for label, score in zip(candidate_labels, probs):
 ```

 </hfoption>
-
 </hfoptions>

-## Notes
-
- ALIGN projects the text and visual features into latent space and the dot product between the projected image and text features is used as the similarity score. The example below demonstrates how to calculate the image-text similarity score with [`AlignProcessor`] and [`AlignModel`].
-
-  ```py
-  # Example of using ALIGN for image-text similarity
-  from transformers import AlignProcessor, AlignModel
-  import torch
-  from PIL import Image
-  import requests
-  from io import BytesIO
-  
-  # Load processor and model
-  processor = AlignProcessor.from_pretrained("kakaobrain/align-base")
-  model = AlignModel.from_pretrained("kakaobrain/align-base")
-  
-  # Download image from URL
-  url = "https://huggingface.co/roschmid/dog-races/resolve/main/images/Golden_Retriever.jpg"
-  response = requests.get(url)
-  image = Image.open(BytesIO(response.content))  # Convert the downloaded bytes to a PIL Image
-  
-  texts = ["a photo of a cat", "a photo of a dog"]
-  
-  # Process image and text inputs
-  inputs = processor(images=image, text=texts, return_tensors="pt")
-  
-  # Get the embeddings
-  with torch.no_grad():
-      outputs = model(**inputs)
-  
-  image_embeds = outputs.image_embeds
-  text_embeds = outputs.text_embeds
-  
-  # Normalize embeddings for cosine similarity
-  image_embeds = image_embeds / image_embeds.norm(dim=1, keepdim=True)
-  text_embeds = text_embeds / text_embeds.norm(dim=1, keepdim=True)
-  
-  # Calculate similarity scores
-  similarity_scores = torch.matmul(text_embeds, image_embeds.T)
-  
-  # Print raw scores
-  print("Similarity scores:", similarity_scores)
-  
-  # Convert to probabilities
-  probs = torch.nn.functional.softmax(similarity_scores, dim=0)
-  print("Probabilities:", probs)
-  
-  # Get the most similar text
-  most_similar_idx = similarity_scores.argmax().item()
-  print(f"Most similar text: '{texts[most_similar_idx]}'")
-  ```
-
-## Resources
-
- Refer to the [Kakao Brain’s Open Source ViT, ALIGN, and the New COYO Text-Image Dataset](https://huggingface.co/blog/vit-align) blog post for more details.
-
 ## AlignConfig

 [[autodoc]] AlignConfig
@ -183,3 +101,4 @@ for label, score in zip(candidate_labels, probs):

 [[autodoc]] AlignVisionModel
    - forward
+
--- a/docs/source/en/model_doc/altclip.md
+++ b/docs/source/en/model_doc/altclip.md
@ -13,35 +13,37 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2022-11-12 and added to Hugging Face Transformers on 2023-01-04.*
-
-<div style="float: right;">
-  <div class="flex flex-wrap space-x-1">
-    <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+*This model was released on 2022-11-12 and added to Hugging Face Transformers on 2023-01-04 and contributed by [jongjyh](https://huggingface.co/jongjyh).*

 # AltCLIP

-[AltCLIP](https://huggingface.co/papers/2211.06679) replaces the [CLIP](./clip) text encoder with a multilingual XLM-R encoder and aligns image and text representations with teacher learning and contrastive learning.
+[AltCLIP](https://huggingface.co/papers/2211.06679v2) alters the text encoder in CLIP by replacing it with a pretrained multilingual text encoder XLM-R. This modification enables the model to achieve state-of-the-art performance on tasks such as ImageNet-CN, Flicker30k-CN, and COCO-CN, while maintaining performance close to CLIP on other tasks. The approach involves a two-stage training schema with teacher learning and contrastive learning to align language and image representations, extending CLIP's capabilities to multilingual understanding.

-You can find all the original AltCLIP checkpoints under the [AltClip](https://huggingface.co/collections/BAAI/alt-clip-diffusion-66987a97de8525205f1221bf) collection.
-
-> [!TIP]
-> Click on the AltCLIP models in the right sidebar for more examples of how to apply AltCLIP to different tasks.
-
-The examples below demonstrates how to calculate similarity scores between an image and one or more captions with the [`AutoModel`] class.
+This model was contributed by [jongjyh](https://huggingface.co/jongjyh).

 <hfoptions id="usage">
+<hfoption id="Pipeline">
+
+```py
+import torch
+from transformers import pipeline
+
+pipeline = pipeline(task="zero-shot-image-classification", model="kakaobrain/align-base", dtype="auto")
+candidate_labels = ["a photo of a dog", "a photo of a cat", "a photo of a person"]
+pipeline("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg", candidate_labels=candidate_labels)
+```
+
+</hfoption>
 <hfoption id="AutoModel">

-```python
+```py
 import torch
 import requests
 from PIL import Image
-from transformers import AltCLIPModel, AltCLIPProcessor
+from transformers import AltCLIPModel, AutoProcessor

-model = AltCLIPModel.from_pretrained("BAAI/AltCLIP", dtype=torch.bfloat16)
-processor = AltCLIPProcessor.from_pretrained("BAAI/AltCLIP")
+model = AltCLIPModel.from_pretrained("BAAI/AltCLIP", dtype="auto")
+processor = AutoProcessor.from_pretrained("BAAI/AltCLIP")

 url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
 image = Image.open(requests.get(url, stream=True).raw)
@ -49,8 +51,8 @@ image = Image.open(requests.get(url, stream=True).raw)
 inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)

 outputs = model(**inputs)
-logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
-probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities
+logits_per_image = outputs.logits_per_image
+probs = logits_per_image.softmax(dim=1)

 labels = ["a photo of a cat", "a photo of a dog"]
 for label, prob in zip(labels, probs[0]):
@ -60,48 +62,10 @@ for label, prob in zip(labels, probs[0]):
 </hfoption>
 </hfoptions>

-Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
-
-The example below uses [torchao](../quantization/torchao) to only quantize the weights to int4.
-
-```python
-# !pip install torchao
-import torch
-import requests
-from PIL import Image
-from transformers import AltCLIPModel, AltCLIPProcessor, TorchAoConfig
-
-model = AltCLIPModel.from_pretrained(
-    "BAAI/AltCLIP",
-    quantization_config=TorchAoConfig("int4_weight_only", group_size=128),
-    dtype=torch.bfloat16,
-)
-
-processor = AltCLIPProcessor.from_pretrained("BAAI/AltCLIP")
-
-url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
-image = Image.open(requests.get(url, stream=True).raw)
-
-inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
-
-outputs = model(**inputs)
-logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
-probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities
-
-labels = ["a photo of a cat", "a photo of a dog"]
-for label, prob in zip(labels, probs[0]):
-    print(f"{label}: {prob.item():.4f}")
-```
-
-## Notes
-
- AltCLIP uses bidirectional attention instead of causal attention and it uses the `[CLS]` token in XLM-R to represent a text embedding.
- Use [`CLIPImageProcessor`] to resize (or rescale) and normalize images for the model.
- [`AltCLIPProcessor`] combines [`CLIPImageProcessor`] and [`XLMRobertaTokenizer`] into a single instance to encode text and prepare images.
-
 ## AltCLIPConfig

 [[autodoc]] AltCLIPConfig
+    - from_text_vision_configs

 ## AltCLIPTextConfig

@ -111,18 +75,24 @@ for label, prob in zip(labels, probs[0]):

 [[autodoc]] AltCLIPVisionConfig

+## AltCLIPProcessor
+
+[[autodoc]] AltCLIPProcessor
+
 ## AltCLIPModel

 [[autodoc]] AltCLIPModel
+    - forward
+    - get_text_features
+    - get_image_features

 ## AltCLIPTextModel

 [[autodoc]] AltCLIPTextModel
+    - forward

 ## AltCLIPVisionModel

 [[autodoc]] AltCLIPVisionModel
+    - forward

-## AltCLIPProcessor
-
-[[autodoc]] AltCLIPProcessor
--- a/docs/source/en/model_doc/apertus.md
+++ b/docs/source/en/model_doc/apertus.md
@ -13,28 +13,20 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2025-09-02 and added to Hugging Face Transformers on 2025-08-28.*
-
-# Apertus
+*This model was released on 2025-09-02 and added to Hugging Face Transformers on 2025-10-07.*

 <div style="float: right;">
    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
        <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
        <img alt="Tensor parallelism" src="https://img.shields.io/badge/Tensor%20parallelism-06b6d4?style=flat&logoColor=white">
    </div>
 </div>

-## Overview
+# Apertus

 [Apertus](https://www.swiss-ai.org) is a family of large language models from the Swiss AI Initiative.

-> [!TIP]
-> Coming soon
-
-The example below demonstrates how to generate text with [`Pipeline`] or the [`AutoModel`], and from the command line.
-
 <hfoptions id="usage">
 <hfoption id="Pipeline">

@ -42,13 +34,8 @@ The example below demonstrates how to generate text with [`Pipeline`] or the [`A
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    task="text-generation",
-    model="swiss-ai/Apertus-8B",
-    dtype=torch.bfloat16,
-    device=0
-)
-pipeline("Plants create energy through a process known as")
+pipeline = pipeline(task="text-generation", model="swiss-ai/Apertus-8B", dtype="auto")
+pipeline("Plants generate energy through a process known as  ")
 ```

 </hfoption>
@ -56,28 +43,15 @@ pipeline("Plants create energy through a process known as")

 ```py
 import torch
-from transformers import AutoModelForCausalLM, AutoTokenizer
+from transformers import AutoTokenizer, AutoModelForCausalLM

-tokenizer = AutoTokenizer.from_pretrained(
-    "swiss-ai/Apertus-8B",
-)
-model = AutoModelForCausalLM.from_pretrained(
-    "swiss-ai/Apertus-8B",
-    dtype=torch.bfloat16,
-    device_map="auto",
-    attn_implementation="sdpa"
-)
-input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to("cuda")
+tokenizer = AutoTokenizer.from_pretrained("swiss-ai/Apertus-8B")
+model = ArceeForCausalLM.from_pretrained("swiss-ai/Apertus-8B", dtype="auto")

-output = model.generate(**input_ids)
-print(tokenizer.decode(output[0], skip_special_tokens=True))
-```
-
-</hfoption>
-<hfoption id="transformers CLI">
-
-```bash
-echo -e "Plants create energy through a process known as" | transformers run --task text-generation --model swiss-ai/Apertus-8B --device 0
+inputs = tokenizer("Plants generate energy through a process known as  ", return_tensors="pt")
+with torch.no_grad():
+    outputs = model.generate(**inputs, max_new_tokens=50)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```

 </hfoption>
--- a/docs/source/en/model_doc/arcee.md
+++ b/docs/source/en/model_doc/arcee.md
@ -17,7 +17,6 @@ rendered properly in your Markdown viewer.

 <div style="float: right;">
    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
        <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
    </div>
@ -29,11 +28,6 @@ rendered properly in your Markdown viewer.

 The Arcee model is architecturally similar to Llama but uses `x * relu(x)` in MLP layers for improved gradient flow and is optimized for efficiency in both training and inference scenarios.

-> [!TIP]
-> The Arcee model supports extended context with RoPE scaling and all standard transformers features including Flash Attention 2, SDPA, gradient checkpointing, and quantization support.
-
-The example below demonstrates how to generate text with Arcee using [`Pipeline`] or the [`AutoModel`].
-
 <hfoptions id="usage">
 <hfoption id="Pipeline">

@ -41,15 +35,8 @@ The example below demonstrates how to generate text with Arcee using [`Pipeline`
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    task="text-generation",
-    model="arcee-ai/AFM-4.5B",
-    dtype=torch.float16,
-    device=0
-)
-
-output = pipeline("The key innovation in Arcee is")
-print(output[0]["generated_text"])
+pipeline = pipeline(task="text-generation", model="arcee-ai/AFM-4.5B", dtype="auto")
+pipeline("Plants generate energy through a process known as  ")
 ```

 </hfoption>
@ -57,16 +44,12 @@ print(output[0]["generated_text"])

 ```py
 import torch
-from transformers import AutoTokenizer, ArceeForCausalLM
+from transformers import AutoTokenizer, AutoModelForCausalLM

 tokenizer = AutoTokenizer.from_pretrained("arcee-ai/AFM-4.5B")
-model = ArceeForCausalLM.from_pretrained(
-    "arcee-ai/AFM-4.5B",
-    dtype=torch.float16,
-    device_map="auto"
-)
+model = ArceeForCausalLM.from_pretrained("arcee-ai/AFM-4.5B", dtype="auto")

-inputs = tokenizer("The key innovation in Arcee is", return_tensors="pt")
+inputs = tokenizer("Plants generate energy through a process known as  ", return_tensors="pt")
 with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=50)
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
@ -102,4 +85,4 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ## ArceeForTokenClassification

 [[autodoc]] ArceeForTokenClassification
-    - forward
+    - forward
--- a/docs/source/en/model_doc/aria.md
+++ b/docs/source/en/model_doc/aria.md
@ -13,11 +13,10 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2024-10-08 and added to Hugging Face Transformers on 2024-12-06.*
+*This model was released on 2024-10-08 and added to Hugging Face Transformers on 2024-12-06 and contributed by [m-ric](https://huggingface.co/m-ric).*

 <div style="float: right;">
    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
        <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
    </div>
@ -25,48 +24,27 @@ rendered properly in your Markdown viewer.

 # Aria

-[Aria](https://huggingface.co/papers/2410.05993) is a multimodal mixture-of-experts (MoE) model. The goal of this model is to open-source a training recipe for creating a multimodal native model from scratch. Aria has 3.9B and 3.5B activated parameters per visual and text token respectively. Text is handled by a MoE decoder and visual inputs are handled by a lightweight visual encoder. It is trained in 4 stages, language pretraining, multimodal pretraining, multimodal long-context pretraining, and multimodal post-training.
-
-You can find all the original Aria checkpoints under the [Aria](https://huggingface.co/rhymes-ai?search_models=aria) organization.
-
-> [!TIP]
-> Click on the Aria models in the right sidebar for more examples of how to apply Aria to different multimodal tasks.
-
-The example below demonstrates how to generate text based on an image with [`Pipeline`] or the [`AutoModel`] class.
+[Aria](https://huggingface.co/papers/2410.05993) is an open multimodal-native model designed to integrate diverse information sources and deliver comprehensive understanding. It employs a Mixture-of-Experts architecture with 3.9B and 3.5B activated parameters per visual and text token, respectively. Aria outperforms models like Pixtral-12B and Llama3.2-11B across various multimodal, language, and coding tasks. The model is pre-trained through a 4-stage pipeline that enhances language understanding, multimodal capabilities, long context handling, and instruction following. Aria's weights and codebase are open-sourced to facilitate adoption and adaptation in real-world applications.

 <hfoptions id="usage">
 <hfoption id="Pipeline">

-```python
+```py
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    "image-to-text",
-    model="rhymes-ai/Aria",
-    device=0,
-    dtype=torch.bfloat16
-)
-pipeline(
-    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
-    text="What is shown in this image?"
-)
+pipeline = pipeline(task="image-to-text", model="rhymes-ai/Aria", dtype="auto")
+pipeline("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg", text="What is shown in this image?")
 ```

 </hfoption>
 <hfoption id="AutoModel">

-```python
+```py
 import torch
 from transformers import AutoModelForCausalLM, AutoProcessor

-model = AutoModelForCausalLM.from_pretrained(
-    "rhymes-ai/Aria",
-    device_map="auto",
-    dtype=torch.bfloat16,
-    attn_implementation="sdpa"
-)
-
+model = AutoModelForCausalLM.from_pretrained("rhymes-ai/Aria", dtype="auto")
 processor = AutoProcessor.from_pretrained("rhymes-ai/Aria")

 messages = [
@ -81,8 +59,7 @@ messages = [
 inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt")
 ipnuts = inputs.to(model.device, torch.bfloat16)

-output = model.generate(
-    **inputs,
+output = model.generate(**inputs,
    max_new_tokens=15,
    stop_strings=["<|im_end|>"],
    tokenizer=processor.tokenizer,
@ -97,51 +74,6 @@ print(response)
 </hfoption>
 </hfoptions>

-Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
-
-The example below uses [torchao](../quantization/torchao) to only quantize the weights to int4 and the [rhymes-ai/Aria-sequential_mlp](https://huggingface.co/rhymes-ai/Aria-sequential_mlp) checkpoint. This checkpoint replaces grouped GEMM with `torch.nn.Linear` layers for easier quantization.
-
-```py
-# pip install torchao
-import torch
-from transformers import TorchAoConfig, AutoModelForCausalLM, AutoProcessor
-
-quantization_config = TorchAoConfig("int4_weight_only", group_size=128)
-model = AutoModelForCausalLM.from_pretrained(
-    "rhymes-ai/Aria-sequential_mlp",
-    dtype=torch.bfloat16,
-    device_map="auto",
-    quantization_config=quantization_config
-)
-processor = AutoProcessor.from_pretrained(
-    "rhymes-ai/Aria-sequential_mlp",
-)
-
-messages = [
-    {
-        "role": "user", "content": [
-            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
-            {"type": "text", "text": "What is shown in this image?"},
-        ]
-    },
-]
-
-inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt")
-inputs = inputs.to(model.device, torch.bfloat16)
-
-output = model.generate(
-    **inputs,
-    max_new_tokens=15,
-    stop_strings=["<|im_end|>"],
-    tokenizer=processor.tokenizer,
-    do_sample=True,
-    temperature=0.9,
-)
-output_ids = output[0][inputs["input_ids"].shape[1]:]
-response = processor.decode(output_ids, skip_special_tokens=True)
-print(response)
-```
-
 ## AriaImageProcessor

 [[autodoc]] AriaImageProcessor
@ -162,15 +94,17 @@ print(response)

 [[autodoc]] AriaTextModel

-## AriaModel
-
-[[autodoc]] AriaModel
-
 ## AriaTextForCausalLM

 [[autodoc]] AriaTextForCausalLM

+## AriaModel
+
+[[autodoc]] AriaModel
+    - forward
+
 ## AriaForConditionalGeneration

 [[autodoc]] AriaForConditionalGeneration
    - forward
+
--- a/docs/source/en/model_doc/audio-spectrogram-transformer.md
+++ b/docs/source/en/model_doc/audio-spectrogram-transformer.md
@ -13,82 +13,55 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2021-04-05 and added to Hugging Face Transformers on 2022-11-21.*
+*This model was released on 2021-04-05 and added to Hugging Face Transformers on 2022-11-21 and contributed by [nielsr](https://huggingface.co/nielsr).*
+
+<div style="float: right;">
+    <div class="flex flex-wrap space-x-1">
+        <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
+        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
+    </div>
+</div>

 # Audio Spectrogram Transformer

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
-<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[Audio Spectrogram Transformer](https://huggingface.co/papers/2104.01778) applies a Vision Transformer to audio by converting audio into spectrograms, achieving state-of-the-art results in audio classification without using convolutional layers. It outperforms existing models on benchmarks like AudioSet, ESC-50, and Speech Commands V2, demonstrating the effectiveness of purely attention-based models in this domain.

-## Overview
-
-The Audio Spectrogram Transformer model was proposed in [AST: Audio Spectrogram Transformer](https://huggingface.co/papers/2104.01778) by Yuan Gong, Yu-An Chung, James Glass.
-The Audio Spectrogram Transformer applies a [Vision Transformer](vit) to audio, by turning audio into an image (spectrogram). The model obtains state-of-the-art results
-for audio classification.
-
-The abstract from the paper is the following:
-
-*In the past decade, convolutional neural networks (CNNs) have been widely adopted as the main building block for end-to-end audio classification models, which aim to learn a direct mapping from audio spectrograms to corresponding labels. To better capture long-range global context, a recent trend is to add a self-attention mechanism on top of the CNN, forming a CNN-attention hybrid model. However, it is unclear whether the reliance on a CNN is necessary, and if neural networks purely based on attention are sufficient to obtain good performance in audio classification. In this paper, we answer the question by introducing the Audio Spectrogram Transformer (AST), the first convolution-free, purely attention-based model for audio classification. We evaluate AST on various audio classification benchmarks, where it achieves new state-of-the-art results of 0.485 mAP on AudioSet, 95.6% accuracy on ESC-50, and 98.1% accuracy on Speech Commands V2.*
-
-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/audio_spectogram_transformer_architecture.png"
-alt="drawing" width="600"/>
-
-<small> Audio Spectrogram Transformer architecture. Taken from the <a href="https://huggingface.co/papers/2104.01778">original paper</a>.</small>
-
-This model was contributed by [nielsr](https://huggingface.co/nielsr).
-The original code can be found [here](https://github.com/YuanGongND/ast).
-
-## Usage tips
-
- When fine-tuning the Audio Spectrogram Transformer (AST) on your own dataset, it's recommended to take care of the input normalization (to make
-sure the input has mean of 0 and std of 0.5). [`ASTFeatureExtractor`] takes care of this. Note that it uses the AudioSet
-mean and std by default. You can check [`ast/src/get_norm_stats.py`](https://github.com/YuanGongND/ast/blob/master/src/get_norm_stats.py) to see how
-the authors compute the stats for a downstream dataset.
- Note that the AST needs a low learning rate (the authors use a 10 times smaller learning rate compared to their CNN model proposed in the
-[PSLA paper](https://huggingface.co/papers/2102.01243)) and converges quickly, so please search for a suitable learning rate and learning rate scheduler for your task.
-
-### Using Scaled Dot Product Attention (SDPA)
-
-PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function
-encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the
-[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html)
-or the [GPU Inference](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention)
-page for more information.
-
-SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set
-`attn_implementation="sdpa"` in `from_pretrained()` to explicitly request SDPA to be used.
+<hfoptions id="usage">
+<hfoption id="Pipeline">

 ```py
-from transformers import ASTForAudioClassification
-model = ASTForAudioClassification.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593", attn_implementation="sdpa", dtype=torch.float16)
-...
+import torch
+from transformers import pipeline
+
+pipeline = pipeline(task="audio-classification",model="MIT/ast-finetuned-audioset-10-10-0.4593", dtype="auto")
+pipeline("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac")
 ```

-For the best speedups, we recommend loading the model in half-precision (e.g. `torch.float16` or `torch.bfloat16`).
+</hfoption>
+<hfoption id="AutoModel"

-On a local benchmark (A100-40GB, PyTorch 2.3.0, OS Ubuntu 22.04) with `float32` and `MIT/ast-finetuned-audioset-10-10-0.4593` model, we saw the following speedups during inference.
+```py
+import torch
+from datasets import load_dataset
+from transformers import AutoFeatureExtractor, AutoModelForAudioClassification

-|   Batch size |   Average inference time (ms), eager mode |   Average inference time (ms), sdpa model |   Speed up, Sdpa / Eager (x) |
-|--------------|-------------------------------------------|-------------------------------------------|------------------------------|
-|            1 |                                        27 |                                         6 |                      4.5 |
-|            2 |                                        12 |                                         6 |                      2   |
-|            4 |                                        21 |                                         8 |                      2.62 |
-|            8 |                                        40 |                                        14 |                      2.86 |
+dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation").sort("id")
+sampling_rate = dataset.features["audio"].sampling_rate

-## Resources
+feature_extractor = AutoFeatureExtractor.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593")
+model = AutoModelForAudioClassification.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593")

-A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with the Audio Spectrogram Transformer.
+inputs = feature_extractor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")

-<PipelineTag pipeline="audio-classification"/>
+with torch.no_grad():
+    logits = model(**inputs).logits

- A notebook illustrating inference with AST for audio classification can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/AST).
- [`ASTForAudioClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/audio-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/audio_classification.ipynb).
- See also: [Audio classification](../tasks/audio_classification).
+predicted_class_ids = torch.argmax(logits, dim=-1).item()
+print(f"Predicted label: {model.config.id2label[predicted_class_ids]}")
+```

-If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
+</hfoption>
+</hfoptions>

 ## ASTConfig

@ -108,3 +81,4 @@ If you're interested in submitting a resource to be included here, please feel f

 [[autodoc]] ASTForAudioClassification
    - forward
+
--- a/docs/source/en/model_doc/auto.md
+++ b/docs/source/en/model_doc/auto.md
@ -29,7 +29,7 @@ model = AutoModel.from_pretrained("google-bert/bert-base-cased")

 will create a model that is an instance of [`BertModel`].

-There is one class of `AutoModel` for each task.
+There is one class of `AutoModel` for each task, and for each backend (PyTorch, TensorFlow, or Flax).

 ## Extending the Auto Classes

@ -48,7 +48,7 @@ You will then be able to use the auto classes like you would usually do!

 <Tip warning={true}>

-If your `NewModelConfig` is a subclass of [`~transformers.PreTrainedConfig`], make sure its
+If your `NewModelConfig` is a subclass of [`~transformers.PretrainedConfig`], make sure its
 `model_type` attribute is set to the same key you use when registering the config (here `"new-model"`).

 Likewise, if your `NewModel` is a subclass of [`PreTrainedModel`], make sure its
@ -73,14 +73,14 @@ Likewise, if your `NewModel` is a subclass of [`PreTrainedModel`], make sure its

 [[autodoc]] AutoImageProcessor

-## AutoVideoProcessor
-
-[[autodoc]] AutoVideoProcessor
-
 ## AutoProcessor

 [[autodoc]] AutoProcessor

+## AutoVideoProcessor
+
+[[autodoc]] AutoVideoProcessor
+
 ## Generic model classes

 The following auto classes are available for instantiating a base model class without a specific head.
@ -161,10 +161,6 @@ The following auto classes are available for the following computer vision tasks

 [[autodoc]] AutoModelForKeypointDetection

-### AutoModelForKeypointMatching
-
-[[autodoc]] AutoModelForKeypointMatching
-
 ### AutoModelForMaskedImageModeling

 [[autodoc]] AutoModelForMaskedImageModeling
@ -201,6 +197,10 @@ The following auto classes are available for the following computer vision tasks

 [[autodoc]] AutoModelForZeroShotObjectDetection

+### AutoModelForKeypointMatching
+
+[[autodoc]] AutoModelForKeypointMatching
+
 ## Audio

 The following auto classes are available for the following audio tasks.
@ -261,8 +261,6 @@ The following auto classes are available for the following multimodal tasks.

 [[autodoc]] AutoModelForImageTextToText

-## Time Series
-
 ### AutoModelForTimeSeriesPrediction

 [[autodoc]] AutoModelForTimeSeriesPrediction
--- a/docs/source/en/model_doc/autoformer.md
+++ b/docs/source/en/model_doc/autoformer.md
@ -13,32 +13,39 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2021-06-24 and added to Hugging Face Transformers on 2023-05-30.*
+*This model was released on 2021-06-24 and added to Hugging Face Transformers on 2023-05-30 and contributed by [elisim](https://huggingface.co/elisim) and [kashif](https://huggingface.co/kashif).*

 # Autoformer

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://huggingface.co/papers/2106.13008) addresses the challenge of long-term time series forecasting by introducing a novel decomposition architecture. Autoformer integrates an Auto-Correlation mechanism that progressively decomposes trend and seasonal components, enhancing the model's ability to capture intricate temporal patterns. This approach surpasses traditional self-attention methods in both efficiency and accuracy, achieving state-of-the-art results with a 38% relative improvement across six benchmarks in diverse applications including energy, traffic, economics, weather, and disease forecasting.

-## Overview
+<hfoptions id="usage">
+<hfoption id="AutoformerForPrediction">

-The Autoformer model was proposed in [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://huggingface.co/papers/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long.
+```py
+import torch
+from huggingface_hub import hf_hub_download
+from transformers import AutoformerForPrediction

-This model augments the Transformer as a deep decomposition architecture, which can progressively decompose the trend and seasonal components during the forecasting process.
+file = hf_hub_download(
+    repo_id="hf-internal-testing/tourism-monthly-batch", filename="train-batch.pt", repo_type="dataset"
+)
+batch = torch.load(file)

-The abstract from the paper is the following:
+model = AutoformerForPrediction.from_pretrained("huggingface/autoformer-tourism-monthly", dtype="auto")
+outputs = model.generate(
+    past_values=batch["past_values"],
+    past_time_features=batch["past_time_features"],
+    past_observed_mask=batch["past_observed_mask"],
+    static_categorical_features=batch["static_categorical_features"],
+    future_time_features=batch["future_time_features"],
+)

-*Extending the forecasting time is a critical demand for real applications, such as extreme weather early warning and long-term energy consumption planning. This paper studies the long-term forecasting problem of time series. Prior Transformer-based models adopt various self-attention mechanisms to discover the long-range dependencies. However, intricate temporal patterns of the long-term future prohibit the model from finding reliable dependencies. Also, Transformers have to adopt the sparse versions of point-wise self-attentions for long series efficiency, resulting in the information utilization bottleneck. Going beyond Transformers, we design Autoformer as a novel decomposition architecture with an Auto-Correlation mechanism. We break with the pre-processing convention of series decomposition and renovate it as a basic inner block of deep models. This design empowers Autoformer with progressive decomposition capacities for complex time series. Further, inspired by the stochastic process theory, we design the Auto-Correlation mechanism based on the series periodicity, which conducts the dependencies discovery and representation aggregation at the sub-series level. Auto-Correlation outperforms self-attention in both efficiency and accuracy. In long-term forecasting, Autoformer yields state-of-the-art accuracy, with a 38% relative improvement on six benchmarks, covering five practical applications: energy, traffic, economics, weather and disease.*
+mean_prediction = outputs.sequences.mean(dim=1)
+```

-This model was contributed by [elisim](https://huggingface.co/elisim) and [kashif](https://huggingface.co/kashif).
-The original code can be found [here](https://github.com/thuml/Autoformer).
-
-## Resources
-
-A list of official Hugging Face and community (indicated by 🌎) resources to help you get started. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
-
- Check out the Autoformer blog-post in HuggingFace blog: [Yes, Transformers are Effective for Time Series Forecasting (+ Autoformer)](https://huggingface.co/blog/autoformer)
+</hfoption>
+</hfoptions>

 ## AutoformerConfig

@ -53,3 +60,4 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h

 [[autodoc]] AutoformerForPrediction
    - forward
+
--- a/docs/source/en/model_doc/aya_vision.md
+++ b/docs/source/en/model_doc/aya_vision.md
@ -13,250 +13,64 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2025-05-13 and added to Hugging Face Transformers on 2025-03-04.*
+*This model was released on 2025-05-13 and added to Hugging Face Transformers on 2025-03-04 and contributed by [saurabhdash](https://huggingface.co/saurabhdash) and [yonigozlan](https://huggingface.co/yonigozlan).*

-<div style="float: right;">
-    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-    </div>
-</div>
+# AyaVision

-# Aya Vision
-
-[Aya Vision](https://huggingface.co/papers/2505.08751) is a family of open-weight multimodal vision-language models from Cohere Labs. It is trained with a synthetic annotation framework that generates high-quality multilingual image captions, improving Aya Vision's generated responses. In addition, a cross-modal model merging technique is used to prevent the model from losing its text capabilities after adding vision capabilities. The model combines a CommandR-7B language model with a SigLIP vision encoder.
-
-You can find all the original Aya Vision checkpoints under the [Aya Vision](https://huggingface.co/collections/CohereLabs/cohere-labs-aya-vision-67c4ccd395ca064308ee1484) collection.
-
-> [!TIP]
-> This model was contributed by [saurabhdash](https://huggingface.co/saurabhdash) and [yonigozlan](https://huggingface.co/yonigozlan).
->
-> Click on the Aya Vision models in the right sidebar for more examples of how to apply Aya Vision to different image-to-text tasks.
-
-The example below demonstrates how to generate text based on an image with [`Pipeline`] or the [`AutoModel`] class.
+[Aya Vision](https://huggingface.co/papers/2505.08751) ntroduce two key innovations for multilingual multimodal learning: a synthetic annotation framework that generates high-quality, diverse instruction data across languages, and a cross-modal model merging technique that prevents catastrophic forgetting while preserving strong text-only performance. These methods enable effective alignment between vision and language without degrading existing capabilities. Aya-Vision-8B surpasses comparable models like Qwen-2.5-VL-7B, Pixtral-12B, and even larger models such as Llama-3.2-90B-Vision, while the larger Aya-Vision-32B outperforms models more than twice its size, including Molmo-72B. Overall, the approach demonstrates efficient scaling and state-of-the-art multilingual multimodal performance with reduced computational demands.

 <hfoptions id="usage">
 <hfoption id="Pipeline">

-```python
+```py
+import torch
 from transformers import pipeline

-pipe = pipeline(model="CohereLabs/aya-vision-8b", task="image-text-to-text", device_map="auto")
-
-# Format message with the aya-vision chat template
+pipeline = pipeline(task="image-text-to-text", model="CohereLabs/aya-vision-8b", dtype="auto")
 messages = [
    {"role": "user",
     "content": [
-       {"type": "image", "url": "https://media.istockphoto.com/id/458012057/photo/istanbul-turkey.jpg?s=612x612&w=0&k=20&c=qogAOVvkpfUyqLUMr_XJQyq-HkACXyYUSZbKhBlPrxo="},
-        {"type": "text", "text": "Bu resimde hangi anıt gösterilmektedir?"},
+       {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
+        {"type": "text", "text": "Que montre cette image?"},
    ]},
-    ]
-outputs = pipe(text=messages, max_new_tokens=300, return_full_text=False)
-
-print(outputs)
+]
+pipeline(text=messages, max_new_tokens=300, return_full_text=False)
 ```

 </hfoption>
 <hfoption id="AutoModel">

-```python
-# pip install 'git+https://github.com/huggingface/transformers.git@v4.49.0-Aya Vision'
+```py
 import torch
 from transformers import AutoProcessor, AutoModelForImageTextToText

-model_id = "CohereLabs/aya-vision-8b"
+processor = AutoProcessor.from_pretrained("CohereLabs/aya-vision-8b)
+model = AutoModelForImageTextToText.from_pretrained("CohereLabs/aya-vision-8b", dtype="auto")

-processor = AutoProcessor.from_pretrained(model_id)
-model = AutoModelForImageTextToText.from_pretrained(
-    model_id, device_map="auto", dtype=torch.float16
-)
-
-# Format message with the aya-vision chat template
 messages = [
    {"role": "user",
     "content": [
-       {"type": "image", "url": "https://pbs.twimg.com/media/Fx7YvfQWYAIp6rZ?format=jpg&name=medium"},
-        {"type": "text", "text": "चित्र में लिखा पाठ क्या कहता है?"},
+       {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
+        {"type": "text", "text": "Que montre cette image?"},
    ]},
-    ]
+]

 inputs = processor.apply_chat_template(
    messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt"
-).to(model.device)
+)

-gen_tokens = model.generate(
+outputs = model.generate(
    **inputs,
    max_new_tokens=300,
    do_sample=True,
    temperature=0.3,
 )
-
-print(processor.tokenizer.decode(gen_tokens[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
+print(processor.tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
 ```

 </hfoption>
 </hfoptions>

-Quantization reduces the memory footprint of large models by representing weights at lower precision. Refer to the [Quantization](../quantization/overview) overview for supported backends.
-
-The example below uses [bitsandbytes](../quantization/bitsandbytes) to only quantize the weights to 4-bits.
-
-```python
-import torch
-from transformers import (
-    AutoProcessor,
-    AutoModelForImageTextToText,
-    BitsAndBytesConfig
-)
-
-bnb_config = BitsAndBytesConfig(
-    load_in_4bit=True,
-    bnb_4bit_quant_type="nf4",
-    bnb_4bit_compute_dtype=torch.bfloat16,
-    bnb_4bit_use_double_quant=True
-)
-
-processor = AutoProcessor.from_pretrained("CohereLabs/aya-vision-32b", use_fast=True)
-model = AutoModelForImageTextToText.from_pretrained(
-    "CohereLabs/aya-vision-32b",
-    quantization_config=bnb_config,
-    device_map="auto"
-)
-
-inputs = processor.apply_chat_template(
-    [
-    {"role": "user", "content": [
-        {"type": "image", "url": "https://huggingface.co/roschmid/dog-races/resolve/main/images/Border_Collie.jpg"},
-        {"type": "text",  "text":"Describe what you see."}
-    ]}
-    ],
-    padding=True,
-    add_generation_prompt=True,
-    tokenize=True,
-    return_tensors="pt"
-).to(model.device)
-
-generated = model.generate(**inputs, max_new_tokens=50)
-print(processor.tokenizer.decode(generated[0], skip_special_tokens=True))
-```
-
-## Notes
-
- Images are represented with the `<image>` tag in the chat template.
-
- Use the [`~ProcessorMixin.apply_chat_template`] method to correctly format inputs.
-
- The example below demonstrates inference with multiple images.
-  
-    ```py
-    import torch
-    from transformers import AutoProcessor, AutoModelForImageTextToText
-        
-    processor = AutoProcessor.from_pretrained("CohereForAI/aya-vision-8b")
-    model = AutoModelForImageTextToText.from_pretrained(
-        "CohereForAI/aya-vision-8b", device_map="auto", dtype=torch.float16
-    )
-    
-    messages = [
-        {
-            "role": "user",
-            "content": [
-                {
-                    "type": "image",
-                    "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg",
-                },
-                {
-                    "type": "image",
-                    "url": "https://thumbs.dreamstime.com/b/golden-gate-bridge-san-francisco-purple-flowers-california-echium-candicans-36805947.jpg",
-                },
-                {
-                    "type": "text",
-                    "text": "These images depict two different landmarks. Can you identify them?",
-                },
-            ],
-        },
-    ]
-    
-    inputs = processor.apply_chat_template(
-        messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt"
-    ).to(model.device)
-    
-    gen_tokens = model.generate(
-        **inputs, 
-        max_new_tokens=300, 
-        do_sample=True, 
-        temperature=0.3,
-    )
-    
-    gen_text = processor.tokenizer.decode(gen_tokens[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
-    print(gen_text)
-    ```
-
- The example below demonstrates inference with batched inputs.
-  
-    ```py
-    import torch
-    from transformers import AutoProcessor, AutoModelForImageTextToText
-        
-    processor = AutoProcessor.from_pretrained(model_id)
-    model = AutoModelForImageTextToText.from_pretrained(
-        "CohereForAI/aya-vision-8b", device_map="auto", dtype=torch.float16
-    )
-    
-    batch_messages = [
-        [
-            {
-                "role": "user",
-                "content": [
-                    {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
-                    {"type": "text", "text": "Write a haiku for this image"},
-                ],
-            },
-        ],
-        [
-            {
-                "role": "user",
-                "content": [
-                    {
-                        "type": "image",
-                        "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg",
-                    },
-                    {
-                        "type": "image",
-                        "url": "https://thumbs.dreamstime.com/b/golden-gate-bridge-san-francisco-purple-flowers-california-echium-candicans-36805947.jpg",
-                    },
-                    {
-                        "type": "text",
-                        "text": "These images depict two different landmarks. Can you identify them?",
-                    },
-                ],
-            },
-        ],
-    ]
-    
-    batch_inputs = processor.apply_chat_template(
-        batch_messages, 
-        padding=True, 
-        add_generation_prompt=True, 
-        tokenize=True, 
-        return_dict=True, 
-        return_tensors="pt"
-    ).to(model.device)
-    
-    batch_outputs = model.generate(
-        **batch_inputs,
-        max_new_tokens=300,
-        do_sample=True,
-        temperature=0.3,
-    )
-    
-    for i, output in enumerate(batch_outputs):
-        response = processor.tokenizer.decode(
-            output[batch_inputs.input_ids.shape[1]:], 
-            skip_special_tokens=True
-        )
-        print(f"Response {i+1}:\n{response}\n")
-    ```
-
 ## AyaVisionProcessor

 [[autodoc]] AyaVisionProcessor
@ -268,6 +82,7 @@ print(processor.tokenizer.decode(generated[0], skip_special_tokens=True))
 ## AyaVisionModel

 [[autodoc]] AyaVisionModel
+    - forward

 ## AyaVisionForConditionalGeneration

--- a/docs/source/en/model_doc/bamba.md
+++ b/docs/source/en/model_doc/bamba.md
@ -13,11 +13,10 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2024-12-18 and added to Hugging Face Transformers on 2024-12-19.*
+*This model was released on 2024-12-18 and added to Hugging Face Transformers on 2024-12-19 and contributed by [ani300](https://github.com/ani300) and [fabianlim](https://github.com/fabianlim).*

 <div style="float: right;">
    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
        <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
    </div>
@ -25,106 +24,52 @@ rendered properly in your Markdown viewer.

 # Bamba

-[Bamba](https://huggingface.co/blog/bamba) is a 9B parameter decoder-only language model built on the [Mamba-2](./mamba2) architecture. It is pretrained in two stages - it starts by training on 2T tokens from the [Dolma v1.7](https://huggingface.co/datasets/allenai/dolma) dataset and then trained on an additional 200B tokens from [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) and [Cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia).
-
-You can find all the original Bamba checkpoints under the [Bamba](https://huggingface.co/collections/ibm-ai-platform/bamba-674f1388b9bbc98b413c7bab) collection.
-
-> [!TIP]
-> This model was contributed by [ani300](https://github.com/ani300) and [fabianlim](https://github.com/fabianlim).
->
-> Click on the Bamba models in the right sidebar for more examples of how to apply Bamba to different text generation tasks.
-
-The example below demonstrates how to generate text with [`Pipeline`], [`AutoModel`], and from the command line.
+[Bamba-9B](https://github.com/state-spaces/mamba) is a new hybrid language model that combines Mamba2 and Transformer layers to improve inference efficiency. By interleaving Mamba2 layers, it avoids the memory bottleneck of the Transformer’s growing KV-cache, achieving up to 2.5× higher throughput and 2× lower latency in vLLM. The model has 9 billion parameters and was trained on 2.2 trillion tokens of open data, with full training recipes and checkpoints released for reproducibility. It integrates seamlessly with Hugging Face tools like Transformers, TRL, vLLM, and llama.cpp, and comes with additional resources such as a stateless shuffle dataloader and quantization support. Developed in collaboration with IBM, Princeton, CMU, and UIUC, Bamba is intended as an open, efficient foundation for experimenting with hybrid architectures.

 <hfoptions id="usage">
 <hfoption id="Pipeline">

-```python
+```py
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    task="text-generation",
-    model="ibm-ai-platform/Bamba-9B-v2",
-    dtype=torch.bfloat16,
-    device=0
-)
-pipeline("Plants create energy through a process known as")
+pipeline = pipeline(task="text-generation", model="ibm-fms/Bamba-9B", dtype="auto")
+pipeline("Plants generate energy through a process known as  ")
 ```

 </hfoption>
-
 <hfoption id="AutoModel">

-```python
+```py
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer

-tokenizer = AutoTokenizer.from_pretrained("ibm-ai-platform/Bamba-9B-v2")
-model = AutoModelForCausalLM.from_pretrained("ibm-ai-platform/Bamba-9B-v2", dtype=torch.bfloat16, device_map="auto", attn_implementation="sdpa")
-input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to(model.device)
+model = AutoModelForCausalLM.from_pretrained("ibm-fms/Bamba-9B", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("ibm-fms/Bamba-9B")

-output = model.generate(**input_ids)
-print(tokenizer.decode(output[0], skip_special_tokens=True))
+inputs = tokenizer("Plants generate energy through a process known as  ", return_tensors='pt', return_token_type_ids=False)
+outputs = model.generate(**inputs, max_new_tokens=64)
+print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
 ```

-</hfoption>
-
-<hfoption id="transformers CLI">
-```bash
-echo "Plants create energy through a process known as" | transformers run --task text-generation --model ibm-ai-platform/Bamba-9B-v2 --device 0
-```
 </hfoption>
 </hfoptions>

-Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
+## Usage tips

-The example below uses [torchao](../quantization/torchao) to only quantize the weights to int4.
+- Bamba supports padding-free training. This concatenates distinct training examples while processing inputs as separate batches. Expect ~2x inference acceleration (varies by model and data distribution). Memory usage drops when examples have varying lengths since you avoid padding token overhead.

-```python
-import torch
-from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
+- Padding-free training requires the flash-attn, mamba-ssm, and causal-conv1d packages. Pass these arguments alongside `input_ids` and `labels`:

-quantization_config = TorchAoConfig("int4_weight_only", group_size=128)
-tokenizer = AutoTokenizer.from_pretrained("ibm-ai-platform/Bamba-9B-v2")
-model = AutoModelForCausalLM.from_pretrained(
-   "ibm-ai-platform/Bamba-9B-v2",
-   quantization_config=quantization_config,
-   device_map="auto",
-   attn_implementation="sdpa"
-)
+- `position_ids`: `torch.LongTensor` - position index of each token in each sequence
+- `seq_idx`: `torch.LongTensor` - index of each sequence in the batch
+- `FlashAttentionKwargs`:
+  - `cu_seq_lens_q`: `torch.LongTensor` - cumulative sequence lengths of all queries
+  - `cu_seq_lens_k`: `torch.LongTensor` - cumulative sequence lengths of all keys  
+  - `max_length_q`: `int` - longest query length in the batch
+  - `max_length_k`: `int` - longest key length in the batch

-inputs = tokenizer("Plants create energy through a process known as", return_tensors="pt").to(model.device)
-output = model.generate(**inputs)
-print(tokenizer.decode(output[0], skip_special_tokens=True))
-```
-
-## Notes
-
- Bamba supports padding-free training which concatenates distinct training examples while still processing inputs as separate batches. It can significantly accelerate inference by [~2x](https://github.com/huggingface/transformers/pull/35861#issue-2807873129) (depending on model and data distribution) and reduce memory-usage if there are examples of varying lengths by avoiding unnecessary compute and memory overhead from padding tokens.
-
-  Padding-free training requires the `flash-attn`, `mamba-ssm`, and `causal-conv1d` packages and the following arguments must be passed to the model in addition to `input_ids` and `labels`.
-
-  - `position_ids: torch.LongTensor`: the position index of each token in each sequence.
-  - `seq_idx: torch.IntTensor`: the index of each sequence in the batch.
-  - Each of the [`FlashAttentionKwargs`]
-    - `cu_seq_lens_q: torch.LongTensor`: the cumulative sequence lengths of all queries.
-    - `cu_seq_lens_k: torch.LongTensor`: the cumulative sequence lengths of all keys.
-    - `max_length_q: int`: the longest query length in the batch.
-    - `max_length_k: int`: the longest key length in the batch.
-
-  The `attention_mask` inputs should not be provided. The [`DataCollatorWithFlattening`] programmatically generates the set of additional arguments above using `return_seq_idx=True` and `return_flash_attn_kwargs=True`. See the [Improving Hugging Face Training Efficiency Through Packing with Flash Attention](https://huggingface.co/blog/packing-with-FA2) blog post for additional information.
-
-  ```python
-  from transformers import DataCollatorWithFlattening
-
-  # Example of using padding-free training
-  data_collator = DataCollatorWithFlattening(
-      tokenizer=tokenizer,
-      return_seq_idx=True,
-      return_flash_attn_kwargs=True
-  )
-  ```
+- Don't provide `attention_mask` inputs. The [`DataCollatorWithFlattening`] generates these arguments automatically when you set `return_seq_idx=True` and `return_flash_attn_kwargs=True`. See the [Improving Hugging Face Training Efficiency Through Packing with Flash Attention](https://huggingface.co/blog/packing-with-FA2) blog post for details.

 ## BambaConfig

--- a/docs/source/en/model_doc/bark.md
+++ b/docs/source/en/model_doc/bark.md
@ -9,165 +9,50 @@ Unless required by applicable law or agreed to in writing, software distributed
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
-*This model was released on 2023-04-09 and added to Hugging Face Transformers on 2023-07-17.*
+*This model was released on {release_date} and added to Hugging Face Transformers on 2023-07-17 and contributed by [ylacombe](https://huggingface.co/ylacombe) and [sanchit-gandhi](https://github.com/sanchit-gandhi).*
+
+<div style="float: right;">
+    <div class="flex flex-wrap space-x-1">
+        <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
+    </div>
+</div>

 # Bark

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
-</div>
+[Bark](https://github.com/suno-ai/bark) is a text-to-audio generative model capable of producing realistic speech, music, and sound effects directly from text prompts. It’s built using a transformer-based architecture that models audio tokens rather than phonemes, enabling it to capture tone, emotion, and multilingual speech without explicit linguistic preprocessing. Bark uses semantic and coarse acoustic tokens, trained on diverse multilingual datasets, to generate natural prosody and expressive delivery. Its outputs are decoded from discrete audio representations, similar in spirit to models like EnCodec or VALL-E, allowing highly expressive and context-aware audio synthesis.

-## Overview
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-[Bark](https://huggingface.co/suno/bark) is a transformer-based text-to-speech model proposed by Suno AI in [suno-ai/bark](https://github.com/suno-ai/bark).
-
-Bark is made of 4 main models:
-
- [`BarkSemanticModel`] (also referred to as the 'text' model): a causal auto-regressive transformer model that takes as input tokenized text, and predicts semantic text tokens that capture the meaning of the text.
- [`BarkCoarseModel`] (also referred to as the 'coarse acoustics' model): a causal autoregressive transformer, that takes as input the results of the [`BarkSemanticModel`] model. It aims at predicting the first two audio codebooks necessary for EnCodec.
- [`BarkFineModel`] (the 'fine acoustics' model), this time a non-causal autoencoder transformer, which iteratively predicts the last codebooks based on the sum of the previous codebooks embeddings.
- having predicted all the codebook channels from the [`EncodecModel`], Bark uses it to decode the output audio array.
-
-It should be noted that each of the first three modules can support conditional speaker embeddings to condition the output sound according to specific predefined voice.
-
-This model was contributed by [Yoach Lacombe (ylacombe)](https://huggingface.co/ylacombe) and [Sanchit Gandhi (sanchit-gandhi)](https://github.com/sanchit-gandhi).
-The original code can be found [here](https://github.com/suno-ai/bark).
-
-### Optimizing Bark
-
-Bark can be optimized with just a few extra lines of code, which **significantly reduces its memory footprint** and **accelerates inference**.
-
-#### Using half-precision
-
-You can speed up inference and reduce memory footprint by 50% simply by loading the model in half-precision.
-
-```python
-from transformers import BarkModel
-from accelerate import Accelerator
+```py
 import torch
+from transformers import pipeline

-device = Accelerator().device
-model = BarkModel.from_pretrained("suno/bark-small", dtype=torch.float16).to(device)
+pipeline = pipeline(task="text-to-audio", model="suno/bark-small", dtype="auto")
+output = pipeline("Plants create energy through a process known as photosynthesis.")
+audio = output["audio"]
 ```

-#### Using CPU offload
+</hfoption>
+<hfoption id="BarkModel">

-As mentioned above, Bark is made up of 4 sub-models, which are called up sequentially during audio generation. In other words, while one sub-model is in use, the other sub-models are idle.
-
-If you're using a CUDA GPU or Intel XPU, a simple solution to benefit from an 80% reduction in memory footprint is to offload the submodels from device to CPU when they're idle. This operation is called *CPU offloading*. You can use it with one line of code as follows:
-
-```python
-model.enable_cpu_offload()
-```
-
-Note that 🤗 Accelerate must be installed before using this feature. [Here's how to install it.](https://huggingface.co/docs/accelerate/basic_tutorials/install)
-
-#### Using Flash Attention 2
-
-Flash Attention 2 is an even faster, optimized version of the previous optimization.
-
-##### Installation
-
-First, check whether your hardware is compatible with Flash Attention 2. The latest list of compatible hardware can be found in the [official documentation](https://github.com/Dao-AILab/flash-attention#installation-and-features).
-Next, [install](https://github.com/Dao-AILab/flash-attention#installation-and-features) the latest version of Flash Attention 2:
-
-```bash
-pip install -U flash-attn --no-build-isolation
-```
-
-##### Usage
-
-To load a model using Flash Attention 2, we can pass the `attn_implementation="flash_attention_2"` flag to [`.from_pretrained`](https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel.from_pretrained). We'll also load the model in half-precision (e.g. `torch.float16`), since it results in almost no degradation to audio quality but significantly lower memory usage and faster inference:
-
-```python
-model = BarkModel.from_pretrained("suno/bark-small", dtype=torch.float16, attn_implementation="flash_attention_2").to(device)
-```
-
-##### Performance comparison
-
-The following diagram shows the latency for the native attention implementation (no optimisation) against Flash Attention 2. In all cases, we generate 400 semantic tokens on a 40GB A100 GPU with PyTorch 2.1:
-
-<div style="text-align: center">
-<img src="https://huggingface.co/datasets/ylacombe/benchmark-comparison/resolve/main/Bark%20Optimization%20Benchmark.png">
-</div>
-
-To put this into perspective, on an NVIDIA A100 and when generating 400 semantic tokens with a batch size of 16, you can get 17 times the [throughput](https://huggingface.co/blog/optimizing-bark#throughput) and still be 2 seconds faster than generating sentences one by one with the native model implementation. In other words, all the samples will be generated 17 times faster.
-
-#### Combining optimization techniques
-
-You can combine optimization techniques, and use CPU offload, half-precision and Flash Attention 2 all at once.
-
-```python
-from transformers import BarkModel
-from accelerate import Accelerator
+```py
 import torch
+from scipy.io.wavfile import write as write_wav
+from transformers import AutoProcessor, BarkModel

-device = Accelerator().device
+processor = AutoProcessor.from_pretrained("suno/bark")
+model = BarkModel.from_pretrained("suno/bark", dtype="auto")

-# load in fp16 and use Flash Attention 2
-model = BarkModel.from_pretrained("suno/bark-small", dtype=torch.float16, attn_implementation="flash_attention_2").to(device)
-
-# enable CPU offload
-model.enable_cpu_offload()
+inputs = processor("Plants create energy through a process known as photosynthesis.", voice_preset="v2/en_speaker_6")
+audio_array = model.generate(**inputs)
+audio_array = audio_array.cpu().numpy().squeeze()
+sample_rate = model.generation_config.sample_rate
+write_wav("bark_generation.wav", sample_rate, audio_array)
 ```

-Find out more on inference optimization techniques [here](https://huggingface.co/docs/transformers/perf_infer_gpu_one).
-
-### Usage tips
-
-Suno offers a library of voice presets in a number of languages [here](https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c).
-These presets are also uploaded in the hub [here](https://huggingface.co/suno/bark-small/tree/main/speaker_embeddings) or [here](https://huggingface.co/suno/bark/tree/main/speaker_embeddings).
-
-```python
->>> from transformers import AutoProcessor, BarkModel
-
->>> processor = AutoProcessor.from_pretrained("suno/bark")
->>> model = BarkModel.from_pretrained("suno/bark")
-
->>> voice_preset = "v2/en_speaker_6"
-
->>> inputs = processor("Hello, my dog is cute", voice_preset=voice_preset)
-
->>> audio_array = model.generate(**inputs)
->>> audio_array = audio_array.cpu().numpy().squeeze()
-```
-
-Bark can generate highly realistic, **multilingual** speech as well as other audio - including music, background noise and simple sound effects.
-
-```python
->>> # Multilingual speech - simplified Chinese
->>> inputs = processor("惊人的！我会说中文")
-
->>> # Multilingual speech - French - let's use a voice_preset as well
->>> inputs = processor("Incroyable! Je peux générer du son.", voice_preset="fr_speaker_5")
-
->>> # Bark can also generate music. You can help it out by adding music notes around your lyrics.
->>> inputs = processor("♪ Hello, my dog is cute ♪")
-
->>> audio_array = model.generate(**inputs)
->>> audio_array = audio_array.cpu().numpy().squeeze()
-```
-
-The model can also produce **nonverbal communications** like laughing, sighing and crying.
-
-```python
->>> # Adding non-speech cues to the input text
->>> inputs = processor("Hello uh ... [clears throat], my dog is cute [laughter]")
-
->>> audio_array = model.generate(**inputs)
->>> audio_array = audio_array.cpu().numpy().squeeze()
-```
-
-To save the audio, simply take the sample rate from the model config and some scipy utility:
-
-```python
->>> from scipy.io.wavfile import write as write_wav
-
->>> # save audio to disk, but first take the sample rate from the model config
->>> sample_rate = model.generation_config.sample_rate
->>> write_wav("bark_generation.wav", sample_rate, audio_array)
-```
+</hfoption>
+</hfoptions>

 ## BarkConfig

@ -220,3 +105,4 @@ To save the audio, simply take the sample rate from the model config and some sc

 [[autodoc]] BarkSemanticConfig
    - all
+
--- a/docs/source/en/model_doc/bart.md
+++ b/docs/source/en/model_doc/bart.md
@ -13,22 +13,18 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2019-10-29 and added to Hugging Face Transformers on 2020-11-16.*
+*This model was released on 2019-10-29 and added to Hugging Face Transformers on 2020-11-16 and contributed by [sshleifer](https://huggingface.co/sshleifer).*

 <div style="float: right;">
    <div class="flex flex-wrap space-x-1">
-    <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-    <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
-    <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
+        <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
+        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
+    </div>
 </div>

 # BART

-[BART](https://huggingface.co/papers/1910.13461) is a sequence-to-sequence model that combines the pretraining objectives from BERT and GPT. It's pretrained by corrupting text in different ways like deleting words, shuffling sentences, or masking tokens and learning how to fix it. The encoder encodes the corrupted document and the corrupted text is fixed by the decoder. As it learns to recover the original text, BART gets really good at both understanding and generating language.
-
-You can find all the original BART checkpoints under the [AI at Meta](https://huggingface.co/facebook?search_models=bart) organization.
-
-The example below demonstrates how to predict the `[MASK]` token with [`Pipeline`], [`AutoModel`], and from the command line.
+[BART](https://huggingface.co/papers/1910.13461) is a Transformer-based sequence-to-sequence model trained as a denoising autoencoder: text is corrupted with noise and the model learns to reconstruct the original. Its architecture combines a bidirectional encoder like BERT with a left-to-right decoder like GPT, making it a general framework for many pretraining approaches. Using techniques like sentence shuffling and span in-filling, BART achieves strong results on both generation and comprehension tasks, matching RoBERTa on GLUE and SQuAD while setting new state-of-the-art results in summarization, dialogue, and question answering. It also boosts machine translation performance and allows ablation experiments that replicate and compare other pretraining schemes.

 <hfoptions id="usage">
 <hfoption id="Pipeline">
@ -37,14 +33,8 @@ The example below demonstrates how to predict the `[MASK]` token with [`Pipeline
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    task="fill-mask",
-    model="facebook/bart-large",
-    dtype=torch.float16,
-    device=0
-)
-pipeline("Plants create <mask> through a process known as photosynthesis.")
-
+pipeline = pipeline(task="summarization", model="facebook/bart-large-cnn", dtype="auto")
+pipeline("The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930.")
 ```

 </hfoption>
@ -52,48 +42,30 @@ pipeline("Plants create <mask> through a process known as photosynthesis.")

 ```py
 import torch
-from transformers import AutoModelForMaskedLM, AutoTokenizer
+from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

-tokenizer = AutoTokenizer.from_pretrained(
-    "facebook/bart-large",
-)
-model = AutoModelForMaskedLM.from_pretrained(
-    "facebook/bart-large",
-    dtype=torch.float16,
-    device_map="auto",
-    attn_implementation="sdpa"
-)
-inputs = tokenizer("Plants create <mask> through a process known as photosynthesis.", return_tensors="pt").to(model.device)
+model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")

-with torch.no_grad():
-    outputs = model(**inputs)
-    predictions = outputs.logits
-
-masked_index = torch.where(inputs['input_ids'] == tokenizer.mask_token_id)[1]
-predicted_token_id = predictions[0, masked_index].argmax(dim=-1)
-predicted_token = tokenizer.decode(predicted_token_id)
-
-print(f"The predicted token is: {predicted_token}")
-```
-
-</hfoption>
-<hfoption id="transformers CLI">
-
-```bash
-echo -e "Plants create <mask> through a process known as photosynthesis." | transformers run --task fill-mask --model facebook/bart-large --device 0
+text="""
+The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930.
+"""
+inputs = tokenizer(text, return_tensors="pt")
+outputs = model.generate(**inputs, max_length=50)
+print(tokenizer.decode(outputs[0]))
 ```

 </hfoption>
 </hfoptions>

-## Notes
+## Usage tips

- Inputs should be padded on the right because BERT uses absolute position embeddings.
- The [facebook/bart-large-cnn](https://huggingface.co/facebook/bart-large-cnn) checkpoint doesn't include `mask_token_id` which means it can't perform mask-filling tasks.
- BART doesn't use `token_type_ids` for sequence classification. Use [`BartTokenizer`] or [`~PreTrainedTokenizerBase.encode`] to get the proper splitting.
- The forward pass of [`BartModel`] creates the `decoder_input_ids` if they're not passed. This can be different from other model APIs, but it is a useful feature for mask-filling tasks.
- Model predictions are intended to be identical to the original implementation when `forced_bos_token_id=0`. This only works if the text passed to `fairseq.encode` begins with a space.
- [`~GenerationMixin.generate`] should be used for conditional generation tasks like summarization.
+- Pad inputs on the right. BERT uses absolute position embeddings.
+- The facebook/bart-large-cnn checkpoint lacks `mask_token_id`. It can't perform mask-filling tasks.
+- BART ignores `token_type_ids` for sequence classification. Use [`BartTokenizer`] or `encode()` for proper splitting.
+- [`BartModel`] creates `decoder_input_ids` automatically if you don't pass them. This differs from other model APIs but helps with mask-filling tasks.
+- Model predictions match the original implementation when `forced_bos_token_id=0.` This works only if your text starts with a space.
+- Use [`generate`] for conditional generation tasks like summarization.

 ## BartConfig

@ -134,3 +106,4 @@ echo -e "Plants create <mask> through a process known as photosynthesis." | tran

 [[autodoc]] BartForCausalLM
    - forward
+
--- a/docs/source/en/model_doc/barthez.md
+++ b/docs/source/en/model_doc/barthez.md
@ -13,25 +13,11 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2020-10-23 and added to Hugging Face Transformers on 2020-11-27.*
-
-<div style="float: right;">
-    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-    </div>
-</div>
+*This model was released on 2020-10-23 and added to Hugging Face Transformers on 2020-11-27 and contributed by [moussakam](https://huggingface.co/moussakam).*

 # BARThez

-[BARThez](https://huggingface.co/papers/2010.12321) is a [BART](./bart) model designed for French language tasks. Unlike existing French BERT models, BARThez includes a pretrained encoder-decoder, allowing it to generate text as well. This model is also available as a multilingual variant, mBARThez, by continuing pretraining multilingual BART on a French corpus.
-
-You can find all of the original BARThez checkpoints under the [BARThez](https://huggingface.co/collections/dascim/barthez-670920b569a07aa53e3b6887) collection.
-
-> [!TIP]
-> This model was contributed by [moussakam](https://huggingface.co/moussakam).
-> Refer to the [BART](./bart) docs for more usage examples.
-
-The example below demonstrates how to predict the `<mask>` token with [`Pipeline`], [`AutoModel`], and from the command line.
+[BARThez](https://huggingface.co/papers/2010.12321) is the first BART model for the French language, pretrained on a large monolingual French corpus. Unlike BERT-based models like CamemBERT and FlauBERT, BARThez includes both an encoder and a decoder pretrained, making it well-suited for generative tasks. Evaluated on the FLUE benchmark and a new summarization dataset, OrangeSum, BARThez demonstrates strong performance. Additionally, continuing the pretraining of multilingual BART on BARThez's corpus results in mBARTHez, which outperforms or matches CamemBERT and FlauBERT.

 <hfoptions id="usage">
 <hfoption id="Pipeline">
@ -40,13 +26,8 @@ The example below demonstrates how to predict the `<mask>` token with [`Pipeline
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    task="fill-mask",
-    model="moussaKam/barthez",
-    dtype=torch.float16,
-    device=0
-)
-pipeline("Les plantes produisent <mask> grâce à un processus appelé photosynthèse.")
+pipeline = pipeline("fill-mask", model="moussaKam/barthez", dtype="auto")
+pipeline("Les plantes créent <mask> grâce à un processus appelé photosynthèse.")
 ```

 </hfoption>
@ -56,32 +37,15 @@ pipeline("Les plantes produisent <mask> grâce à un processus appelé photosynt
 import torch
 from transformers import AutoModelForMaskedLM, AutoTokenizer

-tokenizer = AutoTokenizer.from_pretrained(
-    "moussaKam/barthez",
-)
-model = AutoModelForMaskedLM.from_pretrained(
-    "moussaKam/barthez",
-    dtype=torch.float16,
-    device_map="auto",
-)
-inputs = tokenizer("Les plantes produisent <mask> grâce à un processus appelé photosynthèse.", return_tensors="pt").to(model.device)
+model = AutoModelForMaskedLM.from_pretrained("moussaKam/barthez", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("moussaKam/barthez")

-with torch.no_grad():
-    outputs = model(**inputs)
-    predictions = outputs.logits
-
-masked_index = torch.where(inputs['input_ids'] == tokenizer.mask_token_id)[1]
-predicted_token_id = predictions[0, masked_index].argmax(dim=-1)
-predicted_token = tokenizer.decode(predicted_token_id)
-
-print(f"The predicted token is: {predicted_token}")
-```
-
-</hfoption>
-<hfoption id="transformers CLI">
-
-```bash
-echo -e "Les plantes produisent <mask> grâce à un processus appelé photosynthèse." | transformers run --task fill-mask --model moussaKam/barthez --device 0
+inputs = tokenizer("Les plantes créent <mask> grâce à un processus appelé photosynthèse.", return_tensors="pt")
+outputs = model(**inputs)
+mask_token_id = tokenizer.mask_token_id
+mask_position = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
+predicted_word = tokenizer.decode(outputs.logits[0, mask_position].argmax(dim=-1))
+print(f"Predicted word: {predicted_word}")
 ```

 </hfoption>
--- a/docs/source/en/model_doc/bartpho.md
+++ b/docs/source/en/model_doc/bartpho.md
@ -13,92 +13,47 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2021-09-20 and added to Hugging Face Transformers on 2021-10-18.*
-
-<div style="float: right;">
-   <div class="flex flex-wrap space-x-1">
-      <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-   </div>
-</div>
+*This model was released on 2021-09-20 and added to Hugging Face Transformers on 2021-10-18 and contributed by [dqnguyen](https://huggingface.co/dqnguyen).*

 # BARTpho

-[BARTpho](https://huggingface.co/papers/2109.09701) is a large-scale Vietnamese sequence-to-sequence model. It offers a word-based and syllable-based version. This model is built on the [BART](./bart) large architecture with its denoising pretraining.
+[BARTpho](https://huggingface.co/papers/2109.09701) introduces two versions—BARTpho_word and BARTpho_syllable—as the first large-scale monolingual sequence-to-sequence models pre-trained for Vietnamese. Leveraging the "large" architecture and pre-training scheme of BART, BARTpho excels in generative NLP tasks. Evaluations on Vietnamese text summarization demonstrate that BARTpho surpasses mBART, setting a new state-of-the-art. The model is released to support future research and applications in generative Vietnamese NLP.

-You can find all the original checkpoints under the [VinAI](https://huggingface.co/vinai/models?search=bartpho) organization.
-
-> [!TIP]
-> This model was contributed by [dqnguyen](https://huggingface.co/dqnguyen).
-> Check out the right sidebar for examples of how to apply BARTpho to different language tasks.
-
-The example below demonstrates how to summarize text with [`Pipeline`] or the [`AutoModel`] class.
+This model was contributed by [dqnguyen](https://huggingface.co/dqnguyen). The original code can be found [here](https://github.com/VinAIResearch/BARTpho).

 <hfoptions id="usage">
 <hfoption id="Pipeline">

-```python
+```py
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-   task="summarization",
-   model="vinai/bartpho-word",
-   dtype=torch.float16,
-   device=0
-)
-
-text = """
-Quang tổng hợp hay gọi tắt là quang hợp là quá trình thu nhận và chuyển hóa năng lượng ánh sáng Mặt trời của thực vật,
-tảo và một số vi khuẩn để tạo ra hợp chất hữu cơ phục vụ bản thân cũng như làm nguồn thức ăn cho hầu hết các sinh vật
-trên Trái Đất. Quang hợp trong thực vật thường liên quan đến chất tố diệp lục màu xanh lá cây và tạo ra oxy như một sản phẩm phụ
-"""
-pipeline(text)
+pipeline = pipeline("text2text-generation", model="vinai/bartpho-syllable", dtype="auto")
+pipeline("Thực vật tạo ra năng lượng thông qua một quá trình được gọi là")
 ```

 </hfoption>
 <hfoption id="AutoModel">

-```python
+```py
 import torch
-from transformers import BartForConditionalGeneration, AutoTokenizer
+from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

-tokenizer = AutoTokenizer.from_pretrained(
-    "vinai/bartpho-word",
-)
-model = BartForConditionalGeneration.from_pretrained(
-    "vinai/bartpho-word",
-    dtype=torch.float16,
-    device_map="auto",
-)
+model = AutoModelForSeq2SeqLM.from_pretrained("vinai/bartpho-syllable", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("vinai/bartpho-syllable")

-text = """
-Quang tổng hợp hay gọi tắt là quang hợp là quá trình thu nhận và chuyển hóa năng lượng ánh sáng Mặt trời của thực vật,
-tảo và một số vi khuẩn để tạo ra hợp chất hữu cơ phục vụ bản thân cũng như làm nguồn thức ăn cho hầu hết các sinh vật
-trên Trái Đất. Quang hợp trong thực vật thường liên quan đến chất tố diệp lục màu xanh lá cây và tạo ra oxy như một sản phẩm phụ
-"""
-inputs = tokenizer(text, return_tensors="pt").to(model.device)
-
-outputs = model.generate(inputs["input_ids"], num_beams=2, min_length=0, max_length=20)
-tokenizer.batch_decode(outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
-```
-
-</hfoption>
-<hfoption id="transformers CLI">
-
-```bash
-echo -e "Quang tổng hợp hay gọi tắt là quang hợp là quá trình thu nhận và chuyển hóa năng lượng ánh sáng Mặt trời của thực vật,
-tảo và một số vi khuẩn để tạo ra hợp chất hữu cơ phục vụ bản thân cũng như làm nguồn thức ăn cho hầu hết các sinh vật
-trên Trái Đất. Quang hợp trong thực vật thường liên quan đến chất tố diệp lục màu xanh lá cây và tạo ra oxy như một sản phẩm phụ" | \
-transformers run --task summarization --model vinai/bartpho-word --device 0
+inputs = tokenizer("Thực vật tạo ra năng lượng thông qua một quá trình được gọi là", return_tensors="pt")
+outputs = model.generate(**inputs, max_length=50)
+print(tokenizer.decode(outputs[0]))
 ```

 </hfoption>
 </hfoptions>

-## Notes
+## Usage tips

- BARTpho uses the large architecture of BART with an additional layer-normalization layer on top of the encoder and decoder. The BART-specific classes should be replaced with the mBART-specific classes.
- This implementation only handles tokenization through the `monolingual_vocab_file` file. This is a Vietnamese-specific subset of token types taken from that multilingual vocabulary. If you want to use this tokenizer for another language, replace the `monolingual_vocab_file` with one specialized for your target language.
+- BARTpho uses BART's large architecture plus an extra layer-normalization layer on the encoder and decoder. Replace BART-specific classes with mBART-specific classes.
+- This implementation handles tokenization through the `monolingual_vocab_file`. This contains Vietnamese-specific token types from the multilingual vocabulary. For other languages, replace `monolingual_vocab_file` with one specialized for your target language.

 ## BartphoTokenizer

--- a/docs/source/en/model_doc/beit.md
+++ b/docs/source/en/model_doc/beit.md
@ -13,120 +13,55 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2021-06-15 and added to Hugging Face Transformers on 2021-08-04.*
+*This model was released on 2021-06-15 and added to Hugging Face Transformers on 2021-08-04 and contributed by [nielsr](https://huggingface.co/nielsr).*
+
+<div style="float: right;">
+    <div class="flex flex-wrap space-x-1">
+        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
+    </div>
+</div>

 # BEiT

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[BEiT: BERT Pre-Training of Image Transformers](https://huggingface.co/papers/2106.08254) introduces a self-supervised vision representation model inspired by BERT. BEiT pre-trains Vision Transformers by predicting visual tokens from masked image patches. This approach outperforms supervised pre-training methods. Experiments show that BEiT achieves competitive results on image classification and semantic segmentation, with a base-size model reaching 83.2% top-1 accuracy on ImageNet-1K, surpassing DeiT trained from scratch. A large-size BEiT model achieves 86.3% on ImageNet-1K, even outperforming a ViT-L model pre-trained on ImageNet-22K.

-## Overview
-
-The BEiT model was proposed in [BEiT: BERT Pre-Training of Image Transformers](https://huggingface.co/papers/2106.08254) by
-Hangbo Bao, Li Dong and Furu Wei. Inspired by BERT, BEiT is the first paper that makes self-supervised pre-training of
-Vision Transformers (ViTs) outperform supervised pre-training. Rather than pre-training the model to predict the class
-of an image (as done in the [original ViT paper](https://huggingface.co/papers/2010.11929)), BEiT models are pre-trained to
-predict visual tokens from the codebook of OpenAI's [DALL-E model](https://huggingface.co/papers/2102.12092) given masked
-patches.
-
-The abstract from the paper is the following:
-
-*We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional Encoder representation
-from Image Transformers. Following BERT developed in the natural language processing area, we propose a masked image
-modeling task to pretrain vision Transformers. Specifically, each image has two views in our pre-training, i.e, image
-patches (such as 16x16 pixels), and visual tokens (i.e., discrete tokens). We first "tokenize" the original image into
-visual tokens. Then we randomly mask some image patches and fed them into the backbone Transformer. The pre-training
-objective is to recover the original visual tokens based on the corrupted image patches. After pre-training BEiT, we
-directly fine-tune the model parameters on downstream tasks by appending task layers upon the pretrained encoder.
-Experimental results on image classification and semantic segmentation show that our model achieves competitive results
-with previous pre-training methods. For example, base-size BEiT achieves 83.2% top-1 accuracy on ImageNet-1K,
-significantly outperforming from-scratch DeiT training (81.8%) with the same setup. Moreover, large-size BEiT obtains
-86.3% only using ImageNet-1K, even outperforming ViT-L with supervised pre-training on ImageNet-22K (85.2%).*
-
-This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/microsoft/unilm/tree/master/beit).
-
-## Usage tips
-
- BEiT models are regular Vision Transformers, but pre-trained in a self-supervised way rather than supervised. They
-  outperform both the [original model (ViT)](vit) as well as [Data-efficient Image Transformers (DeiT)](deit) when fine-tuned on ImageNet-1K and CIFAR-100. You can check out demo notebooks regarding inference as well as
-  fine-tuning on custom data [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer) (you can just replace
-  [`ViTImageProcessor`] by [`BeitImageProcessor`] and
-  [`ViTForImageClassification`] by [`BeitForImageClassification`]).
- There's also a demo notebook available which showcases how to combine DALL-E's image tokenizer with BEiT for
-  performing masked image modeling. You can find it [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/BEiT).
- As the BEiT models expect each image to be of the same size (resolution), one can use
-  [`BeitImageProcessor`] to resize (or rescale) and normalize images for the model.
- Both the patch resolution and image resolution used during pre-training or fine-tuning are reflected in the name of
-  each checkpoint. For example, `microsoft/beit-base-patch16-224` refers to a base-sized architecture with patch
-  resolution of 16x16 and fine-tuning resolution of 224x224. All checkpoints can be found on the [hub](https://huggingface.co/models?search=microsoft/beit).
- The available checkpoints are either (1) pre-trained on [ImageNet-22k](http://www.image-net.org/) (a collection of
-  14 million images and 22k classes) only, (2) also fine-tuned on ImageNet-22k or (3) also fine-tuned on [ImageNet-1k](http://www.image-net.org/challenges/LSVRC/2012/) (also referred to as ILSVRC 2012, a collection of 1.3 million
-  images and 1,000 classes).
- BEiT uses relative position embeddings, inspired by the T5 model. During pre-training, the authors shared the
-  relative position bias among the several self-attention layers. During fine-tuning, each layer's relative position
-  bias is initialized with the shared relative position bias obtained after pre-training. Note that, if one wants to
-  pre-train a model from scratch, one needs to either set the `use_relative_position_bias` or the
-  `use_relative_position_bias` attribute of [`BeitConfig`] to `True` in order to add
-  position embeddings.
-
-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/beit_architecture.jpg"
-alt="drawing" width="600"/>
-
-<small> BEiT pre-training. Taken from the <a href="https://huggingface.co/papers/2106.08254">original paper.</a> </small>
-
-### Using Scaled Dot Product Attention (SDPA)
-
-PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function
-encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the
-[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html)
-or the [GPU Inference](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention)
-page for more information.
-
-SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set
-`attn_implementation="sdpa"` in `from_pretrained()` to explicitly request SDPA to be used.
+<hfoptions id="usage">
+<hfoption id="Pipeline">

 ```py
-from transformers import BeitForImageClassification
-model = BeitForImageClassification.from_pretrained("microsoft/beit-base-patch16-224", attn_implementation="sdpa", dtype=torch.float16)
-...
+import torch
+from transformers import pipeline
+
+pipeline = pipeline(task="image-classification", model="microsoft/beit-base-patch16-224-pt22k", dtype="auto")
+pipeline("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg")
 ```

-For the best speedups, we recommend loading the model in half-precision (e.g. `torch.float16` or `torch.bfloat16`).
+</hfoption>
+<hfoption id="AutoModel">

-On a local benchmark (NVIDIA GeForce RTX 2060-8GB, PyTorch 2.5.1, OS Ubuntu 20.04) with `float16` and
-`microsoft/beit-base-patch16-224` model, we saw the following improvements during training and inference:
+```python
+import torch
+import requests
+from PIL import Image
+from transformers import AutoImageProcessor, AutoModelForImageClassification

-#### Training
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
+image = Image.open(requests.get(url, stream=True).raw)

-| num_training_steps | batch_size | image_size   | is_cuda | Time per batch (eager - s) | Time per batch (sdpa - s) | Speedup (%) | Eager peak mem (MB) | SDPA peak mem (MB) | Mem saving (%) |
-|--------------------|------------|--------------|---------|----------------------------|---------------------------|-------------|----------------------|--------------------|----------------|
-| 50                 | 2          | (1048, 640)  | True    | 0.984                      | 0.746                     | 31.975      | 6738.915            | 4319.886          | 55.998         |
+image_processor = AutoImageProcessor.from_pretrained("microsoft/beit-base-patch16-224-pt22k")
+model = AutoModelForImageClassification.from_pretrained("microsoft/beit-base-patch16-224-pt22k", dtype="auto")

-#### Inference
+inputs = image_processor(image, return_tensors="pt")

-|   Image batch size |   Eager (s/iter) | Eager CI, %   |   Eager memory (MB) |   SDPA (s/iter) | SDPA CI, %   |   SDPA memory (MB) |   SDPA speedup | SDPA memory saved (%) |
-|-------------------:|-----------------:|:--------------|--------------------:|----------------:|:-------------|-------------------:|---------------:|----------------------:|
-|                  1 |            0.012 | ±0.3%         |         3.76657e+08 |           0.011 | ±0.5%        |        3.75739e+08 |          1.05  |                 0.244 |
-|                  4 |            0.013 | ±0.1%         |         4.03147e+08 |           0.011 | ±0.2%        |        3.90554e+08 |          1.178 |                 3.225 |
-|                 16 |            0.045 | ±0.1%         |         4.96697e+08 |           0.035 | ±0.1%        |        4.51232e+08 |          1.304 |                10.076 |
-|                 32 |            0.088 | ±0.1%         |         6.24417e+08 |           0.066 | ±0.1%        |        5.33488e+08 |          1.325 |                17.044 |
+with torch.no_grad():
+    logits = model(**inputs).logits

-## Resources
+predicted_label = logits.argmax(-1).item()
+print(model.config.id2label[predicted_label])
+```

-A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with BEiT.
-
-<PipelineTag pipeline="image-classification"/>
-
- [`BeitForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb).
- See also: [Image classification task guide](../tasks/image_classification)
-
-**Semantic segmentation**
-
- [Semantic segmentation task guide](../tasks/semantic_segmentation)
-
-If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
+</hfoption>
+</hfoptions>

 ## BEiT specific outputs

@ -167,3 +102,4 @@ If you're interested in submitting a resource to be included here, please feel f

 [[autodoc]] BeitForSemanticSegmentation
    - forward
+
--- a/docs/source/en/model_doc/bert-generation.md
+++ b/docs/source/en/model_doc/bert-generation.md
@ -13,131 +13,46 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2019-07-29 and added to Hugging Face Transformers on 2020-11-16.*
-
-<div style="float: right;">
-    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-    </div>
-</div>
+*This model was released on 2019-07-29 and added to Hugging Face Transformers on 2020-11-16 and contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten).*

 # BertGeneration

-[BertGeneration](https://huggingface.co/papers/1907.12461) leverages pretrained BERT checkpoints for sequence-to-sequence tasks with the [`EncoderDecoderModel`] architecture. BertGeneration adapts the [`BERT`] for generative tasks.
-
-You can find all the original BERT checkpoints under the [BERT](https://huggingface.co/collections/google/bert-release-64ff5e7a4be99045d1896dbc) collection.
-
-> [!TIP]
-> This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten).
->
-> Click on the BertGeneration models in the right sidebar for more examples of how to apply BertGeneration to different sequence generation tasks.
-
-The example below demonstrates how to use BertGeneration with [`EncoderDecoderModel`] for sequence-to-sequence tasks.
+[BertGeneration](https://huggingface.co/papers/1907.12461) leverages pre-trained BERT checkpoints for sequence-to-sequence tasks using an EncoderDecoderModel framework. This approach achieves state-of-the-art results in Machine Translation, Text Summarization, Sentence Splitting, and Sentence Fusion, demonstrating the utility of initializing both encoder and decoder with pre-trained models.

 <hfoptions id="usage">
 <hfoption id="Pipeline">

-```python
+```py
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    task="text2text-generation",
-    model="google/roberta2roberta_L-24_discofuse",
-    dtype=torch.float16,
-    device=0
-)
-pipeline("Plants create energy through ")
+pipeline = pipeline(task="text2text-generation", model="google/bert_for_seq_generation_L-24_bbc_encoder", dtype="auto")
+pipeline("Plants generate energy through a process known as  ")
 ```

 </hfoption>
 <hfoption id="AutoModel">

-```python
+```py
 import torch
-from transformers import EncoderDecoderModel, AutoTokenizer
+from transformers import AutoModelForCausalLM, AutoTokenizer

-model = EncoderDecoderModel.from_pretrained("google/roberta2roberta_L-24_discofuse", dtype="auto")
-tokenizer = AutoTokenizer.from_pretrained("google/roberta2roberta_L-24_discofuse")
+model = AutoModelForCausalLM.from_pretrained("google/bert_for_seq_generation_L-24_bbc_encoder", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("google/bert_for_seq_generation_L-24_bbc_encoder")

-input_ids = tokenizer(
-    "Plants create energy through ", add_special_tokens=False, return_tensors="pt"
-).input_ids
-
-outputs = model.generate(input_ids)
+inputs = tokenizer("Plants generate energy through a process known as  ", return_tensors="pt")
+outputs = model.generate(**inputs, max_length=50)
 print(tokenizer.decode(outputs[0]))
 ```

-</hfoption>
-<hfoption id="transformers CLI">
-
-```bash
-echo -e "Plants create energy through " | transformers run --task text2text-generation --model "google/roberta2roberta_L-24_discofuse" --device 0
-```
-
 </hfoption>
 </hfoptions>

-Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
+## Usage tips

-The example below uses [BitsAndBytesConfig](../quantizationbitsandbytes) to quantize the weights to 4-bit.
-
-```python
-import torch
-from transformers import EncoderDecoderModel, AutoTokenizer, BitsAndBytesConfig
-
-# Configure 4-bit quantization
-quantization_config = BitsAndBytesConfig(
-    load_in_4bit=True,
-    bnb_4bit_compute_dtype=torch.float16
-)
-
-model = EncoderDecoderModel.from_pretrained(
-    "google/roberta2roberta_L-24_discofuse",
-    quantization_config=quantization_config,
-    dtype="auto"
-)
-tokenizer = AutoTokenizer.from_pretrained("google/roberta2roberta_L-24_discofuse")
-
-input_ids = tokenizer(
-    "Plants create energy through ", add_special_tokens=False, return_tensors="pt"
-).input_ids
-
-outputs = model.generate(input_ids)
-print(tokenizer.decode(outputs[0]))
-```
-
-## Notes
-
- [`BertGenerationEncoder`] and [`BertGenerationDecoder`] should be used in combination with [`EncoderDecoderModel`] for sequence-to-sequence tasks.
-
-   ```python
-   from transformers import BertGenerationEncoder, BertGenerationDecoder, BertTokenizer, EncoderDecoderModel
-   
-   # leverage checkpoints for Bert2Bert model
-   # use BERT's cls token as BOS token and sep token as EOS token
-   encoder = BertGenerationEncoder.from_pretrained("google-bert/bert-large-uncased", bos_token_id=101, eos_token_id=102)
-   # add cross attention layers and use BERT's cls token as BOS token and sep token as EOS token
-   decoder = BertGenerationDecoder.from_pretrained(
-       "google-bert/bert-large-uncased", add_cross_attention=True, is_decoder=True, bos_token_id=101, eos_token_id=102
-   )
-   bert2bert = EncoderDecoderModel(encoder=encoder, decoder=decoder)
-
-   # create tokenizer
-   tokenizer = BertTokenizer.from_pretrained("google-bert/bert-large-uncased")
-
-   input_ids = tokenizer(
-       "This is a long article to summarize", add_special_tokens=False, return_tensors="pt"
-   ).input_ids
-   labels = tokenizer("This is a short summary", return_tensors="pt").input_ids
-
-   # train
-   loss = bert2bert(input_ids=input_ids, decoder_input_ids=labels, labels=labels).loss
-   loss.backward()
-   ```
-
- For summarization, sentence splitting, sentence fusion and translation, no special tokens are required for the input.
- No EOS token should be added to the end of the input for most generation tasks.
+- Use [`BertGenerationEncoder`] and [`BertGenerationDecoder`] with [`EncoderDecoderModel`] for sequence-to-sequence tasks.
+- Summarization, sentence splitting, sentence fusion, and translation don't require special tokens in the input.
+- Don't add `EOS` tokens to the end of inputs for most generation tasks.

 ## BertGenerationConfig

--- a/docs/source/en/model_doc/bert-japanese.md
+++ b/docs/source/en/model_doc/bert-japanese.md
@ -13,73 +13,49 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2019-03-24 and added to Hugging Face Transformers on 2020-11-16.*
+*This model was released on 2019-03-24 and added to Hugging Face Transformers on 2020-11-16 and contributed by [cl-tohoku](https://huggingface.co/cl-tohoku).*

 # BertJapanese

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[BERTJapanese](https://github.com/cl-tohoku/bert-japanese) is a collection of pretrained BERT models for Japanese, developed at Tohoku University and released on Hugging Face. The models follow the original BERT architecture, with base models (12 layers, 768 hidden units, 12 heads) and large models (24 layers, 1024 hidden units, 16 heads). Training was performed on large-scale Japanese corpora such as Wikipedia and the Japanese portion of Common Crawl, with different tokenization strategies including subword and character-based. Multiple versions exist (v1, v2, v3), improving coverage and accuracy for Japanese natural language processing tasks

-## Overview
+Run the command below to install the Japanese dependencies.

-The BERT models trained on Japanese text.
-
-There are models with two different tokenization methods:
-
- Tokenize with MeCab and WordPiece. This requires some extra dependencies, [fugashi](https://github.com/polm/fugashi) which is a wrapper around [MeCab](https://taku910.github.io/mecab/).
- Tokenize into characters.
-
-To use *MecabTokenizer*, you should `pip install transformers["ja"]` (or `pip install -e .["ja"]` if you install
-from source) to install dependencies.
-
-See [details on cl-tohoku repository](https://github.com/cl-tohoku/bert-japanese).
-
-Example of using a model with MeCab and WordPiece tokenization:
-
-```python
->>> import torch
->>> from transformers import AutoModel, AutoTokenizer
-
->>> bertjapanese = AutoModel.from_pretrained("cl-tohoku/bert-base-japanese")
->>> tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese")
-
->>> ## Input Japanese Text
->>> line = "吾輩は猫である。"
-
->>> inputs = tokenizer(line, return_tensors="pt")
-
->>> print(tokenizer.decode(inputs["input_ids"][0]))
-[CLS] 吾輩 は 猫 で ある 。 [SEP]
-
->>> outputs = bertjapanese(**inputs)
+```bash
+!pip install transformers["ja"]
 ```

-Example of using a model with Character tokenization:
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-```python
->>> bertjapanese = AutoModel.from_pretrained("cl-tohoku/bert-base-japanese-char")
->>> tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese-char")
+```py
+import torch
+from transformers import pipeline

->>> ## Input Japanese Text
->>> line = "吾輩は猫である。"
-
->>> inputs = tokenizer(line, return_tensors="pt")
-
->>> print(tokenizer.decode(inputs["input_ids"][0]))
-[CLS] 吾 輩 は 猫 で あ る 。 [SEP]
-
->>> outputs = bertjapanese(**inputs)
+pipeline = pipeline(task="fill-mask", model="tohoku-nlp/bert-base-japanese", dtype="auto")
+pipeline("植物は[MASK]を光合成と呼ばれる過程を通じて作り出します。")
 ```

-This model was contributed by [cl-tohoku](https://huggingface.co/cl-tohoku).
+</hfoption>
+<hfoption id="AutoModel">

-<Tip>
+```py
+import torch
+from transformers import AutoModelForMaskedLM, AutoTokenizer

-This implementation is the same as BERT, except for tokenization method. Refer to [BERT documentation](bert) for
-API reference information.
+model = AutoModelForMaskedLM.from_pretrained("tohoku-nlp/bert-base-japanese", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("tohoku-nlp/bert-base-japanese")

-</Tip>
+inputs = tokenizer("植物は[MASK]を光合成と呼ばれる過程を通じて作り出します。", return_tensors="pt")
+outputs = model(**inputs)
+mask_token_id = tokenizer.mask_token_id
+mask_position = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
+predicted_word = tokenizer.decode(outputs.logits[0, mask_position].argmax(dim=-1))
+print(f"Predicted word: {predicted_word}")
+```
+
+</hfoption>
+</hfoptions>

 ## BertJapaneseTokenizer

--- a/docs/source/en/model_doc/bert.md
+++ b/docs/source/en/model_doc/bert.md
@ -13,25 +13,17 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2018-10-11 and added to Hugging Face Transformers on 2020-11-16.*
+*This model was released on 2018-10-11 and added to Hugging Face Transformers on 2020-11-16 and contributed by [thomwolf](https://huggingface.co/thomwolf).*

 <div style="float: right;">
    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
    </div>
 </div>

 # BERT

-[BERT](https://huggingface.co/papers/1810.04805) is a bidirectional transformer pretrained on unlabeled text to predict masked tokens in a sentence and to predict whether one sentence follows another. The main idea is that by randomly masking some tokens, the model can train on text to the left and right, giving it a more thorough understanding. BERT is also very versatile because its learned language representations can be adapted for other NLP tasks by fine-tuning an additional layer or head.
-
-You can find all the original BERT checkpoints under the [BERT](https://huggingface.co/collections/google/bert-release-64ff5e7a4be99045d1896dbc) collection.
-
-> [!TIP]
-> Click on the BERT models in the right sidebar for more examples of how to apply BERT to different language tasks.
-
-The example below demonstrates how to predict the `[MASK]` token with [`Pipeline`], [`AutoModel`], and from the command line.
+[BERT](https://huggingface.co/papers/1810.04805) introduces a bidirectional transformer model for language representation, pre-trained using masked language modeling and next sentence prediction. BERT achieves state-of-the-art results across various NLP tasks by fine-tuning with minimal task-specific modifications, significantly improving benchmarks like GLUE, MultiNLI, and SQuAD.

 <hfoptions id="usage">
 <hfoption id="Pipeline">
@ -40,12 +32,7 @@ The example below demonstrates how to predict the `[MASK]` token with [`Pipeline
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    task="fill-mask",
-    model="google-bert/bert-base-uncased",
-    dtype=torch.float16,
-    device=0
-)
+pipeline = pipeline(task="fill-mask", model="google-bert/bert-base-uncased", dtype="auto")
 pipeline("Plants create [MASK] through a process known as photosynthesis.")
 ```

@ -56,41 +43,23 @@ pipeline("Plants create [MASK] through a process known as photosynthesis.")
 import torch
 from transformers import AutoModelForMaskedLM, AutoTokenizer

-tokenizer = AutoTokenizer.from_pretrained(
-    "google-bert/bert-base-uncased",
-)
-model = AutoModelForMaskedLM.from_pretrained(
-    "google-bert/bert-base-uncased",
-    dtype=torch.float16,
-    device_map="auto",
-    attn_implementation="sdpa"
-)
-inputs = tokenizer("Plants create [MASK] through a process known as photosynthesis.", return_tensors="pt").to(model.device)
+model = AutoModelForMaskedLM.from_pretrained("google-bert/bert-base-uncased", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")

-with torch.no_grad():
-    outputs = model(**inputs)
-    predictions = outputs.logits
-
-masked_index = torch.where(inputs['input_ids'] == tokenizer.mask_token_id)[1]
-predicted_token_id = predictions[0, masked_index].argmax(dim=-1)
-predicted_token = tokenizer.decode(predicted_token_id)
-
-print(f"The predicted token is: {predicted_token}")
-```
-
-</hfoption>
-<hfoption id="transformers CLI">
-
-```bash
-echo -e "Plants create [MASK] through a process known as photosynthesis." | transformers run --task fill-mask --model google-bert/bert-base-uncased --device 0
+inputs = tokenizer("Plants create [MASK] through a process known as photosynthesis.", return_tensors="pt")
+outputs = model(**inputs)
+mask_token_id = tokenizer.mask_token_id
+mask_position = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
+predicted_word = tokenizer.decode(outputs.logits[0, mask_position].argmax(dim=-1))
+print(f"Predicted word: {predicted_word}")
 ```

 </hfoption>
 </hfoptions>

-## Notes
+## Usage tips

- Inputs should be padded on the right because BERT uses absolute position embeddings.
+- Pad inputs on the right. BERT uses absolute position embeddings.

 ## BertConfig

@ -109,6 +78,12 @@ echo -e "Plants create [MASK] through a process known as photosynthesis." | tran

 [[autodoc]] BertTokenizerFast

+## Bert specific outputs
+
+[[autodoc]] models.bert.modeling_bert.BertForPreTrainingOutput
+
+] models.bert.modeling_flax_bert.FlaxBertForPreTrainingOutput
+
 ## BertModel

 [[autodoc]] BertModel
@ -153,7 +128,3 @@ echo -e "Plants create [MASK] through a process known as photosynthesis." | tran

 [[autodoc]] BertForQuestionAnswering
    - forward
-
-## Bert specific outputs
-
-[[autodoc]] models.bert.modeling_bert.BertForPreTrainingOutput
--- a/docs/source/en/model_doc/bertweet.md
+++ b/docs/source/en/model_doc/bertweet.md
@ -13,25 +13,11 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2020-05-20 and added to Hugging Face Transformers on 2020-11-16.*
+*This model was released on 2020-05-20 and added to Hugging Face Transformers on 2020-11-16 and contributed by [dqnguyen](https://huggingface.co/dqnguyen).*

 # BERTweet

-<div style="float: right;">
-    <div class="flex flex-wrap space-x-1">
-    <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
-
-## BERTweet
-
-[BERTweet](https://huggingface.co/papers/2005.10200) shares the same architecture as [BERT-base](./bert), but it's pretrained like [RoBERTa](./roberta) on English Tweets. It performs really well on Tweet-related tasks like part-of-speech tagging, named entity recognition, and text classification.
-
-You can find all the original BERTweet checkpoints under the [VinAI Research](https://huggingface.co/vinai?search_models=BERTweet) organization.
-
-> [!TIP]
-> Refer to the [BERT](./bert) docs for more examples of how to apply BERTweet to different language tasks.
-
-The example below demonstrates how to predict the `<mask>` token with [`Pipeline`], [`AutoModel`], and from the command line.
+[BERTweet](https://huggingface.co/papers/2005.10200) is a large-scale pre-trained language model for English Tweets, sharing the architecture of BERT-base and trained using the RoBERTa pre-training procedure. It surpasses strong baselines like RoBERTa-base and XLM-R-base, achieving superior results in Part-of-speech tagging, Named-entity recognition, and text classification tasks.

 <hfoptions id="usage">
 <hfoption id="Pipeline">
@ -40,58 +26,37 @@ The example below demonstrates how to predict the `<mask>` token with [`Pipeline
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    task="fill-mask",
-    model="vinai/bertweet-base",
-    dtype=torch.float16,
-    device=0
-)
-pipeline("Plants create <mask> through a process known as photosynthesis.")
+pipeline = pipeline(task="text-classification", model="vinai/bertweet-base", dtype="auto")
+result = pipeline("SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:")
+print(f"Label: {result[0]['label']}, Score: {result[0]['score']}")
 ```

 </hfoption>
-<hfoption id="AutoModel">
+<hfoption id="Pipeline">

 ```py
 import torch
-from transformers import AutoModelForMaskedLM, AutoTokenizer
+from transformers import AutoModelForSequenceClassification, AutoTokenizer

-tokenizer = AutoTokenizer.from_pretrained(
-   "vinai/bertweet-base",
-)
-model = AutoModelForMaskedLM.from_pretrained(
-    "vinai/bertweet-base",
-    dtype=torch.float16,
-    device_map="auto"
-)
-inputs = tokenizer("Plants create <mask> through a process known as photosynthesis.", return_tensors="pt").to(model.device)
+model = AutoModelForSequenceClassification.from_pretrained("vinai/bertweet-base", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")

-with torch.no_grad():
-    outputs = model(**inputs)
-    predictions = outputs.logits
-
-masked_index = torch.where(inputs['input_ids'] == tokenizer.mask_token_id)[1]
-predicted_token_id = predictions[0, masked_index].argmax(dim=-1)
-predicted_token = tokenizer.decode(predicted_token_id)
-
-print(f"The predicted token is: {predicted_token}")
-```
-
-</hfoption>
-<hfoption id="transformers CLI">
-
-```bash
-echo -e "Plants create <mask> through a process known as photosynthesis." | transformers run --task fill-mask --model vinai/bertweet-base --device 0
+inputs = tokenizer("SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:", return_tensors="pt")
+outputs = model(**inputs)
+predicted_class_id = outputs.logits.argmax(dim=-1).item()
+label = model.config.id2label[predicted_class_id]
+print(f"Predicted label: {label}")
 ```

 </hfoption>
 </hfoptions>

-## Notes
+## Usage tips

- Use the [`AutoTokenizer`] or [`BertweetTokenizer`] because it's preloaded with a custom vocabulary adapted to tweet-specific tokens like hashtags (#), mentions (@), emojis, and common abbreviations. Make sure to also install the [emoji](https://pypi.org/project/emoji/) library.
- Inputs should be padded on the right (`padding="max_length"`) because BERT uses absolute position embeddings.
+- Use [`AutoTokenizer`] or [`BertweetTokenizer`]. They come preloaded with custom vocabulary for tweet-specific tokens like hashtags (#), mentions (@), emojis, and common abbreviations. Install the [emoji](https://pypi.org/project/emoji/) library too.
+- Pad inputs on the right (`padding="max_length"`). BERT uses absolute position embeddings.

 ## BertweetTokenizer

 [[autodoc]] BertweetTokenizer
+
--- a/docs/source/en/model_doc/big_bird.md
+++ b/docs/source/en/model_doc/big_bird.md
@ -13,24 +13,11 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2020-07-28 and added to Hugging Face Transformers on 2021-03-30.*
-
-<div style="float: right;">
-    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white" >
-    </div>
-</div>
+*This model was released on 2020-07-28 and added to Hugging Face Transformers on 2021-03-30 and contributed by [vasudevgupta](https://huggingface.co/vasudevgupta).*

 # BigBird

-[BigBird](https://huggingface.co/papers/2007.14062) is a transformer model built to handle sequence lengths up to 4096 compared to 512 for [BERT](./bert). Traditional transformers struggle with long inputs because attention gets really expensive as the sequence length grows. BigBird fixes this by using a sparse attention mechanism, which means it doesn’t try to look at everything at once. Instead, it mixes in local attention, random attention, and a few global tokens to process the whole input. This combination gives it the best of both worlds. It keeps the computation efficient while still capturing enough of the sequence to understand it well. Because of this, BigBird is great at tasks involving long documents, like question answering, summarization, and genomic applications.
-
-You can find all the original BigBird checkpoints under the [Google](https://huggingface.co/google?search_models=bigbird) organization.
-
-> [!TIP]
-> Click on the BigBird models in the right sidebar for more examples of how to apply BigBird to different language tasks.
-
-The example below demonstrates how to predict the `[MASK]` token with [`Pipeline`], [`AutoModel`], and from the command line.
+[BigBird: Transformers for Longer Sequences](https://huggingface.co/papers/2007.14062) introduces a sparse-attention mechanism that reduces the quadratic dependency on sequence length to linear, enabling handling of much longer sequences compared to models like BERT. BigBird combines sparse, global, and random attention to approximate full attention efficiently. This allows it to process sequences up to 8 times longer on similar hardware, improving performance on long document NLP tasks such as question answering and summarization. Additionally, the model supports novel applications in genomics.

 <hfoptions id="usage">
 <hfoption id="Pipeline">
@ -39,12 +26,7 @@ The example below demonstrates how to predict the `[MASK]` token with [`Pipeline
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    task="fill-mask",
-    model="google/bigbird-roberta-base",
-    dtype=torch.float16,
-    device=0
-)
+pipeline = pipeline(task="fill-mask", model="google/bigbird-roberta-base", dtype="auto")
 pipeline("Plants create [MASK] through a process known as photosynthesis.")
 ```

@ -55,47 +37,26 @@ pipeline("Plants create [MASK] through a process known as photosynthesis.")
 import torch
 from transformers import AutoModelForMaskedLM, AutoTokenizer

-tokenizer = AutoTokenizer.from_pretrained(
-    "google/bigbird-roberta-base",
-)
-model = AutoModelForMaskedLM.from_pretrained(
-    "google/bigbird-roberta-base",
-    dtype=torch.float16,
-    device_map="auto",
-)
-inputs = tokenizer("Plants create [MASK] through a process known as photosynthesis.", return_tensors="pt").to(model.device)
+model = AutoModelForMaskedLM.from_pretrained("google/bigbird-roberta-base", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("google/bigbird-roberta-base")

-with torch.no_grad():
-    outputs = model(**inputs)
-    predictions = outputs.logits
-
-masked_index = torch.where(inputs['input_ids'] == tokenizer.mask_token_id)[1]
-predicted_token_id = predictions[0, masked_index].argmax(dim=-1)
-predicted_token = tokenizer.decode(predicted_token_id)
-
-print(f"The predicted token is: {predicted_token}")
-```
-
-</hfoption>
-<hfoption id="transformers CLI">
-
-```bash
-!echo -e "Plants create [MASK] through a process known as photosynthesis." | transformers run --task fill-mask --model google/bigbird-roberta-base --device 0
+inputs = tokenizer("Plants create [MASK] through a process known as photosynthesis.", return_tensors="pt")
+outputs = model(**inputs)
+mask_token_id = tokenizer.mask_token_id
+mask_position = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
+predicted_word = tokenizer.decode(outputs.logits[0, mask_position].argmax(dim=-1))
+print(f"Predicted word: {predicted_word}")
 ```

 </hfoption>
 </hfoptions>

-## Notes
+## Usage tips

- Inputs should be padded on the right because BigBird uses absolute position embeddings.
- BigBird supports `original_full` and `block_sparse` attention. If the input sequence length is less than 1024, it is recommended to use `original_full` since sparse patterns don't offer much benefit for smaller inputs.
- The current implementation uses window size of 3 blocks and 2 global blocks, only supports the ITC-implementation, and doesn't support `num_random_blocks=0`.
- The sequence length must be divisible by the block size.
-
-## Resources
-
- Read the [BigBird](https://huggingface.co/blog/big-bird) blog post for more details about how its attention works.
+- Pad inputs on the right. BigBird uses absolute position embeddings.
+- BigBird supports `original_full` and `block_sparse` attention. Use `original_full` for sequences under 1024 tokens since sparse patterns don't help much with smaller inputs.
+- Current implementation uses 3-block window size and 2 global blocks. It only supports ITC-implementation and doesn't support `num_random_blocks=0`.
+- Sequence length must be divisible by the block size.

 ## BigBirdConfig

@ -156,3 +117,4 @@ print(f"The predicted token is: {predicted_token}")

 [[autodoc]] BigBirdForQuestionAnswering
    - forward
+
--- a/docs/source/en/model_doc/bigbird_pegasus.md
+++ b/docs/source/en/model_doc/bigbird_pegasus.md
@ -13,26 +13,11 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2020-07-28 and added to Hugging Face Transformers on 2021-05-07.*
-
-<div style="float: right;">
-    <div class="flex flex-wrap space-x-1">
-           <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-    </div>
-</div>
+*This model was released on 2020-07-28 and added to Hugging Face Transformers on 2021-05-07 and contributed by [vasudevgupta](https://huggingface.co/vasudevgupta).*

 # BigBirdPegasus

-[BigBirdPegasus](https://huggingface.co/papers/2007.14062) is an encoder-decoder (sequence-to-sequence) transformer model for long-input summarization. It extends the [BigBird](./big_bird) architecture with an additional pretraining objective borrowed from [Pegasus](./pegasus) called gap sequence generation (GSG). Whole sentences are masked and the model has to fill in the gaps in the document. BigBirdPegasus's ability to keep track of long contexts makes it effective at summarizing lengthy inputs, surpassing the performance of base Pegasus models.
-
-You can find all the original BigBirdPegasus checkpoints under the [Google](https://huggingface.co/google/models?search=bigbird-pegasus) organization.
-
-> [!TIP]
-> This model was contributed by [vasudevgupta](https://huggingface.co/vasudevgupta).
->
-> Click on the BigBirdPegasus models in the right sidebar for more examples of how to apply BigBirdPegasus to different language tasks.
-
-The example below demonstrates how to summarize text with [`Pipeline`], [`AutoModel`], and from the command line.
+[BigBird: Transformers for Longer Sequences](https://huggingface.co/papers/2007.14062) introduces a sparse-attention mechanism that reduces the quadratic dependency on sequence length to linear, enabling handling of much longer sequences compared to models like BERT. BigBird combines sparse, global, and random attention to approximate full attention efficiently. This allows it to process sequences up to 8 times longer on similar hardware, improving performance on long document NLP tasks such as question answering and summarization. The model is also a universal approximator of sequence functions and Turing complete, preserving the capabilities of full attention models. Additionally, BigBird explores applications in genomics data.

 <hfoptions id="usage">
 <hfoption id="Pipeline">
@ -41,16 +26,8 @@ The example below demonstrates how to summarize text with [`Pipeline`], [`AutoMo
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    task="summarization",
-    model="google/bigbird-pegasus-large-arxiv",
-    dtype=torch.float32,
-    device=0
-)
-pipeline("""Plants are among the most remarkable and essential life forms on Earth, possessing a unique ability to produce their own food through a process known as photosynthesis. This complex biochemical process is fundamental not only to plant life but to virtually all life on the planet.
-Through photosynthesis, plants capture energy from sunlight using a green pigment called chlorophyll, which is located in specialized cell structures called chloroplasts. In the presence of light, plants absorb carbon dioxide from the atmosphere through small pores in their leaves called stomata, and take in water from the soil through their root systems.
-These ingredients are then transformed into glucose, a type of sugar that serves as a source of chemical energy, and oxygen, which is released as a byproduct into the atmosphere. The glucose produced during photosynthesis is not just used immediately; plants also store it as starch or convert it into other organic compounds like cellulose, which is essential for building their cellular structure.
-This energy reserve allows them to grow, develop leaves, produce flowers, bear fruit, and carry out various physiological processes throughout their lifecycle.""")
+pipeline = pipeline(task="summarization", model="google/bigbird-pegasus-large-arxiv", dtype="auto")
+pipeline("Plants are among the most remarkable and essential life forms on Earth, possessing a unique ability to produce their own food through a process known as photosynthesis. This complex biochemical process is fundamental not only to plant life but to virtually all life on the planet. Through photosynthesis, plants capture energy from sunlight using a green pigment called chlorophyll, which is located in specialized cell structures called chloroplasts. In the presence of light, plants absorb carbon dioxide from the atmosphere through small pores in their leaves called stomata, and take in water from the soil through their root systems. These ingredients are then transformed into glucose, a type of sugar that serves as a source of chemical energy, and oxygen, which is released as a byproduct into the atmosphere. The glucose produced during photosynthesis is not just used immediately; plants also store it as starch or convert it into other organic compounds like cellulose, which is essential for building their cellular structure.")
 ```

 </hfoption>
@ -58,82 +35,31 @@ This energy reserve allows them to grow, develop leaves, produce flowers, bear f

 ```py
 import torch
-from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
+from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

-tokenizer = AutoTokenizer.from_pretrained(
-    "google/bigbird-pegasus-large-arxiv"
-)
-model = AutoModelForSeq2SeqLM.from_pretrained(
-    "google/bigbird-pegasus-large-arxiv",
-    dtype=torch.bfloat16,
-    device_map="auto",
-)
+model = AutoModelForSeq2SeqLM.from_pretrained("google/bigbird-pegasus-large-arxiv", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("google/bigbird-pegasus-large-arxiv")

-input_text = """Plants are among the most remarkable and essential life forms on Earth, possessing a unique ability to produce their own food through a process known as photosynthesis. This complex biochemical process is fundamental not only to plant life but to virtually all life on the planet.
+text="""
+Plants are among the most remarkable and essential life forms on Earth, possessing a unique ability to produce their own food through a process known as photosynthesis. This complex biochemical process is fundamental not only to plant life but to virtually all life on the planet.
 Through photosynthesis, plants capture energy from sunlight using a green pigment called chlorophyll, which is located in specialized cell structures called chloroplasts. In the presence of light, plants absorb carbon dioxide from the atmosphere through small pores in their leaves called stomata, and take in water from the soil through their root systems.
 These ingredients are then transformed into glucose, a type of sugar that serves as a source of chemical energy, and oxygen, which is released as a byproduct into the atmosphere. The glucose produced during photosynthesis is not just used immediately; plants also store it as starch or convert it into other organic compounds like cellulose, which is essential for building their cellular structure.
-This energy reserve allows them to grow, develop leaves, produce flowers, bear fruit, and carry out various physiological processes throughout their lifecycle."""
-input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)
-
-output = model.generate(**input_ids, cache_implementation="static")
-print(tokenizer.decode(output[0], skip_special_tokens=True))
-```
-
-</hfoption>
-<hfoption id="transformers">
-
-```bash
-echo -e "Plants are among the most remarkable and essential life forms on Earth, possessing a unique ability to produce their own food through a process known as photosynthesis. This complex biochemical process is fundamental not only to plant life but to virtually all life on the planet. Through photosynthesis, plants capture energy from sunlight using a green pigment called chlorophyll, which is located in specialized cell structures called chloroplasts." | transformers run --task summarization --model google/bigbird-pegasus-large-arxiv --device 0
+"""
+inputs = tokenizer(text, return_tensors="pt")
+outputs = model.generate(**inputs, max_length=50)
+print(tokenizer.decode(outputs[0]))
 ```

 </hfoption>
 </hfoptions>

-Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
+## Usage tips

-The example below uses [bitsandbytes](../quantization/bitsandbytes) to only quantize the weights to int4.
-
-```py
-import torch
-from transformers import BitsAndBytesConfig, AutoModelForSeq2SeqLM, AutoTokenizer
-
-quantization_config = BitsAndBytesConfig(
-    load_in_4bit=True,
-    bnb_4bit_compute_dtype=torch.bfloat16,
-    bnb_4bit_quant_type="nf4"
-)
-model = AutoModelForSeq2SeqLM.from_pretrained(
-    "google/bigbird-pegasus-large-arxiv",
-    dtype=torch.bfloat16,
-    device_map="auto",
-    quantization_config=quantization_config
-)
-
-tokenizer = AutoTokenizer.from_pretrained(
-    "google/bigbird-pegasus-large-arxiv"
-)
-
-input_text = """Plants are among the most remarkable and essential life forms on Earth, possessing a unique ability to produce their own food through a process known as photosynthesis. This complex biochemical process is fundamental not only to plant life but to virtually all life on the planet.
-Through photosynthesis, plants capture energy from sunlight using a green pigment called chlorophyll, which is located in specialized cell structures called chloroplasts. In the presence of light, plants absorb carbon dioxide from the atmosphere through small pores in their leaves called stomata, and take in water from the soil through their root systems.
-These ingredients are then transformed into glucose, a type of sugar that serves as a source of chemical energy, and oxygen, which is released as a byproduct into the atmosphere. The glucose produced during photosynthesis is not just used immediately; plants also store it as starch or convert it into other organic compounds like cellulose, which is essential for building their cellular structure.
-This energy reserve allows them to grow, develop leaves, produce flowers, bear fruit, and carry out various physiological processes throughout their lifecycle."""
-input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)
-
-output = model.generate(**input_ids, cache_implementation="static")
-print(tokenizer.decode(output[0], skip_special_tokens=True))
-```
-
-## Notes
-
- BigBirdPegasus also uses the [`PegasusTokenizer`].
- Inputs should be padded on the right because BigBird uses absolute position embeddings.
- BigBirdPegasus supports `original_full` and `block_sparse` attention. If the input sequence length is less than 1024, it is recommended to use `original_full` since sparse patterns don't offer much benefit for smaller inputs.
- The current implementation uses window size of 3 blocks and 2 global blocks, only supports the ITC-implementation, and doesn't support `num_random_blocks=0`.
- The sequence length must be divisible by the block size.
-
-## Resources
-
-Read the [Understanding BigBird's Block Sparse Attention](https://huggingface.co/blog/big-bird) blog post for more details about how BigBird's attention works.
+- BigBirdPegasus uses [`PegasusTokenizer`].
+- Pad inputs on the right. BigBird uses absolute position embeddings.
+- BigBirdPegasus supports `original_full` and `block_sparse` attention. Use `original_full` for sequences under 1024 tokens since sparse patterns don't help much with smaller inputs.
+- Current implementation uses 3-block window size and 2 global blocks. It only supports ITC-implementation and doesn't support `num_random_blocks=0`.
+- Sequence length must be divisible by the block size.

 ## BigBirdPegasusConfig

@ -164,3 +90,4 @@ Read the [Understanding BigBird's Block Sparse Attention](https://huggingface.co

 [[autodoc]] BigBirdPegasusForCausalLM
    - forward
+
--- a/docs/source/en/model_doc/biogpt.md
+++ b/docs/source/en/model_doc/biogpt.md
@ -13,26 +13,17 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2022-10-19 and added to Hugging Face Transformers on 2022-12-05.*
+*This model was released on {release_date} and added to Hugging Face Transformers on 2022-12-05 and contributed by [kamalkraj](https://huggingface.co/kamalkraj).*

 <div style="float: right;">
    <div class="flex flex-wrap space-x-1">
-            <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-            <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
-            <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
+        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
    </div>
 </div>

 # BioGPT

-[BioGPT](https://huggingface.co/papers/2210.10341) is a generative Transformer model based on [GPT-2](./gpt2) and pretrained on 15 million PubMed abstracts. It is designed for biomedical language tasks.
-
-You can find all the original BioGPT checkpoints under the [Microsoft](https://huggingface.co/microsoft?search_models=biogpt) organization.
-
-> [!TIP]
-> Click on the BioGPT models in the right sidebar for more examples of how to apply BioGPT to different language tasks.
-
-The example below demonstrates how to generate biomedical text with [`Pipeline`], [`AutoModel`], and also from the command line.
+[BioGPT](https://huggingface.co/papers/bbac409) is a domain-specific generative Transformer language model designed for biomedical text generation and mining. Trained on 15M PubMed abstracts, BioGPT excels in various biomedical NLP tasks, outperforming previous models. It achieves notable F1 scores of 44.98%, 38.42%, and 40.76% on BC5CDR, KD-DTI, and DDI end-to-end relation extraction tasks, respectively, and sets a new record with 78.2% accuracy on PubMedQA. Additionally, BioGPT demonstrates superior text generation capabilities, producing fluent descriptions for biomedical terms.

 <hfoptions id="usage">
 <hfoption id="Pipeline">
@ -41,14 +32,8 @@ The example below demonstrates how to generate biomedical text with [`Pipeline`]
 import torch
 from transformers import pipeline

-generator = pipeline(
-    task="text-generation",
-    model="microsoft/biogpt",
-    dtype=torch.float16,
-    device=0,
-)
-result = generator("Ibuprofen is best used for", truncation=True, max_length=50, do_sample=True)[0]["generated_text"]
-print(result)
+pipeline = pipeline(task="text-generation", model="microsoft/biogpt", dtype="auto")
+pipeline("Ibuprofen is best used for ")
 ```

 </hfoption>
@ -58,77 +43,21 @@ print(result)
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer

+model = AutoModelForCausalLM.from_pretrained("microsoft/biogpt", dtype="auto")
 tokenizer = AutoTokenizer.from_pretrained("microsoft/biogpt")
-model = AutoModelForCausalLM.from_pretrained(
-    "microsoft/biogpt",
-    dtype=torch.float16,
-    device_map="auto",
-    attn_implementation="sdpa"
-)

-input_text = "Ibuprofen is best used for"
-inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
-
-with torch.no_grad():
-    generated_ids = model.generate(**inputs, max_length=50)
-
-output = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
-print(output)
-```
-
-</hfoption>
-<hfoption id="transformers CLI">
-
-```bash
-echo -e "Ibuprofen is best used for" | transformers run --task text-generation --model microsoft/biogpt --device 0
+inputs = tokenizer("Ibuprofen is best used for ", return_tensors="pt")
+outputs = model.generate(**inputs, max_length=50)
+print(tokenizer.decode(outputs[0]))
 ```

 </hfoption>
 </hfoptions>

-Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
+## Usage tips

-The example below uses [bitsandbytes](../quantization/bitsandbytes) to only quantize the weights to 4-bit precision.
-
-```py
-import torch
-from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
-
-bnb_config = BitsAndBytesConfig(
-    load_in_4bit=True,
-    bnb_4bit_quant_type="nf4",
-    bnb_4bit_compute_dtype=torch.bfloat16,
-    bnb_4bit_use_double_quant=True
-)
-
-tokenizer = AutoTokenizer.from_pretrained("microsoft/BioGPT-Large")
-model = AutoModelForCausalLM.from_pretrained(
-    "microsoft/BioGPT-Large",
-    quantization_config=bnb_config,
-    dtype=torch.bfloat16,
-    device_map="auto"
-)
-
-input_text = "Ibuprofen is best used for"
-inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
-with torch.no_grad():
-    generated_ids = model.generate(**inputs, max_length=50)
-output = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
-print(output)
-```
-
-## Notes
-
- Pad inputs on the right because BioGPT uses absolute position embeddings.
- BioGPT can reuse previously computed key-value attention pairs. Access this feature with the [past_key_values](https://huggingface.co/docs/transformers/main/en/model_doc/biogpt#transformers.BioGptModel.forward.past_key_values) parameter in [`BioGPTModel.forward`].
-
-   ```py
-   from transformers import AutoModelForCausalLM
-
-   model = AutoModelForCausalLM.from_pretrained(
-      "microsoft/biogpt",
-      attn_implementation="eager"
-   )
+- Pad inputs on the right. BioGPT uses absolute position embeddings.
+- BioGPT reuses previously computed key-value attention pairs. Access this feature with the `past_key_values` parameter in [`BioGPTModel.forward`].

 ## BioGptConfig

@ -148,7 +77,7 @@ print(output)

 [[autodoc]] BioGptForCausalLM
    - forward
-
+    
 ## BioGptForTokenClassification

 [[autodoc]] BioGptForTokenClassification
@ -158,3 +87,4 @@ print(output)

 [[autodoc]] BioGptForSequenceClassification
    - forward
+
--- a/docs/source/en/model_doc/bit.md
+++ b/docs/source/en/model_doc/bit.md
@ -13,43 +13,49 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2019-12-24 and added to Hugging Face Transformers on 2022-12-07.*
+*This model was released on 2019-12-24 and added to Hugging Face Transformers on 2022-12-07 and contributed by [nielsr](https://huggingface.co/nielsr).*

 # Big Transfer (BiT)

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[Big Transfer (BiT): General Visual Representation Learning](https://huggingface.co/papers/1912.11370) proposes a method for scaling up pre-training of ResNetv2 architectures. This approach, called Big Transfer (BiT), combines specific components and uses a simple heuristic for transfer learning, achieving strong performance across over 20 datasets. BiT demonstrates robustness across various data regimes, from 1 example per class to 1M total examples. It achieves 87.5% top-1 accuracy on ILSVRC-2012, 99.4% on CIFAR-10, and 76.3% on the 19-task Visual Task Adaptation Benchmark (VTAB). On small datasets, BiT reaches 76.8% on ILSVRC-2012 with 10 examples per class and 97.0% on CIFAR-10 with 10 examples per class. The paper includes a detailed analysis of the key components contributing to high transfer performance.

-## Overview
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-The BiT model was proposed in [Big Transfer (BiT): General Visual Representation Learning](https://huggingface.co/papers/1912.11370) by Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, Neil Houlsby.
-BiT is a simple recipe for scaling up pre-training of [ResNet](resnet)-like architectures (specifically, ResNetv2). The method results in significant improvements for transfer learning.
+```py
+import torch
+from transformers import pipeline

-The abstract from the paper is the following:
+pipeline = pipeline(task="image-classification", model="google/bit-50", dtype="auto")
+pipeline("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg")
+```

-*Transfer of pre-trained representations improves sample efficiency and simplifies hyperparameter tuning when training deep neural networks for vision. We revisit the paradigm of pre-training on large supervised datasets and fine-tuning the model on a target task. We scale up pre-training, and propose a simple recipe that we call Big Transfer (BiT). By combining a few carefully selected components, and transferring using a simple heuristic, we achieve strong performance on over 20 datasets. BiT performs well across a surprisingly wide range of data regimes -- from 1 example per class to 1M total examples. BiT achieves 87.5% top-1 accuracy on ILSVRC-2012, 99.4% on CIFAR-10, and 76.3% on the 19 task Visual Task Adaptation Benchmark (VTAB). On small datasets, BiT attains 76.8% on ILSVRC-2012 with 10 examples per class, and 97.0% on CIFAR-10 with 10 examples per class. We conduct detailed analysis of the main components that lead to high transfer performance.*
+</hfoption>
+<hfoption id="AutoModel">

-This model was contributed by [nielsr](https://huggingface.co/nielsr).
-The original code can be found [here](https://github.com/google-research/big_transfer).
+```python
+import torch
+import requests
+from PIL import Image
+from transformers import AutoImageProcessor, AutoModelForImageClassification

-## Usage tips
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
+image = Image.open(requests.get(url, stream=True).raw)

- BiT models are equivalent to ResNetv2 in terms of architecture, except that: 1) all batch normalization layers are replaced by [group normalization](https://huggingface.co/papers/1803.08494),
+image_processor = AutoImageProcessor.from_pretrained("google/bit-50")
+model = AutoModelForImageClassification.from_pretrained("google/bit-50", dtype="auto")

-2) [weight standardization](https://huggingface.co/papers/1903.10520) is used for convolutional layers. The authors show that the combination of both is useful for training with large batch sizes, and has a significant
-impact on transfer learning.
+inputs = image_processor(image, return_tensors="pt")

-## Resources
+with torch.no_grad():
+    logits = model(**inputs).logits

-A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with BiT.
+predicted_label = logits.argmax(-1).item()
+print(model.config.id2label[predicted_label])
+```

-<PipelineTag pipeline="image-classification"/>
-
- [`BitForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb).
- See also: [Image classification task guide](../tasks/image_classification)
-
-If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
+</hfoption>
+</hfoptions>

 ## BitConfig

@ -74,3 +80,4 @@ If you're interested in submitting a resource to be included here, please feel f

 [[autodoc]] BitForImageClassification
    - forward
+
--- a/docs/source/en/model_doc/bitnet.md
+++ b/docs/source/en/model_doc/bitnet.md
@ -17,6 +17,14 @@ rendered properly in your Markdown viewer.

 # BitNet

+```py
+import torch
+from transformers import pipeline
+
+pipeline = pipeline(task="text-generation", model="microsoft/BitNet-b1.58-3B", dtype="auto")
+pipeline("The future of artificial intelligence is")
+```
+
 ## Overview

 Trained on a corpus of 4 trillion tokens, this model demonstrates that native 1-bit LLMs can achieve performance comparable to leading open-weight, full-precision models of similar size, while offering substantial advantages in computational efficiency (memory, energy, latency).
@ -38,22 +46,22 @@ Several versions of the model weights are available on Hugging Face:
 ### Model Details

 * **Architecture:** Transformer-based, modified with `BitLinear` layers (BitNet framework).
-  * Uses Rotary Position Embeddings (RoPE).
-  * Uses squared ReLU (ReLU²) activation in FFN layers.
-  * Employs [`subln`](https://proceedings.mlr.press/v202/wang23u.html) normalization.
-  * No bias terms in linear or normalization layers.
+    * Uses Rotary Position Embeddings (RoPE).
+    * Uses squared ReLU (ReLU²) activation in FFN layers.
+    * Employs [`subln`](https://proceedings.mlr.press/v202/wang23u.html) normalization.
+    * No bias terms in linear or normalization layers.
 * **Quantization:** Native 1.58-bit weights and 8-bit activations (W1.58A8).
-  * Weights are quantized to ternary values {-1, 0, +1} using absmean quantization during the forward pass.
-  * Activations are quantized to 8-bit integers using absmax quantization (per-token).
-  * **Crucially, the model was *trained from scratch* with this quantization scheme, not post-training quantized.**
+    * Weights are quantized to ternary values {-1, 0, +1} using absmean quantization during the forward pass.
+    * Activations are quantized to 8-bit integers using absmax quantization (per-token).
+    * **Crucially, the model was *trained from scratch* with this quantization scheme, not post-training quantized.**
 * **Parameters:** ~2 Billion
 * **Training Tokens:** 4 Trillion
-* **Context Length:** Maximum sequence length of **4096 tokens**.
-  * *Recommendation:* For optimal performance on tasks requiring very long contexts (beyond the pre-training length or for specialized long-reasoning tasks), we recommend performing intermediate long-sequence adaptation/training before the final fine-tuning stage.
+*   **Context Length:** Maximum sequence length of **4096 tokens**.
+    *   *Recommendation:* For optimal performance on tasks requiring very long contexts (beyond the pre-training length or for specialized long-reasoning tasks), we recommend performing intermediate long-sequence adaptation/training before the final fine-tuning stage.
 * **Training Stages:**
-    1. **Pre-training:** Large-scale training on public text/code and synthetic math data using a two-stage learning rate and weight decay schedule.
-    2. **Supervised Fine-tuning (SFT):** Fine-tuned on instruction-following and conversational datasets using sum loss aggregation and specific hyperparameter tuning.
-    3. **Direct Preference Optimization (DPO):** Aligned with human preferences using preference pairs.
+    1.  **Pre-training:** Large-scale training on public text/code and synthetic math data using a two-stage learning rate and weight decay schedule.
+    2.  **Supervised Fine-tuning (SFT):** Fine-tuned on instruction-following and conversational datasets using sum loss aggregation and specific hyperparameter tuning.
+    3.  **Direct Preference Optimization (DPO):** Aligned with human preferences using preference pairs.
 * **Tokenizer:** LLaMA 3 Tokenizer (vocab size: 128,256).

 ## Usage tips
--- a/docs/source/en/model_doc/blenderbot-small.md
+++ b/docs/source/en/model_doc/blenderbot-small.md
@ -13,53 +13,44 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2020-04-28 and added to Hugging Face Transformers on 2021-01-05.*
+*This model was released on 2020-04-28 and added to Hugging Face Transformers on 2021-01-05 and contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten).*

 # Blenderbot Small

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
-<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[Blender](https://huggingface.co/papers/2004.13637) focuses on building open-domain chatbots by emphasizing the importance of various conversational skills beyond just scaling model parameters and data size. The model variants include 90M, 2.7B, and 9.4B parameters, demonstrating that with the right training data and generation strategies, large-scale models can learn to provide engaging talking points, listen, display knowledge, empathy, and personality, while maintaining a consistent persona. Human evaluations indicate that the best models outperform existing approaches in terms of engagingness and humanness in multi-turn dialogues. The paper also analyzes failure cases to highlight the limitations of the work.

-Note that [`BlenderbotSmallModel`] and
-[`BlenderbotSmallForConditionalGeneration`] are only used in combination with the checkpoint
-[facebook/blenderbot-90M](https://huggingface.co/facebook/blenderbot-90M). Larger Blenderbot checkpoints should
-instead be used with [`BlenderbotModel`] and
-[`BlenderbotForConditionalGeneration`]
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-## Overview
+```py
+import torch
+from transformers import pipeline

-The Blender chatbot model was proposed in [Recipes for building an open-domain chatbot](https://huggingface.co/papers/2004.13637) Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu,
-Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston on 30 Apr 2020.
+pipeline = pipeline(task="text-generation", model="facebook/blenderbot_small-90M", dtype="auto")
+pipeline("Plants create energy through a process known as photosynthesis.")
+```

-The abstract of the paper is the following:
+</hfoption>
+<hfoption id="AutoModel">

-*Building open-domain chatbots is a challenging area for machine learning research. While prior work has shown that
-scaling neural models in the number of parameters and the size of the data they are trained on gives improved results,
-we show that other ingredients are important for a high-performing chatbot. Good conversation requires a number of
-skills that an expert conversationalist blends in a seamless way: providing engaging talking points and listening to
-their partners, and displaying knowledge, empathy and personality appropriately, while maintaining a consistent
-persona. We show that large scale models can learn these skills when given appropriate training data and choice of
-generation strategy. We build variants of these recipes with 90M, 2.7B and 9.4B parameter models, and make our models
-and code publicly available. Human evaluations show our best models are superior to existing approaches in multi-turn
-dialogue in terms of engagingness and humanness measurements. We then discuss the limitations of this work by analyzing
-failure cases of our models.*
+```py
+import torch
+from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

-This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The authors' code can be
-found [here](https://github.com/facebookresearch/ParlAI).
+model = AutoModelForSeq2SeqLM.from_pretrained("facebook/blenderbot_small-90M", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("facebook/blenderbot_small-90M")
+
+inputs = tokenizer("Plants create energy through a process known as photosynthesis.", return_tensors="pt")
+outputs = model.generate(**inputs)
+print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
+```
+
+</hfoption>
+</hfoptions>

 ## Usage tips

-Blenderbot Small is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
-the left.
-
-## Resources
-
- [Causal language modeling task guide](../tasks/language_modeling)
- [Translation task guide](../tasks/translation)
- [Summarization task guide](../tasks/summarization)
+- Pad inputs on the right. Blenderbot Small uses absolute position embeddings.

 ## BlenderbotSmallConfig

@ -91,3 +82,4 @@ the left.

 [[autodoc]] BlenderbotSmallForCausalLM
    - forward
+
--- a/docs/source/en/model_doc/blenderbot.md
+++ b/docs/source/en/model_doc/blenderbot.md
@ -13,69 +13,46 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2020-04-28 and added to Hugging Face Transformers on 2020-11-16.*
+*This model was released on 2020-04-28 and added to Hugging Face Transformers on 2020-11-16 and contributed by [sshleifer](https://huggingface.co/sshleifer).*

 # Blenderbot

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
-<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[Blender](https://huggingface.co/papers/2004.13637) focuses on building open-domain chatbots by emphasizing the importance of various conversational skills beyond just scaling model parameters and data size. The model variants include 90M, 2.7B, and 9.4B parameters, demonstrating that with the right training data and generation strategies, large-scale models can learn to provide engaging talking points, listen, display knowledge, empathy, and personality, while maintaining a consistent persona. Human evaluations indicate that the best models outperform existing approaches in terms of engagingness and humanness in multi-turn dialogues. The paper also analyzes failure cases to highlight the limitations of the work.

-## Overview
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-The Blender chatbot model was proposed in [Recipes for building an open-domain chatbot](https://huggingface.co/papers/2004.13637) Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu,
-Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston on 30 Apr 2020.
+```py
+import torch
+from transformers import pipeline

-The abstract of the paper is the following:
-
-*Building open-domain chatbots is a challenging area for machine learning research. While prior work has shown that
-scaling neural models in the number of parameters and the size of the data they are trained on gives improved results,
-we show that other ingredients are important for a high-performing chatbot. Good conversation requires a number of
-skills that an expert conversationalist blends in a seamless way: providing engaging talking points and listening to
-their partners, and displaying knowledge, empathy and personality appropriately, while maintaining a consistent
-persona. We show that large scale models can learn these skills when given appropriate training data and choice of
-generation strategy. We build variants of these recipes with 90M, 2.7B and 9.4B parameter models, and make our models
-and code publicly available. Human evaluations show our best models are superior to existing approaches in multi-turn
-dialogue in terms of engagingness and humanness measurements. We then discuss the limitations of this work by analyzing
-failure cases of our models.*
-
-This model was contributed by [sshleifer](https://huggingface.co/sshleifer). The authors' code can be found [here](https://github.com/facebookresearch/ParlAI) .
-
-## Usage tips and example
-
-Blenderbot is a model with absolute position embeddings so it's usually advised to pad the inputs on the right
-rather than the left.
-
-An example:
-
-```python
->>> from transformers import BlenderbotTokenizer, BlenderbotForConditionalGeneration
-
->>> mname = "facebook/blenderbot-400M-distill"
->>> model = BlenderbotForConditionalGeneration.from_pretrained(mname)
->>> tokenizer = BlenderbotTokenizer.from_pretrained(mname)
->>> UTTERANCE = "My friends are cool but they eat too many carbs."
->>> inputs = tokenizer([UTTERANCE], return_tensors="pt")
->>> reply_ids = model.generate(**inputs)
->>> print(tokenizer.batch_decode(reply_ids))
-["<s> That's unfortunate. Are they trying to lose weight or are they just trying to be healthier?</s>"]
+pipeline = pipeline(task="text-generation", model="facebook/blenderbot-400M-distill", dtype="auto")
+pipeline("Plants create energy through a process known as photosynthesis.")
 ```

-## Implementation Notes
+</hfoption>
+<hfoption id="AutoModel">

- Blenderbot uses a standard [seq2seq model transformer](https://huggingface.co/papers/1706.03762) based architecture.
- Available checkpoints can be found in the [model hub](https://huggingface.co/models?search=blenderbot).
- This is the *default* Blenderbot model class. However, some smaller checkpoints, such as
-  `facebook/blenderbot_small_90M`, have a different architecture and consequently should be used with
-  [BlenderbotSmall](blenderbot-small).
+```py
+import torch
+from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

-## Resources
+model = AutoModelForSeq2SeqLM.from_pretrained("facebook/blenderbot-400M-distill", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("facebook/blenderbot-400M-distill")

- [Causal language modeling task guide](../tasks/language_modeling)
- [Translation task guide](../tasks/translation)
- [Summarization task guide](../tasks/summarization)
+inputs = tokenizer("Plants create energy through a process known as photosynthesis.", return_tensors="pt")
+outputs = model.generate(**inputs)
+print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
+```
+
+</hfoption>
+</hfoptions>
+
+## Usage tips
+
+- Pad inputs on the right. Blenderbot uses absolute position embeddings.
+- Blenderbot uses a standard seq2seq transformer architecture.
+- This is the default Blenderbot model class. Smaller checkpoints like `facebook/blenderbot_small_90M` have different architectures and need [`BlenderbotSmall`].

 ## BlenderbotConfig

@ -109,3 +86,4 @@ See [`~transformers.BartForConditionalGeneration`] for arguments to *forward* an

 [[autodoc]] BlenderbotForCausalLM
    - forward
+
--- a/docs/source/en/model_doc/blip-2.md
+++ b/docs/source/en/model_doc/blip-2.md
@ -13,49 +13,48 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2023-01-30 and added to Hugging Face Transformers on 2023-02-09.*
+*This model was released on 2023-01-30 and added to Hugging Face Transformers on 2023-02-09 and contributed by [nielsr](https://huggingface.co/nielsr).*

 # BLIP-2

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[BLIP-2](https://huggingface.co/papers/2301.12597) bootstraps vision-language pre-training using frozen image encoders and large language models. It employs a lightweight, 12-layer Transformer encoder to bridge the modality gap, achieving state-of-the-art results on various vision-language tasks. Specifically, BLIP-2 surpasses Flamingo by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters. The model also demonstrates strong zero-shot image-to-text generation capabilities following natural language instructions.

-## Overview
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-The BLIP-2 model was proposed in [BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](https://huggingface.co/papers/2301.12597) by
-Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi. BLIP-2 leverages frozen pre-trained image encoders and large language models (LLMs) by training a lightweight, 12-layer Transformer
-encoder in between them, achieving state-of-the-art performance on various vision-language tasks. Most notably, BLIP-2 improves upon [Flamingo](https://huggingface.co/papers/2204.14198), an 80 billion parameter model, by 8.7%
-on zero-shot VQAv2 with 54x fewer trainable parameters.
+```py
+import torch
+from transformers import pipeline

-The abstract from the paper is the following:
+pipeline = pipeline(task="visual-question-answering", model="Salesforce/blip2-opt-2.7b", dtype="auto")
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
+pipeline(question="What is shown in this image?", image=url)
+```

-*The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen language model. BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods. For example, our model outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters. We also demonstrate the model's emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.*
+</hfoption>
+<hfoption id="AutoModel">

-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/blip2_architecture.jpg"
-alt="drawing" width="600"/>
+```py
+import requests
+import torch
+from PIL import Image
+from transformers import AutoProcessor, AutoModelForVisualQuestionAnswering

-<small> BLIP-2 architecture. Taken from the <a href="https://huggingface.co/papers/2301.12597">original paper.</a> </small>
+processor = AutoProcessor.from_pretrained("Salesforce/blip2-opt-2.7b")
+model = AutoModelForVisualQuestionAnswering.from_pretrained("Salesforce/blip2-opt-2.7b", dtype="auto")

-This model was contributed by [nielsr](https://huggingface.co/nielsr).
-The original code can be found [here](https://github.com/salesforce/LAVIS/tree/5ee63d688ba4cebff63acee04adaef2dee9af207).
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
+image = Image.open(requests.get(url, stream=True).raw)

-## Usage tips
+question = "Question: What is shown in this image? Answer:"
+inputs = processor(images=image, text=question, return_tensors="pt")

- BLIP-2 can be used for conditional text generation given an image and an optional text prompt. At inference time, it's recommended to use the [`generate`] method.
- One can use [`Blip2Processor`] to prepare images for the model, and decode the predicted tokens ID's back to text.
+output = model.generate(**inputs)
+print(processor.batch_decode(output, skip_special_tokens=True)[0])
+```

-> [!NOTE]
-> BLIP models after release v4.46 will raise warnings about adding `processor.num_query_tokens = {{num_query_tokens}}` and expand model embeddings layer to add special `<image>` token. It is strongly recommended to add the attributes to the processor if you own the model checkpoint, or open a PR if it is not owned by you. Adding these attributes means that BLIP will add the number of query tokens required per image and expand the text with as many `<image>` placeholders as there will be query tokens. Usually it is around 500 tokens per image, so make sure that the text is not truncated as otherwise there will be failure when merging the embeddings.
-The attributes can be obtained from model config, as `model.config.num_query_tokens` and model embeddings expansion can be done by following [this link](https://gist.github.com/zucchini-nlp/e9f20b054fa322f84ac9311d9ab67042).
-
-## Resources
-
-A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with BLIP-2.
-
- Demo notebooks for BLIP-2 for image captioning, visual question answering (VQA) and chat-like conversations can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/BLIP-2).
-
-If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
+</hfoption>
+</hfoptions>

 ## Blip2Config

@ -109,3 +108,4 @@ If you're interested in submitting a resource to be included here, please feel f
 ## Blip2VisionModelWithProjection

 [[autodoc]] Blip2VisionModelWithProjection
+
--- a/docs/source/en/model_doc/blip.md
+++ b/docs/source/en/model_doc/blip.md
@ -13,77 +13,49 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2022-01-28 and added to Hugging Face Transformers on 2022-12-21.*
-
-<div style="float: right;">
-    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-    </div>
-</div>
+*This model was released on 2022-01-28 and added to Hugging Face Transformers on 2022-12-21 and contributed by [ybelkada](https://huggingface.co/ybelkada).*

 # BLIP

-[BLIP](https://huggingface.co/papers/2201.12086) (Bootstrapped Language-Image Pretraining) is a vision-language pretraining (VLP) framework designed for *both* understanding and generation tasks. Most existing pretrained models are only good at one or the other. It uses a captioner to generate captions and a filter to remove the noisy captions. This increases training data quality and more effectively uses the messy web data.
-
-You can find all the original BLIP checkpoints under the [BLIP](https://huggingface.co/collections/Salesforce/blip-models-65242f40f1491fbf6a9e9472) collection.
-
-> [!TIP]
-> This model was contributed by [ybelkada](https://huggingface.co/ybelkada).
->
-> Click on the BLIP models in the right sidebar for more examples of how to apply BLIP to different vision language tasks.
-
-The example below demonstrates how to visual question answering with [`Pipeline`] or the [`AutoModel`] class.
+[BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation](https://huggingface.co/papers/2201.12086) proposes a new VLP framework that excels in both vision-language understanding and generation tasks. BLIP enhances the use of noisy web data through a bootstrapping process involving synthetic caption generation and noise filtering. This approach leads to state-of-the-art results in image-text retrieval, image captioning, and visual question answering, with notable improvements in recall@1, CIDEr, and VQA scores. Additionally, BLIP demonstrates strong generalization to videolanguage tasks in a zero-shot setting.

 <hfoptions id="usage">
 <hfoption id="Pipeline">

-```python
+```py
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    task="visual-question-answering",
-    model="Salesforce/blip-vqa-base",
-    dtype=torch.float16,
-    device=0
-)
+pipeline = pipeline(task="visual-question-answering", model="Salesforce/blip-vqa-base", dtype="auto")
 url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
-pipeline(question="What is the weather in this image?", image=url)
+pipeline(question="What is shown in this image?", image=url)
 ```

 </hfoption>
 <hfoption id="AutoModel">

-```python
+```py
 import requests
 import torch
 from PIL import Image
 from transformers import AutoProcessor, AutoModelForVisualQuestionAnswering

 processor = AutoProcessor.from_pretrained("Salesforce/blip-vqa-base")
-model = AutoModelForVisualQuestionAnswering.from_pretrained(
-    "Salesforce/blip-vqa-base",
-    dtype=torch.float16,
-    device_map="auto"
-)
+model = AutoModelForVisualQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base", dtype="auto")

 url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
 image = Image.open(requests.get(url, stream=True).raw)

-question = "What is the weather in this image?"
-inputs = processor(images=image, text=question, return_tensors="pt").to(model.device, torch.float16)
+question = "What is shown in this image?"
+inputs = processor(images=image, text=question, return_tensors="pt")

 output = model.generate(**inputs)
-processor.batch_decode(output, skip_special_tokens=True)[0]
+print(processor.batch_decode(output, skip_special_tokens=True)[0])
 ```

 </hfoption>
 </hfoptions>

-## Resources
-
-Refer to this [notebook](https://github.com/huggingface/notebooks/blob/main/examples/image_captioning_blip.ipynb) to learn how to fine-tune BLIP for image captioning on a custom dataset.
-
 ## BlipConfig

 [[autodoc]] BlipConfig
@ -124,11 +96,6 @@ Refer to this [notebook](https://github.com/huggingface/notebooks/blob/main/exam
 [[autodoc]] BlipTextModel
    - forward

-## BlipTextLMHeadModel
-
-[[autodoc]] BlipTextLMHeadModel
-    - forward
-
 ## BlipVisionModel

 [[autodoc]] BlipVisionModel
@ -148,3 +115,9 @@ Refer to this [notebook](https://github.com/huggingface/notebooks/blob/main/exam

 [[autodoc]] BlipForQuestionAnswering
    - forward
+
+## BlipTextLMHeadModel
+
+[[autodoc]] BlipTextLMHeadModel
+    - forward
+
--- a/docs/source/en/model_doc/bloom.md
+++ b/docs/source/en/model_doc/bloom.md
@ -17,46 +17,36 @@ rendered properly in your Markdown viewer.

 # BLOOM

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[BLOOM](https://huggingface.co/papers/2211.05100) is a 176-billion parameter open-access large language model built collaboratively by hundreds of researchers to promote wider accessibility of LLM technology. It is a decoder-only Transformer trained on the ROOTS corpus, which includes text from hundreds of sources across 46 natural and 13 programming languages. BLOOM demonstrates competitive performance across diverse benchmarks, with further gains achieved through multitask prompted finetuning. The model and code are publicly released under the Responsible AI License to support open research and applications.

-## Overview
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-The [BLOOM](https://huggingface.co/papers/2211.05100) model has been proposed with its various versions through the [BigScience Workshop](https://bigscience.huggingface.co/). BigScience is inspired by other open science initiatives where researchers have pooled their time and resources to collectively achieve a higher impact.
-The architecture of BLOOM is essentially similar to GPT3 (auto-regressive model for next token prediction), but has been trained on 46 different languages and 13 programming languages.
-Several smaller versions of the models have been trained on the same dataset. BLOOM is available in the following versions:
+```py
+import torch
+from transformers import pipeline

- [bloom-560m](https://huggingface.co/bigscience/bloom-560m)
- [bloom-1b1](https://huggingface.co/bigscience/bloom-1b1)
- [bloom-1b7](https://huggingface.co/bigscience/bloom-1b7)
- [bloom-3b](https://huggingface.co/bigscience/bloom-3b)
- [bloom-7b1](https://huggingface.co/bigscience/bloom-7b1)
- [bloom](https://huggingface.co/bigscience/bloom) (176B parameters)
+pipeline = pipeline(task="text-generation", model="bigscience/bloom-560m", dtype="auto")
+pipeline("Plants create energy through a process known as photosynthesis.")
+```

-## Resources
+</hfoption>
+<hfoption id="AutoModel">

-A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with BLOOM. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
+```py
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer

-<PipelineTag pipeline="text-generation"/>
+model = AutoModelForCausalLM.from_pretrained("bigscience/bloom-560m")
+tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-560m")

- [`BloomForCausalLM`] is supported by this [causal language modeling example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling#gpt-2gpt-and-causal-language-modeling) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb).
+inputs = tokenizer("Plants create energy through a process known as photosynthesis.", return_tensors="pt")
+outputs = model.generate(**inputs, max_length=50)
+print(tokenizer.decode(outputs[0]))
+```

-See also:
-
- [Causal language modeling task guide](../tasks/language_modeling)
- [Text classification task guide](../tasks/sequence_classification)
- [Token classification task guide](../tasks/token_classification)
- [Question answering task guide](../tasks/question_answering)
-
-⚡️ Inference
-
- A blog on [Optimization story: Bloom inference](https://huggingface.co/blog/bloom-inference-optimization).
- A blog on [Incredibly Fast BLOOM Inference with DeepSpeed and Accelerate](https://huggingface.co/blog/bloom-inference-pytorch-scripts).
-
-⚙️ Training
-
- A blog on [The Technology Behind BLOOM Training](https://huggingface.co/blog/bloom-megatron-deepspeed).
+</hfoption>
+</hfoptions>

 ## BloomConfig

@ -92,3 +82,4 @@ See also:

 [[autodoc]] BloomForQuestionAnswering
    - forward
+
--- a/docs/source/en/model_doc/blt.md
+++ b/docs/source/en/model_doc/blt.md
@ -13,13 +13,11 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2024-12-13 and added to Hugging Face Transformers on 2025-09-19.*
+
+*This model was released on 2024-12-13 and added to Hugging Face Transformers on 2025-10-07 and contributed by [itazap](https://huggingface.co/itazap).*

 <div style="float: right;">
    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-        <img alt="Flax" src="https://img.shields.io/badge/Flax-29a79b.svg?style=flat&logo=data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAC0AAAAtCAMAAAANxBKoAAAC7lBMVEUAAADg5vYHPVgAoJH+/v76+v39/f9JbLP///9+AIgAnY3///+mcqzt8fXy9fgkXa3Ax9709fr+///9/f8qXq49qp5AaLGMwrv8/P0eW60VWawxYq8yqJzG2dytt9Wyu9elzci519Lf3O3S2efY3OrY0+Xp7PT///////+dqNCexMc6Z7AGpJeGvbenstPZ5ejQ1OfJzOLa7ejh4+/r8fT29vpccbklWK8PVa0AS6ghW63O498vYa+lsdKz1NDRt9Kw1c672tbD3tnAxt7R6OHp5vDe7OrDyuDn6vLl6/EAQKak0MgATakkppo3ZK/Bz9y8w9yzu9jey97axdvHzeG21NHH4trTwthKZrVGZLSUSpuPQJiGAI+GAI8SWKydycLL4d7f2OTi1+S9xNzL0ePT6OLGzeEAo5U0qJw/aLEAo5JFa7JBabEAp5Y4qZ2QxLyKmsm3kL2xoMOehrRNb7RIbbOZgrGre68AUqwAqZqNN5aKJ5N/lMq+qsd8kMa4pcWzh7muhLMEV69juq2kbKqgUaOTR5uMMZWLLZSGAI5VAIdEAH+ovNDHuNCnxcy3qcaYx8K8msGplrx+wLahjbYdXrV6vbMvYK9DrZ8QrZ8tqJuFms+Sos6sw8ecy8RffsNVeMCvmb43aLltv7Q4Y7EZWK4QWa1gt6meZKUdr6GOAZVeA4xPAISyveLUwtivxtKTpNJ2jcqfvcltiMiwwcfAoMVxhL+Kx7xjdrqTe60tsaNQs6KaRKACrJ6UTZwkqpqTL5pkHY4AloSgsd2ptNXPvNOOncuxxsqFl8lmg8apt8FJcr9EbryGxLqlkrkrY7dRa7ZGZLQ5t6iXUZ6PPpgVpZeJCJFKAIGareTa0+KJod3H0deY2M+esM25usmYu8d2zsJOdcBVvrCLbqcAOaaHaKQAMaScWqKBXqCXMJ2RHpiLF5NmJZAdAHN2kta11dKu1M+DkcZLdb+Mcql3TppyRJdzQ5ZtNZNlIY+DF4+voCOQAAAAZ3RSTlMABAT+MEEJ/RH+/TP+Zlv+pUo6Ifz8+fco/fz6+evr39S9nJmOilQaF/7+/f38+smmoYp6b1T+/v7++vj189zU0tDJxsGzsrKSfv34+Pf27dDOysG9t6+n/vv6+vr59uzr1tG+tZ6Qg9Ym3QAABR5JREFUSMeNlVVUG1EQhpcuxEspXqS0SKEtxQp1d3d332STTRpIQhIISQgJhODu7lAoDoUCpe7u7u7+1puGpqnCPOyZvffbOXPm/PsP9JfQgyCC+tmTABTOcbxDz/heENS7/1F+9nhvkHePG0wNDLbGWwdXL+rbLWvpmZHXD8+gMfBjTh+aSe6Gnn7lwQIOTR0c8wfX3PWgv7avbdKwf/ZoBp1Gp/PvuvXW3vw5ib7emnTW4OR+3D4jB9vjNJ/7gNvfWWeH/TO/JyYrsiKCRjVEZA3UB+96kON+DxOQ/NLE8PE5iUYgIXjFnCOlxEQMaSGVxjg4gxOnEycGz8bptuNjVx08LscIgrzH3umcn+KKtiBIyvzOO2O99aAdR8cF19oZalnCtvREUw79tCd5sow1g1UKM6kXqUx4T8wsi3sTjJ3yzDmmhenLXLpo8u45eG5y4Vvbk6kkC4LLtJMowkSQxmk4ggVJEG+7c6QpHT8vvW9X7/o7+3ELmiJi2mEzZJiz8cT6TBlanBk70cB5GGIGC1gRDdZ00yADLW1FL6gqhtvNXNG5S9gdSrk4M1qu7JAsmYshzDS4peoMrU/gT7qQdqYGZaYhxZmVbGJAm/CS/HloWyhRUlknQ9KYcExTwS80d3VNOxUZJpITYyspl0LbhArhpZCD9cRWEQuhYkNGMHToQ/2Cs6swJlb39CsllxdXX6IUKh/H5jbnSsPKjgmoaFQ1f8wRLR0UnGE/RcDEjj2jXG1WVTwUs8+zxfcrVO+vSsuOpVKxCfYZiQ0/aPKuxQbQ8lIz+DClxC8u+snlcJ7Yr1z1JPqUH0V+GDXbOwAib931Y4Imaq0NTIXPXY+N5L18GJ37SVWu+hwXff8l72Ds9XuwYIBaXPq6Shm4l+Vl/5QiOlV+uTk6YR9PxKsI9xNJny31ygK1e+nIRC1N97EGkFPI+jCpiHe5PCEy7oWqWSwRrpOvhFzcbTWMbm3ZJAOn1rUKpYIt/lDhW/5RHHteeWFN60qo98YJuoq1nK3uW5AabyspC1BcIEpOhft+SZAShYoLSvnmSfnYADUERP5jJn2h5XtsgCRuhYQqAvwTwn33+YWEKUI72HX5AtfSAZDe8F2DtPPm77afhl0EkthzuCQU0BWApgQIH9+KB0JhopMM7bJrdTRoleM2JAVNMyPF+wdoaz+XJpGoVAQ7WXUkcV7gT3oUZyi/ISIJAVKhgNp+4b4veCFhYVJw4locdSjZCp9cPUhLF9EZ3KKzURepMEtCDPP3VcWFx4UIiZIklIpFNfHpdEafIF2aRmOcrUmjohbT2WUllbmRvgfbythbQO3222fpDJoufaQPncYYuqoGtUEsCJZL6/3PR5b4syeSjZMQG/T2maGANlXT2v8S4AULWaUkCxfLyW8iW4kdka+nEMjxpL2NCwsYNBp+Q61PF43zyDg9Bm9+3NNySn78jMZUUkumqE4Gp7JmFOdP1vc8PpRrzj9+wPinCy8K1PiJ4aYbnTYpCCbDkBSbzhu2QJ1Gd82t8jI8TH51+OzvXoWbnXUOBkNW+0mWFwGcGOUVpU81/n3TOHb5oMt2FgYGjzau0Nif0Ss7Q3XB33hjjQHjHA5E5aOyIQc8CBrLdQSs3j92VG+3nNEjbkbdbBr9zm04ruvw37vh0QKOdeGIkckc80fX3KH/h7PT4BOjgCty8VZ5ux1MoO5Cf5naca2LAsEgehI+drX8o/0Nu+W0m6K/I9gGPd/dfx/EN/wN62AhsBWuAAAAAElFTkSuQmCC
-        ">
        <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
    </div>
@ -27,62 +25,36 @@ rendered properly in your Markdown viewer.

 # Byte Latent Transformer (BLT)

-## Overview
+[Byte Latent Transformer](https://huggingface.co/papers/2412.09871) is a byte-level LLM architecture that matches tokenization-based LLM performance at scale. It encodes bytes into dynamically sized patches based on entropy, optimizing compute and model capacity where data complexity is higher. This approach improves inference efficiency and robustness, with the first flop-controlled scaling study up to 8B parameters and 4T training bytes. BLT demonstrates better scaling than tokenization-based models by dynamically selecting long patches for predictable data, enhancing reasoning and long-tail generalization.

-The BLT model was proposed in [Byte Latent Transformer: Patches Scale Better Than Tokens](https://huggingface.co/papers/2412.09871) by Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li1, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman†, Srinivasan Iyer.
-BLT is a byte-level LLM that achieves tokenization-level performance through entropy-based dynamic patching.
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-The abstract from the paper is the following:
-
-*We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at scale with significant improvements in inference
-efficiency and robustness. BLT encodes bytes into dynamically sized patches, which serve as the primary units of computation. Patches are segmented based on the entropy of the next byte, allocating
-more compute and model capacity where increased data complexity demands it. We present the first flop controlled scaling study of byte-level models up to 8B parameters and 4T training bytes. Our results demonstrate the feasibility of scaling models trained on raw bytes without a fixed vocabulary. Both training and inference efficiency improve due to dynamically selecting long patches when data is predictable, along with qualitative improvements on reasoning and long tail generalization. Overall, for fixed inference costs, BLT shows significantly better scaling than tokenization-based models, by simultaneously growing both patch and model size.*
-
-## Usage Tips
-
- **Dual Model Architecture**: BLT consists of two separate trained models:
-  - **Patcher (Entropy Model)**: A smaller transformer model that predicts byte-level entropy to determine patch boundaries and segment input.
-  - **Main Transformer Model**: The primary model that processes the patches through a Local Encoder, Global Transformer, and Local Decoder.
-
- **Dynamic Patching**: The model uses entropy-based dynamic patching where:
-  - High-entropy regions (complex data) get shorter patches with more computational attention
-  - Low-entropy regions (predictable data) get longer patches for efficiency
-  - This allows the model to allocate compute resources where they're most needed
-
- **Local Encoder**: Processes byte sequences with cross-attention to patch embeddings
- **Global Transformer**: Processes patch-level representations with full attention across patches
- **Local Decoder**: Generates output with cross-attention back to the original byte sequence
-
- **Byte-Level Tokenizer**: Unlike traditional tokenizers that use learned vocabularies, BLT's tokenizer simply converts text to UTF-8 bytes and maps each byte to a token ID. There is no need for a vocabulary.
-
-The model can be loaded via:
-
-<hfoption id="AutoModel">
-
-```python
+```py
 import torch
-from transformers import AutoTokenizer, AutoModelForCausalLM
+from transformers import pipeline

-tokenizer = AutoTokenizer.from_pretrained("itazap/blt-1b-hf")
-model = AutoModelForCausalLM.from_pretrained(
-    "itazap/blt-1b-hf",
-    device_map="auto",
-)
-
-inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
-
-prompt = "my name is"
-generated_ids = model.generate(
-    **inputs, max_new_tokens=NUM_TOKENS_TO_GENERATE, do_sample=False, use_cache=False
-)
-
-print(tokenizer.decode(generated_ids[0]))
+pipeline = pipeline(task="text-generation", model="itazap/blt-1b-hf", dtype="auto")
+pipeline("Plants generate energy through a process known as  ")
 ```

 </hfoption>
+<hfoption id="AutoModel">

-This model was contributed by [itazap](https://huggingface.co/<itazap>).
-The original code can be found [here](<https://github.com/facebookresearch/blt>).
+```py
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+model = AutoModelForCausalLM.from_pretrained("itazap/blt-1b-hf", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("itazap/blt-1b-hf")
+
+inputs = tokenizer("Plants generate energy through a process known as  ", return_tensors='pt', return_token_type_ids=False)
+outputs = model.generate(**inputs, max_new_tokens=64)
+print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
+```
+
+</hfoption>
+</hfoptions>

 ## BltConfig

@ -95,3 +67,4 @@ The original code can be found [here](<https://github.com/facebookresearch/blt>)

 [[autodoc]] BltForCausalLM
    - forward
+
--- a/docs/source/en/model_doc/bort.md
+++ b/docs/source/en/model_doc/bort.md
@ -13,48 +13,49 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2020-10-20 and added to Hugging Face Transformers on 2023-06-20.*
+*This model was released on 2020-10-20 and added to Hugging Face Transformers on 2023-06-20 and contributed by [stefan-it](https://huggingface.co/stefan-it).*
+
+> [!WARNING]
+> This model is in maintenance mode only, we do not accept any new PRs changing its code.
+>
+> If you run into any issues running this model, please reinstall the last version that supported this model: v4.30.0. You can do so by running the following command: pip install -U transformers==4.30.0.

 # BORT

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[BORT](https://huggingface.co/papers/2010.10499) extracts an optimal subset of architectural parameters from BERT, significantly reducing its size to 5.5% of BERT-large's effective size and 16% of its net size. BORT can be pretrained in 288 GPU hours, which is 1.2% of the time required for RoBERTa-large and 33% of BERT-large. It is 7.9x faster on a CPU and outperforms other compressed and some non-compressed variants, achieving performance improvements of 0.3% to 31% on various NLU benchmarks.

-<Tip warning={true}>
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-This model is in maintenance mode only, we do not accept any new PRs changing its code.
+```py
+import torch
+from transformers import pipeline

-If you run into any issues running this model, please reinstall the last version that supported this model: v4.30.0.
-You can do so by running the following command: `pip install -U transformers==4.30.0`.
+pipeline = pipeline(task="fill-mask", model="amazon/bort", dtype="auto")
+pipeline("Plants create [MASK] through a process known as photosynthesis.")
+```

-</Tip>
+</hfoption>
+<hfoption id="AutoModel">

-## Overview
+```py
+import torch
+from transformers import AutoModelForMaskedLM, AutoTokenizer

-The BORT model was proposed in [Optimal Subarchitecture Extraction for BERT](https://huggingface.co/papers/2010.10499) by
-Adrian de Wynter and Daniel J. Perry. It is an optimal subset of architectural parameters for the BERT, which the
-authors refer to as "Bort".
+model = AutoModelForMaskedLM.from_pretrained("amazon/bort", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("amazon/bort")

-The abstract from the paper is the following:
+inputs = tokenizer("Plants create [MASK] through a process known as photosynthesis.", return_tensors="pt")
+outputs = model(**inputs)
+mask_token_id = tokenizer.mask_token_id
+mask_position = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
+predicted_word = tokenizer.decode(outputs.logits[0, mask_position].argmax(dim=-1))
+print(f"Predicted word: {predicted_word}")
+```

-*We extract an optimal subset of architectural parameters for the BERT architecture from Devlin et al. (2018) by
-applying recent breakthroughs in algorithms for neural architecture search. This optimal subset, which we refer to as
-"Bort", is demonstrably smaller, having an effective (that is, not counting the embedding layer) size of 5.5% the
-original BERT-large architecture, and 16% of the net size. Bort is also able to be pretrained in 288 GPU hours, which
-is 1.2% of the time required to pretrain the highest-performing BERT parametric architectural variant, RoBERTa-large
-(Liu et al., 2019), and about 33% of that of the world-record, in GPU hours, required to train BERT-large on the same
-hardware. It is also 7.9x faster on a CPU, as well as being better performing than other compressed variants of the
-architecture, and some of the non-compressed variants: it obtains performance improvements of between 0.3% and 31%,
-absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks.*
-
-This model was contributed by [stefan-it](https://huggingface.co/stefan-it). The original code can be found [here](https://github.com/alexa/bort/).
+</hfoption>
+</hfoptions>

 ## Usage tips

- BORT's model architecture is based on BERT, refer to [BERT's documentation page](bert) for the
-  model's API reference as well as usage examples.
- BORT uses the RoBERTa tokenizer instead of the BERT tokenizer, refer to [RoBERTa's documentation page](roberta) for the tokenizer's API reference as well as usage examples.
- BORT requires a specific fine-tuning algorithm, called [Agora](https://adewynter.github.io/notes/bort_algorithms_and_applications.html#fine-tuning-with-algebraic-topology) ,
-  that is sadly not open-sourced yet. It would be very useful for the community, if someone tries to implement the
-  algorithm to make BORT fine-tuning work.
+- BORT uses the RoBERTa tokenizer instead of the BERT tokenizer. Check RoBERTa's documentation for API reference and usage examples.
--- a/docs/source/en/model_doc/bridgetower.md
+++ b/docs/source/en/model_doc/bridgetower.md
@ -13,124 +13,44 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2022-06-17 and added to Hugging Face Transformers on 2023-01-25.*
+*This model was released on 2022-06-17 and added to Hugging Face Transformers on 2023-01-25 and contributed by [anahita-b](https://huggingface.co/anahita-b), [Tile](https://huggingface.co/Tile), and [shaoyent](https://huggingface.co/shaoyent).*

 # BridgeTower

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[BridgeTower](https://huggingface.co/papers/2206.08657) introduces bridge layers connecting the top layers of uni-modal encoders to each layer of the cross-modal encoder, enabling effective bottom-up cross-modal alignment and fusion. Pre-trained with only 4M images, BRIDGETOWER achieves state-of-the-art performance on various vision-language tasks, outperforming previous models with similar pre-training data and minimal additional parameters and computational costs. When scaled, it surpasses models trained on much larger datasets.

-## Overview
+<hfoptions id="usage">
+<hfoption id="BridgeTowerForContrastiveLearning">

-The BridgeTower model was proposed in [BridgeTower: Building Bridges Between Encoders in Vision-Language Representative Learning](https://huggingface.co/papers/2206.08657) by Xiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, Wanxiang Che, Nan Duan. The goal of this model is to build a
-bridge between each uni-modal encoder and the cross-modal encoder to enable comprehensive and detailed interaction at each layer of the cross-modal encoder thus achieving remarkable performance on various downstream tasks with almost negligible additional performance and computational costs.
+```py
+import torch
+import requests
+from PIL import Image
+from transformers import AutoProcessor, BridgeTowerForContrastiveLearning

-This paper has been accepted to the [AAAI'23](https://aaai.org/Conferences/AAAI-23/) conference.
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
+image = Image.open(requests.get(url, stream=True).raw)
+texts = ["An image of a cat walking in the snow", "A football player scoring a goal"]

-The abstract from the paper is the following:
+processor = AutoProcessor.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")
+model = BridgeTowerForContrastiveLearning.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc", dtype="auto")

-*Vision-Language (VL) models with the TWO-TOWER architecture have dominated visual-language representation learning in recent years.
-Current VL models either use lightweight uni-modal encoders and learn to extract, align and fuse both modalities simultaneously in a deep cross-modal encoder, or feed the last-layer uni-modal representations from the deep pre-trained uni-modal encoders into the top cross-modal encoder.
-Both approaches potentially restrict vision-language representation learning and limit model performance. In this paper, we propose BRIDGETOWER, which introduces multiple bridge layers that build a connection between the top layers of uni-modal encoders and each layer of the crossmodal encoder.
-This enables effective bottom-up cross-modal alignment and fusion between visual and textual representations of different semantic levels of pre-trained uni-modal encoders in the cross-modal encoder. Pre-trained with only 4M images, BRIDGETOWER achieves state-of-the-art performance on various downstream vision-language tasks.
-In particular, on the VQAv2 test-std set, BRIDGETOWER achieves an accuracy of 78.73%, outperforming the previous state-of-the-art model METER by 1.09% with the same pre-training data and almost negligible additional parameters and computational costs.
-Notably, when further scaling the model, BRIDGETOWER achieves an accuracy of 81.15%, surpassing models that are pre-trained on orders-of-magnitude larger datasets.*
+scores = dict()
+for text in texts:
+    # prepare inputs
+    encoding = processor(image, text, return_tensors="pt")
+    outputs = model(**encoding)
+    # Get similarity score by computing cosine similarity
+    score = torch.cosine_similarity(outputs.image_embeds, outputs.text_embeds, dim=1).item()
+    scores[text] = score
+    print(f"Text: '{text}' - Score: {score:.4f}")

-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/bridgetower_architecture%20.jpg"
-alt="drawing" width="600"/>
-
-<small> BridgeTower architecture. Taken from the <a href="https://huggingface.co/papers/2206.08657">original paper.</a> </small>
-
-This model was contributed by [Anahita Bhiwandiwalla](https://huggingface.co/anahita-b), [Tiep Le](https://huggingface.co/Tile) and [Shaoyen Tseng](https://huggingface.co/shaoyent). The original code can be found [here](https://github.com/microsoft/BridgeTower).
-
-## Usage tips and examples
-
-BridgeTower consists of a visual encoder, a textual encoder and cross-modal encoder with multiple lightweight bridge layers.
-The goal of this approach was to build a bridge between each uni-modal encoder and the cross-modal encoder to enable comprehensive and detailed interaction at each layer of the cross-modal encoder.
-In principle, one can apply any visual, textual or cross-modal encoder in the proposed architecture.
-
-The [`BridgeTowerProcessor`] wraps [`RobertaTokenizer`] and [`BridgeTowerImageProcessor`] into a single instance to both
-encode the text and prepare the images respectively.
-
-The following example shows how to run contrastive learning using [`BridgeTowerProcessor`] and [`BridgeTowerForContrastiveLearning`].
-
-```python
->>> from transformers import BridgeTowerProcessor, BridgeTowerForContrastiveLearning
->>> import requests
->>> from PIL import Image
-
->>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
->>> image = Image.open(requests.get(url, stream=True).raw)
->>> texts = ["An image of two cats chilling on a couch", "A football player scoring a goal"]
-
->>> processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")
->>> model = BridgeTowerForContrastiveLearning.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")
-
->>> # forward pass
->>> scores = dict()
->>> for text in texts:
-...     # prepare inputs
-...     encoding = processor(image, text, return_tensors="pt")
-...     outputs = model(**encoding)
-...     scores[text] = outputs
+best_text = max(scores, key=scores.get)
+print(f"\nBest matching text: '{best_text}' with score: {scores[best_text]:.4f}")
 ```

-The following example shows how to run image-text retrieval using [`BridgeTowerProcessor`] and [`BridgeTowerForImageAndTextRetrieval`].
-
-```python
->>> from transformers import BridgeTowerProcessor, BridgeTowerForImageAndTextRetrieval
->>> import requests
->>> from PIL import Image
-
->>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
->>> image = Image.open(requests.get(url, stream=True).raw)
->>> texts = ["An image of two cats chilling on a couch", "A football player scoring a goal"]
-
->>> processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-base-itm-mlm")
->>> model = BridgeTowerForImageAndTextRetrieval.from_pretrained("BridgeTower/bridgetower-base-itm-mlm")
-
->>> # forward pass
->>> scores = dict()
->>> for text in texts:
-...     # prepare inputs
-...     encoding = processor(image, text, return_tensors="pt")
-...     outputs = model(**encoding)
-...     scores[text] = outputs.logits[0, 1].item()
-```
-
-The following example shows how to run masked language modeling using [`BridgeTowerProcessor`] and [`BridgeTowerForMaskedLM`].
-
-```python
->>> from transformers import BridgeTowerProcessor, BridgeTowerForMaskedLM
->>> from PIL import Image
->>> import requests
-
->>> url = "http://images.cocodataset.org/val2017/000000360943.jpg"
->>> image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
->>> text = "a <mask> looking out of the window"
-
->>> processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-base-itm-mlm")
->>> model = BridgeTowerForMaskedLM.from_pretrained("BridgeTower/bridgetower-base-itm-mlm")
-
->>> # prepare inputs
->>> encoding = processor(image, text, return_tensors="pt")
-
->>> # forward pass
->>> outputs = model(**encoding)
-
->>> results = processor.decode(outputs.logits.argmax(dim=-1).squeeze(0).tolist())
-
->>> print(results)
-.a cat looking out of the window.
-```
-
-Tips:
-
- This implementation of BridgeTower uses [`RobertaTokenizer`] to generate text embeddings and OpenAI's CLIP/ViT model to compute visual embeddings.
- Checkpoints for pre-trained [bridgeTower-base](https://huggingface.co/BridgeTower/bridgetower-base) and [bridgetower masked language modeling and image text matching](https://huggingface.co/BridgeTower/bridgetower-base-itm-mlm) are released.
- Please refer to [Table 5](https://huggingface.co/papers/2206.08657) for BridgeTower's performance on Image Retrieval and other down stream tasks.
- The PyTorch version of this model is only available in torch 1.10 and higher.
+</hfoption>
+</hfoptions>

 ## BridgeTowerConfig

@ -178,3 +98,4 @@ Tips:

 [[autodoc]] BridgeTowerForImageAndTextRetrieval
    - forward
+
--- a/docs/source/en/model_doc/bros.md
+++ b/docs/source/en/model_doc/bros.md
@ -9,83 +9,38 @@ Unless required by applicable law or agreed to in writing, software distributed
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
-*This model was released on 2021-08-10 and added to Hugging Face Transformers on 2023-09-15.*
+*This model was released on 2021-08-10 and added to Hugging Face Transformers on 2023-09-15 and contributed by [jinho8345](https://huggingface.co/jinho8345).*

 # BROS

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[BROS](https://huggingface.co/papers/2108.04539) is a pre-trained language model designed for key information extraction (KIE) from document images by focusing on the spatial relationships of text rather than visual features. It encodes the relative 2D positions of text elements and uses an area-masking pre-training strategy to learn spatial-textual dependencies from unlabeled documents. Unlike vision-text models, BROS effectively integrates text and layout information alone, achieving competitive or superior results on major KIE benchmarks (FUNSD, SROIE*, CORD, SciTSR). The model also addresses two key challenges in KIE—handling incorrect text order and learning efficiently with limited labeled data.

-## Overview
+<hfoptions id="usage">
+<hfoption id="BrosForTokenClassification">

-The BROS model was proposed in [BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents](https://huggingface.co/papers/2108.04539) by Teakgyu Hong, Donghyun Kim, Mingi Ji, Wonseok Hwang, Daehyun Nam, Sungrae Park.
+```py
+import torch
+from transformers import AutoProcessor, AutoModelForTokenClassification

-BROS stands for *BERT Relying On Spatiality*. It is an encoder-only Transformer model that takes a sequence of tokens and their bounding boxes as inputs and outputs a sequence of hidden states. BROS encode relative spatial information instead of using absolute spatial information.
+processor = AutoProcessor.from_pretrained("jinho8345/bros-base-uncased")
+model = AutoModelForTokenClassification.from_pretrained("jinho8345/bros-base-uncased", dtype="auto")

-It is pre-trained with two objectives: a token-masked language modeling objective (TMLM) used in BERT, and a novel area-masked language modeling objective (AMLM)
-In TMLM, tokens are randomly masked, and the model predicts the masked tokens using spatial information and other unmasked tokens.
-AMLM is a 2D version of TMLM. It randomly masks text tokens and predicts with the same information as TMLM, but it masks text blocks (areas).
+text = "Plants create energy through a process known as photosynthesis."
+encoding = processor.tokenizer(text, add_special_tokens=False, return_tensors="pt")
+bbox = torch.tensor([[[0, 0, 1, 1]]]).repeat(1, encoding["input_ids"].shape[-1], 1)
+encoding["bbox"] = bbox

-`BrosForTokenClassification` has a simple linear layer on top of BrosModel. It predicts the label of each token.
-`BrosSpadeEEForTokenClassification` has an `initial_token_classifier` and `subsequent_token_classifier` on top of BrosModel. `initial_token_classifier` is used to predict the first token of each entity, and `subsequent_token_classifier` is used to predict the next token of within entity. `BrosSpadeELForTokenClassification` has an `entity_linker` on top of BrosModel. `entity_linker` is used to predict the relation between two entities.
+outputs = model(**encoding)
+predictions = torch.argmax(outputs.logits, dim=-1)
+tokens = processor.tokenizer.convert_ids_to_tokens(encoding["input_ids"][0])

-`BrosForTokenClassification` and `BrosSpadeEEForTokenClassification` essentially perform the same job. However, `BrosForTokenClassification` assumes input tokens are perfectly serialized (which is very challenging task since they exist in a 2D space), while `BrosSpadeEEForTokenClassification` allows for more flexibility in handling serialization errors as it predicts next connection tokens from one token.
-
-`BrosSpadeELForTokenClassification` perform the intra-entity linking task. It predicts relation from one token (of one entity) to another token (of another entity) if these two entities share some relation.
-
-BROS achieves comparable or better result on Key Information Extraction (KIE) benchmarks such as FUNSD, SROIE, CORD and SciTSR, without relying on explicit visual features.
-
-The abstract from the paper is the following:
-
-*Key information extraction (KIE) from document images requires understanding the contextual and spatial semantics of texts in two-dimensional (2D) space. Many recent studies try to solve the task by developing pre-trained language models focusing on combining visual features from document images with texts and their layout. On the other hand, this paper tackles the problem by going back to the basic: effective combination of text and layout. Specifically, we propose a pre-trained language model, named BROS (BERT Relying On Spatiality), that encodes relative positions of texts in 2D space and learns from unlabeled documents with area-masking strategy. With this optimized training scheme for understanding texts in 2D space, BROS shows comparable or better performance compared to previous methods on four KIE benchmarks (FUNSD, SROIE*, CORD, and SciTSR) without relying on visual features. This paper also reveals two real-world challenges in KIE tasks-(1) minimizing the error from incorrect text ordering and (2) efficient learning from fewer downstream examples-and demonstrates the superiority of BROS over previous methods.*
-
-This model was contributed by [jinho8345](https://huggingface.co/jinho8345). The original code can be found [here](https://github.com/clovaai/bros).
-
-## Usage tips and examples
-
- [`~transformers.BrosModel.forward`] requires `input_ids` and `bbox` (bounding box). Each bounding box should be in (x0, y0, x1, y1) format (top-left corner, bottom-right corner). Obtaining of Bounding boxes depends on external OCR system. The `x` coordinate should be normalized by document image width, and the `y` coordinate should be normalized by document image height.
-
-```python
-def expand_and_normalize_bbox(bboxes, doc_width, doc_height):
-    # here, bboxes are numpy array
-
-    # Normalize bbox -> 0 ~ 1
-    bboxes[:, [0, 2]] = bboxes[:, [0, 2]] / width
-    bboxes[:, [1, 3]] = bboxes[:, [1, 3]] / height
+print("Token predictions:")
+for token, pred in zip(tokens, predictions[0]):
+    print(f"'{token}' -> Class {pred.item()}")
 ```

- [`~transformers.BrosForTokenClassification.forward`, `~transformers.BrosSpadeEEForTokenClassification.forward`, `~transformers.BrosSpadeEEForTokenClassification.forward`] require not only `input_ids` and `bbox` but also `box_first_token_mask` for loss calculation. It is a mask to filter out non-first tokens of each box. You can obtain this mask by saving start token indices of bounding boxes when creating `input_ids` from words. You can make `box_first_token_mask` with following code,
-
-```python
-def make_box_first_token_mask(bboxes, words, tokenizer, max_seq_length=512):
-
-    box_first_token_mask = np.zeros(max_seq_length, dtype=np.bool_)
-
-    # encode(tokenize) each word from words (list[str])
-    input_ids_list: list[list[int]] = [tokenizer.encode(e, add_special_tokens=False) for e in words]
-
-    # get the length of each box
-    tokens_length_list: list[int] = [len(l) for l in input_ids_list]
-
-    box_end_token_indices = np.array(list(itertools.accumulate(tokens_length_list)))
-    box_start_token_indices = box_end_token_indices - np.array(tokens_length_list)
-
-    # filter out the indices that are out of max_seq_length
-    box_end_token_indices = box_end_token_indices[box_end_token_indices < max_seq_length - 1]
-    if len(box_start_token_indices) > len(box_end_token_indices):
-        box_start_token_indices = box_start_token_indices[: len(box_end_token_indices)]
-
-    # set box_start_token_indices to True
-    box_first_token_mask[box_start_token_indices] = True
-
-    return box_first_token_mask
-
-```
-
-## Resources
-
- Demo scripts can be found [here](https://github.com/clovaai/bros).
+</hfoption>
+</hfoptions>

 ## BrosConfig

@ -115,3 +70,4 @@ def make_box_first_token_mask(bboxes, words, tokenizer, max_seq_length=512):

 [[autodoc]] BrosSpadeELForTokenClassification
    - forward
+
--- a/docs/source/en/model_doc/byt5.md
+++ b/docs/source/en/model_doc/byt5.md
@ -13,127 +13,49 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2021-05-28 and added to Hugging Face Transformers on 2021-06-01.*
-<div style="float: right;">
-  <div class="flex flex-wrap space-x-1">
-    <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-  </div>
-</div>
+*This model was released on 2021-05-28 and added to Hugging Face Transformers on 2021-06-01 and contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten).*

 # ByT5

-[ByT5](https://huggingface.co/papers/2105.13626) is tokenizer-free version of the [T5](./t5) model designed to works directly on raw UTF-8 bytes. This means it can process any language, more robust to noise like typos, and simpler to use because it doesn't require a preprocessing pipeline.
-
-You can find all the original ByT5 checkpoints under the [Google](https://huggingface.co/google?search_models=byt5) organization.
-
-> [!TIP]
-> Refer to the [T5](./t5) docs for more examples of how to apply ByT5 to different language tasks.
-
-The example below demonstrates how to generate text with [`Pipeline`], [`AutoModel`] and from the command line.
+[ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://huggingface.co/papers/2105.13626) explores the use of standard Transformer architectures to process byte sequences directly, eliminating the need for tokenization. This approach offers benefits such as language-agnostic processing, robustness to noise, and reduced preprocessing complexity. The study demonstrates that byte-level models can compete with token-level models in terms of parameter count, training computational cost, and inference speed. Additionally, byte-level models show superior performance on tasks sensitive to spelling and pronunciation. The paper introduces a new set of pre-trained byte-level Transformer models based on the T5 architecture.

 <hfoptions id="usage">
 <hfoption id="Pipeline">

-```python
+```py
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    task="text2text-generation",
-    model="google/byt5-small",
-    dtype=torch.float16,
-    device=0
-)
-pipeline("translate English to French: The weather is nice today")
+pipeline = pipeline(task="text2text-generation", model="google/byt5-small", dtype="auto")
+pipeline("translate English to French: Plants generate energy through a process known as photosynthesis.")
 ```

 </hfoption>
 <hfoption id="AutoModel">

-```python
+```py
 import torch
 from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

-tokenizer = AutoTokenizer.from_pretrained(
-    "google/byt5-small"
-)
-model = AutoModelForSeq2SeqLM.from_pretrained(
-    "google/byt5-small",
-    dtype=torch.float16,
-    device_map="auto"
-)
+model = AutoModelForSeq2SeqLM.from_pretrained("google/byt5-small", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("google/byt5-small")

-input_ids = tokenizer("summarize: Photosynthesis is the process by which plants, algae, and some bacteria convert light energy into chemical energy.", return_tensors="pt").to(model.device)
-
-output = model.generate(**input_ids)
-print(tokenizer.decode(output[0], skip_special_tokens=True))
+inputs = tokenizer("translate English to French: Plants generate energy through a process known as photosynthesis.", return_tensors="pt")
+outputs = model.generate(**inputs, max_length=50)
+print(tokenizer.decode(outputs[0]))
 ```

-</hfoption>
-<hfoption id="transformers">
-
-```bash
-echo -e "translate English to French: Life is beautiful." | transformers run --task text2text-generation --model google/byt5-small --device 0
-```
-
-</hfoption>
+</hfopton>
 </hfoptions>

-## Quantization
+## Usage tips

-Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
-
-The example below uses [torchao](../quantization/torchao) to only quantize the weights to int4.
-
-```python
-# pip install torchao
-import torch
-from transformers import TorchAoConfig, AutoModelForSeq2SeqLM, AutoTokenizer
-
-quantization_config = TorchAoConfig("int4_weight_only", group_size=128)
-
-model = AutoModelForSeq2SeqLM.from_pretrained(
-    "google/byt5-xl",
-    dtype=torch.bfloat16,
-    device_map="auto",
-    quantization_config=quantization_config
-)
-
-tokenizer = AutoTokenizer.from_pretrained("google/byt5-xl")
-input_ids = tokenizer("translate English to French: The weather is nice today.", return_tensors="pt").to(model.device)
-
-output = model.generate(**input_ids)
-print(tokenizer.decode(output[0], skip_special_tokens=True))
-```
-
-## Notes
-
- It is recommended to use the tokenizer for batched inference and training.
- The example below shows how to use the model without a tokenizer.
-
-    ```python
-    import torch
-    from transformers import AutoModelForSeq2SeqLM
-
-    model = AutoModelForSeq2SeqLM.from_pretrained("google/byt5-small")
-
-    num_special_tokens = 3
-
-    input_ids = torch.tensor([list("Life is like a box of chocolates.".encode("utf-8"))]) + num_special_tokens
-    labels = torch.tensor([list("La vie est comme une boîte de chocolat.".encode("utf-8"))]) + num_special_tokens
-    loss = model(input_ids, labels=labels).loss
-    loss.item()
-    ```
-
- ByT5 uses the top byte values (258, 257, etc.) for masking instead of sentinel tokens like `{extra_id_0}`.
-
-    ```python
-    # Example: character-level denoising with mask tokens
-    input_ids = tokenizer("The dog chases a ball in the park.").input_ids
-    masked_input = torch.tensor([input_ids[:8] + [258] + input_ids[14:21] + [257] + input_ids[28:]])
-    output = model.generate(masked_input, max_length=100)
-    ```
+- Use the tokenizer for batched inference and training.
+- ByT5 uses top byte values (258, 257, etc.) for masking instead of sentinel tokens like `{extra_id_0}`.

 ## ByT5Tokenizer

 [[autodoc]] ByT5Tokenizer
+
+See [`ByT5Tokenizer`] for all details.
+
--- a/docs/source/en/model_doc/camembert.md
+++ b/docs/source/en/model_doc/camembert.md
@ -13,108 +13,50 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2019-11-10 and added to Hugging Face Transformers on 2020-11-16.*
+*This model was released on 2019-11-10 and added to Hugging Face Transformers on 2020-11-16 and contributed by [almanach](https://huggingface.co/almanach).*

 <div style="float: right;">
- <div class="flex flex-wrap space-x-1">
-  <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
+    <div class="flex flex-wrap space-x-1">
        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
- </div>
+    </div>
 </div>

 # CamemBERT

-[CamemBERT](https://huggingface.co/papers/1911.03894) is a language model based on [RoBERTa](./roberta), but trained specifically on French text from the OSCAR dataset, making it more effective for French language tasks.
-
-What sets CamemBERT apart is that it learned from a huge, high quality collection of French data, as opposed to mixing lots of languages. This helps it really understand French better than many multilingual models.
-
-Common applications of CamemBERT include masked language modeling (Fill-mask prediction), text classification (sentiment analysis), token classification (entity recognition) and sentence pair classification (entailment tasks).
-
-You can find all the original CamemBERT checkpoints under the [ALMAnaCH](https://huggingface.co/almanach/models?search=camembert) organization.
-
-> [!TIP]
-> This model was contributed by the [ALMAnaCH (Inria)](https://huggingface.co/almanach) team.
->
-> Click on the CamemBERT models in the right sidebar for more examples of how to apply CamemBERT to different NLP tasks.
-
-The examples below demonstrate how to predict the `<mask>` token with [`Pipeline`], [`AutoModel`], and from the command line.
+[CamemBERT: a Tasty French Language Model](https://huggingface.co/papers/1911.03894) is a French version of the BERT model, trained on 138GB of French text. It addresses the limitation of existing models that are either English-centric or multilingual, offering improved performance in French-specific tasks such as part-of-speech tagging, dependency parsing, named-entity recognition, and natural language inference. The pretrained CamemBERT model is released to encourage further research and applications in French NLP.

 <hfoptions id="usage">
-
 <hfoption id="Pipeline">

-```python
+```py
 import torch
 from transformers import pipeline

-pipeline = pipeline("fill-mask", model="camembert-base", dtype=torch.float16, device=0)
-pipeline("Le camembert est un délicieux fromage <mask>.")
+pipeline = pipeline(task="fill-mask", model="almanach/camembert-base", dtype="auto")
+pipeline("Les plantes créent <mask> grâce à un processus appelé photosynthèse.")
 ```

 </hfoption>
-
 <hfoption id="AutoModel">

-```python
+```py
 import torch
-from transformers import AutoTokenizer, AutoModelForMaskedLM
+from transformers import AutoModelForMaskedLM, AutoTokenizer

-tokenizer = AutoTokenizer.from_pretrained("camembert-base")
-model = AutoModelForMaskedLM.from_pretrained("camembert-base", dtype="auto", device_map="auto", attn_implementation="sdpa")
-inputs = tokenizer("Le camembert est un délicieux fromage <mask>.", return_tensors="pt").to(model.device)
+model = AutoModelForMaskedLM.from_pretrained("almanach/camembert-base", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("almanach/camembert-base")

-with torch.no_grad():
-    outputs = model(**inputs)
-    predictions = outputs.logits
-
-masked_index = torch.where(inputs['input_ids'] == tokenizer.mask_token_id)[1]
-predicted_token_id = predictions[0, masked_index].argmax(dim=-1)
-predicted_token = tokenizer.decode(predicted_token_id)
-
-print(f"The predicted token is: {predicted_token}")
+inputs = tokenizer("Les plantes créent <mask> grâce à un processus appelé photosynthèse.", return_tensors="pt")
+outputs = model(**inputs)
+mask_token_id = tokenizer.mask_token_id
+mask_position = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
+predicted_word = tokenizer.decode(outputs.logits[0, mask_position].argmax(dim=-1))
+print(f"Predicted word: {predicted_word}")
 ```

 </hfoption>
-
-<hfoption id="transformers CLI">
-
-```bash
-echo -e "Le camembert est un délicieux fromage <mask>." | transformers run --task fill-mask --model camembert-base --device 0
-```
-
-</hfoption>
-
 </hfoptions>

-Quantization reduces the memory burden of large models by representing weights in lower precision. Refer to the [Quantization](../quantization/overview) overview for available options.
-
-The example below uses [bitsandbytes](../quantization/bitsandbytes) quantization to quantize the weights to 8-bits.
-
-```python
-from transformers import AutoTokenizer, AutoModelForMaskedLM, BitsAndBytesConfig
-import torch
-
-quant_config = BitsAndBytesConfig(load_in_8bit=True)
-model = AutoModelForMaskedLM.from_pretrained(
-    "almanach/camembert-large",
-    quantization_config=quant_config,
-    device_map="auto"
-)
-tokenizer = AutoTokenizer.from_pretrained("almanach/camembert-large")
-
-inputs = tokenizer("Le camembert est un délicieux fromage <mask>.", return_tensors="pt").to(model.device)
-
-with torch.no_grad():
-    outputs = model(**inputs)
-    predictions = outputs.logits
-
-masked_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
-predicted_token_id = predictions[0, masked_index].argmax(dim=-1)
-predicted_token = tokenizer.decode(predicted_token_id)
-
-print(f"The predicted token is: {predicted_token}")
-```
-
 ## CamembertConfig

 [[autodoc]] CamembertConfig
@ -158,3 +100,4 @@ print(f"The predicted token is: {predicted_token}")
 ## CamembertForQuestionAnswering

 [[autodoc]] CamembertForQuestionAnswering
+
--- a/docs/source/en/model_doc/canine.md
+++ b/docs/source/en/model_doc/canine.md
@ -13,24 +13,11 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2021-03-11 and added to Hugging Face Transformers on 2021-06-30.*
-
-<div style="float: right;">
-    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-    </div>
-</div>
+*This model was released on 2021-03-11 and added to Hugging Face Transformers on 2021-06-30 and contributed by [nielsr](https://huggingface.co/nielsr).*

 # CANINE

-[CANINE](https://huggingface.co/papers/2103.06874) is a tokenization-free Transformer. It skips the usual step of splitting text into subwords or wordpieces and processes text character by character. That means it works directly with raw Unicode, making it especially useful for languages with complex or inconsistent tokenization rules and even noisy inputs like typos. Since working with characters means handling longer sequences, CANINE uses a smart trick. The model compresses the input early on (called downsampling) so the transformer doesn't have to process every character individually. This keeps things fast and efficient.
-
-You can find all the original CANINE checkpoints under the [Google](https://huggingface.co/google?search_models=canine) organization.
-
-> [!TIP]
-> Click on the CANINE models in the right sidebar for more examples of how to apply CANINE to different language tasks.
-
-The example below demonstrates how to generate embeddings with [`Pipeline`], [`AutoModel`], and from the command line.
+[CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://huggingface.co/papers/2103.06874) presents CANINE, a neural encoder that processes text directly at the Unicode character level without explicit tokenization or vocabulary. It addresses the challenges of varying language suitability and vocabulary limitations by using a downsampling strategy to manage longer sequences and a deep Transformer stack to capture context. CANINE achieves a 2.8 F1 score improvement on TyDi QA compared to a similar mBERT model, despite having 28% fewer parameters.

 <hfoptions id="usage">
 <hfoption id="Pipeline">
@ -39,13 +26,8 @@ The example below demonstrates how to generate embeddings with [`Pipeline`], [`A
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    task="feature-extraction",
-    model="google/canine-c",
-    device=0,
-)
-
-pipeline("Plant create energy through a process known as photosynthesis.")
+pipeline = pipeline(task="text-classification", model="google/canine-s", dtype="auto")
+pipeline("Plants are amazing because they can create energy from the sun.")
 ```

 </hfoption>
@ -53,41 +35,25 @@ pipeline("Plant create energy through a process known as photosynthesis.")

 ```py
 import torch
-from transformers import AutoModel
+from transformers import AutoModelForSequenceClassification, AutoTokenizer

-model = AutoModel.from_pretrained("google/canine-c")
+model = AutoModelForSequenceClassification.from_pretrained("google/canine-s", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("google/canine-s")

-text = "Plant create energy through a process known as photosynthesis."
-input_ids = torch.tensor([[ord(char) for char in text]])
-
-outputs = model(input_ids)
-pooled_output = outputs.pooler_output
-sequence_output = outputs.last_hidden_state
-```
-
-</hfoption>
-<hfoption id="transformers CLI">
-
-```bash
-echo -e "Plant create energy through a process known as photosynthesis." | transformers run --task feature-extraction --model google/canine-c --device 0
+inputs = tokenizer("Plants are amazing because they can create energy from the sun.", return_tensors="pt")
+outputs = model(**inputs)
+predicted_class_id = outputs.logits.argmax(dim=-1).item()
+label = model.config.id2label[predicted_class_id]
+print(f"Predicted label: {label}")
 ```

 </hfoption>
 </hfoptions>

-## Notes
+## Usage tips

- CANINE skips tokenization entirely — it works directly on raw characters, not subwords. You can use it with or without a tokenizer. For batched inference and training, it is recommended to use the tokenizer to pad and truncate all sequences to the same length.
-
-    ```py
-    from transformers import AutoTokenizer, AutoModel
-
-    tokenizer = AutoTokenizer("google/canine-c")
-    inputs = ["Life is like a box of chocolates.", "You never know what you gonna get."]
-    encoding = tokenizer(inputs, padding="longest", truncation=True, return_tensors="pt")
-    ```
-
- CANINE is primarily designed to be fine-tuned on a downstream task. The pretrained model can be used for either masked language modeling or next sentence prediction.
+- CANINE skips tokenization entirely. It works directly on raw characters, not subwords. Use it with or without a tokenizer. For batched inference and training, use the tokenizer to pad and truncate all sequences to the same length.
+- CANINE is designed for fine-tuning on downstream tasks. The pretrained model handles masked language modeling or next sentence prediction.

 ## CanineConfig

@ -128,3 +94,4 @@ echo -e "Plant create energy through a process known as photosynthesis." | trans

 [[autodoc]] CanineForQuestionAnswering
    - forward
+
--- a/docs/source/en/model_doc/chameleon.md
+++ b/docs/source/en/model_doc/chameleon.md
@ -13,163 +13,54 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2024-05-16 and added to Hugging Face Transformers on 2024-07-17.*
+*This model was released on 2024-05-16 and added to Hugging Face Transformers on 2024-07-17 and contributed by [joaogante](https://huggingface.co/joaogante) and [RaushanTurganbay](https://huggingface.co/RaushanTurganbay).*
+
+<div style="float: right;">
+    <div class="flex flex-wrap space-x-1">
+        <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
+        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
+    </div>
+</div>

 # Chameleon

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
-<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[Chameleon: Mixed-Modal Early-Fusion Foundation Models](https://huggingface.co/papers/2405.09818v1) is a Vision-Language Model that uses vector quantization to tokenize images, enabling it to generate multimodal output. It handles images and texts in any sequence, including interleaved formats, and produces textual responses. Chameleon demonstrates superior performance in image captioning, outperforms Llama-2 in text-only tasks, and is competitive with Mixtral 8x7B and Gemini-Pro. It also performs non-trivial image generation and matches or exceeds the performance of larger models like Gemini Pro and GPT-4V in long-form mixed-modal generation tasks.

-## Overview
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-The Chameleon model was proposed in [Chameleon: Mixed-Modal Early-Fusion Foundation Models](https://huggingface.co/papers/2405.09818) by META AI Chameleon Team. Chameleon is a Vision-Language Model that use vector quantization to tokenize images which enables the model to generate multimodal output. The model takes images and texts as input, including an interleaved format, and generates textual response. Image generation module is not released yet.
-
-The abstract from the paper is the following:
-
-*We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. We outline a stable training
-approach from inception, an alignment recipe, and an architectural parameterization tailored for the
-early-fusion, token-based, mixed-modal setting. The models are evaluated on a comprehensive range
-of tasks, including visual question answering, image captioning, text generation, image generation, and
-long-form mixed modal generation. Chameleon demonstrates broad and general capabilities, including
-state-of-the-art performance in image captioning tasks, outperforms Llama-2 in text-only tasks while
-being competitive with models such as Mixtral 8x7B and Gemini-Pro, and performs non-trivial image
-generation, all in a single model. It also matches or exceeds the performance of much larger models,
-including Gemini Pro and GPT-4V, according to human judgments on a new long-form mixed-modal
-generation evaluation, where either the prompt or outputs contain mixed sequences of both images and
-text. Chameleon marks a significant step forward in unified modeling of full multimodal documents*
-
-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/chameleon_arch.png"
-alt="drawing" width="600"/>
-
-<small> Chameleon incorporates a vector quantizer module to transform images into discrete tokens. That also enables image generation using an auto-regressive transformer. Taken from the <a href="https://huggingface.co/papers/2405.09818">original paper.</a> </small>
-
-This model was contributed by [joaogante](https://huggingface.co/joaogante) and [RaushanTurganbay](https://huggingface.co/RaushanTurganbay).
-The original code can be found [here](https://github.com/facebookresearch/chameleon).
-
-## Usage tips
-
- We advise users to use `padding_side="left"` when computing batched generation as it leads to more accurate results. Simply make sure to set `processor.tokenizer.padding_side = "left"` before generating.
-
- Note that Chameleon was tuned for safety alignment. If the model is refusing to answer, consider asking a more concrete question, instead of an open question.
-
- Chameleon generates in chat format which means that the generated text will always be the "assistant's turn". You can enable a text completion generation by passing `return_for_text_completion=True` when calling the processor.
-
-> [!NOTE]
-> Chameleon implementation in Transformers uses a special image token to indicate where to merge image embeddings. For special image token we didn't add a new one but used one of the reserved tokens: `<reserved08707>`. You have to add `<image>` to your prompt in the place where the image should be embedded for correct generation.
-
-## Usage example
-
-### Single image inference
-
-Chameleon is a gated model so make sure to have access and login to Hugging Face Hub using a token.
-Here's how to load the model and perform inference in half-precision (`torch.bfloat16`):
-
-```python
-from transformers import ChameleonProcessor, ChameleonForConditionalGeneration
+```py
+import torch
+from transformers import pipeline
+
+pipeline = pipeline(task="image-to-text", model="facebook/chameleon-7b", dtype="auto")
+pipeline("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg", text="What is shown in this image? <image>"
+)
+```
+
+</hfoption>
+<hfoption id="ChameleonForConditionalGeneration">
+
+```py
 import torch
-from PIL import Image
 import requests
+from PIL import Image
+from transformers import AutoProcessor, ChameleonForConditionalGeneration

-processor = ChameleonProcessor.from_pretrained("facebook/chameleon-7b")
-model = ChameleonForConditionalGeneration.from_pretrained("facebook/chameleon-7b", dtype=torch.bfloat16, device_map="auto")
+processor = AutoProcessor.from_pretrained("facebook/chameleon-7b")
+model = ChameleonForConditionalGeneration.from_pretrained("facebook/chameleon-7b", dtype="auto")

-# prepare image and text prompt
-url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
 image = Image.open(requests.get(url, stream=True).raw)
-prompt = "What do you see in this image?<image>"
+prompt = "What is shown in this image?<image>"

-inputs = processor(images=image, text=prompt, return_tensors="pt").to(model.device, dtype=torch.bfloat16)
-
-# autoregressively complete prompt
+inputs = processor(images=image, text=prompt, return_tensors="pt").to(torch.bfloat16)
 output = model.generate(**inputs, max_new_tokens=50)
 print(processor.decode(output[0], skip_special_tokens=True))
 ```

-### Multi image inference
-
-Chameleon can perform inference with multiple images as input, where images either belong to the same prompt or different prompts (in batched inference). Here is how you can do it:
-
-```python
-from transformers import ChameleonProcessor, ChameleonForConditionalGeneration
-import torch
-from PIL import Image
-import requests
-
-processor = ChameleonProcessor.from_pretrained("facebook/chameleon-7b")
-
-model = ChameleonForConditionalGeneration.from_pretrained("facebook/chameleon-7b", dtype=torch.bfloat16, device_map="auto")
-
-# Get three different images
-url = "https://www.ilankelman.org/stopsigns/australia.jpg"
-image_stop = Image.open(requests.get(url, stream=True).raw)
-
-url = "http://images.cocodataset.org/val2017/000000039769.jpg"
-image_cats = Image.open(requests.get(url, stream=True).raw)
-
-url = "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.jpg"
-image_snowman = Image.open(requests.get(url, stream=True).raw)
-
-# Prepare a batched prompt, where the first one is a multi-image prompt and the second is not
-prompts = [
-    "What do these images have in common?<image><image>",
-    "<image>What is shown in this image?"
-]
-
-# We can simply feed images in the order they have to be used in the text prompt
-# Each "<image>" token uses one image leaving the next for the subsequent "<image>" tokens
-inputs = processor(images=[image_stop, image_cats, image_snowman], text=prompts, padding=True, return_tensors="pt").to(device=model.device, dtype=torch.bfloat16)
-
-# Generate
-generate_ids = model.generate(**inputs, max_new_tokens=50)
-processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
-```
-
-## Model optimization
-
-### Quantization using Bitsandbytes
-
-The model can be loaded in 8 or 4 bits, greatly reducing the memory requirements while maintaining the performance of the original model. First make sure to install bitsandbytes, `pip install bitsandbytes` and to have access to a GPU/accelerator that is supported by the library.
-
-<Tip>
-
-bitsandbytes is being refactored to support multiple backends beyond CUDA. Currently, ROCm (AMD GPU) and Intel CPU implementations are mature, with Intel XPU in progress and Apple Silicon support expected by Q4/Q1. For installation instructions and the latest backend updates, visit [this link](https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend).
-
-We value your feedback to help identify bugs before the full release! Check out [these docs](https://huggingface.co/docs/bitsandbytes/main/en/non_cuda_backends) for more details and feedback links.
-
-</Tip>
-
-Simply change the snippet above with:
-
-```python
-from transformers import ChameleonForConditionalGeneration, BitsAndBytesConfig
-
-# specify how to quantize the model
-quantization_config = BitsAndBytesConfig(
-    load_in_4bit=True,
-    bnb_4bit_quant_type="nf4",
-    bnb_4bit_compute_dtype=torch.bfloat16,
-)
-
-model = ChameleonForConditionalGeneration.from_pretrained("facebook/chameleon-7b", quantization_config=quantization_config, device_map="auto")
-```
-
-### Use Flash-Attention 2 and SDPA to further speed-up generation
-
-The models supports both, Flash-Attention 2 and PyTorch's [`torch.nn.functional.scaled_dot_product_attention`](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention.html) which can be enables for optimization. SDPA is the default options when you load the model, If you want to switch for Flash Attention 2, first make sure to install flash-attn. Refer to the [original repository](https://github.com/Dao-AILab/flash-attention) regarding that package installation. Simply change the snippet above with:
-
-```python
-from transformers import ChameleonForConditionalGeneration
-
-model_id = "facebook/chameleon-7b"
-model = ChameleonForConditionalGeneration.from_pretrained(
-    model_id,
-    dtype=torch.bfloat16,
-    attn_implementation="flash_attention_2"
-).to(0)
-```
+</hfoption>
+</hfoptions>

 ## ChameleonConfig

@ -207,3 +98,4 @@ model = ChameleonForConditionalGeneration.from_pretrained(

 [[autodoc]] ChameleonForConditionalGeneration
    - forward
+
--- a/docs/source/en/model_doc/chinese_clip.md
+++ b/docs/source/en/model_doc/chinese_clip.md
@ -13,65 +13,41 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2022-11-02 and added to Hugging Face Transformers on 2022-12-01.*
+*This model was released on 2022-11-02 and added to Hugging Face Transformers on 2022-12-01 and contributed by [OFA-Sys](https://huggingface.co/OFA-Sys).*

 # Chinese-CLIP

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[Chinese-CLIP](https://huggingface.co/papers/2211.01335) constructs a large-scale dataset of Chinese image-text pairs and pretrains models of varying sizes, from 77 to 958 million parameters. It employs a two-stage pretraining method, initially freezing the image encoder before optimizing all parameters. Experiments show superior performance on MUGE, Flickr30K-CN, and COCO-CN for zero-shot learning and finetuning, and competitive results in zero-shot image classification on the ELEVATER benchmark.

-## Overview
+<hfoptions id="usage">
+<hfoption id="ChineseCLIPModel">

-The Chinese-CLIP model was proposed in [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://huggingface.co/papers/2211.01335) by An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou.
-Chinese-CLIP is an implementation of CLIP (Radford et al., 2021) on a large-scale dataset of Chinese image-text pairs. It is capable of performing cross-modal retrieval and also playing as a vision backbone for vision tasks like zero-shot image classification, open-domain object detection, etc. The original Chinese-CLIP code is released [at this link](https://github.com/OFA-Sys/Chinese-CLIP).
+```py
+import torch
+import requests
+from PIL import Image
+from transformers import AutoProcessor, ChineseCLIPModel

-The abstract from the paper is the following:
+model = ChineseCLIPModel.from_pretrained("OFA-Sys/chinese-clip-vit-base-patch16", dtype="auto")
+processor = AutoProcessor.from_pretrained("OFA-Sys/chinese-clip-vit-base-patch16")

-*The tremendous success of CLIP (Radford et al., 2021) has promoted the research and application of contrastive learning for vision-language pretraining. In this work, we construct a large-scale dataset of image-text pairs in Chinese, where most data are retrieved from publicly available datasets, and we pretrain Chinese CLIP models on the new dataset. We develop 5 Chinese CLIP models of multiple sizes, spanning from 77 to 958 million parameters. Furthermore, we propose a two-stage pretraining method, where the model is first trained with the image encoder frozen and then trained with all parameters being optimized, to achieve enhanced model performance. Our comprehensive experiments demonstrate that Chinese CLIP can achieve the state-of-the-art performance on MUGE, Flickr30K-CN, and COCO-CN in the setups of zero-shot learning and finetuning, and it is able to achieve competitive performance in zero-shot image classification based on the evaluation on the ELEVATER benchmark (Li et al., 2022). Our codes, pretrained models, and demos have been released.*
+url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
+image = Image.open(requests.get(url, stream=True).raw)
+# Squirtle, Bulbasaur, Charmander, Pikachu in English
+texts = ["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"]

-The Chinese-CLIP model was contributed by [OFA-Sys](https://huggingface.co/OFA-Sys).
+inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
+outputs = model(**inputs)
+logits_per_image = outputs.logits_per_image
+probs = logits_per_image.softmax(dim=1)

-## Usage example
-
-The code snippet below shows how to compute image & text features and similarities:
-
-```python
->>> from PIL import Image
->>> import requests
->>> from transformers import ChineseCLIPProcessor, ChineseCLIPModel
-
->>> model = ChineseCLIPModel.from_pretrained("OFA-Sys/chinese-clip-vit-base-patch16")
->>> processor = ChineseCLIPProcessor.from_pretrained("OFA-Sys/chinese-clip-vit-base-patch16")
-
->>> url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
->>> image = Image.open(requests.get(url, stream=True).raw)
->>> # Squirtle, Bulbasaur, Charmander, Pikachu in English
->>> texts = ["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"]
-
->>> # compute image feature
->>> inputs = processor(images=image, return_tensors="pt")
->>> image_features = model.get_image_features(**inputs)
->>> image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True)  # normalize
-
->>> # compute text features
->>> inputs = processor(text=texts, padding=True, return_tensors="pt")
->>> text_features = model.get_text_features(**inputs)
->>> text_features = text_features / text_features.norm(p=2, dim=-1, keepdim=True)  # normalize
-
->>> # compute image-text similarity scores
->>> inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
->>> outputs = model(**inputs)
->>> logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
->>> probs = logits_per_image.softmax(dim=1)  # probs: [[1.2686e-03, 5.4499e-02, 6.7968e-04, 9.4355e-01]]
+print("Text-image similarity probabilities:")
+for i, (text, prob) in enumerate(zip(texts, probs[0])):
+    print(f"'{text}' -> {prob.item():.4f} ({prob.item()*100:.1f}%)")
 ```

-Currently, following scales of pretrained Chinese-CLIP models are available on 🤗 Hub:
-
- [OFA-Sys/chinese-clip-vit-base-patch16](https://huggingface.co/OFA-Sys/chinese-clip-vit-base-patch16)
- [OFA-Sys/chinese-clip-vit-large-patch14](https://huggingface.co/OFA-Sys/chinese-clip-vit-large-patch14)
- [OFA-Sys/chinese-clip-vit-large-patch14-336px](https://huggingface.co/OFA-Sys/chinese-clip-vit-large-patch14-336px)
- [OFA-Sys/chinese-clip-vit-huge-patch14](https://huggingface.co/OFA-Sys/chinese-clip-vit-huge-patch14)
+</hfoption>
+</hfoptions>

 ## ChineseCLIPConfig

@ -115,3 +91,4 @@ Currently, following scales of pretrained Chinese-CLIP models are available on

 [[autodoc]] ChineseCLIPVisionModel
    - forward
+
--- a/docs/source/en/model_doc/clap.md
+++ b/docs/source/en/model_doc/clap.md
@ -13,48 +13,35 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2022-11-12 and added to Hugging Face Transformers on 2023-02-16.*
-
-<div style="float: right;">
-  <div class="flex flex-wrap space-x-1">
-    <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-  </div>
-</div>
+*This model was released on 2022-11-12 and added to Hugging Face Transformers on 2023-02-16 and contributed by [ybelkada](https://huggingface.co/ybelkada) and [ArthurZ](https://huggingface.co/ArthurZ).*

 # CLAP

-[CLAP (Contrastive Language-Audio Pretraining)](https://huggingface.co/papers/2211.06687) is a multimodal model that combines audio data with natural language descriptions through contrastive learning.
-
-It incorporates feature fusion and keyword-to-caption augmentation to process variable-length audio inputs and to improve performance. CLAP doesn't require task-specific training data and can learn meaningful audio representations through natural language.
-
-You can find all the original CLAP checkpoints under the [CLAP](https://huggingface.co/collections/laion/clap-contrastive-language-audio-pretraining-65415c0b18373b607262a490) collection.
-
-> [!TIP]
-> This model was contributed by [ybelkada](https://huggingface.co/ybelkada) and [ArthurZ](https://huggingface.co/ArthurZ).
->
-> Click on the CLAP models in the right sidebar for more examples of how to apply CLAP to different audio retrieval and classification tasks.
-
-The example below demonstrates how to extract text embeddings with the [`AutoModel`] class.
+[CLAP](https://huggingface.co/papers/2211.06687) is a neural network trained on a large dataset of audio-text pairs to develop a multimodal representation. It uses a SWINTransformer for audio feature extraction from log-Mel spectrograms and a RoBERTa model for text feature extraction. Both feature sets are projected into a shared latent space, where their similarity is measured using a dot product. The model incorporates feature fusion and keyword-to-caption augmentation to handle variable audio lengths and improve performance. Evaluations across text-to-audio retrieval, zero-shot audio classification, and supervised audio classification show that CLAP achieves superior results in text-to-audio retrieval and state-of-the-art performance in zero-shot audio classification, comparable to non-zero-shot models.

 <hfoptions id="usage">
-<hfoption id="AutoModel">
+<hfoption id="ClapModel">

-```python
-import torch
-from transformers import AutoTokenizer, AutoModel
+```py
+from datasets import load_dataset
+from transformers import AutoProcessor, ClapModel

-model = AutoModel.from_pretrained("laion/clap-htsat-unfused", dtype=torch.float16, device_map="auto")
-tokenizer = AutoTokenizer.from_pretrained("laion/clap-htsat-unfused")
+dataset = load_dataset("hf-internal-testing/ashraq-esc50-1-dog-example")
+audio_sample = dataset["train"]["audio"][0]["array"]

-texts = ["the sound of a cat", "the sound of a dog", "music playing"]
+model = ClapModel.from_pretrained("laion/clap-htsat-unfused", dtype="auto")
+processor = AutoProcessor.from_pretrained("laion/clap-htsat-unfused")

-inputs = tokenizer(texts, padding=True, return_tensors="pt").to(model.device)
+input_text = ["Sound of a dog", "Sound of vacuum cleaner"]

-with torch.no_grad():
-    text_features = model.get_text_features(**inputs)
+inputs = processor(text=input_text, audios=audio_sample, return_tensors="pt", padding=True)

-print(f"Text embeddings shape: {text_features.shape}")
-print(f"Text embeddings: {text_features}")
+outputs = model(**inputs)
+logits_per_audio = outputs.logits_per_audio
+probs = logits_per_audio.softmax(dim=-1)
+
+for i, prob in enumerate(probs[0]):
+    print(f"{input_text[i]}: {prob.item():.3f}")
 ```

 </hfoption>
@ -106,3 +93,4 @@ print(f"Text embeddings: {text_features}")

 [[autodoc]] ClapAudioModelWithProjection
    - forward
+
--- a/docs/source/en/model_doc/clip.md
+++ b/docs/source/en/model_doc/clip.md
@ -13,11 +13,10 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2021-02-26 and added to Hugging Face Transformers on 2021-05-12.*
+*This model was released on 2021-02-26 and added to Hugging Face Transformers on 2021-05-12 and contributed by [valhalla](https://huggingface.co/valhalla).*

 <div style="float: right;">
    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
        <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
    </div>
@ -25,14 +24,7 @@ rendered properly in your Markdown viewer.

 # CLIP

-[CLIP](https://huggingface.co/papers/2103.00020) is a is a multimodal vision and language model motivated by overcoming the fixed number of object categories when training a computer vision model. CLIP learns about images directly from raw text by jointly training on 400M (image, text) pairs. Pretraining on this scale enables zero-shot transfer to downstream tasks. CLIP uses an image encoder and text encoder to get visual features and text features. Both features are projected to a latent space with the same number of dimensions and their dot product gives a similarity score.
-
-You can find all the original CLIP checkpoints under the [OpenAI](https://huggingface.co/openai?search_models=clip) organization.
-
-> [!TIP]
-> Click on the CLIP models in the right sidebar for more examples of how to apply CLIP to different image and language tasks.
-
-The example below demonstrates how to calculate similarity scores between multiple text descriptions and an image with [`Pipeline`] or the [`AutoModel`] class.
+[CLIP](https://huggingface.co/papers/2103.00020) is a neural network trained on 400 million (image, text) pairs from the internet. It learns to predict which caption corresponds to which image, enabling zero-shot transfer to various computer vision tasks. Benchmarked on over 30 datasets, CLIP demonstrates competitive performance without task-specific training, matching ResNet-50's accuracy on ImageNet zero-shot without using its training examples.

 <hfoptions id="usage">
 <hfoption id="Pipeline">
@ -41,49 +33,49 @@ The example below demonstrates how to calculate similarity scores between multip
 import torch
 from transformers import pipeline

-clip = pipeline(
-   task="zero-shot-image-classification",
-   model="openai/clip-vit-base-patch32",
-   dtype=torch.bfloat16,
-   device=0
-)
-labels = ["a photo of a cat", "a photo of a dog", "a photo of a car"]
-clip("http://images.cocodataset.org/val2017/000000039769.jpg", candidate_labels=labels)
+pipeline = pipeline(task="zero-shot-image-classification", model="openai/clip-vit-base-patch32", dtype="auto")
+candidate_labels = ["a photo of a dog", "a photo of a cat", "a photo of a person"]
+pipeline("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg", candidate_labels=candidate_labels)
 ```

 </hfoption>
 <hfoption id="AutoModel">

 ```py
-import requests
 import torch
+import requests
 from PIL import Image
-from transformers import AutoProcessor, AutoModel
+from transformers import AutoProcessor, AutoModelForZeroShotImageClassification

-model = AutoModel.from_pretrained("openai/clip-vit-base-patch32", dtype=torch.bfloat16, attn_implementation="sdpa")
 processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
+model = AutoModelForZeroShotImageClassification.from_pretrained("openai/clip-vit-base-patch32", dtype="auto")

-url = "http://images.cocodataset.org/val2017/000000039769.jpg"
-image = Image.open(requests.get(url, stream=True).raw)
-labels = ["a photo of a cat", "a photo of a dog", "a photo of a car"]
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
+image = requests.get(url, stream=True)
+inputs = Image.open(image.raw).convert("RGB")

-inputs = processor(text=labels, images=image, return_tensors="pt", padding=True)
+image_inputs = processor(images=inputs, return_tensors="pt").to(model.device)
+with torch.no_grad():
+    image_embeds = model.get_image_features(**image_inputs)

-outputs = model(**inputs)
-logits_per_image = outputs.logits_per_image
-probs = logits_per_image.softmax(dim=1)
-most_likely_idx = probs.argmax(dim=1).item()
-most_likely_label = labels[most_likely_idx]
-print(f"Most likely label: {most_likely_label} with probability: {probs[0][most_likely_idx].item():.3f}")
+candidate_labels = ["a photo of a dog", "a photo of a cat", "a photo of a person"]
+text_inputs = processor(text=candidate_labels, padding=True, return_tensors="pt").to(model.device)
+with torch.no_grad():
+    text_embeds = model.get_text_features(**text_inputs)
+
+image_embeds = image_embeds / image_embeds.norm(p=2, dim=-1, keepdim=True)
+text_embeds  = text_embeds  / text_embeds.norm(p=2, dim=-1, keepdim=True)
+
+logits = (image_embeds @ text_embeds.T) * 100.0
+probs  = logits.softmax(dim=-1).cpu().squeeze()
+
+for label, score in zip(candidate_labels, probs):
+    print(f"{label:20s} → {score.item():.4f}")
 ```

 </hfoption>
 </hfoptions>

-## Notes
-
- Use [`CLIPImageProcessor`] to resize (or rescale) and normalizes images for the model.
-
 ## CLIPConfig

 [[autodoc]] CLIPConfig
@ -153,3 +145,4 @@ print(f"Most likely label: {most_likely_label} with probability: {probs[0][most_

 [[autodoc]] CLIPForImageClassification
    - forward
+
--- a/docs/source/en/model_doc/clipseg.md
+++ b/docs/source/en/model_doc/clipseg.md
@ -13,61 +13,41 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2021-12-18 and added to Hugging Face Transformers on 2022-11-08.*
+*This model was released on 2021-12-18 and added to Hugging Face Transformers on 2022-11-08 and contributed by [nielsr](https://huggingface.co/nielsr).*

 # CLIPSeg

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[CLIPSeg](https://huggingface.co/papers/2112.10003) extends the CLIP model with a transformer-based decoder to enable zero-shot and one-shot image segmentation using arbitrary text or image prompts. This unified model can handle referring expression segmentation, zero-shot segmentation, and one-shot segmentation tasks. Trained on an extended PhraseCut dataset, CLIPSeg generates binary segmentation maps based on free-text or image queries, demonstrating adaptability to various binary segmentation tasks involving affordances or properties.

-## Overview
+<hfoptions id="usage">
+<hfoption id="CLIPSegModel">

-The CLIPSeg model was proposed in [Image Segmentation Using Text and Image Prompts](https://huggingface.co/papers/2112.10003) by Timo Lüddecke
-and Alexander Ecker. CLIPSeg adds a minimal decoder on top of a frozen [CLIP](clip) model for zero-shot and one-shot image segmentation.
+```py
+import torch
+from transformers import AutoProcessor, CLIPSegModel
+from transformers.image_utils import load_image

-The abstract from the paper is the following:
+processor = AutoProcessor.from_pretrained("CIDAS/clipseg-rd64-refined")
+model = CLIPSegModel.from_pretrained("CIDAS/clipseg-rd64-refined", dtype="auto")

-*Image segmentation is usually addressed by training a
-model for a fixed set of object classes. Incorporating additional classes or more complex queries later is expensive
-as it requires re-training the model on a dataset that encompasses these expressions. Here we propose a system
-that can generate image segmentations based on arbitrary
-prompts at test time. A prompt can be either a text or an
-image. This approach enables us to create a unified model
-(trained once) for three common segmentation tasks, which
-come with distinct challenges: referring expression segmentation, zero-shot segmentation and one-shot segmentation.
-We build upon the CLIP model as a backbone which we extend with a transformer-based decoder that enables dense
-prediction. After training on an extended version of the
-PhraseCut dataset, our system generates a binary segmentation map for an image based on a free-text prompt or on
-an additional image expressing the query. We analyze different variants of the latter image-based prompts in detail.
-This novel hybrid input allows for dynamic adaptation not
-only to the three segmentation tasks mentioned above, but
-to any binary segmentation task where a text or image query
-can be formulated. Finally, we find our system to adapt well
-to generalized queries involving affordances or properties*
+image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg")
+texts = ["a photo of a cat", "a photo of a dog"]
+inputs = processor(
+    text=texts, images=image, return_tensors="pt", padding=True
+)

-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/clipseg_architecture.png"
-alt="drawing" width="600"/>
+with torch.inference_mode():
+    outputs = model(**inputs)
+logits_per_image = outputs.logits_per_image 
+probs = logits_per_image.softmax(dim=1)

-<small> CLIPSeg overview. Taken from the <a href="https://huggingface.co/papers/2112.10003">original paper.</a> </small>
+print("Text-image similarity probabilities:")
+for i, (text, prob) in enumerate(zip(texts, probs[0])):
+    print(f"'{text}' -> {prob.item():.4f} ({prob.item()*100:.1f}%)")
+```

-This model was contributed by [nielsr](https://huggingface.co/nielsr).
-The original code can be found [here](https://github.com/timojl/clipseg).
-
-## Usage tips
-
- [`CLIPSegForImageSegmentation`] adds a decoder on top of [`CLIPSegModel`]. The latter is identical to [`CLIPModel`].
- [`CLIPSegForImageSegmentation`] can generate image segmentations based on arbitrary prompts at test time. A prompt can be either a text
-(provided to the model as `input_ids`) or an image (provided to the model as `conditional_pixel_values`). One can also provide custom
-conditional embeddings (provided to the model as `conditional_embeddings`).
-
-## Resources
-
-A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with CLIPSeg. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
-
-<PipelineTag pipeline="image-segmentation"/>
-
- A notebook that illustrates [zero-shot image segmentation with CLIPSeg](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/CLIPSeg/Zero_shot_image_segmentation_with_CLIPSeg.ipynb).
+</hfoption>
+</hfoptions>

 ## CLIPSegConfig

@ -106,3 +86,4 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h

 [[autodoc]] CLIPSegForImageSegmentation
    - forward
+
--- a/docs/source/en/model_doc/clvp.md
+++ b/docs/source/en/model_doc/clvp.md
@ -13,63 +13,36 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2023-05-12 and added to Hugging Face Transformers on 2023-11-10.*
+*This model was released on 2023-05-12 and added to Hugging Face Transformers on 2023-11-10 and contributed by [susnato](https://huggingface.co/susnato).*

 # CLVP

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[CLVP](https://huggingface.co/papers/2305.07243) applies advancements from image generation, specifically autoregressive transformers and DDPMs, to speech synthesis. The result is TorToise, an expressive, multi-voice text-to-speech system.

-## Overview
+<hfoptions id="usage">
+<hfoption id="ClvpModelForConditionalGeneration">

-The CLVP (Contrastive Language-Voice Pretrained Transformer) model was proposed in [Better speech synthesis through scaling](https://huggingface.co/papers/2305.07243) by James Betker.
+```py
+import datasets
+import torch
+from transformers import AutoProcessor, ClvpModelForConditionalGeneration

-The abstract from the paper is the following:
+text = "Plants create energy through a process known as photosynthesis."

-*In recent years, the field of image generation has been revolutionized by the application of autoregressive transformers and DDPMs. These approaches model the process of image generation as a step-wise probabilistic processes and leverage large amounts of compute and data to learn the image distribution. This methodology of improving performance need not be confined to images. This paper describes a way to apply advances in the image generative domain to speech synthesis. The result is TorToise - an expressive, multi-voice text-to-speech system.*
+ds = datasets.load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
+ds = ds.cast_column("audio", datasets.Audio(sampling_rate=22050))
+sample = ds[0]["audio"]

-This model was contributed by [Susnato Dhar](https://huggingface.co/susnato).
-The original code can be found [here](https://github.com/neonbjb/tortoise-tts).
+processor = AutoProcessor.from_pretrained("susnato/clvp_dev")
+model = ClvpModelForConditionalGeneration.from_pretrained("susnato/clvp_dev", dtype="auto")

-## Usage tips
-
-1. CLVP is an integral part of the Tortoise TTS model.
-2. CLVP can be used to compare different generated speech candidates with the provided text, and the best speech tokens are forwarded to the diffusion model.
-3. The use of the [`ClvpModelForConditionalGeneration.generate()`] method is strongly recommended for tortoise usage.
-4. Note that the CLVP model expects the audio to be sampled at 22.05 kHz contrary to other audio models which expects 16 kHz.
-
-## Brief Explanation
-
- The [`ClvpTokenizer`] tokenizes the text input, and the [`ClvpFeatureExtractor`] extracts the log mel-spectrogram from the desired audio.
- [`ClvpConditioningEncoder`] takes those text tokens and audio representations and converts them into embeddings conditioned on the text and audio.
- The [`ClvpForCausalLM`] uses those embeddings to generate multiple speech candidates.
- Each speech candidate is passed through the speech encoder ([`ClvpEncoder`]) which converts them into a vector representation, and the text encoder ([`ClvpEncoder`]) converts the text tokens into the same latent space.
- At the end, we compare each speech vector with the text vector to see which speech vector is most similar to the text vector.
- [`ClvpModelForConditionalGeneration.generate()`] compresses all of the logic described above into a single method.  
-
-Example :
-
-```python
->>> import datasets
->>> from transformers import ClvpProcessor, ClvpModelForConditionalGeneration
-
->>> # Define the Text and Load the Audio (We are taking an audio example from HuggingFace Hub using `datasets` library).
->>> text = "This is an example text."
-
->>> ds = datasets.load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
->>> ds = ds.cast_column("audio", datasets.Audio(sampling_rate=22050))
->>> sample = ds[0]["audio"]
-
->>> # Define processor and model.
->>> processor = ClvpProcessor.from_pretrained("susnato/clvp_dev")
->>> model = ClvpModelForConditionalGeneration.from_pretrained("susnato/clvp_dev")
-
->>> # Generate processor output and model output.
->>> processor_output = processor(raw_speech=sample["array"], sampling_rate=sample["sampling_rate"], text=text, return_tensors="pt")
->>> generated_output = model.generate(**processor_output)
+processor_output = processor(raw_speech=sample["array"], sampling_rate=sample["sampling_rate"], text=text, return_tensors="pt")
+outputs = model(**processor_output)
 ```

+</hfoption>
+</hfoptions>
+
 ## ClvpConfig

 [[autodoc]] ClvpConfig
@ -122,3 +95,4 @@ Example :
 ## ClvpDecoder

 [[autodoc]] ClvpDecoder
+
--- a/docs/source/en/model_doc/code_llama.md
+++ b/docs/source/en/model_doc/code_llama.md
@ -13,24 +13,11 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2023-08-24 and added to Hugging Face Transformers on 2023-08-25.*
-
-<div style="float: right;">
-    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-    </div>
-</div>
+*This model was released on 2023-04-27 and added to Hugging Face Transformers on 2023-08-25 and contributed by [ArthurZ](https://huggingface.co/ArthurZ).*

 # CodeLlama

-[Code Llama](https://huggingface.co/papers/2308.12950) is a specialized family of large language models based on [Llama 2](./llama2) for coding tasks.  It comes in different flavors - general code, Python-specific, and instruction-following variant - all available in 7B, 13B, 34B, and 70B parameters. Code Llama models can generate, explain, and even fill in missing parts of your code (called "infilling"). It can also handle very long contexts with stable generation up to 100k tokens, even though it was trained on sequences of 16K tokens.
-
-You can find all the original Code Llama checkpoints under the [Code Llama](https://huggingface.co/collections/meta-llama/code-llama-family-661da32d0a9d678b6f55b933) collection.
-
-> [!TIP]
-> Click on the Code Llama models in the right sidebar for more examples of how to apply Code Llama to different coding tasks.
-
-The example below demonstrates how to generate code with [`Pipeline`], or the [`AutoModel`], and from the command line.
+[CodeLlama](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/) is a family of large language models for code, built on Llama 2, offering state-of-the-art performance among open models. It includes foundation models, Python specializations, and instruction-following models in 7B, 13B, and 34B parameter sizes. These models support infilling, handle large input contexts, and perform zero-shot instruction following for programming tasks. Trained on sequences of 16k tokens, they show improvements with inputs up to 100k tokens. The 7B and 13B Code Llama and Code Llama - Instruct variants support infilling based on surrounding content. Code Llama achieves top scores on HumanEval and MBPP benchmarks, with Code Llama - Python 7B outperforming Llama 2 70B on these tasks. All models outperform other publicly available models on MultiPL-E. Code Llama is released under a permissive license for both research and commercial use.

 <hfoptions id="usage">
 <hfoption id="Pipeline">
@ -39,20 +26,8 @@ The example below demonstrates how to generate code with [`Pipeline`], or the [`
 import torch
 from transformers import pipeline

-pipe = pipeline(
-    "text-generation",
-    model="meta-llama/CodeLlama-7b-hf",
-    dtype=torch.float16,
-    device_map=0
-)
-
-# basic code generation
-result = pipe("# Function to calculate the factorial of a number\ndef factorial(n):", max_new_tokens=256)
-print(result[0]['generated_text'])
-
-# infilling
-infill_result = pipe("def remove_non_ascii(s: str) -> str:\n    \"\"\" <FILL_ME>\n    return result", max_new_tokens=200)
-print(infill_result[0]['generated_text'])
+pipeline = pipeline(task="text-generation", model="meta-llama/CodeLlama-7b-hf", dtype="auto")
+pipeline("def fibonacci(n):")
 ```

 </hfoption>
@ -62,107 +37,24 @@ print(infill_result[0]['generated_text'])
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer

+model = AutoModelForCausalLM.from_pretrained("meta-llama/CodeLlama-7b-hf", dtype="auto")
 tokenizer = AutoTokenizer.from_pretrained("meta-llama/CodeLlama-7b-hf")
-model = AutoModelForCausalLM.from_pretrained(
-    "meta-llama/CodeLlama-7b-hf",
-    dtype=torch.float16,
-    device_map="auto",
-    attn_implementation="sdpa"
-)

-# basic code generation
-prompt = "# Function to calculate the factorial of a number\ndef factorial(n):"
-input_ids = tokenizer(prompt, return_tensors="pt").to(model.device)
-
-output = model.generate(
-    **input_ids,
-    max_new_tokens=256,
-    cache_implementation="static"
-)
-print(tokenizer.decode(output[0], skip_special_tokens=True))
-
-# infilling
-infill_prompt = "def remove_non_ascii(s: str) -> str:\n    \"\"\" <FILL_ME>\n    return result"
-input_ids = tokenizer(infill_prompt, return_tensors="pt").to(model.device)
-
-filled_output = model.generate(**input_ids, max_new_tokens=200)
-filled_text = tokenizer.decode(filled_output[0], skip_special_tokens=True)
-print(filled_text)
-```
-
-</hfoption>
-<hfoption id="transformers CLI">
-
-```bash
-echo -e "# Function to calculate the factorial of a number\ndef factorial(n):" | transformers run --task text-generation --model meta-llama/CodeLlama-7b-hf --device 0
+inputs = tokenizer("def fibonacci(n):", return_tensors="pt")
+outputs = model.generate(**inputs, max_length=50)
+print(tokenizer.decode(outputs[0]))
 ```

 </hfoption>
 </hfoptions>

-Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
+## Usage tips

-The example below uses [bitsandbytes](../quantization/bitsandbytes) to only quantize the weights to 4-bits.
-
-```py
-# pip install bitsandbytes
-import torch
-from transformers import AutoModelForCausalLM, CodeLlamaTokenizer, BitsAndBytesConfig
-
-bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True)
-tokenizer = CodeLlamaTokenizer.from_pretrained("meta-llama/CodeLlama-34b-hf")
-model = AutoModelForCausalLM.from_pretrained(
-   "meta-llama/CodeLlama-34b-hf",
-   dtype=torch.bfloat16,
-   device_map="auto",
-   quantization_config=bnb_config
-)
-
-prompt = "# Write a Python function to check if a string is a palindrome\ndef is_palindrome(s):"
-input_ids = tokenizer(prompt, return_tensors="pt").to(model.device)
-
-output = model.generate(**input_ids, max_new_tokens=200, cache_implementation="static")
-print(tokenizer.decode(output[0], skip_special_tokens=True))
-```
-
-Use the [AttentionMaskVisualizer](https://github.com/huggingface/transformers/blob/beb9b5b02246b9b7ee81ddf938f93f44cfeaad19/src/transformers/utils/attention_visualizer.py#L139) to better understand what tokens the model can and cannot attend to.
-
-```py
-from transformers.utils.attention_visualizer import AttentionMaskVisualizer
-
-visualizer = AttentionMaskVisualizer("meta-llama/CodeLlama-7b-hf")
-visualizer("""def func(a, b):
-  return a + b""")
-```
-
-<div class="flex justify-center">
-    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/codellama-attn-mask.png"/>
-</div>
-
-## Notes
-
- Infilling is only available in the 7B and 13B base models, and not in the Python, Instruct, 34B, or 70B models.
- Use the `<FILL_ME>` token where you want your input to be filled. The tokenizer splits this token to create a formatted input string that follows the [original training pattern](https://github.com/facebookresearch/codellama/blob/cb51c14ec761370ba2e2bc351374a79265d0465e/llama/generation.py#L402). This is more robust than preparing the pattern yourself.
-
-    ```py
-    from transformers import LlamaForCausalLM, CodeLlamaTokenizer
-
-    tokenizer = CodeLlamaTokenizer.from_pretrained("meta-llama/CodeLlama-7b-hf")
-    model = LlamaForCausalLM.from_pretrained("meta-llama/CodeLlama-7b-hf")
-    PROMPT = '''def remove_non_ascii(s: str) -> str:
-        """ <FILL_ME>
-        return result
-    '''
-    input_ids = tokenizer(PROMPT, return_tensors="pt")["input_ids"]
-    generated_ids = model.generate(input_ids, max_new_tokens=128)
-
-    filling = tokenizer.batch_decode(generated_ids[:, input_ids.shape[1]:], skip_special_tokens = True)[0]
-    print(PROMPT.replace("<FILL_ME>", filling))
-    ```
-
- Use `bfloat16` for further training or fine-tuning and `float16` for inference.
- The `BOS` character is not used for infilling when encoding the prefix or suffix, but only at the beginning of each prompt.
- The tokenizer is a byte-pair encoding model based on [SentencePiece](https://github.com/google/sentencepiece). During decoding, if the first token is the start of the word (for example, “Banana”), the tokenizer doesn’t prepend the prefix space to the string.
+- Infilling works only in 7B and 13B base models. It doesn't work in Python, Instruct, 34B, or 70B models.
+- Use the `<FILL_ME>` token where you want input filled. The tokenizer splits this token to create a formatted input string that follows the original training pattern. This beats preparing the pattern yourself.
+- Use `bfloat16` for training or fine-tuning and `float16` for inference.
+- The `BOS` character isn't used for infilling when encoding the prefix or suffix. It only appears at the beginning of each prompt.
+- The tokenizer is a byte-pair encoding model based on SentencePiece. During decoding, if the first token starts a word (like "Banana"), the tokenizer doesn't prepend the prefix space.

 ## CodeLlamaTokenizer

@ -180,3 +72,4 @@ visualizer("""def func(a, b):
    - create_token_type_ids_from_sequences
    - update_post_processor
    - save_vocabulary
+
--- a/docs/source/en/model_doc/codegen.md
+++ b/docs/source/en/model_doc/codegen.md
@ -13,61 +13,40 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2022-03-25 and added to Hugging Face Transformers on 2022-06-24.*
+*This model was released on 2022-03-25 and added to Hugging Face Transformers on 2022-06-24 and contributed by [rooa](https://huggingface.co/rooa).*

 # CodeGen

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[CodeGen](https://huggingface.co/papers/2203.13474) is an autoregressive language model designed for program synthesis through a conversational paradigm. Trained on diverse datasets including The Pile, BigQuery, and BigPython, CodeGen addresses challenges in program synthesis by treating it as a sequence prediction problem where specifications are expressed in natural language. The model demonstrates conversational capabilities and outperforms OpenAI's Codex on the HumanEval benchmark. A multi-turn programming benchmark (MTPB) was developed to evaluate the model's conversational program synthesis abilities. 

-## Overview
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-The CodeGen model was proposed in [A Conversational Paradigm for Program Synthesis](https://huggingface.co/papers/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong.
+```py
+import torch
+from transformers import pipeline

-CodeGen is an autoregressive language model for program synthesis trained sequentially on [The Pile](https://pile.eleuther.ai/), BigQuery, and BigPython.
-
-The abstract from the paper is the following:
-
-*Program synthesis strives to generate a computer program as a solution to a given problem specification. We propose a conversational program synthesis approach via large language models, which addresses the challenges of searching over a vast program space and user intent specification faced in prior approaches. Our new approach casts the process of writing a specification and program as a multi-turn conversation between a user and a system. It treats program synthesis as a sequence prediction problem, in which the specification is expressed in natural language and the desired program is conditionally sampled. We train a family of large language models, called CodeGen, on natural language and programming language data. With weak supervision in the data and the scaling up of data size and model size, conversational capacities emerge from the simple autoregressive language modeling. To study the model behavior on conversational program synthesis, we develop a multi-turn programming benchmark (MTPB), where solving each problem requires multi-step synthesis via multi-turn conversation between the user and the model. Our findings show the emergence of conversational capabilities and the effectiveness of the proposed conversational program synthesis paradigm. In addition, our model CodeGen (with up to 16B parameters trained on TPU-v4) outperforms OpenAI's Codex on the HumanEval benchmark. We make the training library JaxFormer including checkpoints available as open source contribution: [this https URL](https://github.com/salesforce/codegen).*
-
-This model was contributed by [Hiroaki Hayashi](https://huggingface.co/rooa).
-The original code can be found [here](https://github.com/salesforce/codegen).
-
-## Checkpoint Naming
-
-* CodeGen model [checkpoints](https://huggingface.co/models?other=codegen) are available on different pre-training data with variable sizes.
-* The format is: `Salesforce/codegen-{size}-{data}`, where
-  * `size`: `350M`, `2B`, `6B`, `16B`
-  * `data`:
-    * `nl`: Pre-trained on the Pile
-    * `multi`: Initialized with `nl`, then further pre-trained on multiple programming languages data
-    * `mono`: Initialized with `multi`, then further pre-trained on Python data
-* For example, `Salesforce/codegen-350M-mono` offers a 350 million-parameter checkpoint pre-trained sequentially on the Pile, multiple programming languages, and Python.
-
-## Usage example
-
-```python
->>> from transformers import AutoModelForCausalLM, AutoTokenizer
-
->>> checkpoint = "Salesforce/codegen-350M-mono"
->>> model = AutoModelForCausalLM.from_pretrained(checkpoint)
->>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
-
->>> text = "def hello_world():"
-
->>> completion = model.generate(**tokenizer(text, return_tensors="pt"))
-
->>> print(tokenizer.decode(completion[0]))
-def hello_world():
-    print("Hello World")
-
-hello_world()
+pipeline = pipeline(task="text-generation", model="Salesforce/codegen-350M-mono", dtype="auto")
+pipeline("def fibonacci(n):")
 ```

-## Resources
+</hfoption>
+<hfoption id="AutoModel">

- [Causal language modeling task guide](../tasks/language_modeling)
+```py
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen-350M-mono", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-350M-mono")
+
+inputs = tokenizer("def fibonacci(n):", return_tensors="pt")
+outputs = model.generate(**inputs, max_length=50)
+print(tokenizer.decode(outputs[0]))
+```
+
+</hfoption>
+</hfoptions>

 ## CodeGenConfig

@ -93,3 +72,4 @@ hello_world()

 [[autodoc]] CodeGenForCausalLM
    - forward
+
--- a/docs/source/en/model_doc/cohere.md
+++ b/docs/source/en/model_doc/cohere.md
@ -1,4 +1,5 @@
-<!--Copyright 2024 The HuggingFace Team. All rights reserved.
+<!--Copyright 2025 The HuggingFace Team. All rights reserved.
+
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at

@ -8,122 +9,57 @@ Unless required by applicable law or agreed to in writing, software distributed
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.

-⚠️ Note that this file is in Markdown but contains specific syntax for our doc-builder (similar to MDX) that may not be
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
 rendered properly in your Markdown viewer.
+
 -->
-*This model was released on 2024-03-12 and added to Hugging Face Transformers on 2024-03-15.*
+*This model was released on {release_date} and added to Hugging Face Transformers on 2024-03-15 and contributed by [saurabhdash](https://huggingface.co/saurabhdash) and [ahmetustun](https://huggingface.co/ahmetustun).*

 <div style="float: right;">
    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
        <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
-        <img alt="Tensor parallelism" src="https://img.shields.io/badge/Tensor%20parallelism-06b6d4?style=flat&logoColor=white">
    </div>
 </div>

-# Cohere
+# Command-R

-Cohere [Command-R](https://cohere.com/blog/command-r) is a 35B parameter multilingual large language model designed for long context tasks like retrieval-augmented generation (RAG) and calling external APIs and tools. The model is specifically trained for grounded generation and supports both single-step and multi-step tool use. It supports a context length of 128K tokens.
-
-You can find all the original Command-R checkpoints under the [Command Models](https://huggingface.co/collections/CohereForAI/command-models-67652b401665205e17b192ad) collection.
-
-> [!TIP]
-> Click on the Cohere models in the right sidebar for more examples of how to apply Cohere to different language tasks.
-
-The example below demonstrates how to generate text with [`Pipeline`] or the [`AutoModel`], and from the command line.
+[Command-R](https://huggingface.co/papers/2310.06664) is a language model engineered for high-throughput, low-latency retrieval-augmented generation (RAG) and tool use at enterprise scale. It supports a 128,000-token context window, enabling it to reason over very long documents or dialogues, and integrates with external APIs/tools to automate multi-step tasks. The model is optimized for production usage (with strong performance per compute), and fine-tuning of Command R is emphasized as a cost-efficient way to specialize it further.

 <hfoptions id="usage">
 <hfoption id="Pipeline">

-```python
+```py
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    task="text-generation",
-    model="CohereForAI/c4ai-command-r-v01",
-    dtype=torch.float16,
-    device=0
-)
-pipeline("Plants create energy through a process known as")
+pipeline = pipeline(task="text-generation", model="CohereLabs/c4ai-command-r-v01", dtype="auto")
+pipeline("Plants create energy through a process known as photosynthesis.")
 ```

 </hfoption>
 <hfoption id="AutoModel">

-```python
+```py
 import torch
-from transformers import AutoTokenizer, AutoModelForCausalLM
+from transformers import AutoModelForCausalLM, AutoTokenizer

-tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-v01")
-model = AutoModelForCausalLM.from_pretrained("CohereForAI/c4ai-command-r-v01", dtype=torch.float16, device_map="auto", attn_implementation="sdpa")
+model = AutoModelForCausalLM.from_pretrained("CohereLabs/c4ai-command-r-v01")
+tokenizer = AutoTokenizer.from_pretrained("CohereLabs/c4ai-command-r-v01")

-# format message with the Command-R chat template
-messages = [{"role": "user", "content": "How do plants make energy?"}]
-input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)
-output = model.generate(
-    input_ids,
-    max_new_tokens=100,
-    do_sample=True,
-    temperature=0.3,
-    cache_implementation="static",
-)
-print(tokenizer.decode(output[0], skip_special_tokens=True))
-```
+messages = [{"role": "user", "content": "How do plants generate energy?"}]
+input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")

-</hfoption>
-<hfoption id="transformers CLI">
-
-```bash
-# pip install -U flash-attn --no-build-isolation
-transformers chat CohereForAI/c4ai-command-r-v01 --dtype auto --attn_implementation flash_attention_2
+outputs = model.generate(input_ids, max_new_tokens=100, do_sample=True, temperature=0.3,)
+print(tokenizer.decode(outputs[0]))
 ```

 </hfoption>
 </hfoptions>

-Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
+## Usage tips

-The example below uses [bitsandbytes](../quantization/bitsandbytes) to quantize the weights to 4-bits.
-
-```python
-import torch
-from transformers import BitsAndBytesConfig, AutoTokenizer, AutoModelForCausalLM
-
-bnb_config = BitsAndBytesConfig(load_in_4bit=True)
-tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-v01")
-model = AutoModelForCausalLM.from_pretrained("CohereForAI/c4ai-command-r-v01", dtype=torch.float16, device_map="auto", quantization_config=bnb_config, attn_implementation="sdpa")
-
-# format message with the Command-R chat template
-messages = [{"role": "user", "content": "How do plants make energy?"}]
-input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)
-output = model.generate(
-    input_ids,
-    max_new_tokens=100,
-    do_sample=True,
-    temperature=0.3,
-    cache_implementation="static",
-)
-print(tokenizer.decode(output[0], skip_special_tokens=True))
-```
-
-Use the [AttentionMaskVisualizer](https://github.com/huggingface/transformers/blob/beb9b5b02246b9b7ee81ddf938f93f44cfeaad19/src/transformers/utils/attention_visualizer.py#L139) to better understand what tokens the model can and cannot attend to.
-
-```py
-from transformers.utils.attention_visualizer import AttentionMaskVisualizer
-
-visualizer = AttentionMaskVisualizer("CohereForAI/c4ai-command-r-v01")
-visualizer("Plants create energy through a process known as")
-```
-
-<div class="flex justify-center">
-    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/cohere-attn-mask.png"/>
-</div>
-
-## Notes
-
- Don't use the dtype parameter in [`~AutoModel.from_pretrained`] if you're using FlashAttention-2 because it only supports fp16 or bf16. You should use [Automatic Mixed Precision](https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html), set fp16 or bf16 to True if using [`Trainer`], or use [torch.autocast](https://pytorch.org/docs/stable/amp.html#torch.autocast).
+- Don't use the `dtype` parameter in [`~AutoModel.from_pretrained`] with FlashAttention-2. It only supports `fp16` or `bf16`. Use [Automatic Mixed Precision](https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html), set `fp16` or `bf16` to `True` with [`Trainer`], or use [torch.autocast](https://pytorch.org/docs/stable/amp.html#torch.autocast).

 ## CohereConfig

@ -147,3 +83,4 @@ visualizer("Plants create energy through a process known as")

 [[autodoc]] CohereForCausalLM
    - forward
+
--- a/docs/source/en/model_doc/cohere2.md
+++ b/docs/source/en/model_doc/cohere2.md
@ -1,4 +1,5 @@
-<!--Copyright 2024 The HuggingFace Team. All rights reserved.
+<!--Copyright 2025 The HuggingFace Team. All rights reserved.
+
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at

@ -8,121 +9,49 @@ Unless required by applicable law or agreed to in writing, software distributed
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.

-⚠️ Note that this file is in Markdown but contains specific syntax for our doc-builder (similar to MDX) that may not be
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
 rendered properly in your Markdown viewer.
+
 -->
-*This model was released on 2024-12-13 and added to Hugging Face Transformers on 2024-12-13.*
+*This model was released on {release_date} and added to Hugging Face Transformers on 2024-12-13.*

 <div style="float: right;">
    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
        <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
-        <img alt="Tensor parallelism" src="https://img.shields.io/badge/Tensor%20parallelism-06b6d4?style=flat&logoColor=white">
    </div>
 </div>

-# Cohere 2
+# Command R7B

-[Cohere Command R7B](https://cohere.com/blog/command-r7b) is an open weights research release of a 7B billion parameter model. It is a multilingual model trained on 23 languages and has a context window of 128k. The model features three layers with sliding window attention and ROPE for efficient local context modeling and relative positional encoding. A fourth layer uses global attention without positional embeddings, enabling unrestricted token interactions across the entire sequence.
-
-This model is optimized for speed, cost-performance, and compute resources.
-
-You can find all the original Command-R checkpoints under the [Command Models](https://huggingface.co/collections/CohereForAI/command-models-67652b401665205e17b192ad) collection.
-
-> [!TIP]
-> Click on the Cohere models in the right sidebar for more examples of how to apply Cohere to different language tasks.
-
-The example below demonstrates how to generate text with [`Pipeline`] or the [`AutoModel`] class, and from the command line.
+[Command R7B](https://cohere.com/blog/command-r7b) is Cohere’s smallest model in the R series, optimized for speed, efficiency, and high-quality outputs on commodity GPUs and edge devices. It has 7 billion parameters and is fine-tuned for retrieval-augmented generation (RAG), enabling strong grounding in enterprise data while maintaining low latency. The model is designed to balance cost and performance, making it accessible for real-world applications like search, summarization, and knowledge management. R7B continues the R-series focus on practical deployment, emphasizing scalability and adaptability for business use cases

 <hfoptions id="usage">
 <hfoption id="Pipeline">

-```python
+```py
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    task="text-generation",
-    model="CohereLabs/c4ai-command-r7b-12-2024",
-    dtype=torch.float16,
-    device_map=0
-)
-
-messages = [
-    {"role": "user", "content": "Hello, can you please help me book a hotel in Japan?"},
-]
-pipeline(messages)
+pipeline = pipeline(task="text-generation", model="CohereLabs/c4ai-command-r7b-12-2024", dtype="auto")
+pipeline("Plants create energy through a process known as photosynthesis.")
 ```

 </hfoption>
 <hfoption id="AutoModel">

-```python
+```py
 import torch
-from transformers import AutoTokenizer, AutoModelForCausalLM
+from transformers import AutoModelForCausalLM, AutoTokenizer

+model = AutoModelForCausalLM.from_pretrained("CohereLabs/c4ai-command-r7b-12-2024", dtype="auto")
 tokenizer = AutoTokenizer.from_pretrained("CohereLabs/c4ai-command-r7b-12-2024")
-model = AutoModelForCausalLM.from_pretrained(
-    "CohereLabs/c4ai-command-r7b-12-2024",
-    dtype=torch.float16,
-    device_map="auto",
-    attn_implementation="sdpa"
-)

-# format message with the Command-R chat template
-messages = [{"role": "user", "content": "Hello, can you please help me book a hotel in Japan?"}]
-input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)
-output = model.generate(
-    input_ids,
-    max_new_tokens=100,
-    do_sample=True,
-    temperature=0.3,
-    cache_implementation="static",
-)
-print(tokenizer.decode(output[0], skip_special_tokens=True))
-```
+messages = [{"role": "user", "content": "How do plants generate energy?"}]
+input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")

-</hfoption>
-<hfoption id="transformers CLI">
-
-```bash
-# pip install -U flash-attn --no-build-isolation
-transformers chat CohereLabs/c4ai-command-r7b-12-2024 --dtype auto --attn_implementation flash_attention_2
-```
-
-</hfoption>
-</hfoptions>
-
-Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview.md) overview for more available quantization backends.
-
-The example below uses [bitsandbytes](../quantization/bitsandbytes.md) to quantize the weights to 4-bits.
-
-```python
-import torch
-from transformers import BitsAndBytesConfig, AutoTokenizer, AutoModelForCausalLM
-
-bnb_config = BitsAndBytesConfig(load_in_4bit=True)
-tokenizer = AutoTokenizer.from_pretrained("CohereLabs/c4ai-command-r7b-12-2024")
-model = AutoModelForCausalLM.from_pretrained(
-    "CohereLabs/c4ai-command-r7b-12-2024",
-    dtype=torch.float16,
-    device_map="auto",
-    quantization_config=bnb_config,
-    attn_implementation="sdpa"
-)
-
-# format message with the Command-R chat template
-messages = [{"role": "user", "content": "Hello, can you please help me book a hotel in Japan?"}]
-input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)
-output = model.generate(
-    input_ids,
-    max_new_tokens=100,
-    do_sample=True,
-    temperature=0.3,
-    cache_implementation="static",
-)
-print(tokenizer.decode(output[0], skip_special_tokens=True))
+outputs = model.generate(input_ids, max_new_tokens=100, do_sample=True, temperature=0.3,)
+print(tokenizer.decode(outputs[0]))
 ```

 ## Cohere2Config
@ -138,3 +67,4 @@ print(tokenizer.decode(output[0], skip_special_tokens=True))

 [[autodoc]] Cohere2ForCausalLM
    - forward
+
--- a/docs/source/en/model_doc/cohere2_vision.md
+++ b/docs/source/en/model_doc/cohere2_vision.md
@ -15,103 +15,65 @@ rendered properly in your Markdown viewer.
 -->
 *This model was released on 2025-07-31 and added to Hugging Face Transformers on 2025-07-31.*

-# Command A Vision
-
-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
-<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
-<img alt="Tensor parallelism" src="https://img.shields.io/badge/Tensor%20parallelism-06b6d4?style=flat&logoColor=white">
+<div style="float: right;">
+    <div class="flex flex-wrap space-x-1">
+        <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
+        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
+        <img alt="Tensor parallelism" src="https://img.shields.io/badge/Tensor%20parallelism-06b6d4?style=flat&logoColor=white">
+    </div>
 </div>

-## Overview
+# Command A Vision

-Command A Vision ([blog post](https://cohere.com/blog/command-a-vision)) is a state-of-the-art multimodal model designed to seamlessly integrate visual and textual information for a wide range of applications. By combining advanced computer vision techniques with natural language processing capabilities, Command A Vision enables users to analyze, understand, and generate insights from both visual and textual data.
-
-The model excels at tasks including image captioning, visual question answering, document understanding, and chart understanding. This makes it a versatile tool for AI practitioners. Its ability to process complex visual and textual inputs makes it useful in settings where text-only representations are imprecise or unavailable, like real-world image understanding and graphics-heavy document processing.
-
-Command A Vision is built upon a robust architecture that leverages the latest advancements in VLMs. It's highly performant and efficient, even when dealing with large-scale datasets. The model's flexibility makes it suitable for a wide range of use cases, from content moderation and image search to medical imaging analysis and robotics.
-
-## Usage tips
-
-The model and image processor can be loaded as follows:
+[Command A Vision](https://cohere.com/blog/command-a-vision) s a state-of-the-art multimodal generative model optimized for enterprise use, excelling in both visual and text-based tasks. It outperforms leading models like GPT-4.1 and Llama 4 Maverick on benchmarks involving charts, diagrams, documents, and real-world imagery. The model features advanced document OCR with structured JSON outputs, strong scene understanding, and multilingual reasoning across industries such as finance, healthcare, and manufacturing. Designed for secure, efficient deployment, it runs on as little as one H100 or two A100 GPUs, enabling scalable on-premise or private enterprise applications.

 <hfoptions id="usage">
+<hfoption id="Pipeline">
+
+```py
+import torch
+from transformers import pipeline
+
+pipeline = pipeline(task="image-text-to-text", model="CohereLabs/command-a-vision-07-2025", dtype="auto")
+messages = [
+    {"role": "user",
+     "content": [
+       {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
+        {"type": "text", "text": "What is shown in this image?"},
+    ]},
+]
+pipeline(text=messages, max_new_tokens=300, return_full_text=False)
+```
+
+</hfoption>
 <hfoption id="AutoModel">

-```python
+```py
 import torch
-
 from transformers import AutoProcessor, AutoModelForImageTextToText

-model_id = "CohereLabs/command-a-vision-07-2025"
+processor = AutoProcessor.from_pretrained("CohereLabs/aya-vision-8b)
+model = AutoModelForImageTextToText.from_pretrained("CohereLabs/command-a-vision-07-2025", dtype="auto")

-processor = AutoProcessor.from_pretrained(model_id)
-model = AutoModelForImageTextToText.from_pretrained(
-    model_id, device_map="auto", dtype=torch.float16
-)
-
-# Format message with the Command-A-Vision chat template
 messages = [
-    {
-        "role": "user",
-        "content": [
-            {
-                "type": "image",
-                "url": "https://images.pexels.com/photos/1108099/pexels-photo-1108099.jpeg",
-            },
-            {"type": "text", "text": "what is in this image?"},
-        ],
-    },
+    {"role": "user",
+     "content": [
+       {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
+        {"type": "text", "text": "What is shown in this image?"},
+    ]},
 ]

 inputs = processor.apply_chat_template(
-    messages,
-    padding=True,
-    add_generation_prompt=True,
-    tokenize=True,
-    return_dict=True,
-    return_tensors="pt",
-).to(model.device)
+    messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt"
+)

-gen_tokens = model.generate(
+outputs = model.generate(
    **inputs,
    max_new_tokens=300,
    do_sample=True,
    temperature=0.3,
 )
-
-print(
-    processor.tokenizer.decode(
-        gen_tokens[0][inputs.input_ids.shape[1] :], skip_special_tokens=True
-    )
-)
-```
-
-</hfoption>
-<hfoption id="Pipeline">
-
-```python
-from transformers import pipeline
-
-pipe = pipeline(model="CohereLabs/command-a-vision-07-2025", task="image-text-to-text", device_map="auto")
-
-messages = [
-    {
-        "role": "user",
-        "content": [
-            {
-                "type": "image",
-                "url": "https://media.istockphoto.com/id/458012057/photo/istanbul-turkey.jpg?s=612x612&w=0&k=20&c=qogAOVvkpfUyqLUMr_XJQyq-HkACXyYUSZbKhBlPrxo=",
-            },
-            {"type": "text", "text": "Where was this taken ?"},
-        ],
-    },
-]
-
-outputs = pipe(text=messages, max_new_tokens=300, return_full_text=False)
-
-print(outputs)
+print(processor.tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
 ```

 </hfoption>
--- a/docs/source/en/model_doc/colpali.md
+++ b/docs/source/en/model_doc/colpali.md
@ -1,4 +1,5 @@
 <!--Copyright 2024 The HuggingFace Team. All rights reserved.
+
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at

@ -8,50 +9,28 @@ Unless required by applicable law or agreed to in writing, software distributed
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.

-⚠️ Note that this file is in Markdown but contains specific syntax for our doc-builder (similar to MDX) that may not be
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
 rendered properly in your Markdown viewer.
-->
-*This model was released on 2024-06-27 and added to Hugging Face Transformers on 2024-12-17.*

-<div style="float: right;">
-    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-    </div>
-</div>
+-->
+*This model was released on 2024-06-27 and added to Hugging Face Transformers on 2024-12-17 and contributed by [tonywu71](https://huggingface.co/tonywu71) and [yonigozlan](https://huggingface.co/yonigozlan).*

 # ColPali

-[ColPali](https://huggingface.co/papers/2407.01449) is a model designed to retrieve documents by analyzing their visual features. Unlike traditional systems that rely heavily on text extraction and OCR, ColPali treats each page as an image. It uses [Paligemma-3B](./paligemma) to capture not only text, but also the layout, tables, charts, and other visual elements to create detailed multi-vector embeddings that can be used for retrieval by computing pairwise late interaction similarity scores. This offers a more comprehensive understanding of documents and enables more efficient and accurate retrieval.
-
-This model was contributed by [@tonywu71](https://huggingface.co/tonywu71) (ILLUIN Technology) and [@yonigozlan](https://huggingface.co/yonigozlan) (HuggingFace).
-
-You can find all the original ColPali checkpoints under Vidore's [Hf-native ColVision Models](https://huggingface.co/collections/vidore/hf-native-colvision-models-6755d68fc60a8553acaa96f7) collection.
-
-> [!TIP]
-> Click on the ColPali models in the right sidebar for more examples of how to use ColPali for image retrieval.
+[ColPali](https://huggingface.co/papers/2407.01449) is a retrieval model designed for visually rich documents that processes document pages as images rather than relying solely on text. It builds on recent vision-language models to generate high-quality contextualized embeddings that capture both textual and visual information. Using a late interaction matching mechanism, ColPali achieves faster and more accurate document retrieval compared to existing systems. The model is evaluated on the new Visual Document Retrieval Benchmark (ViDoRe), which spans diverse domains, languages, and retrieval settings.

 <hfoptions id="usage">
-<hfoption id="image retrieval">
+<hfoption id="ColPaliForRetrieval">

-```python
+```py
 import requests
 import torch
 from PIL import Image
+from transformers import ColPaliForRetrieval, AutoProcessor

-from transformers import ColPaliForRetrieval, ColPaliProcessor
+model = ColPaliForRetrieval.from_pretrained("vidore/colpali-v1.3-hf",dtype="auto")
+processor = AutoProcessor.from_pretrained("vidore/colpali-v1.3-hf")

-
-# Load the model and the processor
-model_name = "vidore/colpali-v1.3-hf"
-
-model = ColPaliForRetrieval.from_pretrained(
-    model_name,
-    dtype=torch.bfloat16,
-    device_map="auto",  # "cpu", "cuda", "xpu", or "mps" for Apple Silicon
-)
-processor = ColPaliProcessor.from_pretrained(model_name)
-
-# The document page screenshots from your corpus
 url1 = "https://upload.wikimedia.org/wikipedia/commons/8/89/US-original-Declaration-1776.jpg"
 url2 = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/4c/Romeoandjuliet1597.jpg/500px-Romeoandjuliet1597.jpg"

@ -60,103 +39,26 @@ images = [
    Image.open(requests.get(url2, stream=True).raw),
 ]

-# The queries you want to retrieve documents for
 queries = [
    "When was the United States Declaration of Independence proclaimed?",
    "Who printed the edition of Romeo and Juliet?",
 ]

-# Process the inputs
 inputs_images = processor(images=images).to(model.device)
 inputs_text = processor(text=queries).to(model.device)

-# Forward pass
 with torch.no_grad():
    image_embeddings = model(**inputs_images).embeddings
    query_embeddings = model(**inputs_text).embeddings
-
-# Score the queries against the images
 scores = processor.score_retrieval(query_embeddings, image_embeddings)

 print("Retrieval scores (query x image):")
 print(scores)
 ```

-If you have issue with loading the images with PIL, you can use the following code to create dummy images:
-
-```python
-images = [
-    Image.new("RGB", (128, 128), color="white"),
-    Image.new("RGB", (64, 32), color="black"),
-]
-```
-
 </hfoption>
 </hfoptions>

-Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
-
-The example below uses [bitsandbytes](../quantization/bitsandbytes) to quantize the weights to int4.
-
-```python
-import requests
-import torch
-from PIL import Image
-
-from transformers import BitsAndBytesConfig, ColPaliForRetrieval, ColPaliProcessor
-
-
-model_name = "vidore/colpali-v1.3-hf"
-
-# 4-bit quantization configuration
-bnb_config = BitsAndBytesConfig(
-    load_in_4bit=True,
-    bnb_4bit_use_double_quant=True,
-    bnb_4bit_quant_type="nf4",
-    bnb_4bit_compute_dtype=torch.float16,
-)
-
-model = ColPaliForRetrieval.from_pretrained(
-    model_name,
-    quantization_config=bnb_config,
-    device_map="auto",
-)
-
-processor = ColPaliProcessor.from_pretrained(model_name)
-
-url1 = "https://upload.wikimedia.org/wikipedia/commons/8/89/US-original-Declaration-1776.jpg"
-url2 = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/4c/Romeoandjuliet1597.jpg/500px-Romeoandjuliet1597.jpg"
-
-images = [
-    Image.open(requests.get(url1, stream=True).raw),
-    Image.open(requests.get(url2, stream=True).raw),
-]
-
-queries = [
-    "When was the United States Declaration of Independence proclaimed?",
-    "Who printed the edition of Romeo and Juliet?",
-]
-
-# Process the inputs
-inputs_images = processor(images=images, return_tensors="pt").to(model.device)
-inputs_text = processor(text=queries, return_tensors="pt").to(model.device)
-
-# Forward pass
-with torch.no_grad():
-    image_embeddings = model(**inputs_images).embeddings
-    query_embeddings = model(**inputs_text).embeddings
-
-# Score the queries against the images
-scores = processor.score_retrieval(query_embeddings, image_embeddings)
-
-print("Retrieval scores (query x image):")
-print(scores)
-```
-
-## Notes
-
- [`~ColPaliProcessor.score_retrieval`] returns a 2D tensor where the first dimension is the number of queries and the second dimension is the number of images. A higher score indicates more similarity between the query and image.
-
 ## ColPaliConfig

 [[autodoc]] ColPaliConfig
--- a/docs/source/en/model_doc/colqwen2.md
+++ b/docs/source/en/model_doc/colqwen2.md
@ -13,49 +13,24 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2024-06-27 and added to Hugging Face Transformers on 2025-06-02.*
-
-<div style="float: right;">
-    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-    </div>
-</div>
+*This model was released on 2024-06-27 and added to Hugging Face Transformers on 2025-06-02 and contributed by [tonywu71](https://huggingface.co/tonywu71) and [yonigozlan](https://huggingface.co/yonigozlan).*

 # ColQwen2

 [ColQwen2](https://huggingface.co/papers/2407.01449) is a variant of the [ColPali](./colpali) model designed to retrieve documents by analyzing their visual features. Unlike traditional systems that rely heavily on text extraction and OCR, ColQwen2 treats each page as an image. It uses the [Qwen2-VL](./qwen2_vl) backbone to capture not only text, but also the layout, tables, charts, and other visual elements to create detailed multi-vector embeddings that can be used for retrieval by computing pairwise late interaction similarity scores. This offers a more comprehensive understanding of documents and enables more efficient and accurate retrieval.

-This model was contributed by [@tonywu71](https://huggingface.co/tonywu71) (ILLUIN Technology) and [@yonigozlan](https://huggingface.co/yonigozlan) (HuggingFace).
-
-You can find all the original ColPali checkpoints under Vidore's [Hf-native ColVision Models](https://huggingface.co/collections/vidore/hf-native-colvision-models-6755d68fc60a8553acaa96f7) collection.
-
-> [!TIP]
-> Click on the ColQwen2 models in the right sidebar for more examples of how to use ColQwen2 for image retrieval.
-
 <hfoptions id="usage">
-<hfoption id="image retrieval">
+<hfoption id="ColQwen2ForRetrieval">

 ```python
 import requests
 import torch
 from PIL import Image
+from transformers import ColQwen2ForRetrieval, AutoProcessor

-from transformers import ColQwen2ForRetrieval, ColQwen2Processor
-from transformers.utils.import_utils import is_flash_attn_2_available
+model = ColQwen2ForRetrieval.from_pretrained("vidore/colqwen2-v1.0-hf",dtype="auto")
+processor = AutoProcessor.from_pretrained("vidore/colqwen2-v1.0-hf")

-
-# Load the model and the processor
-model_name = "vidore/colqwen2-v1.0-hf"
-
-model = ColQwen2ForRetrieval.from_pretrained(
-    model_name,
-    dtype=torch.bfloat16,
-    device_map="auto",  # "cpu", "cuda", "xpu" or "mps" for Apple Silicon
-    attn_implementation="flash_attention_2" if is_flash_attn_2_available() else "sdpa",
-)
-processor = ColQwen2Processor.from_pretrained(model_name)
-
-# The document page screenshots from your corpus
 url1 = "https://upload.wikimedia.org/wikipedia/commons/8/89/US-original-Declaration-1776.jpg"
 url2 = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/4c/Romeoandjuliet1597.jpg/500px-Romeoandjuliet1597.jpg"

@ -64,106 +39,26 @@ images = [
    Image.open(requests.get(url2, stream=True).raw),
 ]

-# The queries you want to retrieve documents for
 queries = [
    "When was the United States Declaration of Independence proclaimed?",
    "Who printed the edition of Romeo and Juliet?",
 ]

-# Process the inputs
 inputs_images = processor(images=images).to(model.device)
 inputs_text = processor(text=queries).to(model.device)

-# Forward pass
 with torch.no_grad():
    image_embeddings = model(**inputs_images).embeddings
    query_embeddings = model(**inputs_text).embeddings
-
-# Score the queries against the images
 scores = processor.score_retrieval(query_embeddings, image_embeddings)

 print("Retrieval scores (query x image):")
 print(scores)
 ```

-If you have issue with loading the images with PIL, you can use the following code to create dummy images:
-
-```python
-images = [
-    Image.new("RGB", (128, 128), color="white"),
-    Image.new("RGB", (64, 32), color="black"),
-]
-```
-
 </hfoption>
 </hfoptions>

-Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
-
-The example below uses [bitsandbytes](../quantization/bitsandbytes) to quantize the weights to int4.
-
-```python
-import requests
-import torch
-from PIL import Image
-
-from transformers import BitsAndBytesConfig, ColQwen2ForRetrieval, ColQwen2Processor
-from accelerate import Accelerator
-
-model_name = "vidore/colqwen2-v1.0-hf"
-device = Accelerator().device
-
-# 4-bit quantization configuration
-bnb_config = BitsAndBytesConfig(
-    load_in_4bit=True,
-    bnb_4bit_use_double_quant=True,
-    bnb_4bit_quant_type="nf4",
-    bnb_4bit_compute_dtype=torch.float16,
-)
-
-model = ColQwen2ForRetrieval.from_pretrained(
-    model_name,
-    quantization_config=bnb_config,
-    device_map=device,
-).eval()
-
-processor = ColQwen2Processor.from_pretrained(model_name)
-
-url1 = "https://upload.wikimedia.org/wikipedia/commons/8/89/US-original-Declaration-1776.jpg"
-url2 = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/4c/Romeoandjuliet1597.jpg/500px-Romeoandjuliet1597.jpg"
-
-images = [
-    Image.open(requests.get(url1, stream=True).raw),
-    Image.open(requests.get(url2, stream=True).raw),
-]
-
-queries = [
-    "When was the United States Declaration of Independence proclaimed?",
-    "Who printed the edition of Romeo and Juliet?",
-]
-
-# Process the inputs
-inputs_images = processor(images=images, return_tensors="pt").to(model.device)
-inputs_text = processor(text=queries, return_tensors="pt").to(model.device)
-
-# Forward pass
-with torch.no_grad():
-    image_embeddings = model(**inputs_images).embeddings
-    query_embeddings = model(**inputs_text).embeddings
-
-# Score the queries against the images
-scores = processor.score_retrieval(query_embeddings, image_embeddings)
-
-print("Retrieval scores (query x image):")
-print(scores)
-```
-
-## Notes
-
- [`~ColQwen2Processor.score_retrieval`] returns a 2D tensor where the first dimension is the number of queries and the second dimension is the number of images. A higher score indicates more similarity between the query and image.
- Unlike ColPali, ColQwen2 supports arbitrary image resolutions and aspect ratios, which means images are not resized into fixed-size squares. This preserves more of the original input signal.
- Larger input images generate longer multi-vector embeddings, allowing users to adjust image resolution to balance performance and memory usage.
-
 ## ColQwen2Config

 [[autodoc]] ColQwen2Config
--- a/docs/source/en/model_doc/conditional_detr.md
+++ b/docs/source/en/model_doc/conditional_detr.md
@ -13,33 +13,54 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2021-08-13 and added to Hugging Face Transformers on 2022-09-22.*
+*This model was released on 2021-08-13 and added to Hugging Face Transformers on 2022-09-22 and contributed by [DepuMeng](https://huggingface.co/DepuMeng).*

 # Conditional DETR

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[Conditional DETR](https://huggingface.co/papers/2108.06152) addresses slow training convergence in DETR by introducing a conditional cross-attention mechanism. This mechanism allows the decoder to learn a conditional spatial query, enabling each cross-attention head to focus on distinct regions such as object extremities or internal regions. This approach reduces reliance on high-quality content embeddings, simplifying training and achieving up to 10× faster convergence for stronger backbones.

-## Overview
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-The Conditional DETR model was proposed in [Conditional DETR for Fast Training Convergence](https://huggingface.co/papers/2108.06152) by Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, Jingdong Wang. Conditional DETR presents a conditional cross-attention mechanism for fast DETR training. Conditional DETR converges 6.7× to 10× faster than DETR.
+```py
+import torch
+from transformers import pipeline

-The abstract from the paper is the following:
+pipeline = pipeline(task="object-detection", model="microsoft/conditional-detr-resnet-50", dtype="auto")
+pipeline("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg")
+```

-*The recently-developed DETR approach applies the transformer encoder and decoder architecture to object detection and achieves promising performance. In this paper, we handle the critical issue, slow training convergence, and present a conditional cross-attention mechanism for fast DETR training. Our approach is motivated by that the cross-attention in DETR relies highly on the content embeddings for localizing the four extremities and predicting the box, which increases the need for high-quality content embeddings and thus the training difficulty. Our approach, named conditional DETR, learns a conditional spatial query from the decoder embedding for decoder multi-head cross-attention. The benefit is that through the conditional spatial query, each cross-attention head is able to attend to a band containing a distinct region, e.g., one object extremity or a region inside the object box. This narrows down the spatial range for localizing the distinct regions for object classification and box regression, thus relaxing the dependence on the content embeddings and easing the training. Empirical results show that conditional DETR converges 6.7× faster for the backbones R50 and R101 and 10× faster for stronger backbones DC5-R50 and DC5-R101. Code is available at https://github.com/Atten4Vis/ConditionalDETR.*
+</hfoption>
+<hfoption id="AutoModel">

-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/conditional_detr_curve.jpg"
-alt="drawing" width="600"/>
+```py
+import torch
+import requests
+from PIL import Image
+from transformers import AutoImageProcessor, AutoModelForObjectDetection

-<small> Conditional DETR shows much faster convergence compared to the original DETR. Taken from the <a href="https://huggingface.co/papers/2108.06152">original paper</a>.</small>
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
+image = Image.open(requests.get(url, stream=True).raw)

-This model was contributed by [DepuMeng](https://huggingface.co/DepuMeng). The original code can be found [here](https://github.com/Atten4Vis/ConditionalDETR).
+image_processor = AutoImageProcessor.from_pretrained("microsoft/conditional-detr-resnet-50")
+model = AutoModelForObjectDetection.from_pretrained("microsoft/conditional-detr-resnet-50", dtype="auto")

-## Resources
+inputs = image_processor(images=image, return_tensors="pt")
+outputs = model(**inputs)
+target_sizes = torch.tensor([image.size[::-1]])
+results = image_processor.post_process_object_detection(outputs, threshold=0.5, target_sizes=target_sizes)[
+    0
+]
+for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
+    box = [round(i, 2) for i in box.tolist()]
+    print(
+        f"Detected {model.config.id2label[label.item()]} with confidence "
+        f"{round(score.item(), 3)} at location {box}"
+    )
+```

- Scripts for finetuning [`ConditionalDetrForObjectDetection`] with [`Trainer`] or [Accelerate](https://huggingface.co/docs/accelerate/index) can be found [here](https://github.com/huggingface/transformers/tree/main/examples/pytorch/object-detection).
- See also: [Object detection task guide](../tasks/object_detection).
+</hfoption>
+</hfoptions>

 ## ConditionalDetrConfig

@ -49,6 +70,10 @@ This model was contributed by [DepuMeng](https://huggingface.co/DepuMeng). The o

 [[autodoc]] ConditionalDetrImageProcessor
    - preprocess
+    - post_process_object_detection
+    - post_process_instance_segmentation
+    - post_process_semantic_segmentation
+    - post_process_panoptic_segmentation

 ## ConditionalDetrImageProcessorFast

@ -73,3 +98,4 @@ This model was contributed by [DepuMeng](https://huggingface.co/DepuMeng). The o

 [[autodoc]] ConditionalDetrForSegmentation
    - forward
+
--- a/docs/source/en/model_doc/convbert.md
+++ b/docs/source/en/model_doc/convbert.md
@ -13,48 +13,46 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2020-08-06 and added to Hugging Face Transformers on 2021-01-27.*
+*This model was released on 2020-08-06 and added to Hugging Face Transformers on 2021-01-27 and contributed by [abhishek](https://huggingface.co/abhishek).*

 # ConvBERT

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://huggingface.co/papers/2008.02496) proposes a novel span-based dynamic convolution to enhance BERT by replacing some self-attention heads with convolution heads, forming a mixed attention block. This design improves efficiency in learning both global and local contexts. ConvBERT outperforms BERT and its variants in various tasks, achieving an 86.4 GLUE score with less training cost and fewer parameters.

-## Overview
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-The ConvBERT model was proposed in [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://huggingface.co/papers/2008.02496) by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng
-Yan.
+```py
+import torch
+from transformers import pipeline

-The abstract from the paper is the following:
+pipeline = pipeline(task="fill-mask", model="YituTech/conv-bert-base", dtype="auto")
+pipeline("Plants create [MASK] through a process known as photosynthesis.")
+```

-*Pre-trained language models like BERT and its variants have recently achieved impressive performance in various
-natural language understanding tasks. However, BERT heavily relies on the global self-attention block and thus suffers
-large memory footprint and computation cost. Although all its attention heads query on the whole input sequence for
-generating the attention map from a global perspective, we observe some heads only need to learn local dependencies,
-which means the existence of computation redundancy. We therefore propose a novel span-based dynamic convolution to
-replace these self-attention heads to directly model local dependencies. The novel convolution heads, together with the
-rest self-attention heads, form a new mixed attention block that is more efficient at both global and local context
-learning. We equip BERT with this mixed attention design and build a ConvBERT model. Experiments have shown that
-ConvBERT significantly outperforms BERT and its variants in various downstream tasks, with lower training cost and
-fewer model parameters. Remarkably, ConvBERTbase model achieves 86.4 GLUE score, 0.7 higher than ELECTRAbase, while
-using less than 1/4 training cost. Code and pre-trained models will be released.*
+</hfoption>
+<hfoption id="AutoModel">

-This model was contributed by [abhishek](https://huggingface.co/abhishek). The original implementation can be found
-here: https://github.com/yitu-opensource/ConvBert
+```py
+import torch
+from transformers import AutoModelForMaskedLM, AutoTokenizer

-## Usage tips
+model = AutoModelForMaskedLM.from_pretrained("YituTech/conv-bert-base", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("YituTech/conv-bert-base")

-ConvBERT training tips are similar to those of BERT. For usage tips refer to [BERT documentation](bert).
+inputs = tokenizer("Plants create [MASK] through a process known as photosynthesis.", return_tensors="pt")
+outputs = model(**inputs)
+mask_token_id = tokenizer.mask_token_id
+mask_position = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
+predicted_word = tokenizer.decode(outputs.logits[0, mask_position].argmax(dim=-1))
+print(f"Predicted word: {predicted_word}")
+```
+
+</hfoption>
+</hfoptions>

 ## Resources

- [Text classification task guide](../tasks/sequence_classification)
- [Token classification task guide](../tasks/token_classification)
- [Question answering task guide](../tasks/question_answering)
- [Masked language modeling task guide](../tasks/masked_language_modeling)
- [Multiple choice task guide](../tasks/multiple_choice)
-
 ## ConvBertConfig

 [[autodoc]] ConvBertConfig
@ -100,3 +98,4 @@ ConvBERT training tips are similar to those of BERT. For usage tips refer to [BE

 [[autodoc]] ConvBertForQuestionAnswering
    - forward
+
--- a/docs/source/en/model_doc/convnext.md
+++ b/docs/source/en/model_doc/convnext.md
@ -13,47 +13,49 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2022-01-10 and added to Hugging Face Transformers on 2022-02-07.*
+*This model was released on 2022-01-10 and added to Hugging Face Transformers on 2022-02-07 and contributed by [nielsr](https://huggingface.co/nielsr).*

 # ConvNeXT

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[ConvNeXT](https://huggingface.co/papers/2201.03545) reexamines the design spaces of ConvNets and explores the potential of pure ConvNet architectures inspired by Vision Transformers. By modernizing a standard ResNet, the model identifies key components that enhance performance. ConvNeXT achieves competitive accuracy and scalability, reaching 87.8% ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detection and ADE20K segmentation, while retaining the simplicity and efficiency of traditional ConvNets.

-## Overview
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-The ConvNeXT model was proposed in [A ConvNet for the 2020s](https://huggingface.co/papers/2201.03545) by Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie.
-ConvNeXT is a pure convolutional model (ConvNet), inspired by the design of Vision Transformers, that claims to outperform them.
+```py
+import torch
+from transformers import pipeline

-The abstract from the paper is the following:
+pipeline = pipeline(task="image-classification", model="facebook/convnext-tiny-224", dtype="auto")
+pipeline("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg")
+```

-*The "Roaring 20s" of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model.
-A vanilla ViT, on the other hand, faces difficulties when applied to general computer vision tasks such as object detection and semantic segmentation. It is the hierarchical Transformers
-(e.g., Swin Transformers) that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone and demonstrating remarkable performance on a wide
-variety of vision tasks. However, the effectiveness of such hybrid approaches is still largely credited to the intrinsic superiority of Transformers, rather than the inherent inductive
-biases of convolutions. In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve. We gradually "modernize" a standard ResNet toward the design
-of a vision Transformer, and discover several key components that contribute to the performance difference along the way. The outcome of this exploration is a family of pure ConvNet models
-dubbed ConvNeXt. Constructed entirely from standard ConvNet modules, ConvNeXts compete favorably with Transformers in terms of accuracy and scalability, achieving 87.8% ImageNet top-1 accuracy
-and outperforming Swin Transformers on COCO detection and ADE20K segmentation, while maintaining the simplicity and efficiency of standard ConvNets.*
+</hfoption>
+<hfoption id="AutoModel">

-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/convnext_architecture.jpg"
-alt="drawing" width="600"/>
+```python
+import torch
+import requests
+from PIL import Image
+from transformers import AutoImageProcessor, AutoModelForImageClassification

-<small> ConvNeXT architecture. Taken from the <a href="https://huggingface.co/papers/2201.03545">original paper</a>.</small>
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
+image = Image.open(requests.get(url, stream=True).raw)

-This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/facebookresearch/ConvNeXt).
+image_processor = AutoImageProcessor.from_pretrained("facebook/convnext-tiny-224")
+model = AutoModelForImageClassification.from_pretrained("facebook/convnext-tiny-224", dtype="auto")

-## Resources
+inputs = image_processor(image, return_tensors="pt")

-A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with ConvNeXT.
+with torch.no_grad():
+    logits = model(**inputs).logits

-<PipelineTag pipeline="image-classification"/>
+predicted_label = logits.argmax(-1).item()
+print(model.config.id2label[predicted_label])
+```

- [`ConvNextForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb).
- See also: [Image classification task guide](../tasks/image_classification)
-
-If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
+</hfoption>
+</hfoptions>

 ## ConvNextConfig

@ -78,3 +80,4 @@ If you're interested in submitting a resource to be included here, please feel f

 [[autodoc]] ConvNextForImageClassification
    - forward
+
--- a/docs/source/en/model_doc/convnextv2.md
+++ b/docs/source/en/model_doc/convnextv2.md
@ -13,39 +13,49 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2023-01-02 and added to Hugging Face Transformers on 2023-03-14.*
+*This model was released on 2023-01-02 and added to Hugging Face Transformers on 2023-03-14 and contributed by [adirik](https://huggingface.co/adirik).*

 # ConvNeXt V2

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[ConvNeXt V2](https://huggingface.co/papers/2301.00808) is a fully convolutional model inspired by Vision Transformers and built upon ConvNeXt. It integrates a novel Global Response Normalization (GRN) layer to enhance inter-channel feature competition and a fully convolutional masked autoencoder framework. This co-design improves performance on various recognition tasks, including ImageNet classification, COCO detection, and ADE20K segmentation. Pre-trained ConvNeXt V2 models range from an efficient 3.7M-parameter Atto model achieving 76.7% top-1 accuracy on ImageNet to a 650M Huge model with 88.9% accuracy using only public training data.

-## Overview
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-The ConvNeXt V2 model was proposed in [ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders](https://huggingface.co/papers/2301.00808) by Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, Saining Xie.
-ConvNeXt V2 is a pure convolutional model (ConvNet), inspired by the design of Vision Transformers, and a successor of [ConvNeXT](convnext).
+```py
+import torch
+from transformers import pipeline

-The abstract from the paper is the following:
+pipeline = pipeline(task="image-classification", model="facebook/convnextv2-tiny-1k-224", dtype="auto")
+pipeline("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg")
+```

-*Driven by improved architectures and better representation learning frameworks, the field of visual recognition has enjoyed rapid modernization and performance boost in the early 2020s. For example, modern ConvNets, represented by ConvNeXt, have demonstrated strong performance in various scenarios. While these models were originally designed for supervised learning with ImageNet labels, they can also potentially benefit from self-supervised learning techniques such as masked  autoencoders (MAE). However, we found that simply combining these two approaches leads to subpar performance. In this paper, we propose a fully convolutional masked autoencoder framework and a new Global Response Normalization (GRN) layer that can be added to the ConvNeXt architecture to enhance inter-channel feature competition. This co-design of self-supervised learning techniques and architectural improvement results in a new model family called ConvNeXt V2, which significantly improves the performance of pure ConvNets on various recognition benchmarks, including ImageNet classification, COCO detection, and ADE20K segmentation. We also provide pre-trained ConvNeXt V2 models of various sizes, ranging from an efficient 3.7M-parameter Atto model with 76.7% top-1 accuracy on ImageNet, to a 650M Huge model that achieves a state-of-the-art 88.9% accuracy using only public training data.*
+</hfoption>
+<hfoption id="AutoModel">

-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/convnextv2_architecture.png"
-alt="drawing" width="600"/>
+```python
+import torch
+import requests
+from PIL import Image
+from transformers import AutoImageProcessor, AutoModelForImageClassification

-<small> ConvNeXt V2 architecture. Taken from the <a href="https://huggingface.co/papers/2301.00808">original paper</a>.</small>
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
+image = Image.open(requests.get(url, stream=True).raw)

-This model was contributed by [adirik](https://huggingface.co/adirik). The original code can be found [here](https://github.com/facebookresearch/ConvNeXt-V2).
+image_processor = AutoImageProcessor.from_pretrained("facebook/convnextv2-tiny-1k-224")
+model = AutoModelForImageClassification.from_pretrained("facebook/convnextv2-tiny-1k-224", dtype="auto")

-## Resources
+inputs = image_processor(image, return_tensors="pt")

-A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with ConvNeXt V2.
+with torch.no_grad():
+    logits = model(**inputs).logits

-<PipelineTag pipeline="image-classification"/>
+predicted_label = logits.argmax(-1).item()
+print(model.config.id2label[predicted_label])
+```

- [`ConvNextV2ForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb).
-
-If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
+</hfoption>
+</hfoptions>

 ## ConvNextV2Config

@ -60,3 +70,4 @@ If you're interested in submitting a resource to be included here, please feel f

 [[autodoc]] ConvNextV2ForImageClassification
    - forward
+
--- a/docs/source/en/model_doc/cpm.md
+++ b/docs/source/en/model_doc/cpm.md
@ -13,41 +13,41 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2020-12-01 and added to Hugging Face Transformers on 2021-04-10.*
+*This model was released on 2020-12-01 and added to Hugging Face Transformers on 2021-04-10 and contributed by [canwenxu](https://huggingface.co/canwenxu).*

 # CPM

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[CPM: A Large-scale Generative Chinese Pre-trained Language Model](https://huggingface.co/papers/2012.00413) is the largest Chinese pre-trained language model with 2.6 billion parameters and 100GB of Chinese training data. It facilitates various downstream NLP tasks including conversation, essay generation, cloze test, and language understanding. Extensive experiments show that CPM performs strongly in few-shot and zero-shot learning settings. Its architecture mirrors GPT-2, with the primary difference being the tokenization method.

-## Overview
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-The CPM model was proposed in [CPM: A Large-scale Generative Chinese Pre-trained Language Model](https://huggingface.co/papers/2012.00413) by Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin,
-Yusheng Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi Wang, Yanan Zheng, Guoyang Zeng, Huanqi Cao, Shengqi Chen,
-Daixuan Li, Zhenbo Sun, Zhiyuan Liu, Minlie Huang, Wentao Han, Jie Tang, Juanzi Li, Xiaoyan Zhu, Maosong Sun.
+```py
+import torch
+from transformers import pipeline

-The abstract from the paper is the following:
+pipeline = pipeline(task="text-generation", model="TsinghuaAI/CPM-Generate", dtype="auto")
+pipeline("植物通过光合作用产生能量。")
+```

-*Pre-trained Language Models (PLMs) have proven to be beneficial for various downstream NLP tasks. Recently, GPT-3,
-with 175 billion parameters and 570GB training data, drew a lot of attention due to the capacity of few-shot (even
-zero-shot) learning. However, applying GPT-3 to address Chinese NLP tasks is still challenging, as the training corpus
-of GPT-3 is primarily English, and the parameters are not publicly available. In this technical report, we release the
-Chinese Pre-trained Language Model (CPM) with generative pre-training on large-scale Chinese training data. To the best
-of our knowledge, CPM, with 2.6 billion parameters and 100GB Chinese training data, is the largest Chinese pre-trained
-language model, which could facilitate several downstream Chinese NLP tasks, such as conversation, essay generation,
-cloze test, and language understanding. Extensive experiments demonstrate that CPM achieves strong performance on many
-NLP tasks in the settings of few-shot (even zero-shot) learning.*
+</hfoption>
+<hfoption id="AutoModel">

-This model was contributed by [canwenxu](https://huggingface.co/canwenxu). The original implementation can be found
-here: https://github.com/TsinghuaAI/CPM-Generate
+```py
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer

-<Tip>
+model = AutoModelForCausalLM.from_pretrained("TsinghuaAI/CPM-Generate", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("TsinghuaAI/CPM-Generate")

-CPM's architecture is the same as GPT-2, except for tokenization method. Refer to [GPT-2 documentation](gpt2) for
-API reference information.
+inputs = tokenizer("植物通过光合作用产生能量。", return_tensors="pt")
+outputs = model.generate(**inputs, max_length=50, do_sample=True)
+generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(generated_text)
+```

-</Tip>
+</hfoption>
+</hfoptions>

 ## CpmTokenizer

@ -56,3 +56,4 @@ API reference information.
 ## CpmTokenizerFast

 [[autodoc]] CpmTokenizerFast
+
--- a/docs/source/en/model_doc/cpmant.md
+++ b/docs/source/en/model_doc/cpmant.md
@ -13,23 +13,41 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2022-09-16 and added to Hugging Face Transformers on 2023-04-12.*
+*This model was released on {release_date} and added to Hugging Face Transformers on 2023-04-12 and contributed by [openbmb](https://huggingface.co/openbmb).*

 # CPMAnt

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[CPM-Ant](https://github.com/OpenBMB/CPM-Live/tree/cpm-ant/cpm-live) is developed from CPM-Live, an open-source framework for training and serving large language models. It supports distributed training across multiple GPUs and nodes with model, data, and pipeline parallelism, enabling efficient scaling to billions of parameters. The framework provides features like dynamic micro-batching, mixed precision training, and checkpointing for fault tolerance. It also includes APIs for interactive inference, making it practical for both research and real-world deployment of large Transformer-based models.

-## Overview
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-CPM-Ant is an open-source Chinese pre-trained language model (PLM) with 10B parameters. It is also the first milestone of the live training process of CPM-Live. The training process is cost-effective and environment-friendly. CPM-Ant also achieves promising results with delta tuning on the CUGE benchmark. Besides the full model, we also provide various compressed versions to meet the requirements of different hardware configurations. [See more](https://github.com/OpenBMB/CPM-Live/tree/cpm-ant/cpm-live)
+```py
+import torch
+from transformers import pipeline

-This model was contributed by [OpenBMB](https://huggingface.co/openbmb). The original code can be found [here](https://github.com/OpenBMB/CPM-Live/tree/cpm-ant/cpm-live).
+pipeline = pipeline(task="text-generation", model="openbmb/cpm-ant-10b", dtype="auto")
+pipeline("植物通过光合作用产生能量。")
+```

-## Resources
+</hfoption>
+<hfoption id="AutoModel">

- A tutorial on [CPM-Live](https://github.com/OpenBMB/CPM-Live/tree/cpm-ant/cpm-live).
+```py
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+model = AutoModelForCausalLM.from_pretrained("openbmb/cpm-ant-10b", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("openbmb/cpm-ant-10b")
+
+inputs = tokenizer("植物通过光合作用产生能量。", return_tensors="pt")
+outputs = model.generate(**inputs, max_length=50, do_sample=True)
+generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(generated_text)
+```
+
+</hfoption>
+</hfoptions>

 ## CpmAntConfig

@ -45,8 +63,8 @@ This model was contributed by [OpenBMB](https://huggingface.co/openbmb). The ori

 [[autodoc]] CpmAntModel
    - all
-
+    
 ## CpmAntForCausalLM

 [[autodoc]] CpmAntForCausalLM
-    - all
+    - all
--- a/docs/source/en/model_doc/csm.md
+++ b/docs/source/en/model_doc/csm.md
@ -13,342 +13,49 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2025-02-27 and added to Hugging Face Transformers on 2025-05-07.*
+*This model was released on 2025-02-27 and added to Hugging Face Transformers on 2025-05-07 and contributed by [eustlb](https://huggingface.co/eustlb).*

-# Csm
+# CSM

-## Overview
+[CSM](https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice) is an end-to-end multimodal transformer system that generates contextually appropriate, high-fidelity speech by interleaving text and audio tokens. It operates directly on Residual Vector Quantization (RVQ) audio tokens and splits processing into two transformers: a large multimodal backbone that predicts the zeroth codebook and a lightweight audio decoder that handles the remaining codebooks for real-time generation. This structure allows CSM to capture conversational context while maintaining low latency. To train efficiently, it uses a compute amortization technique—training the audio decoder on only a small random subset of frames—preserving quality while dramatically reducing memory and compute costs.

-The Conversational Speech Model (CSM) is the first open-source contextual text-to-speech model [released by Sesame](https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice). It is designed to generate natural-sounding speech with or without conversational context. This context typically consists of multi-turn dialogue between speakers, represented as sequences of text and corresponding spoken audio.
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-**Model Architecture:**
-CSM is composed of two LLaMA-style auto-regressive transformer decoders: a backbone decoder that predicts the first codebook token and a depth decoder that generates the remaining tokens. It uses the pretrained codec model [Mimi](./mimi), introduced by Kyutai, to encode speech into discrete codebook tokens and decode them back into audio.
+```py
+import torch
+from transformers import pipeline

-The original csm-1b checkpoint is available under the [Sesame](https://huggingface.co/sesame/csm-1b) organization on Hugging Face.
+pipeline = pipeline(task="text-to-audio", model="sesame/csm-1b", dtype="auto")
+output = pipeline("Plants create energy through a process known as photosynthesis.")
+audio = output["audio"]
+```

-<div class="flex justify-center">
-    <img src="https://huggingface.co/datasets/eustlb/documentation-images/resolve/main/csm_architecture.png"/>
-</div>
+</hfoption>
+<hfoption id="CsmForConditionalGeneration">

-## Usage Tips
-
-### Without Conversational Context
-
-CSM can be used to simply generate speech from a text prompt:
-
-```python
+```py
 import torch
 from transformers import CsmForConditionalGeneration, AutoProcessor
-from accelerate import Accelerator

-model_id = "sesame/csm-1b"
-device = Accelerator().device
+processor = AutoProcessor.from_pretrained("sesame/csm-1b")
+model = CsmForConditionalGeneration.from_pretrained("sesame/csm-1b", dtype="auto")

-# load the model and the processor
-processor = AutoProcessor.from_pretrained(model_id)
-model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)
-
-# prepare the inputs
-text = "[0]The past is just a story we tell ourselves." # `[0]` for speaker id 0
-inputs = processor(text, add_special_tokens=True).to(device)
-
-# another equivalent way to prepare the inputs
 conversation = [
-    {"role": "0", "content": [{"type": "text", "text": "The past is just a story we tell ourselves."}]},
+    {"role": "0", "content": [{"type": "text", "text": "Plants generate energy through a process known as photosynthesis."}]},
 ]
 inputs = processor.apply_chat_template(
    conversation,
    tokenize=True,
    return_dict=True,
-).to(model.device)
+)

-# infer the model
-audio = model.generate(**inputs, output_audio=True)
-processor.save_audio(audio, "example_without_context.wav")
-```
-
-### With Conversational Context
-
-CSM can be used to generate speech given a conversation, allowing consistency in the voices and content-aware generation:
-
-```python
-import torch
-from transformers import CsmForConditionalGeneration, AutoProcessor
-from accelerate import Accelerator
-from datasets import load_dataset, Audio
-
-model_id = "sesame/csm-1b"
-device = Accelerator().device
-
-# load the model and the processor
-processor = AutoProcessor.from_pretrained(model_id)
-model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)
-
-# prepare the inputs
-ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
-# ensure the audio is 24kHz
-ds = ds.cast_column("audio", Audio(sampling_rate=24000))
-conversation = []
-
-# 1. context
-for text, audio, speaker_id in zip(ds[:4]["text"], ds[:4]["audio"], ds[:4]["speaker_id"]):
-    conversation.append(
-        {
-            "role": f"{speaker_id}",
-            "content": [{"type": "text", "text": text}, {"type": "audio", "path": audio["array"]}],
-        }
-    )
-
-# 2. text prompt
-conversation.append({"role": f"{ds[4]['speaker_id']}", "content": [{"type": "text", "text": ds[4]["text"]}]})
-
-inputs = processor.apply_chat_template(
-    conversation,
-    tokenize=True,
-    return_dict=True,
-).to(model.device)
-
-# infer the model
 audio = model.generate(**inputs, output_audio=True)
 processor.save_audio(audio, "example_with_context.wav")
 ```

-### Batched Inference
-
-CSM supports batched inference!
-
-```python
-import torch
-from transformers import CsmForConditionalGeneration, AutoProcessor
-from accelerate import Accelerator
-from datasets import load_dataset, Audio
-
-model_id = "sesame/csm-1b"
-device = Accelerator().device
-
-# load the model and the processor
-processor = AutoProcessor.from_pretrained(model_id)
-model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)
-
-# prepare the inputs 
-ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
-# ensure the audio is 24kHz
-ds = ds.cast_column("audio", Audio(sampling_rate=24000))
-# here a batch with two prompts
-conversation = [
-    [
-        {
-            "role": f"{ds[0]['speaker_id']}",
-            "content": [
-                {"type": "text", "text": ds[0]["text"]},
-                {"type": "audio", "path": ds[0]["audio"]["array"]},
-            ],
-        },
-        {
-            "role": f"{ds[1]['speaker_id']}",
-            "content": [
-                {"type": "text", "text": ds[1]["text"]},
-            ],
-        },
-    ],
-    [
-        {
-            "role": f"{ds[0]['speaker_id']}",
-            "content": [
-                {"type": "text", "text": ds[0]["text"]},
-            ],
-        }
-    ],
-]
-inputs = processor.apply_chat_template(
-    conversation,
-    tokenize=True,
-    return_dict=True,
-).to(model.device)
-
-audio = model.generate(**inputs, output_audio=True)
-processor.save_audio(audio, [f"speech_batch_idx_{i}.wav" for i in range(len(audio))])
-```
-
-### Making The Model Go Brrr
-
-CSM supports full-graph compilation with CUDA graphs!
-
-```python
-import torch
-import copy
-from transformers import CsmForConditionalGeneration, AutoProcessor
-from datasets import load_dataset
-
-model_id = "sesame/csm-1b"
-device = "cuda"
-
-# set logs to ensure no recompilation and graph breaks
-torch._logging.set_logs(graph_breaks=True, recompiles=True, cudagraphs=True)
-
-# load the model and the processor
-processor = AutoProcessor.from_pretrained(model_id)
-model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)
-
-# use static cache, enabling automatically torch compile with fullgraph and reduce-overhead
-model.generation_config.max_length = 250 # big enough to avoid recompilation
-model.generation_config.max_new_tokens = None # would take precedence over max_length
-model.generation_config.cache_implementation = "static"
-model.depth_decoder.generation_config.cache_implementation = "static"
-
-# generation kwargs
-gen_kwargs = {
-    "do_sample": False,
-    "depth_decoder_do_sample": False,
-    "temperature": 1.0,
-    "depth_decoder_temperature": 1.0,
-}
-
-# Define a timing decorator
-class TimerContext:
-    def __init__(self, name="Execution"):
-        self.name = name
-        self.start_event = None
-        self.end_event = None
-        
-    def __enter__(self):
-        # Use CUDA events for more accurate GPU timing
-        self.start_event = torch.cuda.Event(enable_timing=True)
-        self.end_event = torch.cuda.Event(enable_timing=True)
-        self.start_event.record()
-        return self
-
-    def __exit__(self, *args):
-        self.end_event.record()
-        torch.cuda.synchronize()
-        elapsed_time = self.start_event.elapsed_time(self.end_event) / 1000.0
-        print(f"{self.name} time: {elapsed_time:.4f} seconds")
-
-# prepare the inputs 
-ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
-
-conversation = [
-    {
-        "role": f"{ds[0]['speaker_id']}",
-        "content": [
-            {"type": "text", "text": ds[0]["text"]},
-            {"type": "audio", "path": ds[0]["audio"]["array"]},
-        ],
-    },
-    {
-        "role": f"{ds[1]['speaker_id']}",
-        "content": [
-            {"type": "text", "text": ds[1]["text"]},
-            {"type": "audio", "path": ds[1]["audio"]["array"]},
-        ],
-    },
-    {
-        "role": f"{ds[2]['speaker_id']}",
-        "content": [
-            {"type": "text", "text": ds[2]["text"]},
-        ],
-    },
-]
-
-padded_inputs_1 = processor.apply_chat_template(
-    conversation,
-    tokenize=True,
-    return_dict=True,
-).to(model.device)
-
-print("\n" + "="*50)
-print("First generation - compiling and recording CUDA graphs...")
-with TimerContext("First generation"):
-    _ = model.generate(**padded_inputs_1, **gen_kwargs)
-print("="*50)
-
-print("\n" + "="*50)
-print("Second generation - fast !!!")
-with TimerContext("Second generation"):
-    _ = model.generate(**padded_inputs_1, **gen_kwargs)
-print("="*50)
-
-# now with different inputs
-conversation = [
-    {
-        "role": f"{ds[0]['speaker_id']}",
-        "content": [
-            {"type": "text", "text": ds[2]["text"]},
-            {"type": "audio", "path": ds[2]["audio"]["array"]},
-        ],
-    },
-    {
-        "role": f"{ds[1]['speaker_id']}",
-        "content": [
-            {"type": "text", "text": ds[3]["text"]},
-            {"type": "audio", "path": ds[3]["audio"]["array"]},
-        ],
-    },
-    {
-        "role": f"{ds[2]['speaker_id']}",
-        "content": [
-            {"type": "text", "text": ds[4]["text"]},
-        ],
-    },
-]
-padded_inputs_2 = processor.apply_chat_template(
-    conversation,
-    tokenize=True,
-    return_dict=True,
-).to(model.device)
-
-print("\n" + "="*50)
-print("Generation with other inputs!")
-with TimerContext("Generation with different inputs"):
-    _ = model.generate(**padded_inputs_2, **gen_kwargs)
-print("="*50)
-```
-
-### Training
-
-CSM Transformers integration supports training!
-
-```python
-from transformers import CsmForConditionalGeneration, AutoProcessor
-from accelerate import Accelerator
-from datasets import load_dataset, Audio
-
-model_id = "sesame/csm-1b"
-device = Accelerator().device
-
-# load the model and the processor
-processor = AutoProcessor.from_pretrained(model_id)
-model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)
-model.train()
-model.codec_model.eval()
-
-ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
-# ensure the audio is 24kHz
-ds = ds.cast_column("audio", Audio(sampling_rate=24000))
-conversation = []
-
-# context
-for text, audio, speaker_id in zip(ds[:4]["text"], ds[:4]["audio"], ds[:4]["speaker_id"]):
-    conversation.append(
-        {
-            "role": f"{speaker_id}",
-            "content": [{"type": "text", "text": text}, {"type": "audio", "path": audio["array"]}],
-        }
-    )
-
-inputs = processor.apply_chat_template(
-    conversation,
-    tokenize=True,
-    return_dict=True,
-    output_labels=True,
-).to(model.device)
-
-out = model(**inputs)
-out.loss.backward()
-```
-
-This model was contributed by [Eustache Le Bihan](https://huggingface.co/eustlb).
-The original code can be found [here](https://github.com/SesameAILabs/csm).
+</hfoption>
+</hfoptions>

 ## CsmConfig

@ -360,10 +67,6 @@ The original code can be found [here](https://github.com/SesameAILabs/csm).

 ## CsmProcessor

-<div class="flex justify-center">
-    <img src="https://huggingface.co/datasets/eustlb/documentation-images/resolve/main/fig1.jpg"/>
-</div>
-
 [[autodoc]] CsmProcessor
    - __call__

--- a/docs/source/en/model_doc/ctrl.md
+++ b/docs/source/en/model_doc/ctrl.md
@ -13,52 +13,47 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2019-09-11 and added to Hugging Face Transformers on 2020-11-16.*
+*This model was released on 2019-09-11 and added to Hugging Face Transformers on 2020-11-16 and contributed by [keskarnitishr](https://huggingface.co/keskarnitishr).*

 # CTRL

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[CTRL](https://huggingface.co/papers/1909.05858) is a 1.63 billion-parameter conditional transformer language model designed to generate text based on control codes. These codes guide the style, content, and task-specific behavior of the generated text, leveraging unsupervised learning while offering explicit control. The model can also predict the most likely data sources for a given sequence, enabling model-based source attribution.

-## Overview
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-CTRL model was proposed in [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://huggingface.co/papers/1909.05858) by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and
-Richard Socher. It's a causal (unidirectional) transformer pre-trained using language modeling on a very large corpus
-of ~140 GB of text data with the first token reserved as a control code (such as Links, Books, Wikipedia etc.).
+```py
+import torch
+from transformers import pipeline

-The abstract from the paper is the following:
+pipeline = pipeline(task="text-classification", model="salesforce/ctrl", dtype="auto")
+pipeline("Plants are amazing because they can create energy from the sun.")
+```

-*Large-scale language models show promising text generation capabilities, but users cannot easily control particular
-aspects of the generated text. We release CTRL, a 1.63 billion-parameter conditional transformer language model,
-trained to condition on control codes that govern style, content, and task-specific behavior. Control codes were
-derived from structure that naturally co-occurs with raw text, preserving the advantages of unsupervised learning while
-providing more explicit control over text generation. These codes also allow CTRL to predict which parts of the
-training data are most likely given a sequence. This provides a potential method for analyzing large amounts of data
-via model-based source attribution.*
+</hfoption>
+<hfoption id="AutoModel">

-This model was contributed by [keskarnitishr](https://huggingface.co/keskarnitishr). The original code can be found
-[here](https://github.com/salesforce/ctrl).
+```py
+from transformers import AutoModelForSequenceClassification, AutoTokenizer
+
+model = AutoModelForSequenceClassification.from_pretrained("Salesforce/ctrl", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("Salesforce/ctrl")
+
+inputs = tokenizer("Plants are amazing because they can create energy from the sun.", return_tensors="pt")
+outputs = model(**inputs)
+predicted_class_id = outputs.logits.argmax(dim=-1).item()
+label = model.config.id2label[predicted_class_id]
+print(f"Predicted label: {label}")
+```
+
+</hfoption>
+</hfoptions>

 ## Usage tips

- CTRL makes use of control codes to generate text: it requires generations to be started by certain words, sentences
-  or links to generate coherent text. Refer to the [original implementation](https://github.com/salesforce/ctrl) for
-  more information.
- CTRL is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
-  the left.
- CTRL was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next
-  token in a sequence. Leveraging this feature allows CTRL to generate syntactically coherent text as it can be
-  observed in the *run_generation.py* example script.
- The PyTorch models can take the `past_key_values` as input, which is the previously computed key/value attention pairs.
-  Using the `past_key_values` value prevents the model from re-computing
-  pre-computed values in the context of text generation. See the [`forward`](model_doc/ctrl#transformers.CTRLModel.forward)
-  method for more information on the usage of this argument.
-
-## Resources
-
- [Text classification task guide](../tasks/sequence_classification)
- [Causal language modeling task guide](../tasks/language_modeling)
+- CTRL uses control codes to generate text. Start generations with specific words, sentences, or links to generate coherent text. Check the original implementation for details.
+- Pad inputs on the right. CTRL uses absolute position embeddings.
+- PyTorch models accept `past_key_values` as input. These are previously computed key/value attention pairs. Using `past_key_values` prevents re-computing pre-computed values during text generation. See the [`~CTRLModel.forward`] method for usage details.

 ## CTRLConfig

@ -83,3 +78,4 @@ This model was contributed by [keskarnitishr](https://huggingface.co/keskarnitis

 [[autodoc]] CTRLForSequenceClassification
    - forward
+
--- a/docs/source/en/model_doc/cvt.md
+++ b/docs/source/en/model_doc/cvt.md
@ -13,26 +13,11 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2021-03-29 and added to Hugging Face Transformers on 2022-05-18.*
-
-<div style="float: right;">
-    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-    </div>
-</div>
+*This model was released on 2021-03-29 and added to Hugging Face Transformers on 2022-05-18 and contributed by [anugunj](https://huggingface.co/anugunj).*

 # Convolutional Vision Transformer (CvT)

-[Convolutional Vision Transformer (CvT)](https://huggingface.co/papers/2103.15808) is a model that combines the strengths of convolutional neural networks (CNNs) and Vision transformers for the computer vision tasks. It introduces convolutional layers into the vision transformer architecture, allowing it to capture local patterns in images while maintaining the global context provided by self-attention mechanisms.
-
-You can find all the CvT checkpoints under the [Microsoft](https://huggingface.co/microsoft?search_models=cvt) organization.
-
-> [!TIP]
-> This model was contributed by [anujunj](https://huggingface.co/anugunj).
->
-> Click on the CvT models in the right sidebar for more examples of how to apply CvT to different computer vision tasks.
-
-The example below demonstrates how to classify an image with [`Pipeline`] or the [`AutoModel`] class.
+[Convolutional vision Transformer (CvT)](https://huggingface.co/papers/2103.15808) enhances Vision Transformer (ViT) through the integration of convolutions, combining the strengths of both architectures. Key modifications include a hierarchical Transformer with a convolutional token embedding and a convolutional Transformer block with a convolutional projection. These enhancements introduce CNN properties like shift, scale, and distortion invariance while retaining Transformer benefits such as dynamic attention and global context. CvT achieves state-of-the-art performance on ImageNet-1k with fewer parameters and lower FLOPs, even when pretrained on larger datasets like ImageNet-22k. Notably, positional encoding can be omitted in CvT, simplifying the design for high-resolution vision tasks.

 <hfoptions id="usage">
 <hfoption id="Pipeline">
@ -41,51 +26,37 @@ The example below demonstrates how to classify an image with [`Pipeline`] or the
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    task="image-classification",
-    model="microsoft/cvt-13",
-    dtype=torch.float16,
-    device=0
-)
+pipeline = pipeline(task="image-classification", model="microsoft/cvt-13", dtype="auto")
 pipeline("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg")
 ```

 </hfoption>
 <hfoption id="AutoModel">

-```py
+```python
 import torch
 import requests
 from PIL import Image
-from transformers import AutoModelForImageClassification, AutoImageProcessor
-
-image_processor = AutoImageProcessor.from_pretrained("microsoft/cvt-13")
-model = AutoModelForImageClassification.from_pretrained(
-    "microsoft/cvt-13",
-    dtype=torch.float16,
-    device_map="auto"
-)
+from transformers import AutoImageProcessor, AutoModelForImageClassification

 url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
 image = Image.open(requests.get(url, stream=True).raw)
-inputs = image_processor(image, return_tensors="pt").to(model.device)
+
+image_processor = AutoImageProcessor.from_pretrained("microsoft/cvt-13")
+model = AutoModelForImageClassification.from_pretrained("microsoft/cvt-13", dtype="auto")
+
+inputs = image_processor(image, return_tensors="pt")

 with torch.no_grad():
-  logits = model(**inputs).logits
-predicted_class_id = logits.argmax(dim=-1).item()
+    logits = model(**inputs).logits

-class_labels = model.config.id2label
-predicted_class_label = class_labels[predicted_class_id]
-print(f"The predicted class label is: {predicted_class_label}")
+predicted_label = logits.argmax(-1).item()
+print(model.config.id2label[predicted_label])
 ```

 </hfoption>
 </hfoptions>

-## Resources
-
-Refer to this set of ViT [notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer) for examples of inference and fine-tuning on custom datasets. Replace [`ViTFeatureExtractor`] and [`ViTForImageClassification`] in these notebooks with [`AutoImageProcessor`] and [`CvtForImageClassification`].
-
 ## CvtConfig

 [[autodoc]] CvtConfig
@ -99,3 +70,4 @@ Refer to this set of ViT [notebooks](https://github.com/NielsRogge/Transformers-

 [[autodoc]] CvtForImageClassification
    - forward
+
--- a/docs/source/en/model_doc/cwm.md
+++ b/docs/source/en/model_doc/cwm.md
@ -15,7 +15,6 @@ limitations under the License.
 ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be rendered properly in your Markdown viewer.

 -->
-*This model was released on {release_date} and added to Hugging Face Transformers on 2025-10-09.*

 # Code World Model (CWM)

--- a/docs/source/en/model_doc/d_fine.md
+++ b/docs/source/en/model_doc/d_fine.md
@ -13,55 +13,56 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2024-10-17 and added to Hugging Face Transformers on 2025-04-29.*
+
+*This model was released on 2024-10-17 and added to Hugging Face Transformers on 2025-04-29 and contributed by [VladOS95-cyber](https://github.com/VladOS95-cyber).*

 # D-FINE

-## Overview
+[D-FINE](https://huggingface.co/papers/2410.13842) redefines bounding box regression in DETR models through Fine-grained Distribution Refinement (FDR) and Global Optimal Localization Self-Distillation (GO-LSD). FDR iteratively refines probability distributions for enhanced localization accuracy, while GO-LSD optimizes localization knowledge transfer and simplifies residual predictions. D-FINE includes lightweight optimizations for speed and accuracy, achieving 54.0% / 55.8% AP on COCO at 124 / 78 FPS on an NVIDIA T4 GPU. Pretrained on Objects365, D-FINE-L / X reaches 57.1% / 59.3% AP, outperforming existing real-time detectors. The method improves various DETR models by up to 5.3% AP with minimal additional parameters and training costs.

-The D-FINE model was proposed in [D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement](https://huggingface.co/papers/2410.13842) by
-Yansong Peng, Hebei Li, Peixi Wu, Yueyi Zhang, Xiaoyan Sun, Feng Wu
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-The abstract from the paper is the following:
+```py
+import torch
+from transformers import pipeline

-*We introduce D-FINE, a powerful real-time object detector that achieves outstanding localization precision by redefining the bounding box regression task in DETR models. D-FINE comprises two key components: Fine-grained Distribution Refinement (FDR) and Global Optimal Localization Self-Distillation (GO-LSD).
-FDR transforms the regression process from predicting fixed coordinates to iteratively refining probability distributions, providing a fine-grained intermediate representation that significantly enhances localization accuracy. GO-LSD is a bidirectional optimization strategy that transfers localization knowledge from refined distributions to shallower layers through self-distillation, while also simplifying the residual prediction tasks for deeper layers. Additionally, D-FINE incorporates lightweight optimizations in computationally intensive modules and operations, achieving a better balance between speed and accuracy. Specifically, D-FINE-L / X achieves 54.0% / 55.8% AP on the COCO dataset at 124 / 78 FPS on an NVIDIA T4 GPU. When pretrained on Objects365, D-FINE-L / X attains 57.1% / 59.3% AP, surpassing all existing real-time detectors. Furthermore, our method significantly enhances the performance of a wide range of DETR models by up to 5.3% AP with negligible extra parameters and training costs. Our code and pretrained models: this https URL.*
-
-This model was contributed by [VladOS95-cyber](https://github.com/VladOS95-cyber).
-The original code can be found [here](https://github.com/Peterande/D-FINE).
-
-## Usage tips
-
-```python
->>> import torch
->>> from transformers.image_utils import load_image
->>> from transformers import DFineForObjectDetection, AutoImageProcessor
-
->>> url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
->>> image = load_image(url)
-
->>> image_processor = AutoImageProcessor.from_pretrained("ustc-community/dfine_x_coco")
->>> model = DFineForObjectDetection.from_pretrained("ustc-community/dfine_x_coco")
-
->>> inputs = image_processor(images=image, return_tensors="pt")
-
->>> with torch.no_grad():
-...     outputs = model(**inputs)
-
->>> results = image_processor.post_process_object_detection(outputs, target_sizes=[(image.height, image.width)], threshold=0.5)
-
->>> for result in results:
-...     for score, label_id, box in zip(result["scores"], result["labels"], result["boxes"]):
-...         score, label = score.item(), label_id.item()
-...         box = [round(i, 2) for i in box.tolist()]
-...         print(f"{model.config.id2label[label]}: {score:.2f} {box}")
-cat: 0.96 [344.49, 23.4, 639.84, 374.27]
-cat: 0.96 [11.71, 53.52, 316.64, 472.33]
-remote: 0.95 [40.46, 73.7, 175.62, 117.57]
-sofa: 0.92 [0.59, 1.88, 640.25, 474.74]
-remote: 0.89 [333.48, 77.04, 370.77, 187.3]
+pipeline = pipeline(task="object-detection", model="ustc-community/dfine-xlarge-coco", dtype="auto")
+pipeline("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg")
 ```

+</hfoption>
+<hfoption id="AutoModel">
+
+```py
+import torch
+import requests
+from PIL import Image
+from transformers import AutoImageProcessor, AutoModelForObjectDetection
+
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
+image = Image.open(requests.get(url, stream=True).raw)
+
+image_processor = AutoImageProcessor.from_pretrained("ustc-community/dfine-xlarge-coco")
+model = AutoModelForObjectDetection.from_pretrained("ustc-community/dfine-xlarge-coco", dtype="auto")
+
+inputs = image_processor(images=image, return_tensors="pt")
+outputs = model(**inputs)
+target_sizes = torch.tensor([image.size[::-1]])
+results = image_processor.post_process_object_detection(outputs, threshold=0.5, target_sizes=target_sizes)[
+    0
+]
+for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
+    box = [round(i, 2) for i in box.tolist()]
+    print(
+        f"Detected {model.config.id2label[label.item()]} with confidence "
+        f"{round(score.item(), 3)} at location {box}"
+    )
+```
+
+</hfoption>
+</hfoptions>
+
 ## DFineConfig

 [[autodoc]] DFineConfig
@ -75,3 +76,4 @@ remote: 0.89 [333.48, 77.04, 370.77, 187.3]

 [[autodoc]] DFineForObjectDetection
    - forward
+
--- a/docs/source/en/model_doc/dab-detr.md
+++ b/docs/source/en/model_doc/dab-detr.md
@ -13,106 +13,54 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2022-01-28 and added to Hugging Face Transformers on 2025-02-04.*
+*This model was released on 2022-01-28 and added to Hugging Face Transformers on 2025-02-04 and contributed by [davidhajdu](https://huggingface.co/davidhajdu).*

 # DAB-DETR

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[DAB-DETR](https://huggingface.co/papers/2201.12329) introduces a novel query formulation using dynamic anchor boxes for DETR. This approach directly employs box coordinates as queries in Transformer decoders, updating them iteratively. By leveraging explicit positional priors and box dimensions, it enhances query-to-feature similarity and accelerates training convergence. This method achieves top performance on the MS-COCO benchmark, reaching 45.7% AP with a ResNet-50-DC5 backbone after 50 epochs. Extensive experiments validate the effectiveness of this design.

-## Overview
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-The DAB-DETR model was proposed in [DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR](https://huggingface.co/papers/2201.12329) by Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi, Hang Su, Jun Zhu, Lei Zhang.
-DAB-DETR is an enhanced variant of Conditional DETR. It utilizes dynamically updated anchor boxes to provide both a reference query point (x, y) and a reference anchor size (w, h), improving cross-attention computation. This new approach achieves 45.7% AP when trained for 50 epochs with a single ResNet-50 model as the backbone.
+```py
+import torch
+from transformers import pipeline

-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/dab_detr_convergence_plot.png"
-alt="drawing" width="600"/>
+pipeline = pipeline(task="object-detection", model="IDEA-Research/dab-detr-resnet-50", dtype="auto")
+pipeline("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg")
+```

-The abstract from the paper is the following:
+</hfoption>
+<hfoption id="AutoModel">

-*We present in this paper a novel query formulation using dynamic anchor boxes
-for DETR (DEtection TRansformer) and offer a deeper understanding of the role
-of queries in DETR. This new formulation directly uses box coordinates as queries
-in Transformer decoders and dynamically updates them layer-by-layer. Using box
-coordinates not only helps using explicit positional priors to improve the query-to-feature similarity and eliminate the slow training convergence issue in DETR,
-but also allows us to modulate the positional attention map using the box width
-and height information. Such a design makes it clear that queries in DETR can be
-implemented as performing soft ROI pooling layer-by-layer in a cascade manner.
-As a result, it leads to the best performance on MS-COCO benchmark among
-the DETR-like detection models under the same setting, e.g., AP 45.7% using
-ResNet50-DC5 as backbone trained in 50 epochs. We also conducted extensive
-experiments to confirm our analysis and verify the effectiveness of our methods.*
-
-This model was contributed by [davidhajdu](https://huggingface.co/davidhajdu).
-The original code can be found [here](https://github.com/IDEA-Research/DAB-DETR).
-
-## How to Get Started with the Model
-
-Use the code below to get started with the model.
-
-```python
+```py
 import torch
 import requests
-
 from PIL import Image
-from transformers import AutoModelForObjectDetection, AutoImageProcessor
+from transformers import AutoImageProcessor, AutoModelForObjectDetection

-url = 'http://images.cocodataset.org/val2017/000000039769.jpg' 
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
 image = Image.open(requests.get(url, stream=True).raw)

 image_processor = AutoImageProcessor.from_pretrained("IDEA-Research/dab-detr-resnet-50")
-model = AutoModelForObjectDetection.from_pretrained("IDEA-Research/dab-detr-resnet-50")
+model = AutoModelForObjectDetection.from_pretrained("IDEA-Research/dab-detr-resnet-50", dtype="auto")

 inputs = image_processor(images=image, return_tensors="pt")
-
-with torch.no_grad():
-    outputs = model(**inputs)
-
-results = image_processor.post_process_object_detection(outputs, target_sizes=torch.tensor([image.size[::-1]]), threshold=0.3)
-
-for result in results:
-    for score, label_id, box in zip(result["scores"], result["labels"], result["boxes"]):
-        score, label = score.item(), label_id.item()
-        box = [round(i, 2) for i in box.tolist()]
-        print(f"{model.config.id2label[label]}: {score:.2f} {box}")
+outputs = model(**inputs)
+target_sizes = torch.tensor([image.size[::-1]])
+results = image_processor.post_process_object_detection(outputs, threshold=0.5, target_sizes=target_sizes)[
+    0
+]
+for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
+    box = [round(i, 2) for i in box.tolist()]
+    print(
+        f"Detected {model.config.id2label[label.item()]} with confidence "
+        f"{round(score.item(), 3)} at location {box}"
+    )
 ```

-This should output
-
-```text
-cat: 0.87 [14.7, 49.39, 320.52, 469.28]
-remote: 0.86 [41.08, 72.37, 173.39, 117.2]
-cat: 0.86 [344.45, 19.43, 639.85, 367.86]
-remote: 0.61 [334.27, 75.93, 367.92, 188.81]
-couch: 0.59 [-0.04, 1.34, 639.9, 477.09]
-```
-
-There are three other ways to instantiate a DAB-DETR model (depending on what you prefer):
-
-Option 1: Instantiate DAB-DETR with pre-trained weights for entire model
-
-```py
->>> from transformers import DabDetrForObjectDetection
-
->>> model = DabDetrForObjectDetection.from_pretrained("IDEA-Research/dab-detr-resnet-50")
-```
-
-Option 2: Instantiate DAB-DETR with randomly initialized weights for Transformer, but pre-trained weights for backbone
-
-```py
->>> from transformers import DabDetrConfig, DabDetrForObjectDetection
-
->>> config = DabDetrConfig()
->>> model = DabDetrForObjectDetection(config)
-```
-
-Option 3: Instantiate DAB-DETR with randomly initialized weights for backbone + Transformer
-
-```py
->>> config = DabDetrConfig(use_pretrained_backbone=False)
->>> model = DabDetrForObjectDetection(config)
-```
+</hfoption>
+</hfoptions>

 ## DabDetrConfig

@ -127,3 +75,4 @@ Option 3: Instantiate DAB-DETR with randomly initialized weights for backbone +

 [[autodoc]] DabDetrForObjectDetection
    - forward
+
--- a/docs/source/en/model_doc/dac.md
+++ b/docs/source/en/model_doc/dac.md
@ -13,59 +13,36 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2023-06-11 and added to Hugging Face Transformers on 2024-08-19.*
+*This model was released on 2023-06-11 and added to Hugging Face Transformers on 2024-08-19 and contributed by [kamilakesbi](https://huggingface.co/kamilakesbi).*

 # DAC

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[DAC](https://huggingface.co/papers/2306.06546) is a high-fidelity universal neural audio compression algorithm that compresses 44.1 KHz audio into tokens at 8kbps bandwidth, achieving approximately 90x compression. It combines advancements in high-fidelity audio generation with improved vector quantization techniques from the image domain, enhanced adversarial and reconstruction losses, and a single universal model for various audio domains including speech, environment, and music. This method outperforms competing audio compression algorithms and is supported by open-source code and trained model weights.

-## Overview
+<hfoptions id="usage">
+<hfoption id="DacModel">

-The DAC model was proposed in [Descript Audio Codec: High-Fidelity Audio Compression with Improved RVQGAN](https://huggingface.co/papers/2306.06546) by Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, Kundan Kumar.
+```py
+from datasets import load_dataset, Audio
+from transformers import DacModel, AutoProcessor

-The Descript Audio Codec (DAC) model is a powerful tool for compressing audio data, making it highly efficient for storage and transmission. By compressing 44.1 KHz audio into tokens at just 8kbps bandwidth, the DAC model enables high-quality audio processing while significantly reducing the data footprint. This is particularly useful in scenarios where bandwidth is limited or storage space is at a premium, such as in streaming applications, remote conferencing, and archiving large audio datasets.
+model = DacModel.from_pretrained("descript/dac_16khz", dtype="auto")
+processor = AutoProcessor.from_pretrained("descript/dac_16khz")

-The abstract from the paper is the following:
+librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
+librispeech_dummy = librispeech_dummy.cast_column("audio", Audio(sampling_rate=processor.sampling_rate))
+audio_sample = librispeech_dummy[-1]["audio"]["array"]
+inputs = processor(raw_audio=audio_sample, sampling_rate=processor.sampling_rate, return_tensors="pt")

-*Language models have been successfully used to model natural signals, such as images, speech, and music. A key component of these models is a high quality neural compression model that can compress high-dimensional natural signals into lower dimensional discrete tokens. To that end, we introduce a high-fidelity universal neural audio compression algorithm that achieves ~90x compression of 44.1 KHz audio into tokens at just 8kbps bandwidth. We achieve this by combining advances in high-fidelity audio generation with better vector quantization techniques from the image domain, along with improved adversarial and reconstruction losses. We compress all domains (speech, environment, music, etc.) with a single universal model, making it widely applicable to generative modeling of all audio. We compare with competing audio compression algorithms, and find our method outperforms them significantly. We provide thorough ablations for every design choice, as well as open-source code and trained model weights. We hope our work can lay the foundation for the next generation of high-fidelity audio modeling.*
-
-This model was contributed by [Kamil Akesbi](https://huggingface.co/kamilakesbi).
-The original code can be found [here](https://github.com/descriptinc/descript-audio-codec/tree/main?tab=readme-ov-file).
-
-## Model structure
-
-The Descript Audio Codec (DAC) model is structured into three distinct stages:
-
-1. Encoder Model: This stage compresses the input audio, reducing its size while retaining essential information.
-2. Residual Vector Quantizer (RVQ) Model: Working in tandem with the encoder, this model quantizes the latent codes of the audio, refining the compression and ensuring high-quality reconstruction.
-3. Decoder Model: This final stage reconstructs the audio from its compressed form, restoring it to a state that closely resembles the original input.
-
-## Usage example
-
-Here is a quick example of how to encode and decode an audio using this model:
-
-```python
->>> from datasets import load_dataset, Audio
->>> from transformers import DacModel, AutoProcessor
->>> librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
-
->>> model = DacModel.from_pretrained("descript/dac_16khz")
->>> processor = AutoProcessor.from_pretrained("descript/dac_16khz")
->>> librispeech_dummy = librispeech_dummy.cast_column("audio", Audio(sampling_rate=processor.sampling_rate))
->>> audio_sample = librispeech_dummy[-1]["audio"]["array"]
->>> inputs = processor(raw_audio=audio_sample, sampling_rate=processor.sampling_rate, return_tensors="pt")
-
->>> encoder_outputs = model.encode(inputs["input_values"])
->>> # Get the intermediate audio codes
->>> audio_codes = encoder_outputs.audio_codes
->>> # Reconstruct the audio from its quantized representation
->>> audio_values = model.decode(encoder_outputs.quantized_representation)
->>> # or the equivalent with a forward pass
->>> audio_values = model(inputs["input_values"]).audio_values
+encoder_outputs = model.encode(inputs["input_values"])
+audio_codes = encoder_outputs.audio_codes
+audio_values = model.decode(encoder_outputs.quantized_representation)
+audio_values = model(inputs["input_values"]).audio_values
 ```

+</hfoption>
+</hfoptions>
+
 ## DacConfig

 [[autodoc]] DacConfig
@ -81,3 +58,4 @@ Here is a quick example of how to encode and decode an audio using this model:
    - decode
    - encode
    - forward
+
--- a/docs/source/en/model_doc/data2vec.md
+++ b/docs/source/en/model_doc/data2vec.md
@ -13,115 +13,42 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2022-02-07 and added to Hugging Face Transformers on 2022-03-01.*
+*This model was released on 2022-02-07 and added to Hugging Face Transformers on 2022-03-01 and contributed by [edugp](https://huggingface.co/edugp) and [patrickvonplaten](https://huggingface.co/patrickvonplaten).*
+
+<div style="float: right;">
+    <div class="flex flex-wrap space-x-1">
+        <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
+        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
+    </div>
+</div>

 # Data2Vec

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
-<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[Data2Vec](https://huggingface.co/papers/2202.03555) presents a unified framework for self-supervised learning applicable to speech, NLP, and computer vision. It employs a self-distillation setup using a standard Transformer architecture to predict latent representations of the full input data based on a masked view. Unlike traditional methods that predict modality-specific, local targets, data2vec focuses on predicting contextualized latent representations that encapsulate information from the entire input. Experiments across speech recognition, image classification, and natural language understanding benchmarks show state-of-the-art or competitive performance.

-## Overview
-
-The Data2Vec model was proposed in [data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language](https://huggingface.co/papers/2202.03555) by Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu and Michael Auli.
-Data2Vec proposes a unified framework for self-supervised learning across different data modalities - text, audio and images.
-Importantly, predicted targets for pre-training are contextualized latent representations of the inputs, rather than modality-specific, context-independent targets.
-
-The abstract from the paper is the following:
-
-*While the general idea of self-supervised learning is identical across modalities, the actual algorithms and
-objectives differ widely because they were developed with a single modality in mind. To get us closer to general
-self-supervised learning, we present data2vec, a framework that uses the same learning method for either speech,
-NLP or computer vision. The core idea is to predict latent representations of the full input data based on a
-masked view of the input in a selfdistillation setup using a standard Transformer architecture.
-Instead of predicting modality-specific targets such as words, visual tokens or units of human speech which
-are local in nature, data2vec predicts contextualized latent representations that contain information from
-the entire input. Experiments on the major benchmarks of speech recognition, image classification, and
-natural language understanding demonstrate a new state of the art or competitive performance to predominant approaches.
-Models and code are available at www.github.com/pytorch/fairseq/tree/master/examples/data2vec.*
-
-This model was contributed by [edugp](https://huggingface.co/edugp) and [patrickvonplaten](https://huggingface.co/patrickvonplaten).
-
-The original code (for NLP and Speech) can be found [here](https://github.com/pytorch/fairseq/tree/main/examples/data2vec).
-The original code for vision can be found [here](https://github.com/facebookresearch/data2vec_vision/tree/main/beit).
-
-## Usage tips
-
- Data2VecAudio, Data2VecText, and Data2VecVision have all been trained using the same self-supervised learning method.
- For Data2VecAudio, preprocessing is identical to [`Wav2Vec2Model`], including feature extraction
- For Data2VecText, preprocessing is identical to [`RobertaModel`], including tokenization.
- For Data2VecVision, preprocessing is identical to [`BeitModel`], including feature extraction.
-
-### Using Scaled Dot Product Attention (SDPA)
-
-PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function
-encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the
-[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html)
-or the [GPU Inference](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention)
-page for more information.
-
-SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set
-`attn_implementation="sdpa"` in `from_pretrained()` to explicitly request SDPA to be used.
-
-The SDPA implementation is currently available for the Data2VecAudio and Data2VecVision models.
+<hfoptions id="usage">
+<hfoption id="Data2VecAudioForCTC">

 ```py
-from transformers import Data2VecVisionForImageClassification
-model = Data2VecVisionForImageClassification.from_pretrained("facebook/data2vec-vision-base", attn_implementation="sdpa", dtype=torch.float16)
-...
+import torch
+from datasets import load_dataset
+from transformers import AutoProcessor, Data2VecAudioForCTC
+
+dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation").sort("id")
+sampling_rate = dataset.features["audio"].sampling_rate
+
+processor = AutoProcessor.from_pretrained("facebook/data2vec-audio-base-960h")
+model = AutoModelForCTC.from_pretrained("facebook/data2vec-audio-base-960h", dtype="auto")
+
+inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
+with torch.no_grad():
+    logits = model(**inputs).logits
+predicted_ids = torch.argmax(logits, dim=-1)
+print(f"Transcription: {processor.batch_decode(predicted_ids)[0]}")
 ```

-For the best speedups, we recommend loading the model in half-precision (e.g. `torch.float16` or `torch.bfloat16`).
-
-For the Data2VecVision model, on a local benchmark (NVIDIA GeForce RTX 2060-8GB, PyTorch 2.5.1, OS Ubuntu 20.04)
-with `float16` and `facebook/data2vec-vision-base` model, we saw the following improvements during training and
-inference:
-
-#### Training
-
-| num_training_steps | batch_size | image_size   | is_cuda | Time per batch (eager - s) | Time per batch (sdpa - s) | Speedup (%) | Eager peak mem (MB) | SDPA peak mem (MB) | Mem saving (%) |
-|--------------------|------------|--------------|---------|----------------------------|---------------------------|-------------|----------------------|--------------------|----------------|
-| 50                 | 2          | (1048, 640)  | True    | 0.996                      | 0.754                     | 32.147      | 6722.198            | 4264.653          | 57.626         |
-
-#### Inference
-
-|   Image batch size |   Eager (s/iter) | Eager CI, %   |   Eager memory (MB) |   SDPA (s/iter) | SDPA CI, %   |   SDPA memory (MB) |   SDPA speedup |   SDPA memory saved |
-|-------------------:|-----------------:|:--------------|--------------------:|----------------:|:-------------|-------------------:|---------------:|--------------------:|
-|                  1 |            0.011 | ±0.3%         |         3.76143e+08 |           0.01  | ±0.3%        |        3.74397e+08 |          1.101 |               0.466 |
-|                  4 |            0.014 | ±0.1%         |         4.02756e+08 |           0.012 | ±0.2%        |        3.91373e+08 |          1.219 |               2.909 |
-|                 16 |            0.046 | ±0.3%         |         4.96482e+08 |           0.035 | ±0.2%        |        4.51017e+08 |          1.314 |              10.081 |
-|                 32 |            0.088 | ±0.1%         |         6.23903e+08 |           0.067 | ±0.1%        |        5.32974e+08 |          1.33  |              17.061 |
-
-## Resources
-
-A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with Data2Vec.
-
-<PipelineTag pipeline="image-classification"/>
-
- [`Data2VecVisionForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb).
-
-**Data2VecText documentation resources**
-
- [Text classification task guide](../tasks/sequence_classification)
- [Token classification task guide](../tasks/token_classification)
- [Question answering task guide](../tasks/question_answering)
- [Causal language modeling task guide](../tasks/language_modeling)
- [Masked language modeling task guide](../tasks/masked_language_modeling)
- [Multiple choice task guide](../tasks/multiple_choice)
-
-**Data2VecAudio documentation resources**
-
- [Audio classification task guide](../tasks/audio_classification)
- [Automatic speech recognition task guide](../tasks/asr)
-
-**Data2VecVision documentation resources**
-
- [Image classification](../tasks/image_classification)
- [Semantic segmentation](../tasks/semantic_segmentation)
-
-If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
+</hfoption>
+</hfoptions>

 ## Data2VecTextConfig

@ -209,3 +136,4 @@ If you're interested in submitting a resource to be included here, please feel f

 [[autodoc]] Data2VecVisionForSemanticSegmentation
    - forward
+
--- a/docs/source/en/model_doc/dbrx.md
+++ b/docs/source/en/model_doc/dbrx.md
@ -9,105 +9,47 @@ Unless required by applicable law or agreed to in writing, software distributed
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
-*This model was released on 2024-03-27 and added to Hugging Face Transformers on 2024-04-18.*
+*This model was released on {release_date} and added to Hugging Face Transformers on 2024-04-18 and contributed by [eitanturok](https://huggingface.co/eitanturok) and [abhi-db](https://huggingface.co/abhi-db).*
+
+<div style="float: right;">
+    <div class="flex flex-wrap space-x-1">
+        <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
+        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
+    </div>
+</div>

 # DBRX

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
-<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[DBRX](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm) is an open, general-purpose large language model introduced by Databricks that achieves state-of-the-art performance among open LLMs, surpassing GPT-3.5 and competing with Gemini 1.0 Pro. It uses a fine-grained mixture-of-experts (MoE) architecture, making inference up to 2x faster than LLaMA2-70B and about 4x more compute-efficient than Databricks’ previous MPT models. DBRX excels at programming tasks, outperforming specialized models like CodeLLaMA-70B, and is strong in language understanding and math benchmarks. Both base and instruction-tuned versions are openly released on Hugging Face, with availability via APIs and integration into Databricks products

-## Overview
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-DBRX is a [transformer-based](https://www.isattentionallyouneed.com/) decoder-only large language model (LLM) that was trained using next-token prediction.
-It uses a *fine-grained* mixture-of-experts (MoE) architecture with 132B total parameters of which 36B parameters are active on any input.
-It was pre-trained on 12T tokens of text and code data.
-Compared to other open MoE models like Mixtral-8x7B and Grok-1, DBRX is fine-grained, meaning it uses a larger number of smaller experts. DBRX has 16 experts and chooses 4, while Mixtral-8x7B and Grok-1 have 8 experts and choose 2.
-This provides 65x more possible combinations of experts and we found that this improves model quality.
-DBRX uses rotary position encodings (RoPE), gated linear units (GLU), and grouped query attention (GQA).
-It is a BPE based model and uses the GPT-4 tokenizer as described in the [tiktoken](https://github.com/openai/tiktoken) repository.
-We made these choices based on exhaustive evaluation and scaling experiments.
-
-DBRX was pretrained on 12T tokens of carefully curated data and a maximum context length of 32K tokens.
-We estimate that this data is at least 2x better token-for-token than the data we used to pretrain the MPT family of models.
-This new dataset was developed using the full suite of Databricks tools, including Apache Spark™ and Databricks notebooks for data processing, and Unity Catalog for data management and governance.
-We used curriculum learning for pretraining, changing the data mix during training in ways we found to substantially improve model quality.
-
-More detailed information about DBRX Instruct and DBRX Base can be found in our [technical blog post](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm).
-
-This model was contributed by [eitan-turok](https://huggingface.co/eitanturok) and [abhi-db](https://huggingface.co/abhi-db). The original code can be found [here](https://github.com/databricks/dbrx-instruct), though this may not be up to date.
-
-## Usage Examples
-
-The `generate()` method can be used to generate text using DBRX. You can generate using the standard attention implementation, flash-attention, and the PyTorch scaled dot product attention. The last two attention implementations give speed ups.
-
-```python
-from transformers import DbrxForCausalLM, AutoTokenizer
+```py
 import torch
+from transformers import pipeline

-tokenizer = AutoTokenizer.from_pretrained("databricks/dbrx-instruct", token="YOUR_HF_TOKEN")
-model = DbrxForCausalLM.from_pretrained(
-    "databricks/dbrx-instruct",
-    device_map="auto",
-    dtype=torch.bfloat16,
-    token="YOUR_HF_TOKEN",
-    )
+pipeline = pipeline("text-generation", model="databricks/dbrx-instruct", dtype="auto")
+pipeline("Plants create energy through a process known as photosynthesis.")
+```

-input_text = "What does it take to build a great LLM?"
-messages = [{"role": "user", "content": input_text}]
-input_ids = tokenizer.apply_chat_template(messages, return_dict=True, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)
+</hfoption>
+<hfoption id="AutoModel">

-outputs = model.generate(**input_ids, max_new_tokens=200)
+```py
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+model = AutoModelForCausalLM.from_pretrained("databricks/dbrx-instruct", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("databricks/dbrx-instruct")
+
+inputs = tokenizer("Plants create energy through a process known as photosynthesis.", return_tensors="pt")
+outputs = model.generate(**inputs, max_length=50)
 print(tokenizer.decode(outputs[0]))
 ```

-If you have flash-attention installed (`pip install flash-attn`), it is possible to generate faster. (The HuggingFace documentation for flash-attention can be found [here](https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2).)
-
-```python
-from transformers import DbrxForCausalLM, AutoTokenizer
-import torch
-
-tokenizer = AutoTokenizer.from_pretrained("databricks/dbrx-instruct", token="YOUR_HF_TOKEN")
-model = DbrxForCausalLM.from_pretrained(
-    "databricks/dbrx-instruct",
-    device_map="auto",
-    dtype=torch.bfloat16,
-    token="YOUR_HF_TOKEN",
-    attn_implementation="flash_attention_2",
-    )
-
-input_text = "What does it take to build a great LLM?"
-messages = [{"role": "user", "content": input_text}]
-input_ids = tokenizer.apply_chat_template(messages, return_dict=True, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)
-
-outputs = model.generate(**input_ids, max_new_tokens=200)
-print(tokenizer.decode(outputs[0]))
-```
-
-You can also generate faster using the PyTorch scaled dot product attention. (The HuggingFace documentation for scaled dot product attention can be found [here](https://huggingface.co/docs/transformers/perf_infer_gpu_one#pytorch-scaled-dot-product-attention).)
-
-```python
-from transformers import DbrxForCausalLM, AutoTokenizer
-import torch
-
-tokenizer = AutoTokenizer.from_pretrained("databricks/dbrx-instruct", token="YOUR_HF_TOKEN")
-model = DbrxForCausalLM.from_pretrained(
-    "databricks/dbrx-instruct",
-    device_map="auto",
-    dtype=torch.bfloat16,
-    token="YOUR_HF_TOKEN",
-    attn_implementation="sdpa",
-    )
-
-input_text = "What does it take to build a great LLM?"
-messages = [{"role": "user", "content": input_text}]
-input_ids = tokenizer.apply_chat_template(messages, return_dict=True, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)
-
-outputs = model.generate(**input_ids, max_new_tokens=200)
-print(tokenizer.decode(outputs[0]))
-```
+</hfoption>
+</hfoptions>

 ## DbrxConfig

@ -122,3 +64,4 @@ print(tokenizer.decode(outputs[0]))

 [[autodoc]] DbrxForCausalLM
    - forward
+
--- a/docs/source/en/model_doc/deberta-v2.md
+++ b/docs/source/en/model_doc/deberta-v2.md
@ -13,26 +13,11 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2020-06-05 and added to Hugging Face Transformers on 2021-02-19.*
-
-<div style="float: right;">
-    <div class="flex flex-wrap space-x-1">
-           <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white" >
-    </div>
-</div>
+*This model was released on 2020-06-05 and added to Hugging Face Transformers on 2021-02-19 and contributed by [DeBERTa](https://huggingface.co/DeBERTa).*

 # DeBERTa-v2

-[DeBERTa-v2](https://huggingface.co/papers/2006.03654) improves on the original [DeBERTa](./deberta) architecture by using a SentencePiece-based tokenizer and a new vocabulary size of 128K. It also adds an additional convolutional layer within the first transformer layer to better learn local dependencies of input tokens. Finally, the position projection and content projection matrices are shared in the attention layer to reduce the number of parameters.
-
-You can find all the original [DeBERTa-v2] checkpoints under the [Microsoft](https://huggingface.co/microsoft?search_models=deberta-v2) organization.
-
-> [!TIP]
-> This model was contributed by [Pengcheng He](https://huggingface.co/DeBERTa).
->
-> Click on the DeBERTa-v2 models in the right sidebar for more examples of how to apply DeBERTa-v2 to different language tasks.
-
-The example below demonstrates how to classify text with [`Pipeline`] or the [`AutoModel`] class.
+[DeBERTa](https://huggingface.co/papers/2006.03654) enhances BERT and RoBERTa with disentangled attention and an improved mask decoder. Disentangled attention uses separate vectors for content and position, while the mask decoder replaces the softmax layer for better pretraining efficiency. DeBERTa v2 introduces a new vocabulary, nGiE for local dependencies, shared position and content projection matrices, bucket-encoded relative positions, and additional model sizes of 900M and 1.5B, achieving superior performance on various NLP tasks.

 <hfoptions id="usage">
 <hfoption id="Pipeline">
@ -41,14 +26,8 @@ The example below demonstrates how to classify text with [`Pipeline`] or the [`A
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    task="text-classification",
-    model="microsoft/deberta-v2-xlarge-mnli",
-    device=0,
-    dtype=torch.float16
-)
-result = pipeline("DeBERTa-v2 is great at understanding context!")
-print(result)
+pipeline = pipeline(task="fill-mask", model="microsoft/deberta-v2-xlarge", dtype="auto")
+pipeline("Plants create [MASK] through a process known as photosynthesis.")
 ```

 </hfoption>
@ -56,68 +35,22 @@ print(result)

 ```py
 import torch
-from transformers import AutoTokenizer, AutoModelForSequenceClassification
+from transformers import AutoModelForMaskedLM, AutoTokenizer

-tokenizer = AutoTokenizer.from_pretrained(
-    "microsoft/deberta-v2-xlarge-mnli"
-)
-model = AutoModelForSequenceClassification.from_pretrained(
-    "microsoft/deberta-v2-xlarge-mnli",
-    dtype=torch.float16,
-    device_map="auto"
-)
+model = AutoModelForMaskedLM.from_pretrained("microsoft/deberta-v2-xlarge", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v2-xlarge")

-inputs = tokenizer("DeBERTa-v2 is great at understanding context!", return_tensors="pt").to(model.device)
+inputs = tokenizer("Plants create [MASK] through a process known as photosynthesis.", return_tensors="pt")
 outputs = model(**inputs)
-
-logits = outputs.logits
-predicted_class_id = logits.argmax().item()
-predicted_label = model.config.id2label[predicted_class_id]
-print(f"Predicted label: {predicted_label}")
-
-```
-
-</hfoption>
-
-<hfoption id="transformers CLI">
-
-```bash
-echo -e "DeBERTa-v2 is great at understanding context!" | transformers run --task fill-mask --model microsoft/deberta-v2-xlarge-mnli --device 0
+mask_token_id = tokenizer.mask_token_id
+mask_position = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
+predicted_word = tokenizer.decode(outputs.logits[0, mask_position].argmax(dim=-1))
+print(f"Predicted word: {predicted_word}")
 ```

 </hfoption>
 </hfoptions>

-Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
-
-The example below uses [bitsandbytes quantization](../quantization/bitsandbytes) to only quantize the weights to 4-bit.
-
-```py
-from transformers import AutoModelForSequenceClassification, AutoTokenizer, BitsAndBytesConfig
-
-model_id = "microsoft/deberta-v2-xlarge-mnli"
-quantization_config = BitsAndBytesConfig(
-    load_in_4bit=True,
-    bnb_4bit_quant_type="nf4",
-    bnb_4bit_compute_dtype="float16",
-    bnb_4bit_use_double_quant=True,
-)
-tokenizer = AutoTokenizer.from_pretrained(model_id)
-model = AutoModelForSequenceClassification.from_pretrained(
-    model_id,
-    quantization_config=quantization_config,
-    dtype="float16"
-)
-
-inputs = tokenizer("DeBERTa-v2 is great at understanding context!", return_tensors="pt").to(model.device)
-outputs = model(**inputs)
-logits = outputs.logits
-predicted_class_id = logits.argmax().item()
-predicted_label = model.config.id2label[predicted_class_id]
-print(f"Predicted label: {predicted_label}")
-
-```
-
 ## DebertaV2Config

 [[autodoc]] DebertaV2Config
@ -170,3 +103,4 @@ print(f"Predicted label: {predicted_label}")

 [[autodoc]] DebertaV2ForMultipleChoice
    - forward
+
--- a/docs/source/en/model_doc/deberta.md
+++ b/docs/source/en/model_doc/deberta.md
@ -13,28 +13,11 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2020-06-05 and added to Hugging Face Transformers on 2020-11-16.*
-
-<div style="float: right;">
-    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-    </div>
-</div>
+*This model was released on 2020-06-05 and added to Hugging Face Transformers on 2020-11-16 and contributed by [DeBERTa](https://huggingface.co/DeBERTa).*

 # DeBERTa

-[DeBERTa](https://huggingface.co/papers/2006.03654) improves the pretraining efficiency of BERT and RoBERTa with two key ideas, disentangled attention and an enhanced mask decoder. Instead of mixing everything together like BERT, DeBERTa separates a word's *content* from its *position* and processes them independently. This gives it a clearer sense of what's being said and where in the sentence it's happening.
-
-The enhanced mask decoder replaces the traditional softmax decoder to make better predictions.
-
-Even with less training data than RoBERTa, DeBERTa manages to outperform it on several benchmarks.
-
-You can find all the original DeBERTa checkpoints under the [Microsoft](https://huggingface.co/microsoft?search_models=deberta) organization.
-
-> [!TIP]
-> Click on the DeBERTa models in the right sidebar for more examples of how to apply DeBERTa to different language tasks.
-
-The example below demonstrates how to classify text with [`Pipeline`], [`AutoModel`], and from the command line.
+[DeBERTa](https://huggingface.co/papers/2006.03654) improves upon BERT and RoBERTa through disentangled attention and an enhanced mask decoder. Disentangled attention uses separate vectors for content and position, computing attention weights with disentangled matrices. The enhanced mask decoder replaces the softmax layer for predicting masked tokens during pretraining. These techniques boost pretraining efficiency and downstream task performance, with DeBERTa outperforming RoBERTa-Large on MNLI, SQuAD v2.0, and RACE using half the training data.

 <hfoptions id="usage">
 <hfoption id="Pipeline">
@ -43,16 +26,8 @@ The example below demonstrates how to classify text with [`Pipeline`], [`AutoMod
 import torch
 from transformers import pipeline

-classifier = pipeline(
-    task="text-classification",
-    model="microsoft/deberta-base-mnli",
-    device=0,
-)
-
-classifier({
-    "text": "A soccer game with multiple people playing.",
-    "text_pair": "Some people are playing a sport."
-})
+pipeline = pipeline(task="fill-mask", model="microsoft/deberta-base", dtype="auto")
+pipeline("Plants create [MASK] through a process known as photosynthesis.")
 ```

 </hfoption>
@ -60,42 +35,26 @@ classifier({

 ```py
 import torch
-from transformers import AutoModelForSequenceClassification, AutoTokenizer
+from transformers import AutoModelForMaskedLM, AutoTokenizer

-model_name = "microsoft/deberta-base-mnli"
-tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-base-mnli")
-model = AutoModelForSequenceClassification.from_pretrained("microsoft/deberta-base-mnli", device_map="auto")
+model = AutoModelForMaskedLM.from_pretrained("microsoft/deberta-base", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-base")

-inputs = tokenizer(
-    "A soccer game with multiple people playing.",
-    "Some people are playing a sport.",
-    return_tensors="pt"
-).to(model.device)
-
-with torch.no_grad():
-    logits = model(**inputs).logits
-    predicted_class = logits.argmax().item()
-
-labels = ["contradiction", "neutral", "entailment"]
-print(f"The predicted relation is: {labels[predicted_class]}")
-
-```
-
-</hfoption>
-<hfoption id="transformers CLI">
-
-```bash
-echo -e '{"text": "A soccer game with multiple people playing.", "text_pair": "Some people are playing a sport."}' | transformers run --task text-classification --model microsoft/deberta-base-mnli --device 0
+inputs = tokenizer("Plants create [MASK] through a process known as photosynthesis.", return_tensors="pt")
+outputs = model(**inputs)
+mask_position = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
+predicted_word = tokenizer.decode(outputs.logits[0, mask_position].argmax(dim=-1))
+print(f"Predicted word: {predicted_word}")
 ```

 </hfoption>
 </hfoptions>

-## Notes
+## Usage tips

- DeBERTa uses **relative position embeddings**, so it does not require **right-padding** like BERT.
- For best results, use DeBERTa on sentence-level or sentence-pair classification tasks like MNLI, RTE, or SST-2.
- If you're using DeBERTa for token-level tasks like masked language modeling, make sure to load a checkpoint specifically pretrained or fine-tuned for token-level tasks.
+- DeBERTa uses relative position embeddings. It doesn't require right-padding like BERT.
+- Use DeBERTa on sentence-level or sentence-pair classification tasks like MNLI, RTE, or SST-2 for best results.
+- For token-level tasks like masked language modeling, load a checkpoint specifically pretrained or fine-tuned for token-level tasks.

 ## DebertaConfig

@ -143,3 +102,4 @@ echo -e '{"text": "A soccer game with multiple people playing.", "text_pair": "S

 [[autodoc]] DebertaForQuestionAnswering
    - forward
+
--- a/docs/source/en/model_doc/decision_transformer.md
+++ b/docs/source/en/model_doc/decision_transformer.md
@ -13,34 +13,48 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2021-06-02 and added to Hugging Face Transformers on 2022-03-23.*
+*This model was released on 2021-06-02 and added to Hugging Face Transformers on 2022-03-23 and contributed by [edbeeching](https://huggingface.co/edbeeching).*

 # Decision Transformer

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[Decision Transformer: Reinforcement Learning via Sequence Modeling](https://huggingface.co/papers/2106.01345) reframes reinforcement learning as a conditional sequence modeling problem, using a causally masked Transformer architecture instead of traditional value functions or policy gradients. It generates actions by autoregressively conditioning on past states, actions, and a desired return, allowing the model to produce future actions that achieve specified rewards. This approach leverages advances from language modeling, such as GPT and BERT, for scalability and simplicity. Despite its straightforward design, Decision Transformer matches or surpasses state-of-the-art model-free offline RL performance on benchmarks like Atari, OpenAI Gym, and Key-to-Door tasks.

-## Overview
+<hfoptions id="usage">
+<hfoption id="DecisionTransformerModel">

-The Decision Transformer model was proposed in [Decision Transformer: Reinforcement Learning via Sequence Modeling](https://huggingface.co/papers/2106.01345)  
-by Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, Igor Mordatch.
+```py
+import torch
+from transformers import DecisionTransformerModel

-The abstract from the paper is the following:
+model = DecisionTransformerModel.from_pretrained("edbeeching/decision-transformer-gym-hopper-medium", dtype="auto")
+model.eval()

-*We introduce a framework that abstracts Reinforcement Learning (RL) as a sequence modeling problem.
-This allows us to draw upon the simplicity and scalability of the Transformer architecture, and associated advances
- in language modeling such as GPT-x and BERT. In particular, we present Decision Transformer, an architecture that
- casts the problem of RL as conditional sequence modeling. Unlike prior approaches to RL that fit value functions or
- compute policy gradients, Decision Transformer simply outputs the optimal actions by leveraging a causally masked
- Transformer. By conditioning an autoregressive model on the desired return (reward), past states, and actions, our
- Decision Transformer model can generate future actions that achieve the desired return. Despite its simplicity,
- Decision Transformer matches or exceeds the performance of state-of-the-art model-free offline RL baselines on
- Atari, OpenAI Gym, and Key-to-Door tasks.*
+env = gym.make("Hopper-v3")
+state_dim = env.observation_space.shape[0]
+act_dim = env.action_space.shape[0]

-This version of the model is for tasks where the state is a vector.
+state = env.reset()
+states = torch.from_numpy(state).reshape(1, 1, state_dim).to(device=device, dtype=torch.float32)
+actions = torch.zeros((1, 1, act_dim), device=device, dtype=torch.float32)
+rewards = torch.zeros(1, 1, device=device, dtype=torch.float32)
+target_return = torch.tensor(TARGET_RETURN, dtype=torch.float32).reshape(1, 1)
+timesteps = torch.tensor(0, device=device, dtype=torch.long).reshape(1, 1)
+attention_mask = torch.zeros(1, 1, device=device, dtype=torch.float32)

-This model was contributed by [edbeeching](https://huggingface.co/edbeeching). The original code can be found [here](https://github.com/kzl/decision-transformer).
+with torch.no_grad():
+    state_preds, action_preds, return_preds = model(
+        states=states,
+        actions=actions,
+        rewards=rewards,
+        returns_to_go=target_return,
+        timesteps=timesteps,
+        attention_mask=attention_mask,
+        return_dict=False,
+    )
+```
+
+</hfoption>
+</hfoptions>

 ## DecisionTransformerConfig

@ -55,3 +69,4 @@ This model was contributed by [edbeeching](https://huggingface.co/edbeeching). T

 [[autodoc]] DecisionTransformerModel
    - forward
+
--- a/docs/source/en/model_doc/deepseek_v2.md
+++ b/docs/source/en/model_doc/deepseek_v2.md
@ -13,25 +13,31 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2024-05-07 and added to Hugging Face Transformers on 2025-07-09.*
+
+*This model was released on 2024-05-07 and added to Hugging Face Transformers on 2025-07-09 and contributed by [VladOS95-cyber](https://github.com/VladOS95-cyber).*

 # DeepSeek-V2

-## Overview
-
-The DeepSeek-V2 model was proposed in [DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model](https://huggingface.co/papers/2405.04434) by DeepSeek-AI Team.
-
-The abstract from the paper is the following:
-We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models.
+[DeepSeek-V2](https://huggingface.co/papers/2405.04434) is a Mixture-of-Experts (MoE) language model with 236B total parameters, where 21B are active per token, and supports a 128K token context length. It utilizes Multi-head Latent Attention (MLA) to compress the Key-Value (KV) cache and DeepSeekMoE for cost-effective training. Compared to DeepSeek 67B, DeepSeek-V2 offers superior performance, reduced training costs by 42.5%, decreased KV cache by 93.3%, and increased generation throughput by 5.76 times. Trained on an 8.1T token corpus and enhanced with Supervised Fine-Tuning and Reinforcement Learning, DeepSeek-V2 achieves top-tier performance with only 21B active parameters.

 This model was contributed by [VladOS95-cyber](https://github.com/VladOS95-cyber).
 The original code can be found [here](https://huggingface.co/deepseek-ai/DeepSeek-V2).

 ### Usage tips
-
 The model uses Multi-head Latent Attention (MLA) and DeepSeekMoE architectures for efficient inference and cost-effective training. It employs an auxiliary-loss-free strategy for load balancing and multi-token prediction training objective. The model can be used for various language tasks after being pre-trained on 14.8 trillion tokens and going through Supervised Fine-Tuning and Reinforcement Learning stages.

 ## DeepseekV2Config
+```py
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+model = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-V2-Lite")
+tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V2-Lite")
+
+inputs = tokenizer("Hello, my name is", return_tensors="pt")
+outputs = model.generate(**inputs, max_length=50)
+print(tokenizer.decode(outputs[0]))
+```
+

 [[autodoc]] DeepseekV2Config

@ -49,3 +55,4 @@ The model uses Multi-head Latent Attention (MLA) and DeepSeekMoE architectures f

 [[autodoc]] DeepseekV2ForSequenceClassification
    - forward
+
--- a/docs/source/en/model_doc/deepseek_v3.md
+++ b/docs/source/en/model_doc/deepseek_v3.md
@ -13,16 +13,23 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
+
 *This model was released on 2024-12-27 and added to Hugging Face Transformers on 2025-03-28.*

+```py
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+model = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-V3-Base")
+tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V3-Base")
+
+inputs = tokenizer("Hello, my name is", return_tensors="pt")
+outputs = model.generate(**inputs, max_length=50)
+print(tokenizer.decode(outputs[0]))
+```
+
 # DeepSeek-V3

-## Overview
-
-The DeepSeek-V3 model was proposed in [DeepSeek-V3 Technical Report](https://huggingface.co/papers/2412.19437) by DeepSeek-AI Team.
-
-The abstract from the paper is the following:
-We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.
+[DeepSeek-V3](https://huggingface.co/papers/2412.19437) is a Mixture-of-Experts (MoE) language model with 671B total parameters, 37B of which are activated per token. It employs Multi-head Latent Attention (MLA) and DeepSeekMoE architectures for efficient inference and cost-effective training. DeepSeek-V3 introduces an auxiliary-loss-free strategy for load balancing and a multi-token prediction training objective to enhance performance. Pre-trained on 14.8 trillion diverse tokens, the model undergoes Supervised Fine-Tuning and Reinforcement Learning. Evaluations show that DeepSeek-V3 outperforms other open-source models and matches leading closed-source models. Training requires 2.788M H800 GPU hours and is notably stable without any irrecoverable loss spikes.

 ## Limitations and call for contribution!

@ -34,7 +41,6 @@ We are super happy to make this code community-powered, and would love to see ho
 - static cache is not supported (this should be just a generation config issue / config shape issues)

 ### Usage tips
-
 The model uses Multi-head Latent Attention (MLA) and DeepSeekMoE architectures for efficient inference and cost-effective training. It employs an auxiliary-loss-free strategy for load balancing and multi-token prediction training objective. The model can be used for various language tasks after being pre-trained on 14.8 trillion tokens and going through Supervised Fine-Tuning and Reinforcement Learning stages.

 You can run the model in `FP8` automatically, using 2 nodes of 8 H100 should be more than enough!
@ -53,7 +59,6 @@ chat = [
  {"role": "user", "content": "I'd like to show off how chat templating works!"},
 ]

-
 model = AutoModelForCausalLM.from_pretrained("deepseek-r1", device_map="auto", dtype=torch.bfloat16)
 inputs = tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)
 import time
@ -197,3 +202,4 @@ error, it means NCCL was probably not loaded.

 [[autodoc]] DeepseekV3ForTokenClassification
    - forward
+
--- a/docs/source/en/model_doc/deepseek_vl.md
+++ b/docs/source/en/model_doc/deepseek_vl.md
@ -17,7 +17,6 @@ rendered properly in your Markdown viewer.

 <div style="float: right;">
    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
        <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
    </div>
@ -25,14 +24,7 @@ rendered properly in your Markdown viewer.

 # DeepseekVL

-[Deepseek-VL](https://huggingface.co/papers/2403.05525) was introduced by the DeepSeek AI team. It is a vision-language model (VLM) designed to process both text and images for generating contextually relevant responses. The model leverages [LLaMA](./llama) as its text encoder, while [SigLip](./siglip) is used for encoding images.
-
-You can find all the original Deepseek-VL checkpoints under the [DeepSeek-community](https://huggingface.co/deepseek-community) organization.
-
-> [!TIP]
-> Click on the Deepseek-VL models in the right sidebar for more examples of how to apply Deepseek-VL to different vision and language tasks.
-
-The example below demonstrates how to generate text based on an image with [`Pipeline`] or the [`AutoModel`] class.
+[Deepseek-VL](https://huggingface.co/papers/2403.05525) is an open-source vision-language model optimized for real-world multimodal understanding. It employs a hybrid vision encoder capable of efficiently processing high-resolution images (1024×1024) while minimizing computational cost, enabling rich semantic and detail capture across diverse tasks. The model is trained on a large, diverse dataset that includes real-world content like web screenshots, PDFs, charts, and OCR data, with instruction tuning guided by a taxonomy of practical user scenarios. By integrating language model pretraining from the start to balance vision–language learning, DeepSeek-VL (available in 1.3B and 7B versions) achieves state-of-the-art performance on vision-language benchmarks while retaining strong language capabilities.

 <hfoptions id="usage">
 <hfoption id="Pipeline">
@ -41,167 +33,51 @@ The example below demonstrates how to generate text based on an image with [`Pip
 import torch
 from transformers import pipeline

-pipe = pipeline(
-    task="image-text-to-text",
-    model="deepseek-community/deepseek-vl-1.3b-chat",
-    device=0,
-    dtype=torch.float16
-)
-
+pipeline = pipeline(task="image-text-to-text", model="deepseek-community/deepseek-vl-1.3b-chat", dtype="auto")
 messages = [
-    {
-        "role": "user",
-        "content": [
-            {
-                "type": "image",
-                "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
-            },
-            { "type": "text", "text": "Describe this image."},
-        ]
-    }
+    {"role": "user",
+     "content": [
+       {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
+        {"type": "text", "text": "What is shown in this image?"},
+    ]},
 ]
-
-pipe(text=messages, max_new_tokens=20, return_full_text=False)
+pipeline(text=messages, max_new_tokens=300, return_full_text=False)
 ```

 </hfoption>
-
 <hfoption id="AutoModel">

 ```py
 import torch
-from transformers import DeepseekVLForConditionalGeneration, AutoProcessor
-
-model = DeepseekVLForConditionalGeneration.from_pretrained(
-    "deepseek-community/deepseek-vl-1.3b-chat",
-    dtype=torch.float16,
-    device_map="auto",
-    attn_implementation="sdpa"
-)
+from transformers import AutoProcessor, AutoModelForImageTextToText

 processor = AutoProcessor.from_pretrained("deepseek-community/deepseek-vl-1.3b-chat")
+model = AutoModelForImageTextToText.from_pretrained("deepseek-community/deepseek-vl-1.3b-chat", dtype="auto")

 messages = [
-    {
-        "role":"user",
-        "content":[
-            {
-                "type":"image",
-                "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
-            },
-            {
-                "type":"text",
-                "text":"Describe this image."
-            }
-        ]
-    }
-
+    {"role": "user",
+     "content": [
+       {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
+        {"type": "text", "text": "What is shown in this image?"},
+    ]},
 ]

 inputs = processor.apply_chat_template(
-    messages,
-    add_generation_prompt=True,
-    tokenize=True,
-    return_dict=True,
-    return_tensors="pt"
-).to(model.device, dtype=model.dtype)
-
-generated_ids = model.generate(**inputs, max_new_tokens=128)
-generated_ids_trimmed = [
-    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
-]
-output_text = processor.batch_decode(
-    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+    messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt"
 )

-print(output_text)
+outputs = model.generate(
+    **inputs,
+    max_new_tokens=300,
+    do_sample=True,
+    temperature=0.3,
+)
+print(processor.tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
 ```

 </hfoption>
 </hfoptions>

-Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
-
-The example below uses [torchao](../quantization/torchao) to only quantize the weights to int4.
-
-```python
-import torch
-from transformers import TorchAoConfig, DeepseekVLForConditionalGeneration, AutoProcessor
-
-quantization_config = TorchAoConfig(
-    "int4_weight_only",
-    group_size=128
-)
-
-model = DeepseekVLForConditionalGeneration.from_pretrained(
-    "deepseek-community/deepseek-vl-1.3b-chat",
-    dtype=torch.bfloat16,
-    device_map="auto",
-    quantization_config=quantization_config
-)
-```
-
-### Notes
-
- Do inference with multiple images in a single conversation.
-
-    ```py
-    import torch
-    from transformers import DeepseekVLForConditionalGeneration, AutoProcessor
-
-    model = DeepseekVLForConditionalGeneration.from_pretrained(
-        "deepseek-community/deepseek-vl-1.3b-chat",
-        dtype=torch.float16,
-        device_map="auto",
-        attn_implementation="sdpa"
-    )
-
-    processor = AutoProcessor.from_pretrained("deepseek-community/deepseek-vl-1.3b-chat")
-
-    messages = [
-        [
-            {
-                "role": "user",
-                "content": [
-                    {"type": "text", "text": "What’s the difference between"},
-                    {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
-                    {"type": "text", "text": " and "},
-                    {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"}
-                ]
-            }
-        ],
-        [
-            {
-                "role": "user",
-                "content": [
-                    {"type": "image", "url": "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.jpg"},
-                    {"type": "text", "text": "What do you see in this image?"}
-                ]
-            }
-        ]
-    ]
-
-    inputs = processor.apply_chat_template(
-        messages,
-        add_generation_prompt=True,
-        padding=True,
-        truncation=True,
-        tokenize=True,
-        return_dict=True,
-        return_tensors="pt"
-    ).to(model.device, dtype=model.dtype)
-
-    generated_ids = model.generate(**inputs, max_new_tokens=128)
-    generated_ids_trimmed = [
-        out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
-    ]
-    output_text = processor.batch_decode(
-        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
-    )
-
-    print(output_text)
-    ```
-
 ## DeepseekVLConfig

 [[autodoc]] DeepseekVLConfig
--- a/docs/source/en/model_doc/deepseek_vl_hybrid.md
+++ b/docs/source/en/model_doc/deepseek_vl_hybrid.md
@ -17,21 +17,13 @@ rendered properly in your Markdown viewer.

 <div style="float: right;">
    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
    </div>
 </div>

 # DeepseekVLHybrid

-[Deepseek-VL-Hybrid](https://huggingface.co/papers/2403.05525) was introduced by the DeepSeek AI team. It is a vision-language model (VLM) designed to process both text and images for generating contextually relevant responses. The model leverages [LLaMA](./llama) as its text encoder, while [SigLip](./siglip) is used for encoding low-resolution images and [SAM (Segment Anything Model)](./sam) is incorporated to handle high-resolution image encoding, enhancing the model's ability to process fine-grained visual details. Deepseek-VL-Hybrid is a variant of Deepseek-VL that uses [SAM (Segment Anything Model)](./sam) to handle high-resolution image encoding.
-
-You can find all the original Deepseek-VL-Hybrid checkpoints under the [DeepSeek-community](https://huggingface.co/deepseek-community) organization.
-
-> [!TIP]
-> Click on the Deepseek-VL-Hybrid models in the right sidebar for more examples of how to apply Deepseek-VL-Hybrid to different vision and language tasks.
-
-The example below demonstrates how to generate text based on an image with [`Pipeline`] or the [`AutoModel`] class.
+[Deepseek-VL-Hybrid](https://huggingface.co/papers/2403.05525) [Deepseek-VL](https://huggingface.co/papers/2403.05525) is an open-source vision-language model optimized for real-world multimodal understanding. It employs a hybrid vision encoder capable of efficiently processing high-resolution images (1024×1024) while minimizing computational cost, enabling rich semantic and detail capture across diverse tasks. The model is trained on a large, diverse dataset that includes real-world content like web screenshots, PDFs, charts, and OCR data, with instruction tuning guided by a taxonomy of practical user scenarios. By integrating language model pretraining from the start to balance vision–language learning, DeepSeek-VL (available in 1.3B and 7B versions) achieves state-of-the-art performance on vision-language benchmarks while retaining strong language capabilities.

 <hfoptions id="usage">
 <hfoption id="Pipeline">
@ -40,167 +32,51 @@ The example below demonstrates how to generate text based on an image with [`Pip
 import torch
 from transformers import pipeline

-pipe = pipeline(
-    task="image-text-to-text",
-    model="deepseek-community/deepseek-vl-7b-chat",
-    device=0,
-    dtype=torch.float16
-)
-
+pipeline = pipeline(task="image-text-to-text", model="deepseek-community/deepseek-vl-1.3b-chat", dtype="auto")
 messages = [
-    {
-        "role": "user",
-        "content": [
-            {
-                "type": "image",
-                "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
-            },
-            { "type": "text", "text": "Describe this image."},
-        ]
-    }
+    {"role": "user",
+     "content": [
+       {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
+        {"type": "text", "text": "What is shown in this image?"},
+    ]},
 ]
-
-pipe(text=messages, max_new_tokens=20, return_full_text=False)
+pipeline(text=messages, max_new_tokens=300, return_full_text=False)
 ```

 </hfoption>
-
 <hfoption id="AutoModel">

 ```py
 import torch
-from transformers import DeepseekVLHybridForConditionalGeneration, AutoProcessor
+from transformers import AutoProcessor, AutoModelForImageTextToText

-model = DeepseekVLHybridForConditionalGeneration.from_pretrained(
-    "deepseek-community/deepseek-vl-7b-chat",
-    dtype=torch.float16,
-    device_map="auto",
-    attn_implementation="sdpa"
-)
-
-processor = AutoProcessor.from_pretrained("deepseek-community/deepseek-vl-7b-chat")
+processor = AutoProcessor.from_pretrained("deepseek-community/deepseek-vl-1.3b-chat")
+model = AutoModelForImageTextToText.from_pretrained("deepseek-community/deepseek-vl-1.3b-chat", dtype="auto")

 messages = [
-    {
-        "role":"user",
-        "content":[
-            {
-                "type":"image",
-                "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
-            },
-            {
-                "type":"text",
-                "text":"Describe this image."
-            }
-        ]
-    }
-
+    {"role": "user",
+     "content": [
+       {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
+        {"type": "text", "text": "What is shown in this image?"},
+    ]},
 ]

 inputs = processor.apply_chat_template(
-    messages,
-    add_generation_prompt=True,
-    tokenize=True,
-    return_dict=True,
-    return_tensors="pt"
-).to(model.device, dtype=model.dtype)
-
-generated_ids = model.generate(**inputs, max_new_tokens=128)
-generated_ids_trimmed = [
-    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
-]
-output_text = processor.batch_decode(
-    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+    messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt"
 )

-print(output_text)
+outputs = model.generate(
+    **inputs,
+    max_new_tokens=300,
+    do_sample=True,
+    temperature=0.3,
+)
+print(processor.tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
 ```

 </hfoption>
 </hfoptions>

-Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
-
-The example below uses [torchao](../quantization/torchao) to only quantize the weights to int4.
-
-```python
-import torch
-from transformers import TorchAoConfig, DeepseekVLHybridForConditionalGeneration, AutoProcessor
-
-quantization_config = TorchAoConfig(
-    "int4_weight_only",
-    group_size=128
-)
-
-model = DeepseekVLHybridForConditionalGeneration.from_pretrained(
-    "deepseek-community/deepseek-vl-7b-chat",
-    dtype=torch.bfloat16,
-    device_map="auto",
-    quantization_config=quantization_config
-)
-```
-
-### Notes
-
- Do inference with multiple images in a single conversation.
-
-    ```py
-    import torch
-    from transformers import DeepseekVLHybridForConditionalGeneration, AutoProcessor
-
-    model = DeepseekVLHybridForConditionalGeneration.from_pretrained(
-        "deepseek-community/deepseek-vl-7b-chat",
-        dtype=torch.float16,
-        device_map="auto",
-        attn_implementation="sdpa"
-    )
-
-    processor = AutoProcessor.from_pretrained("deepseek-community/deepseek-vl-7b-chat")
-
-    messages = [
-        [
-            {
-                "role": "user",
-                "content": [
-                    {"type": "text", "text": "What’s the difference between"},
-                    {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
-                    {"type": "text", "text": " and "},
-                    {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"}
-                ]
-            }
-        ],
-        [
-            {
-                "role": "user",
-                "content": [
-                    {"type": "image", "url": "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.jpg"},
-                    {"type": "text", "text": "What do you see in this image?"}
-                ]
-            }
-        ]
-    ]
-
-    inputs = processor.apply_chat_template(
-        messages,
-        add_generation_prompt=True,
-        padding=True,
-        truncation=True,
-        tokenize=True,
-        return_dict=True,
-        return_tensors="pt"
-    ).to(model.device, dtype=model.dtype)
-
-    generated_ids = model.generate(**inputs, max_new_tokens=128)
-    generated_ids_trimmed = [
-        out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
-    ]
-    output_text = processor.batch_decode(
-        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
-    )
-
-    print(output_text)
-    ```
-
 ## DeepseekVLHybridConfig

 [[autodoc]] DeepseekVLHybridConfig
--- a/docs/source/en/model_doc/deformable_detr.md
+++ b/docs/source/en/model_doc/deformable_detr.md
@ -13,86 +13,55 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2020-10-08 and added to Hugging Face Transformers on 2022-09-14.*
-
-<div style="float: right;">
- <div class="flex flex-wrap space-x-1">
-  <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
- </div>
-</div>
+*This model was released on 2020-10-08 and added to Hugging Face Transformers on 2022-09-14 and contributed by [nielsr](https://huggingface.co/nielsr).*

 # Deformable DETR

-[Deformable DETR](https://huggingface.co/papers/2010.04159) improves on the original [DETR](./detr) by using a deformable attention module. This mechanism selectively attends to a small set of key sampling points around a reference. It improves training speed and improves accuracy.
-
-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/deformable_detr_architecture.png"
-alt="drawing" width="600"/>
-
-<small> Deformable DETR architecture. Taken from the <a href="https://huggingface.co/papers/2010.04159">original paper</a>.</small>
-
-You can find all the available Deformable DETR checkpoints under the [SenseTime](https://huggingface.co/SenseTime) organization.
-
-> [!TIP]
-> This model was contributed by [nielsr](https://huggingface.co/nielsr).
->
-> Click on the Deformable DETR models in the right sidebar for more examples of how to apply Deformable DETR to different object detection and segmentation tasks.
-
-The example below demonstrates how to perform object detection with the [`Pipeline`] and the [`AutoModel`] class.
+[Deformable DETR: Deformable Transformers for End-to-End Object Detection](https://huggingface.co/papers/2010.04159) addresses the slow convergence and limited feature spatial resolution issues of DETR by introducing a deformable attention module. This module focuses on a small set of key sampling points around a reference, enhancing performance, particularly for small objects, and reducing training time by a factor of ten. Experiments on the COCO benchmark confirm the effectiveness of this approach.

 <hfoptions id="usage">
 <hfoption id="Pipeline">

-```python
-from transformers import pipeline
+```py
 import torch
+from transformers import pipeline

-pipeline = pipeline(
-    "object-detection", 
-    model="SenseTime/deformable-detr",
-    dtype=torch.float16,
-    device_map=0
-)
-
-pipeline("http://images.cocodataset.org/val2017/000000039769.jpg")
+pipeline = pipeline(task="object-detection", model="SenseTime/deformable-detr", dtype="auto")
+pipeline("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg")
 ```

 </hfoption>
 <hfoption id="AutoModel">

-```python
-from transformers import AutoImageProcessor, AutoModelForObjectDetection
-from PIL import Image
-import requests
+```py
 import torch
+import requests
+from PIL import Image
+from transformers import AutoImageProcessor, AutoModelForObjectDetection

-url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
 image = Image.open(requests.get(url, stream=True).raw)

 image_processor = AutoImageProcessor.from_pretrained("SenseTime/deformable-detr")
-model = AutoModelForObjectDetection.from_pretrained("SenseTime/deformable-detr")
+model = AutoModelForObjectDetection.from_pretrained("SenseTime/deformable-detr", dtype="auto")

-# prepare image for the model
 inputs = image_processor(images=image, return_tensors="pt")
-
-with torch.no_grad():
-    outputs = model(**inputs)
-
-results = image_processor.post_process_object_detection(outputs, target_sizes=torch.tensor([image.size[::-1]]), threshold=0.3)
-
-for result in results:
-    for score, label_id, box in zip(result["scores"], result["labels"], result["boxes"]):
-        score, label = score.item(), label_id.item()
-        box = [round(i, 2) for i in box.tolist()]
-        print(f"{model.config.id2label[label]}: {score:.2f} {box}")
+outputs = model(**inputs)
+target_sizes = torch.tensor([image.size[::-1]])
+results = image_processor.post_process_object_detection(outputs, threshold=0.5, target_sizes=target_sizes)[
+    0
+]
+for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
+    box = [round(i, 2) for i in box.tolist()]
+    print(
+        f"Detected {model.config.id2label[label.item()]} with confidence "
+        f"{round(score.item(), 3)} at location {box}"
+    )
 ```

 </hfoption>
 </hfoptions>

-## Resources
-
- Refer to this set of [notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/Deformable-DETR) for inference and fine-tuning [`DeformableDetrForObjectDetection`] on a custom dataset.
-
 ## DeformableDetrImageProcessor

 [[autodoc]] DeformableDetrImageProcessor
@ -118,3 +87,4 @@ for result in results:

 [[autodoc]] DeformableDetrForObjectDetection
    - forward
+
--- a/docs/source/en/model_doc/deit.md
+++ b/docs/source/en/model_doc/deit.md
@ -13,110 +13,56 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2020-12-23 and added to Hugging Face Transformers on 2021-04-13.*
+*This model was released on 2020-12-23 and added to Hugging Face Transformers on 2021-04-13 and contributed by [nielsr](https://huggingface.co/nielsr).*
+
+<div style="float: right;">
+    <div class="flex flex-wrap space-x-1">
+        <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
+        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
+    </div>
+</div>

 # DeiT

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
-<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[DeiT](https://huggingface.co/papers/2012.12877) addresses the inefficiency of training visual transformers by developing a more data-efficient model. This model achieves competitive results on ImageNet with only internal data and minimal computational resources, training on a single computer in less than 3 days. A key innovation is the introduction of a token-based distillation strategy, which enhances the student model's learning from a teacher model, particularly when the teacher is a convolutional neural network. This approach results in top-1 accuracy of up to 85.2% on ImageNet and strong performance on other tasks.

-## Overview
-
-The DeiT model was proposed in [Training data-efficient image transformers & distillation through attention](https://huggingface.co/papers/2012.12877) by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre
-Sablayrolles, Hervé Jégou. The [Vision Transformer (ViT)](vit) introduced in [Dosovitskiy et al., 2020](https://huggingface.co/papers/2010.11929) has shown that one can match or even outperform existing convolutional neural
-networks using a Transformer encoder (BERT-like). However, the ViT models introduced in that paper required training on
-expensive infrastructure for multiple weeks, using external data. DeiT (data-efficient image transformers) are more
-efficiently trained transformers for image classification, requiring far less data and far less computing resources
-compared to the original ViT models.
-
-The abstract from the paper is the following:
-
-*Recently, neural networks purely based on attention were shown to address image understanding tasks such as image
-classification. However, these visual transformers are pre-trained with hundreds of millions of images using an
-expensive infrastructure, thereby limiting their adoption. In this work, we produce a competitive convolution-free
-transformer by training on Imagenet only. We train them on a single computer in less than 3 days. Our reference vision
-transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop evaluation) on ImageNet with no external
-data. More importantly, we introduce a teacher-student strategy specific to transformers. It relies on a distillation
-token ensuring that the student learns from the teacher through attention. We show the interest of this token-based
-distillation, especially when using a convnet as a teacher. This leads us to report results competitive with convnets
-for both Imagenet (where we obtain up to 85.2% accuracy) and when transferring to other tasks. We share our code and
-models.*
-
-This model was contributed by [nielsr](https://huggingface.co/nielsr).
-
-## Usage tips
-
- Compared to ViT, DeiT models use a so-called distillation token to effectively learn from a teacher (which, in the
-  DeiT paper, is a ResNet like-model). The distillation token is learned through backpropagation, by interacting with
-  the class ([CLS]) and patch tokens through the self-attention layers.
- There are 2 ways to fine-tune distilled models, either (1) in a classic way, by only placing a prediction head on top
-  of the final hidden state of the class token and not using the distillation signal, or (2) by placing both a
-  prediction head on top of the class token and on top of the distillation token. In that case, the [CLS] prediction
-  head is trained using regular cross-entropy between the prediction of the head and the ground-truth label, while the
-  distillation prediction head is trained using hard distillation (cross-entropy between the prediction of the
-  distillation head and the label predicted by the teacher). At inference time, one takes the average prediction
-  between both heads as final prediction. (2) is also called "fine-tuning with distillation", because one relies on a
-  teacher that has already been fine-tuned on the downstream dataset. In terms of models, (1) corresponds to
-  [`DeiTForImageClassification`] and (2) corresponds to
-  [`DeiTForImageClassificationWithTeacher`].
- Note that the authors also did try soft distillation for (2) (in which case the distillation prediction head is
-  trained using KL divergence to match the softmax output of the teacher), but hard distillation gave the best results.
- All released checkpoints were pre-trained and fine-tuned on ImageNet-1k only. No external data was used. This is in
-  contrast with the original ViT model, which used external data like the JFT-300M dataset/Imagenet-21k for
-  pre-training.
- The authors of DeiT also released more efficiently trained ViT models, which you can directly plug into
-  [`ViTModel`] or [`ViTForImageClassification`]. Techniques like data
-  augmentation, optimization, and regularization were used in order to simulate training on a much larger dataset
-  (while only using ImageNet-1k for pre-training). There are 4 variants available (in 3 different sizes):
-  *facebook/deit-tiny-patch16-224*, *facebook/deit-small-patch16-224*, *facebook/deit-base-patch16-224* and
-  *facebook/deit-base-patch16-384*. Note that one should use [`DeiTImageProcessor`] in order to
-  prepare images for the model.
-
-### Using Scaled Dot Product Attention (SDPA)
-
-PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function
-encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the
-[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html)
-or the [GPU Inference](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention)
-page for more information.
-
-SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set
-`attn_implementation="sdpa"` in `from_pretrained()` to explicitly request SDPA to be used.
+<hfoptions id="usage">
+<hfoption id="Pipeline">

 ```py
-from transformers import DeiTForImageClassification
-model = DeiTForImageClassification.from_pretrained("facebook/deit-base-distilled-patch16-224", attn_implementation="sdpa", dtype=torch.float16)
-...
+import torch
+from transformers import pipeline
+
+pipeline = pipeline(task="image-classification", model="facebook/deit-base-distilled-patch16-224", dtype="auto")
+pipeline("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg")
 ```

-For the best speedups, we recommend loading the model in half-precision (e.g. `torch.float16` or `torch.bfloat16`).
+</hfoption>
+<hfoption id="AutoModel">

-On a local benchmark (A100-40GB, PyTorch 2.3.0, OS Ubuntu 22.04) with `float32` and `facebook/deit-base-distilled-patch16-224` model, we saw the following speedups during inference.
+```python
+import torch
+import requests
+from PIL import Image
+from transformers import AutoImageProcessor, AutoModelForImageClassification

-|   Batch size |   Average inference time (ms), eager mode |   Average inference time (ms), sdpa model |   Speed up, Sdpa / Eager (x) |
-|--------------|-------------------------------------------|-------------------------------------------|------------------------------|
-|            1 |                                         8 |                                         6 |                      1.33 |
-|            2 |                                         9 |                                         6 |                      1.5  |
-|            4 |                                         9 |                                         6 |                      1.5  |
-|            8 |                                         8 |                                         6 |                      1.33 |
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
+image = Image.open(requests.get(url, stream=True).raw)

-## Resources
+image_processor = AutoImageProcessor.from_pretrained("facebook/deit-base-distilled-patch16-224")
+model = AutoModelForImageClassification.from_pretrained("facebook/deit-base-distilled-patch16-224", dtype="auto")

-A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with DeiT.
+inputs = image_processor(image, return_tensors="pt")

-<PipelineTag pipeline="image-classification"/>
+with torch.no_grad():
+    logits = model(**inputs).logits

- [`DeiTForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb).
- See also: [Image classification task guide](../tasks/image_classification)
+predicted_label = logits.argmax(-1).item()
+print(model.config.id2label[predicted_label])
+```

-Besides that:
-
- [`DeiTForMaskedImageModeling`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-pretraining).
-
-If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
+</hfoption>
+</hfoptions>

 ## DeiTConfig

@ -151,3 +97,4 @@ If you're interested in submitting a resource to be included here, please feel f

 [[autodoc]] DeiTForImageClassificationWithTeacher
    - forward
+
--- a/docs/source/en/model_doc/deplot.md
+++ b/docs/source/en/model_doc/deplot.md
@ -17,34 +17,20 @@ rendered properly in your Markdown viewer.

 # DePlot

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[DePlot](https://huggingface.co/papers/2212.10505) presents a one-shot solution for visual language reasoning by decomposing the task into plot-to-text translation and reasoning over the translated text. The model, DePlot, translates images of plots or charts into linearized tables using a modality conversion module. This output is then used to prompt a pretrained large language model (LLM), leveraging the LLM's few-shot reasoning capabilities. DePlot is trained end-to-end on a standardized plot-to-table task and can be used with LLMs in a plug-and-play fashion. Compared to a state-of-the-art model fine-tuned on over 28,000 data points, DePlot combined with LLM achieves a 24.0% improvement on human-written queries in chart QA tasks.

-## Overview
+<hfoptions id="usage">
+<hfoption id="Pix2StructForConditionalGeneration">

-DePlot was proposed in the paper [DePlot: One-shot visual language reasoning by plot-to-table translation](https://huggingface.co/papers/2212.10505) from Fangyu Liu, Julian Martin Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Collier, Yasemin Altun.
-
-The abstract of the paper states the following:
-
-*Visual language such as charts and plots is ubiquitous in the human world. Comprehending plots and charts requires strong reasoning skills. Prior state-of-the-art (SOTA) models require at least tens of thousands of training examples and their reasoning capabilities are still much limited, especially on complex human-written queries. This paper presents the first one-shot solution to visual language reasoning. We decompose the challenge of visual language reasoning into two steps: (1) plot-to-text translation, and (2) reasoning over the translated text. The key in this method is a modality conversion module, named as DePlot, which translates the image of a plot or chart to a linearized table. The output of DePlot can then be directly used to prompt a pretrained large language model (LLM), exploiting the few-shot reasoning capabilities of LLMs. To obtain DePlot, we standardize the plot-to-table task by establishing unified task formats and metrics, and train DePlot end-to-end on this task. DePlot can then be used off-the-shelf together with LLMs in a plug-and-play fashion. Compared with a SOTA model finetuned on more than >28k data points, DePlot+LLM with just one-shot prompting achieves a 24.0% improvement over finetuned SOTA on human-written queries from the task of chart QA.*
-
-DePlot is a model that is trained using `Pix2Struct` architecture. You can find more information about `Pix2Struct` in the [Pix2Struct documentation](https://huggingface.co/docs/transformers/main/en/model_doc/pix2struct).
-DePlot is a Visual Question Answering subset of `Pix2Struct` architecture. It renders the input question on the image and predicts the answer.
-
-## Usage example
-
-Currently one checkpoint is available for DePlot:
-
- `google/deplot`: DePlot fine-tuned on ChartQA dataset
-
-```python
-from transformers import AutoProcessor, Pix2StructForConditionalGeneration
+```py
+import torch
 import requests
 from PIL import Image
+from transformers import AutoProcessor, Pix2StructForConditionalGeneration

-model = Pix2StructForConditionalGeneration.from_pretrained("google/deplot")
+model = Pix2StructForConditionalGeneration.from_pretrained("google/deplot", dtype="auto")
 processor = AutoProcessor.from_pretrained("google/deplot")
+
 url = "https://raw.githubusercontent.com/vis-nlp/ChartQA/main/ChartQA%20Dataset/val/png/5090.png"
 image = Image.open(requests.get(url, stream=True).raw)

@ -53,19 +39,5 @@ predictions = model.generate(**inputs, max_new_tokens=512)
 print(processor.decode(predictions[0], skip_special_tokens=True))
 ```

-## Fine-tuning
-
-To fine-tune DePlot, refer to the pix2struct [fine-tuning notebook](https://github.com/huggingface/notebooks/blob/main/examples/image_captioning_pix2struct.ipynb). For `Pix2Struct` models, we have found out that fine-tuning the model with Adafactor and cosine learning rate scheduler leads to faster convergence:
-
-```python
-from transformers.optimization import Adafactor, get_cosine_schedule_with_warmup
-
-optimizer = Adafactor(self.parameters(), scale_parameter=False, relative_step=False, lr=0.01, weight_decay=1e-05)
-scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=1000, num_training_steps=40000)
-```
-
-<Tip>
-
-DePlot is a model trained using `Pix2Struct` architecture. For API reference, see [`Pix2Struct` documentation](pix2struct).
-
-</Tip>
+</hfoption>
+</hfoptions>
--- a/docs/source/en/model_doc/depth_anything.md
+++ b/docs/source/en/model_doc/depth_anything.md
@ -13,24 +13,10 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2024-01-19 and added to Hugging Face Transformers on 2024-01-25.*
-
-<div style="float: right;">
-    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-    </div>
-</div>
-
+*This model was released on 2024-01-19 and added to Hugging Face Transformers on 2024-01-25 and contributed by [nielsr](https://huggingface.co/nielsr).*
 # Depth Anything

-[Depth Anything](https://huggingface.co/papers/2401.10891) is designed to be a foundation model for monocular depth estimation (MDE). It is jointly trained on labeled and ~62M unlabeled images to enhance the dataset. It uses a pretrained [DINOv2](./dinov2) model as an image encoder to inherit its existing rich semantic priors, and [DPT](./dpt) as the decoder. A teacher model is trained on unlabeled images to create pseudo-labels. The student model is trained on a combination of the pseudo-labels and labeled images. To improve the student model's performance, strong perturbations are added to the unlabeled images to challenge the student model to learn more visual knowledge from the image.
-
-You can find all the original Depth Anything checkpoints under the [Depth Anything](https://huggingface.co/collections/LiheYoung/depth-anything-release-65b317de04eec72abf6b55aa) collection.
-
-> [!TIP]
-> Click on the Depth Anything models in the right sidebar for more examples of how to apply Depth Anything to different vision tasks.
-
-The example below demonstrates how to obtain a depth map with [`Pipeline`] or the [`AutoModel`] class.
+[Depth Anything](https://huggingface.co/papers/2401.10891) is a robust monocular depth estimation model based on the DPT architecture. Trained on approximately 62 million images, it achieves state-of-the-art results in both relative and absolute depth estimation. The model leverages large-scale unlabeled data, enhanced by data augmentation and auxiliary supervision from pre-trained encoders, to improve generalization and robustness. Extensive zero-shot evaluations on six public datasets and random photos demonstrate its impressive capabilities, and fine-tuning with metric depth information from NYUv2 and KITTI sets new benchmarks. Additionally, the improved depth model enhances depth-conditioned ControlNet performance.

 <hfoptions id="usage">
 <hfoption id="Pipeline">
@ -39,14 +25,14 @@ The example below demonstrates how to obtain a depth map with [`Pipeline`] or th
 import torch
 from transformers import pipeline

-pipe = pipeline(task="depth-estimation", model="LiheYoung/depth-anything-base-hf", dtype=torch.bfloat16, device=0)
-pipe("http://images.cocodataset.org/val2017/000000039769.jpg")["depth"]
+pipeline = pipeline(task="depth-estimation", model="LiheYoung/depth-anything-base-hf", dtype="auto")
+pipeline("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg")
 ```

 </hfoption>
 <hfoption id="AutoModel">

-```py
+```python
 import torch
 import requests
 import numpy as np
@ -54,8 +40,8 @@ from PIL import Image
 from transformers import AutoImageProcessor, AutoModelForDepthEstimation

 image_processor = AutoImageProcessor.from_pretrained("LiheYoung/depth-anything-base-hf")
-model = AutoModelForDepthEstimation.from_pretrained("LiheYoung/depth-anything-base-hf", dtype=torch.bfloat16)
-url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+model = AutoModelForDepthEstimation.from_pretrained("LiheYoung/depth-anything-base-hf", dtype="auto")
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
 image = Image.open(requests.get(url, stream=True).raw)
 inputs = image_processor(images=image, return_tensors="pt")

@ -75,10 +61,6 @@ Image.fromarray(depth.astype("uint8"))
 </hfoption>
 </hfoptions>

-## Notes
-
- [DepthAnythingV2](./depth_anything_v2), released in June 2024, uses the same architecture as Depth Anything and is compatible with all code examples and existing workflows. It uses synthetic data and a larger capacity teacher model to achieve much finer and robust depth predictions.
-
 ## DepthAnythingConfig

 [[autodoc]] DepthAnythingConfig
@ -87,3 +69,4 @@ Image.fromarray(depth.astype("uint8"))

 [[autodoc]] DepthAnythingForDepthEstimation
    - forward
+
--- a/docs/source/en/model_doc/depth_anything_v2.md
+++ b/docs/source/en/model_doc/depth_anything_v2.md
@ -17,91 +17,50 @@ rendered properly in your Markdown viewer.

 # Depth Anything V2

-## Overview
+[Depth Anything V2](https://huggingface.co/papers/2406.09414) enhances monocular depth estimation by replacing real images with synthetic data, increasing the teacher model's capacity, and using large-scale pseudo-labeled real images to train student models. This results in finer and more robust depth predictions, offering efficiency and accuracy improvements over models based on Stable Diffusion. Available in various sizes (25M to 1.3B parameters), these models can be fine-tuned for metric depth tasks. The paper also introduces a new evaluation benchmark with precise annotations and diverse scenes to support future research.

-Depth Anything V2 was introduced in [the paper of the same name](https://huggingface.co/papers/2406.09414) by Lihe Yang et al. It uses the same architecture as the original [Depth Anything model](depth_anything), but uses synthetic data and a larger capacity teacher model to achieve much finer and robust depth predictions.
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-The abstract from the paper is the following:
+```py
+import torch
+from transformers import pipeline

-*This work presents Depth Anything V2. Without pursuing fancy techniques, we aim to reveal crucial findings to pave the way towards building a powerful monocular depth estimation model. Notably, compared with V1, this version produces much finer and more robust depth predictions through three key practices: 1) replacing all labeled real images with synthetic images, 2) scaling up the capacity of our teacher model, and 3) teaching student models via the bridge of large-scale pseudo-labeled real images. Compared with the latest models built on Stable Diffusion, our models are significantly more efficient (more than 10x faster) and more accurate. We offer models of different scales (ranging from 25M to 1.3B params) to support extensive scenarios. Benefiting from their strong generalization capability, we fine-tune them with metric depth labels to obtain our metric depth models. In addition to our models, considering the limited diversity and frequent noise in current test sets, we construct a versatile evaluation benchmark with precise annotations and diverse scenes to facilitate future research.*
-
-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/depth_anything_overview.jpg"
-alt="drawing" width="600"/>
-
-<small> Depth Anything overview. Taken from the <a href="https://huggingface.co/papers/2401.10891">original paper</a>.</small>
-
-The Depth Anything models were contributed by [nielsr](https://huggingface.co/nielsr).
-The original code can be found [here](https://github.com/DepthAnything/Depth-Anything-V2).
-
-## Usage example
-
-There are 2 main ways to use Depth Anything V2: either using the pipeline API, which abstracts away all the complexity for you, or by using the `DepthAnythingForDepthEstimation` class yourself.
-
-### Pipeline API
-
-The pipeline allows to use the model in a few lines of code:
-
-```python
->>> from transformers import pipeline
->>> from PIL import Image
->>> import requests
-
->>> # load pipe
->>> pipe = pipeline(task="depth-estimation", model="depth-anything/Depth-Anything-V2-Small-hf")
-
->>> # load image
->>> url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
->>> image = Image.open(requests.get(url, stream=True).raw)
-
->>> # inference
->>> depth = pipe(image)["depth"]
+pipeline = pipeline(task="depth-estimation", model="depth-anything/Depth-Anything-V2-Small-hf", dtype="auto")
+pipeline("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg")
 ```

-### Using the model yourself
-
-If you want to do the pre- and post-processing yourself, here's how to do that:
+</hfoption>
+<hfoption id="AutoModel">

 ```python
->>> from transformers import AutoImageProcessor, AutoModelForDepthEstimation
->>> import torch
->>> import numpy as np
->>> from PIL import Image
->>> import requests
+import torch
+import requests
+import numpy as np
+from PIL import Image
+from transformers import AutoImageProcessor, AutoModelForDepthEstimation

->>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
->>> image = Image.open(requests.get(url, stream=True).raw)
+image_processor = AutoImageProcessor.from_pretrained("depth-anything/Depth-Anything-V2-Small-hf")
+model = AutoModelForDepthEstimation.from_pretrained("depth-anything/Depth-Anything-V2-Small-hf", dtype="auto")
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
+image = Image.open(requests.get(url, stream=True).raw)
+inputs = image_processor(images=image, return_tensors="pt")

->>> image_processor = AutoImageProcessor.from_pretrained("depth-anything/Depth-Anything-V2-Small-hf")
->>> model = AutoModelForDepthEstimation.from_pretrained("depth-anything/Depth-Anything-V2-Small-hf")
+with torch.no_grad():
+    outputs = model(**inputs)

->>> # prepare image for the model
->>> inputs = image_processor(images=image, return_tensors="pt")
-
->>> with torch.no_grad():
-...     outputs = model(**inputs)
-
->>> # interpolate to original size and visualize the prediction
->>> post_processed_output = image_processor.post_process_depth_estimation(
-...     outputs,
-...     target_sizes=[(image.height, image.width)],
-... )
-
->>> predicted_depth = post_processed_output[0]["predicted_depth"]
->>> depth = (predicted_depth - predicted_depth.min()) / (predicted_depth.max() - predicted_depth.min())
->>> depth = depth.detach().cpu().numpy() * 255
->>> depth = Image.fromarray(depth.astype("uint8"))
+post_processed_output = image_processor.post_process_depth_estimation(
+    outputs,
+    target_sizes=[(image.height, image.width)],
+)
+predicted_depth = post_processed_output[0]["predicted_depth"]
+depth = (predicted_depth - predicted_depth.min()) / (predicted_depth.max() - predicted_depth.min())
+depth = depth.detach().cpu().numpy() * 255
+Image.fromarray(depth.astype("uint8"))
 ```

-## Resources
-
-A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with Depth Anything.
-
- [Monocular depth estimation task guide](../tasks/monocular_depth_estimation)
- [Depth Anything V2 demo](https://huggingface.co/spaces/depth-anything/Depth-Anything-V2).
- A notebook showcasing inference with [`DepthAnythingForDepthEstimation`] can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Depth%20Anything/Predicting_depth_in_an_image_with_Depth_Anything.ipynb). 🌎
- [Core ML conversion of the `small` variant for use on Apple Silicon](https://huggingface.co/apple/coreml-depth-anything-v2-small).
-
-If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
+</hfoption>
+</hfoptions>

 ## DepthAnythingConfig

@ -111,3 +70,4 @@ If you're interested in submitting a resource to be included here, please feel f

 [[autodoc]] DepthAnythingForDepthEstimation
    - forward
+
--- a/docs/source/en/model_doc/depth_pro.md
+++ b/docs/source/en/model_doc/depth_pro.md
@ -13,158 +13,56 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2024-10-02 and added to Hugging Face Transformers on 2025-02-10.*
+*This model was released on 2024-10-02 and added to Hugging Face Transformers on 2025-02-10 and contributed by [geetu040](https://github.com/geetu040).*

 # DepthPro

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[DepthPro](https://huggingface.co/papers/2410.02073) is a foundation model for zero-shot metric monocular depth estimation, generating high-resolution depth maps with sharpness and fine details. It uses a multi-scale Vision Transformer (ViT)-based architecture with a shared Dinov2 encoder and a DPT-like fusion stage for precise depth estimation. The model achieves metric accuracy without camera intrinsics and produces a 2.25-megapixel depth map in 0.3 seconds on a standard GPU. Technical contributions include an efficient multi-scale vision transformer, a combined real and synthetic dataset training protocol, and state-of-the-art focal length estimation from a single image.

-## Overview
-
-The DepthPro model was proposed in [Depth Pro: Sharp Monocular Metric Depth in Less Than a Second](https://huggingface.co/papers/2410.02073) by Aleksei Bochkovskii, Amaël Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R. Richter, Vladlen Koltun.
-
-DepthPro is a foundation model for zero-shot metric monocular depth estimation, designed to generate high-resolution depth maps with remarkable sharpness and fine-grained details. It employs a multi-scale Vision Transformer (ViT)-based architecture, where images are downsampled, divided into patches, and processed using a shared Dinov2 encoder. The extracted patch-level features are merged, upsampled, and refined using a DPT-like fusion stage, enabling precise depth estimation.
-
-The abstract from the paper is the following:
-
-*We present a foundation model for zero-shot metric monocular depth estimation. Our model, Depth Pro, synthesizes high-resolution depth maps with unparalleled sharpness and high-frequency details. The predictions are metric, with absolute scale, without relying on the availability of metadata such as camera intrinsics. And the model is fast, producing a 2.25-megapixel depth map in 0.3 seconds on a standard GPU. These characteristics are enabled by a number of technical contributions, including an efficient multi-scale vision transformer for dense prediction, a training protocol that combines real and synthetic datasets to achieve high metric accuracy alongside fine boundary tracing, dedicated evaluation metrics for boundary accuracy in estimated depth maps, and state-of-the-art focal length estimation from a single image. Extensive experiments analyze specific design choices and demonstrate that Depth Pro outperforms prior work along multiple dimensions.*
-
-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/depth_pro_teaser.png"
-alt="drawing" width="600"/>
-
-<small> DepthPro Outputs. Taken from the <a href="https://github.com/apple/ml-depth-pro" target="_blank">official code</a>. </small>
-
-This model was contributed by [geetu040](https://github.com/geetu040). The original code can be found [here](https://github.com/apple/ml-depth-pro).
-
-## Usage Tips
-
-The DepthPro model processes an input image by first downsampling it at multiple scales and splitting each scaled version into patches. These patches are then encoded using a shared Vision Transformer (ViT)-based Dinov2 patch encoder, while the full image is processed by a separate image encoder. The extracted patch features are merged into feature maps, upsampled, and fused using a DPT-like decoder to generate the final depth estimation. If enabled, an additional Field of View (FOV) encoder processes the image for estimating the camera's field of view, aiding in depth accuracy.
+<hfoptions id="usage">
+<hfoption id="Pipeline">

 ```py
->>> import requests
->>> from PIL import Image
->>> import torch
->>> from transformers import DepthProImageProcessorFast, DepthProForDepthEstimation
-from accelerate import Accelerator
+import torch
+from transformers import pipeline

->>> device = Accelerator().device
-
->>> url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
->>> image = Image.open(requests.get(url, stream=True).raw)
-
->>> image_processor = DepthProImageProcessorFast.from_pretrained("apple/DepthPro-hf")
->>> model = DepthProForDepthEstimation.from_pretrained("apple/DepthPro-hf").to(device)
-
->>> inputs = image_processor(images=image, return_tensors="pt").to(model.device)
-
->>> with torch.no_grad():
-...     outputs = model(**inputs)
-
->>> post_processed_output = image_processor.post_process_depth_estimation(
-...     outputs, target_sizes=[(image.height, image.width)],
-... )
-
->>> field_of_view = post_processed_output[0]["field_of_view"]
->>> focal_length = post_processed_output[0]["focal_length"]
->>> depth = post_processed_output[0]["predicted_depth"]
->>> depth = (depth - depth.min()) / depth.max()
->>> depth = depth * 255.
->>> depth = depth.detach().cpu().numpy()
->>> depth = Image.fromarray(depth.astype("uint8"))
+pipeline = pipeline(task="depth-estimation", model="apple/DepthPro-hf", dtype="auto")
+pipeline("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg")
 ```

-### Architecture and Configuration
+</hfoption>
+<hfoption id="AutoModel">

-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/depth_pro_architecture.png"
-alt="drawing" width="600"/>
+```python
+import torch
+import requests
+import numpy as np
+from PIL import Image
+from transformers import AutoImageProcessor, AutoModelForDepthEstimation

-<small> DepthPro architecture. Taken from the <a href="https://huggingface.co/papers/2410.02073" target="_blank">original paper</a>. </small>
+image_processor = AutoImageProcessor.from_pretrained("apple/DepthPro-hf")
+model = AutoModelForDepthEstimation.from_pretrained("apple/DepthPro-hf", dtype="auto")
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
+image = Image.open(requests.get(url, stream=True).raw)
+inputs = image_processor(images=image, return_tensors="pt")

-The `DepthProForDepthEstimation` model uses a `DepthProEncoder`, for encoding the input image and a `FeatureFusionStage` for fusing the output features from encoder.
+with torch.no_grad():
+    outputs = model(**inputs)

-The `DepthProEncoder` further uses two encoders:
-
- `patch_encoder`
-  - Input image is scaled with multiple ratios, as specified in the `scaled_images_ratios` configuration.
-  - Each scaled image is split into smaller **patches** of size `patch_size` with overlapping areas determined by `scaled_images_overlap_ratios`.
-  - These patches are processed by the **`patch_encoder`**
- `image_encoder`
-  - Input image is also rescaled to `patch_size` and processed by the **`image_encoder`**
-
-Both these encoders can be configured via `patch_model_config` and `image_model_config` respectively, both of which are separate `Dinov2Model` by default.
-
-Outputs from both encoders (`last_hidden_state`) and selected intermediate states (`hidden_states`) from **`patch_encoder`** are fused by a `DPT`-based `FeatureFusionStage` for depth estimation.
-
-### Field-of-View (FOV) Prediction
-
-The network is supplemented with a focal length estimation head. A small convolutional head ingests frozen features from the depth estimation network and task-specific features from a separate ViT image encoder to predict the horizontal angular field-of-view.
-
-The `use_fov_model` parameter in `DepthProConfig` controls whether **FOV prediction** is enabled. By default, it is set to `False` to conserve memory and computation. When enabled, the **FOV encoder** is instantiated based on the `fov_model_config` parameter, which defaults to a `Dinov2Model`. The `use_fov_model` parameter can also be passed when initializing the `DepthProForDepthEstimation` model.
-
-The pretrained model at checkpoint `apple/DepthPro-hf` uses the FOV encoder. To use the pretrained-model without FOV encoder, set `use_fov_model=False` when loading the model, which saves computation.
-
-```py
->>> from transformers import DepthProForDepthEstimation
->>> model = DepthProForDepthEstimation.from_pretrained("apple/DepthPro-hf", use_fov_model=False)
+post_processed_output = image_processor.post_process_depth_estimation(
+    outputs,
+    target_sizes=[(image.height, image.width)],
+)
+field_of_view = post_processed_output[0]["field_of_view"]
+focal_length = post_processed_output[0]["focal_length"]
+predicted_depth = post_processed_output[0]["predicted_depth"]
+depth = (predicted_depth - predicted_depth.min()) / (predicted_depth.max() - predicted_depth.min())
+depth = depth.detach().cpu().numpy() * 255
+Image.fromarray(depth.astype("uint8"))
 ```

-To instantiate a new model with FOV encoder, set `use_fov_model=True` in the config.
-
-```py
->>> from transformers import DepthProConfig, DepthProForDepthEstimation
->>> config = DepthProConfig(use_fov_model=True)
->>> model = DepthProForDepthEstimation(config)
-```
-
-Or set `use_fov_model=True` when initializing the model, which overrides the value in config.
-
-```py
->>> from transformers import DepthProConfig, DepthProForDepthEstimation
->>> config = DepthProConfig()
->>> model = DepthProForDepthEstimation(config, use_fov_model=True)
-```
-
-### Using Scaled Dot Product Attention (SDPA)
-
-PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function
-encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the
-[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html)
-or the [GPU Inference](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention)
-page for more information.
-
-SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set
-`attn_implementation="sdpa"` in `from_pretrained()` to explicitly request SDPA to be used.
-
-```py
-from transformers import DepthProForDepthEstimation
-model = DepthProForDepthEstimation.from_pretrained("apple/DepthPro-hf", attn_implementation="sdpa", dtype=torch.float16)
-```
-
-For the best speedups, we recommend loading the model in half-precision (e.g. `torch.float16` or `torch.bfloat16`).
-
-On a local benchmark (A100-40GB, PyTorch 2.3.0, OS Ubuntu 22.04) with `float32` and `google/vit-base-patch16-224` model, we saw the following speedups during inference.
-
-|   Batch size |   Average inference time (ms), eager mode |   Average inference time (ms), sdpa model |   Speed up, Sdpa / Eager (x) |
-|--------------|-------------------------------------------|-------------------------------------------|------------------------------|
-|            1 |                                         7 |                                         6 |                      1.17 |
-|            2 |                                         8 |                                         6 |                      1.33 |
-|            4 |                                         8 |                                         6 |                      1.33 |
-|            8 |                                         8 |                                         6 |                      1.33 |
-
-## Resources
-
-A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with DepthPro:
-
- Research Paper: [Depth Pro: Sharp Monocular Metric Depth in Less Than a Second](https://huggingface.co/papers/2410.02073)
- Official Implementation: [apple/ml-depth-pro](https://github.com/apple/ml-depth-pro)
- DepthPro Inference Notebook: [DepthPro Inference](https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/DepthPro_inference.ipynb)
- DepthPro for Super Resolution and Image Segmentation
-  - Read blog on Medium: [Depth Pro: Beyond Depth](https://medium.com/@raoarmaghanshakir040/depth-pro-beyond-depth-9d822fc557ba)
-  - Code on Github: [geetu040/depthpro-beyond-depth](https://github.com/geetu040/depthpro-beyond-depth)
-
-If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
+</hfoption>
+</hfoptions>

 ## DepthProConfig

@ -191,3 +89,4 @@ If you're interested in submitting a resource to be included here, please feel f

 [[autodoc]] DepthProForDepthEstimation
    - forward
+
--- a/docs/source/en/model_doc/deta.md
+++ b/docs/source/en/model_doc/deta.md
@ -13,49 +13,57 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2022-12-12 and added to Hugging Face Transformers on 2023-06-20.*
+*This model was released on 2022-12-12 and added to Hugging Face Transformers on 2023-06-20 and contributed by [nielsr](https://huggingface.co/nielsr).*
+
+> [!WARNING]
+> This model is in maintenance mode only, we don’t accept any new PRs changing its code. If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2. You can do so by running the following command: pip install -U transformers==4.40.2.

 # DETA

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[DETA](https://huggingface.co/papers/2212.06137) enhances Deformable DETR by substituting the one-to-one bipartite Hungarian matching loss with one-to-many label assignments, a technique commonly used in traditional detectors with non-maximum suppression (NMS). This change results in a significant improvement of up to 2.5 mAP. The model achieves 50.2 COCO mAP within 12 epochs using a ResNet50 backbone, outperforming both traditional and transformer-based detectors in this setting. The study demonstrates that bipartite matching is not essential for effective detection transformers, attributing their success to the expressive transformer architecture.

-<Tip warning={true}>
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-This model is in maintenance mode only, we don't accept any new PRs changing its code.
-If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
-You can do so by running the following command: `pip install -U transformers==4.40.2`.
+```py
+import torch
+from transformers import pipeline

-</Tip>
+pipeline = pipeline(task="object-detection", model="jozhang97/deta-swin-large", dtype="auto")
+pipeline("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg")
+```

-## Overview
+</hfoption>
+<hfoption id="AutoModel">

-The DETA model was proposed in [NMS Strikes Back](https://huggingface.co/papers/2212.06137) by Jeffrey Ouyang-Zhang, Jang Hyun Cho, Xingyi Zhou, Philipp Krähenbühl.
-DETA (short for Detection Transformers with Assignment) improves [Deformable DETR](deformable_detr) by replacing the one-to-one bipartite Hungarian matching loss
-with one-to-many label assignments used in traditional detectors with non-maximum suppression (NMS). This leads to significant gains of up to 2.5 mAP.
+```py
+import torch
+import requests
+from PIL import Image
+from transformers import AutoImageProcessor, AutoModelForObjectDetection

-The abstract from the paper is the following:
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
+image = Image.open(requests.get(url, stream=True).raw)

-*Detection Transformer (DETR) directly transforms queries to unique objects by using one-to-one bipartite matching during training and enables end-to-end object detection. Recently, these models have surpassed traditional detectors on COCO with undeniable elegance. However, they differ from traditional detectors in multiple designs, including model architecture and training schedules, and thus the effectiveness of one-to-one matching is not fully understood. In this work, we conduct a strict comparison between the one-to-one Hungarian matching in DETRs and the one-to-many label assignments in traditional detectors with non-maximum supervision (NMS). Surprisingly, we observe one-to-many assignments with NMS consistently outperform standard one-to-one matching under the same setting, with a significant gain of up to 2.5 mAP. Our detector that trains Deformable-DETR with traditional IoU-based label assignment achieved 50.2 COCO mAP within 12 epochs (1x schedule) with ResNet50 backbone, outperforming all existing traditional or transformer-based detectors in this setting. On multiple datasets, schedules, and architectures, we consistently show bipartite matching is unnecessary for performant detection transformers. Furthermore, we attribute the success of detection transformers to their expressive transformer architecture.*
+image_processor = AutoImageProcessor.from_pretrained("jozhang97/deta-swin-large")
+model = AutoModelForObjectDetection.from_pretrained("jozhang97/deta-swin-large", dtype="auto")

-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/deta_architecture.jpg"
-alt="drawing" width="600"/>
+inputs = image_processor(images=image, return_tensors="pt")
+outputs = model(**inputs)
+target_sizes = torch.tensor([image.size[::-1]])
+results = image_processor.post_process_object_detection(outputs, threshold=0.5, target_sizes=target_sizes)[
+    0
+]
+for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
+    box = [round(i, 2) for i in box.tolist()]
+    print(
+        f"Detected {model.config.id2label[label.item()]} with confidence "
+        f"{round(score.item(), 3)} at location {box}"
+    )
+```

-<small> DETA overview. Taken from the <a href="https://huggingface.co/papers/2212.06137">original paper</a>. </small>
-
-This model was contributed by [nielsr](https://huggingface.co/nielsr).
-The original code can be found [here](https://github.com/jozhang97/DETA).
-
-## Resources
-
-A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with DETA.
-
- Demo notebooks for DETA can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/DETA).
- Scripts for finetuning [`DetaForObjectDetection`] with [`Trainer`] or [Accelerate](https://huggingface.co/docs/accelerate/index) can be found [here](https://github.com/huggingface/transformers/tree/main/examples/pytorch/object-detection).
- See also: [Object detection task guide](../tasks/object_detection).
-
-If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
+</hfoption>
+</hfoptions>

 ## DetaConfig

@ -76,3 +84,4 @@ If you're interested in submitting a resource to be included here, please feel f

 [[autodoc]] DetaForObjectDetection
    - forward
+
--- a/docs/source/en/model_doc/detr.md
+++ b/docs/source/en/model_doc/detr.md
@ -13,147 +13,55 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2020-05-26 and added to Hugging Face Transformers on 2021-06-09.*
-
-<div style="float: right;">
- <div class="flex flex-wrap space-x-1">
-  <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
- </div>
-</div>
+*This model was released on 2020-05-26 and added to Hugging Face Transformers on 2021-06-09 and contributed by [nielsr](https://huggingface.co/nielsr).*

 # DETR

-[DETR](https://huggingface.co/papers/2005.12872) consists of a convolutional backbone followed by an encoder-decoder Transformer which can be trained end-to-end for object detection. It greatly simplifies a lot of the complexity of models like Faster-R-CNN and Mask-R-CNN, which use things like region proposals, non-maximum suppression procedure and anchor generation. Moreover, DETR can also be naturally extended to perform panoptic segmentation, by simply adding a mask head on top of the decoder outputs.
-
-You can find all the original DETR checkpoints under the [AI at Meta](https://huggingface.co/facebook/models?search=detr) organization.
-
-> [!TIP]
-> This model was contributed by [nielsr](https://huggingface.co/nielsr).
->
-> Click on the DETR models in the right sidebar for more examples of how to apply DETR to different object detection and segmentation tasks.
-
-The example below demonstrates how to perform object detection with the [`Pipeline`] or the [`AutoModel`] class.
+[DETR](https://huggingface.co/papers/2005.12872) presents a novel method for object detection by framing it as a direct set prediction problem. This approach eliminates the need for hand-designed components such as non-maximum suppression and anchor generation. DETR uses a set-based global loss and a transformer encoder-decoder architecture to output predictions in parallel. It achieves accuracy and runtime performance comparable to Faster R-CNN on the COCO dataset and can be extended to panoptic segmentation with superior results.

 <hfoptions id="usage">
 <hfoption id="Pipeline">

-```python
-from transformers import pipeline
+```py
 import torch
+from transformers import pipeline

-pipeline = pipeline(
-    "object-detection", 
-    model="facebook/detr-resnet-50",
-    dtype=torch.float16,
-    device_map=0
-)
-
-pipeline("http://images.cocodataset.org/val2017/000000039769.jpg")
+pipeline = pipeline(task="object-detection", model="facebook/detr-resnet-50", dtype="auto")
+pipeline("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg")
 ```

 </hfoption>
 <hfoption id="AutoModel">

-```python
-from transformers import AutoImageProcessor, AutoModelForObjectDetection
-from PIL import Image
-import requests
+```py
 import torch
+import requests
+from PIL import Image
+from transformers import AutoImageProcessor, AutoModelForObjectDetection

-url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
 image = Image.open(requests.get(url, stream=True).raw)

 image_processor = AutoImageProcessor.from_pretrained("facebook/detr-resnet-50")
-model = AutoModelForObjectDetection.from_pretrained("facebook/detr-resnet-50")
+model = AutoModelForObjectDetection.from_pretrained("facebook/detr-resnet-50", dtype="auto")

-# prepare image for the model
 inputs = image_processor(images=image, return_tensors="pt")
-
-with torch.no_grad():
-    outputs = model(**inputs)
-
-results = image_processor.post_process_object_detection(outputs, target_sizes=torch.tensor([image.size[::-1]]), threshold=0.3)
-
-for result in results:
-    for score, label_id, box in zip(result["scores"], result["labels"], result["boxes"]):
-        score, label = score.item(), label_id.item()
-        box = [round(i, 2) for i in box.tolist()]
-        print(f"{model.config.id2label[label]}: {score:.2f} {box}")
+outputs = model(**inputs)
+target_sizes = torch.tensor([image.size[::-1]])
+results = image_processor.post_process_object_detection(outputs, threshold=0.5, target_sizes=target_sizes)[
+    0
+]
+for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
+    box = [round(i, 2) for i in box.tolist()]
+    print(
+        f"Detected {model.config.id2label[label.item()]} with confidence "
+        f"{round(score.item(), 3)} at location {box}"
+    )
 ```

 </hfoption>
 </hfoptions>

-<details>
-<summary>How DETR works</summary>
-
-Here's a TLDR explaining how [`~transformers.DetrForObjectDetection`] works:
-
-First, an image is sent through a pre-trained convolutional backbone (in the paper, the authors use ResNet-50/ResNet-101). Let's assume we also add a batch dimension. This means that the input to the backbone is a tensor of shape `(batch_size, 3, height, width)`, assuming the image has 3 color channels (RGB). The CNN backbone outputs a new lower-resolution feature map, typically of shape `(batch_size, 2048, height/32, width/32)`. This is then projected to match the hidden dimension of the Transformer of DETR, which is `256` by default, using a `nn.Conv2D` layer. So now, we have a tensor of shape `(batch_size, 256, height/32, width/32).` Next, the feature map is flattened and transposed to obtain a tensor of shape `(batch_size, seq_len, d_model)` = `(batch_size, width/32*height/32, 256)`. So a difference with NLP models is that the sequence length is actually longer than usual, but with a smaller `d_model` (which in NLP is typically 768 or higher).
-
-Next, this is sent through the encoder, outputting `encoder_hidden_states` of the same shape (you can consider these as image features). Next, so-called **object queries** are sent through the decoder. This is a tensor of shape `(batch_size, num_queries, d_model)`, with `num_queries` typically set to 100 and initialized with zeros. These input embeddings are learnt positional encodings that the authors refer to as object queries, and similarly to the encoder, they are added to the input of each attention layer. Each object query will look for a particular object in the image. The decoder updates these embeddings through multiple self-attention and encoder-decoder attention layers to output `decoder_hidden_states` of the same shape: `(batch_size, num_queries, d_model)`. Next, two heads are added on top for object detection: a linear layer for classifying each object query into one of the objects or "no object", and a MLP to predict bounding boxes for each query.
-
-The model is trained using a **bipartite matching loss**: so what we actually do is compare the predicted classes + bounding boxes of each of the N = 100 object queries to the ground truth annotations, padded up to the same length N (so if an image only contains 4 objects, 96 annotations will just have a "no object" as class and "no bounding box" as bounding box). The [Hungarian matching algorithm](https://en.wikipedia.org/wiki/Hungarian_algorithm) is used to find an optimal one-to-one mapping of each of the N queries to each of the N annotations. Next, standard cross-entropy (for the classes) and a linear combination of the L1 and [generalized IoU loss](https://giou.stanford.edu/) (for the bounding boxes) are used to optimize the parameters of the model.
-
-DETR can be naturally extended to perform panoptic segmentation (which unifies semantic segmentation and instance segmentation). [`~transformers.DetrForSegmentation`] adds a segmentation mask head on top of [`~transformers.DetrForObjectDetection`]. The mask head can be trained either jointly, or in a two steps process, where one first trains a [`~transformers.DetrForObjectDetection`] model to detect bounding boxes around both "things" (instances) and "stuff" (background things like trees, roads, sky), then freeze all the weights and train only the mask head for 25 epochs. Experimentally, these two approaches give similar results. Note that predicting boxes is required for the training to be possible, since the Hungarian matching is computed using distances between boxes.
-
-</details>
-
-## Notes
-
- DETR uses so-called **object queries** to detect objects in an image. The number of queries determines the maximum number of objects that can be detected in a single image, and is set to 100 by default (see parameter `num_queries` of [`~transformers.DetrConfig`]). Note that it's good to have some slack (in COCO, the authors used 100, while the maximum number of objects in a COCO image is ~70).
- The decoder of DETR updates the query embeddings in parallel. This is different from language models like GPT-2, which use autoregressive decoding instead of parallel. Hence, no causal attention mask is used.
- DETR adds position embeddings to the hidden states at each self-attention and cross-attention layer before projecting to queries and keys. For the position embeddings of the image, one can choose between fixed sinusoidal or learned absolute position embeddings. By default, the parameter `position_embedding_type` of [`~transformers.DetrConfig`] is set to `"sine"`.
- During training, the authors of DETR did find it helpful to use auxiliary losses in the decoder, especially to help the model output the correct number of objects of each class. If you set the parameter `auxiliary_loss` of [`~transformers.DetrConfig`] to `True`, then prediction feedforward neural networks and Hungarian losses are added after each decoder layer (with the FFNs sharing parameters).
- If you want to train the model in a distributed environment across multiple nodes, then one should update the *num_boxes* variable in the *DetrLoss* class of *modeling_detr.py*. When training on multiple nodes, this should be set to the average number of target boxes across all nodes, as can be seen in the original implementation [here](https://github.com/facebookresearch/detr/blob/a54b77800eb8e64e3ad0d8237789fcbf2f8350c5/models/detr.py#L227-L232).
- [`~transformers.DetrForObjectDetection`] and [`~transformers.DetrForSegmentation`] can be initialized with any convolutional backbone available in the [timm library](https://github.com/rwightman/pytorch-image-models). Initializing with a MobileNet backbone for example can be done by setting the `backbone` attribute of [`~transformers.DetrConfig`] to `"tf_mobilenetv3_small_075"`, and then initializing the model with that config.
- DETR resizes the input images such that the shortest side is at least a certain amount of pixels while the longest is at most 1333 pixels. At training time, scale augmentation is used such that the shortest side is randomly set to at least 480 and at most 800 pixels. At inference time, the shortest side is set to 800. One can use [`~transformers.DetrImageProcessor`] to prepare images (and optional annotations in COCO format) for the model. Due to this resizing, images in a batch can have different sizes. DETR solves this by padding images up to the largest size in a batch, and by creating a pixel mask that indicates which pixels are real/which are padding. Alternatively, one can also define a custom `collate_fn` in order to batch images together, using [`~transformers.DetrImageProcessor.pad_and_create_pixel_mask`].
- The size of the images will determine the amount of memory being used, and will thus determine the `batch_size`. It is advised to use a batch size of 2 per GPU. See [this Github thread](https://github.com/facebookresearch/detr/issues/150) for more info.
-
-There are three other ways to instantiate a DETR model (depending on what you prefer):
-
- Option 1: Instantiate DETR with pre-trained weights for entire model
-
-```python
-from transformers import DetrForObjectDetection
-
-model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")
-```
-
- Option 2: Instantiate DETR with randomly initialized weights for Transformer, but pre-trained weights for backbone
-
-```python
-from transformers import DetrConfig, DetrForObjectDetection
-
-config = DetrConfig()
-model = DetrForObjectDetection(config)
-```
-
- Option 3: Instantiate DETR with randomly initialized weights for backbone + Transformer
-
-```python
-config = DetrConfig(use_pretrained_backbone=False)
-model = DetrForObjectDetection(config)
-```
-
-As a summary, consider the following table:
-
-| Task | Object detection | Instance segmentation | Panoptic segmentation |
-|------|------------------|-----------------------|-----------------------|
-| **Description** | Predicting bounding boxes and class labels around objects in an image | Predicting masks around objects (i.e. instances) in an image | Predicting masks around both objects (i.e. instances) as well as "stuff" (i.e. background things like trees and roads) in an image |
-| **Model** | [`~transformers.DetrForObjectDetection`] | [`~transformers.DetrForSegmentation`] | [`~transformers.DetrForSegmentation`] |
-| **Example dataset** | COCO detection | COCO detection, COCO panoptic | COCO panoptic                           |
-| **Format of annotations to provide to**  [`~transformers.DetrImageProcessor`] | {'image_id': `int`, 'annotations': `list[Dict]`} each Dict being a COCO object annotation  | {'image_id': `int`, 'annotations': `list[Dict]`}  (in case of COCO detection) or {'file_name': `str`, 'image_id': `int`, 'segments_info': `list[Dict]`} (in case of COCO panoptic) | {'file_name': `str`, 'image_id': `int`, 'segments_info': `list[Dict]`} and masks_path (path to directory containing PNG files of the masks) |
-| **Postprocessing** (i.e. converting the output of the model to Pascal VOC format) | [`~transformers.DetrImageProcessor.post_process`] | [`~transformers.DetrImageProcessor.post_process_segmentation`] | [`~transformers.DetrImageProcessor.post_process_segmentation`], [`~transformers.DetrImageProcessor.post_process_panoptic`] |
-| **evaluators** | `CocoEvaluator` with `iou_types="bbox"` | `CocoEvaluator` with `iou_types="bbox"` or `"segm"` | `CocoEvaluator` with `iou_tupes="bbox"` or `"segm"`, `PanopticEvaluator` |
-
- In short, one should prepare the data either in COCO detection or COCO panoptic format, then use [`~transformers.DetrImageProcessor`] to create `pixel_values`, `pixel_mask` and optional `labels`, which can then be used to train (or fine-tune) a model.
- For evaluation, one should first convert the outputs of the model using one of the postprocessing methods of [`~transformers.DetrImageProcessor`]. These can be provided to either `CocoEvaluator` or `PanopticEvaluator`, which allow you to calculate metrics like mean Average Precision (mAP) and Panoptic Quality (PQ). The latter objects are implemented in the [original repository](https://github.com/facebookresearch/detr). See the [example notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/DETR) for more info regarding evaluation.
-
-## Resources
-
- Refer to these [notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/DETR) for examples of fine-tuning [`DetrForObjectDetection`] and [`DetrForSegmentation`] on a custom dataset.
-
 ## DetrConfig

 [[autodoc]] DetrConfig
@ -198,3 +106,4 @@ As a summary, consider the following table:

 [[autodoc]] DetrForSegmentation
    - forward
+
--- a/docs/source/en/model_doc/dia.md
+++ b/docs/source/en/model_doc/dia.md
@ -13,115 +13,55 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2025-04-21 and added to Hugging Face Transformers on 2025-06-26.*
+*This model was released on 2025-04-21 and added to Hugging Face Transformers on 2025-06-26 and contributed by [buttercrab](https://huggingface.co/buttercrab) and [ArthurZ](https://huggingface.co/ArthurZ).*

 # Dia

 <div style="float: right;">
    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
        <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
    </div>
 </div>

-## Overview
+[Dia](https://github.com/nari-labs/dia) is a 1.6B-parameter text-to-speech model from Nari Labs designed to generate natural, emotionally expressive dialogue, including non-verbal sounds like laughter and coughing. It uses an encoder-decoder transformer architecture enhanced with modern features such as rotational positional embeddings (RoPE). Text input is processed with a byte tokenizer, while audio is handled through a pretrained DAC codec that converts speech to and from discrete codebook tokens. This setup enables realistic voice synthesis with controllable tone and emotion via audio conditioning.

-[Dia](https://github.com/nari-labs/dia) is an open-source text-to-speech (TTS) model (1.6B parameters) developed by [Nari Labs](https://huggingface.co/nari-labs).
-It can generate highly realistic dialogue from transcript including non-verbal communications such as laughter and coughing.
-Furthermore, emotion and tone control is also possible via audio conditioning (voice cloning).
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-**Model Architecture:**
-Dia is an encoder-decoder transformer based on the original transformer architecture. However, some more modern features such as
-rotational positional embeddings (RoPE) are also included. For its text portion (encoder), a byte tokenizer is utilized while
-for the audio portion (decoder), a pretrained codec model [DAC](./dac) is used - DAC encodes speech into discrete codebook
-tokens and decodes them back into audio.
-
-## Usage Tips
-
-### Generation with Text
-
-```python
-from transformers import AutoProcessor, DiaForConditionalGeneration
-from accelerate import Accelerator
-
-torch_device = Accelerator().device
-model_checkpoint = "nari-labs/Dia-1.6B-0626"
-
-text = ["[S1] Dia is an open weights text to dialogue model."]
-processor = AutoProcessor.from_pretrained(model_checkpoint)
-inputs = processor(text=text, padding=True, return_tensors="pt").to(torch_device)
-
-model = DiaForConditionalGeneration.from_pretrained(model_checkpoint).to(torch_device)
-outputs = model.generate(**inputs, max_new_tokens=256)  # corresponds to around ~2s
-
-# save audio to a file
-outputs = processor.batch_decode(outputs)
-processor.save_audio(outputs, "example.wav")
+```py
+import torch
+from transformers import pipeline

+pipeline = pipeline(task="text-to-audio", model="nari-labs/Dia-1.6B-0626", dtype="auto")
+output = pipeline("Plants create energy through a process known as photosynthesis.")
+audio = output["audio"]
 ```

-### Generation with Text and Audio (Voice Cloning)
+</hfoption>
+<hfoption id="DiaForConditionalGeneration">

 ```python
 from datasets import load_dataset, Audio
 from transformers import AutoProcessor, DiaForConditionalGeneration
-from accelerate import Accelerator

-torch_device = Accelerator().device
-model_checkpoint = "nari-labs/Dia-1.6B-0626"
+processor = AutoProcessor.from_pretrained("nari-labs/Dia-1.6B-0626")
+model = DiaForConditionalGeneration.from_pretrained("nari-labs/Dia-1.6B-0626").to(torch_device)

 ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
 ds = ds.cast_column("audio", Audio(sampling_rate=44100))
 audio = ds[-1]["audio"]["array"]
-# text is a transcript of the audio + additional text you want as new audio
-text = ["[S1] I know. It's going to save me a lot of money, I hope. [S2] I sure hope so for you."]
-
-processor = AutoProcessor.from_pretrained(model_checkpoint)
-inputs = processor(text=text, audio=audio, padding=True, return_tensors="pt").to(torch_device)
+text = ["[S1] Plants create energy through a process known as photosynthesis. [S2] That is so amazing!"]
+inputs = processor(text=text, audio=audio, padding=True, return_tensors="pt")
 prompt_len = processor.get_audio_prompt_len(inputs["decoder_attention_mask"])

-model = DiaForConditionalGeneration.from_pretrained(model_checkpoint).to(torch_device)
-outputs = model.generate(**inputs, max_new_tokens=256)  # corresponds to around ~2s
-
-# retrieve actually generated audio and save to a file
+outputs = model.generate(**inputs, max_new_tokens=256)
 outputs = processor.batch_decode(outputs, audio_prompt_len=prompt_len)
 processor.save_audio(outputs, "example_with_audio.wav")
 ```

-### Training
-
-```python
-from datasets import load_dataset, Audio
-from transformers import AutoProcessor, DiaForConditionalGeneration
-from accelerate import Accelerator
-
-torch_device = Accelerator().device
-model_checkpoint = "nari-labs/Dia-1.6B-0626"
-
-ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
-ds = ds.cast_column("audio", Audio(sampling_rate=44100))
-audio = ds[-1]["audio"]["array"]
-# text is a transcript of the audio
-text = ["[S1] I know. It's going to save me a lot of money, I hope."]
-
-processor = AutoProcessor.from_pretrained(model_checkpoint)
-inputs = processor(
-    text=text,
-    audio=audio,
-    generation=False,
-    output_labels=True,
-    padding=True,
-    return_tensors="pt"
-).to(torch_device)
-
-model = DiaForConditionalGeneration.from_pretrained(model_checkpoint).to(torch_device)
-out = model(**inputs)
-out.loss.backward()
-```
-
-This model was contributed by [Jaeyong Sung](https://huggingface.co/buttercrab), [Arthur Zucker](https://huggingface.co/ArthurZ),
-and [Anton Vlasjuk](https://huggingface.co/AntonV). The original code can be found [here](https://github.com/nari-labs/dia/).
+</hfoption>
+</hfoptions>

 ## DiaConfig

--- a/docs/source/en/model_doc/dialogpt.md
+++ b/docs/source/en/model_doc/dialogpt.md
@ -17,45 +17,37 @@ rendered properly in your Markdown viewer.

 # DialoGPT

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[DialoGPT](https://huggingface.co/papers/1911.00536) is trained on 147M conversation-like exchanges from Reddit. It achieves human-like performance in single-turn dialogue settings, generating relevant, contentful, and context-consistent responses. The pre-trained model and training pipeline are publicly available for research and development in neural response generation and intelligent open-domain dialogue systems.

-## Overview
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-DialoGPT was proposed in [DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation](https://huggingface.co/papers/1911.00536) by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao,
-Jianfeng Gao, Jingjing Liu, Bill Dolan. It's a GPT2 Model trained on 147M conversation-like exchanges extracted from
-Reddit.
+```py
+import torch
+from transformers import pipeline

-The abstract from the paper is the following:
+pipeline = pipeline(task="text-generation", model="microsoft/DialoGPT-medium", dtype="auto")
+pipeline("Plants create energy through a process known as photosynthesis.")
+```

-*We present a large, tunable neural conversational response generation model, DialoGPT (dialogue generative pre-trained
-transformer). Trained on 147M conversation-like exchanges extracted from Reddit comment chains over a period spanning
-from 2005 through 2017, DialoGPT extends the Hugging Face PyTorch transformer to attain a performance close to human
-both in terms of automatic and human evaluation in single-turn dialogue settings. We show that conversational systems
-that leverage DialoGPT generate more relevant, contentful and context-consistent responses than strong baseline
-systems. The pre-trained model and training pipeline are publicly released to facilitate research into neural response
-generation and the development of more intelligent open-domain dialogue systems.*
+</hfoption>
+<hfoption id="AutoModel">

-The original code can be found [here](https://github.com/microsoft/DialoGPT).
+```py
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-medium", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-medium")
+
+inputs = tokenizer("Plants create energy through a process known as photosynthesis.", return_tensors="pt")
+outputs = model.generate(**inputs, max_length=50)
+print(tokenizer.decode(outputs[0]))
+```
+
+</hfoption>
+</hfoptions>

 ## Usage tips

- DialoGPT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather
-  than the left.
- DialoGPT was trained with a causal language modeling (CLM) objective on conversational data and is therefore powerful
-  at response generation in open-domain dialogue systems.
- DialoGPT enables the user to create a chat bot in just 10 lines of code as shown on [DialoGPT's model card](https://huggingface.co/microsoft/DialoGPT-medium).
-
-Training:
-
-In order to train or fine-tune DialoGPT, one can use causal language modeling training. To cite the official paper: *We
-follow the OpenAI GPT-2 to model a multiturn dialogue session as a long text and frame the generation task as language
-modeling. We first concatenate all dialog turns within a dialogue session into a long text x_1,..., x_N (N is the
-sequence length), ended by the end-of-text token.* For more information please confer to the original paper.
-
-<Tip>
-
-DialoGPT's architecture is based on the GPT2 model, refer to [GPT2's documentation page](gpt2) for API reference and examples.
-
-</Tip>
+- Pad inputs on the right. DialoGPT uses absolute position embeddings.
--- a/docs/source/en/model_doc/diffllama.md
+++ b/docs/source/en/model_doc/diffllama.md
@ -15,26 +15,45 @@ rendered properly in your Markdown viewer.
 -->
 *This model was released on 2024-10-07 and added to Hugging Face Transformers on 2025-01-07.*

-# DiffLlama
-
-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
-<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
+<div style="float: right;">
+    <div class="flex flex-wrap space-x-1">
+        <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
+        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
+    </div>
 </div>

-## Overview
+# DiffLlama

-The DiffLlama model was proposed in [Differential Transformer](https://huggingface.co/papers/2410.05258) by Kazuma Matsumoto and .
-This model is combine Llama model and Differential Transformer's Attention.
+[DiffLlama](https://huggingface.co/papers/2410.05258) integrates the Llama model with Differential Transformer's Attention mechanism. This differential attention calculates scores as the difference between two softmax attention maps, reducing noise and promoting sparse attention. Experiments demonstrate that DiffLlama outperforms traditional Transformer models in scaling, long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and activation outlier reduction. It enhances accuracy and robustness in in-context learning and reduces distractions from irrelevant context, improving performance in question answering and text summarization.

-The abstract from the paper is the following:
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-*Transformer tends to overallocate attention to irrelevant context. In this work, we introduce Diff Transformer, which amplifies attention to the relevant context while canceling noise. Specifically, the differential attention mechanism calculates attention scores as the difference between two separate softmax attention maps. The subtraction cancels noise, promoting the emergence of sparse attention patterns. Experimental results on language modeling show that Diff Transformer outperforms Transformer in various settings of scaling up model size and training tokens. More intriguingly, it offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers. By being less distracted by irrelevant context, Diff Transformer can mitigate hallucination in question answering and text summarization. For in-context learning, Diff Transformer not only enhances accuracy but is also more robust to order permutation, which was considered as a chronic robustness issue. The results position Diff Transformer as a highly effective and promising architecture to advance large language models.*
+```py
+import torch
+from transformers import pipeline

-### Usage tips
+pipeline = pipeline(task="text-generation", model="kajuma/DiffLlama-0.3B-handcut", dtype="auto")
+pipeline("植物は光合成と呼ばれる過程を通じてエネルギーを作り出します。")
+```

-The hyperparameters of this model is the same as Llama model.
+</hfoption>
+<hfoption id="AutoModel">
+
+```py
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+model = AutoModelForCausalLM.from_pretrained("kajuma/DiffLlama-0.3B-handcut", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("kajuma/DiffLlama-0.3B-handcut")
+
+inputs = tokenizer("植物は光合成と呼ばれる過程を通じてエネルギーを作り出します。", return_tensors="pt")
+outputs = model.generate(**inputs, max_length=50)
+print(tokenizer.decode(outputs[0]))
+```
+
+</hfoption>
+</hfoptions>

 ## DiffLlamaConfig

@ -64,3 +83,4 @@ The hyperparameters of this model is the same as Llama model.

 [[autodoc]] DiffLlamaForTokenClassification
    - forward
+
--- a/docs/source/en/model_doc/dinat.md
+++ b/docs/source/en/model_doc/dinat.md
@ -13,74 +13,49 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2022-09-29 and added to Hugging Face Transformers on 2022-11-18.*
+*This model was released on 2022-09-29 and added to Hugging Face Transformers on 2022-11-18 and contributed by [alihassanijr](https://huggingface.co/alihassanijr).*

 # Dilated Neighborhood Attention Transformer

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[Dilated Neighborhood Attention Transformer](https://huggingface.co/papers/2209.15001) extends Neighborhood Attention (NA) by incorporating a Dilated Neighborhood Attention (DiNA) pattern, enhancing global context capture without additional computational cost. DiNAT combines local attention from NA with DiNA's sparse global attention, leading to significant performance improvements over models like NAT, Swin, and ConvNeXt. The large DiNAT variant achieves state-of-the-art results in various vision tasks, including COCO object detection, COCO instance segmentation, ADE20K semantic segmentation, and panoptic segmentation on both COCO and ADE20K datasets.

-## Overview
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-DiNAT was proposed in [Dilated Neighborhood Attention Transformer](https://huggingface.co/papers/2209.15001)
-by Ali Hassani and Humphrey Shi.
+```py
+import torch
+from transformers import pipeline

-It extends [NAT](nat) by adding a Dilated Neighborhood Attention pattern to capture global context,
-and shows significant performance improvements over it.
+pipeline = pipeline(task="image-classification", model="shi-labs/dinat-mini-in1k-224", dtype="auto")
+pipeline("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg")
+```

-The abstract from the paper is the following:
+</hfoption>
+<hfoption id="AutoModel">

-*Transformers are quickly becoming one of the most heavily applied deep learning architectures across modalities,
-domains, and tasks. In vision, on top of ongoing efforts into plain transformers, hierarchical transformers have
-also gained significant attention, thanks to their performance and easy integration into existing frameworks.
-These models typically employ localized attention mechanisms, such as the sliding-window Neighborhood Attention (NA)
-or Swin Transformer's Shifted Window Self Attention. While effective at reducing self attention's quadratic complexity,
-local attention weakens two of the most desirable properties of self attention: long range inter-dependency modeling,
-and global receptive field. In this paper, we introduce Dilated Neighborhood Attention (DiNA), a natural, flexible and
-efficient extension to NA that can capture more global context and expand receptive fields exponentially at no
-additional cost. NA's local attention and DiNA's sparse global attention complement each other, and therefore we
-introduce Dilated Neighborhood Attention Transformer (DiNAT), a new hierarchical vision transformer built upon both.
-DiNAT variants enjoy significant improvements over strong baselines such as NAT, Swin, and ConvNeXt.
-Our large model is faster and ahead of its Swin counterpart by 1.5% box AP in COCO object detection,
-1.3% mask AP in COCO instance segmentation, and 1.1% mIoU in ADE20K semantic segmentation.
-Paired with new frameworks, our large variant is the new state of the art panoptic segmentation model on COCO (58.2 PQ)
-and ADE20K (48.5 PQ), and instance segmentation model on Cityscapes (44.5 AP) and ADE20K (35.4 AP) (no extra data).
-It also matches the state of the art specialized semantic segmentation models on ADE20K (58.2 mIoU),
-and ranks second on Cityscapes (84.5 mIoU) (no extra data).*
+```python
+import torch
+import requests
+from PIL import Image
+from transformers import AutoImageProcessor, AutoModelForImageClassification

-<img
-src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/dilated-neighborhood-attention-pattern.jpg"
-alt="drawing" width="600"/>
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
+image = Image.open(requests.get(url, stream=True).raw)

-<small> Neighborhood Attention with different dilation values.
-Taken from the <a href="https://huggingface.co/papers/2209.15001">original paper</a>.</small>
+image_processor = AutoImageProcessor.from_pretrained("shi-labs/dinat-mini-in1k-224")
+model = AutoModelForImageClassification.from_pretrained("shi-labs/dinat-mini-in1k-224", dtype="auto")

-This model was contributed by [Ali Hassani](https://huggingface.co/alihassanijr).
-The original code can be found [here](https://github.com/SHI-Labs/Neighborhood-Attention-Transformer).
+inputs = image_processor(image, return_tensors="pt")

-## Usage tips
+with torch.no_grad():
+    logits = model(**inputs).logits

-DiNAT can be used as a *backbone*. When `output_hidden_states = True`,
-it will output both `hidden_states` and `reshaped_hidden_states`. The `reshaped_hidden_states` have a shape of `(batch, num_channels, height, width)` rather than `(batch_size, height, width, num_channels)`.
+predicted_label = logits.argmax(-1).item()
+print(model.config.id2label[predicted_label])
+```

-Notes:
-
- DiNAT depends on [NATTEN](https://github.com/SHI-Labs/NATTEN/)'s implementation of Neighborhood Attention and Dilated Neighborhood Attention.
-You can install it with pre-built wheels for Linux by referring to [shi-labs.com/natten](https://shi-labs.com/natten), or build on your system by running `pip install natten`.
-Note that the latter will likely take time to compile. NATTEN does not support Windows devices yet.
- Patch size of 4 is only supported at the moment.
-
-## Resources
-
-A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with DiNAT.
-
-<PipelineTag pipeline="image-classification"/>
-
- [`DinatForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb).
- See also: [Image classification task guide](../tasks/image_classification)
-
-If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
+</hfoption>
+</hfoptions>

 ## DinatConfig

@ -95,3 +70,4 @@ If you're interested in submitting a resource to be included here, please feel f

 [[autodoc]] DinatForImageClassification
    - forward
+
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
stevhliu	0ecb993601	usage tips	2025-10-15 14:08:54 -07:00
stevhliu	d1d5d4d758	fixes	2025-10-15 11:20:56 -07:00
stevhliu	dc570c7505	remove result	2025-10-15 11:20:56 -07:00
stevhliu	daf6069c48	standardize	2025-10-15 11:20:54 -07:00