usage tips

fixes
remove result
2025-10-23 10:54:36 +08:00 · 2025-10-15 14:08:54 -07:00 · 2025-10-15 11:20:56 -07:00 · 2025-10-15 11:20:56 -07:00 · 2025-10-15 11:20:54 -07:00
1609 changed files with 40241 additions and 49269 deletions
--- a/.cursor/commands/style-guide.md
+++ b/.cursor/commands/style-guide.md
@ -0,0 +1,53 @@
+## Sentence structure
+- Write short, declarative sentences most of the time.
+- Vary sentence length to avoid sounding robotic. Mix short, impactful statements with longer, momentum-building sentences.
+- Every time you use a comma, ask whether you can use a period instead.
+- Avoid repeating the same words in a paragraph. Use synonyms or rephrase.
+
+## Voice and tone
+- Write like humans speak. Avoid corporate jargon and marketing fluff.
+- Be confident and direct. Avoid softening phrases like "I think", "maybe", or "could".
+- Use active voice instead of passive voice.
+- Use positive phrasing - say what something *is* rather than what is *isn't*.
+- Say "you" more than "we" when addressing external audiences.
+- Use contractions like "I'll", "won't", and "can't" for a warmer tone.
+
+## Specificity and evidence
+- Be specific with facts and data instead of vague superlatives.
+- Back up claims with concrete examples or metrics.
+- Highlight customers and community members over company achievements.
+- Use realistic, product-based examples instead of `foo/bar/baz` in code.
+- Make content concrete, visual, and falsifiable.
+
+## Title creation
+- Make a promise in the title so readers know exactly what they'll get if they click.
+- Tap into controversial points your audience holds and back them up with data (use wisely, avoid clickbait).
+- Share something uniquely helpful that makes readers better at meaningful aspects of their lives.
+- Avoid vague titles like "My Thoughts on XYZ". Titles should be opinions or shareable facts.
+- Write placeholder titles first, complete the content, then spend time iterating on titles at the end.
+
+## Ban phrases
+- Avoid using "You can"
+
+## Avoid LLM patterns
+- Replace em dashes (-) with semicolons, commas, or sentence breaks.
+- Avoid starting responses with "Great question!", "You're right!", or "Let me help you."
+- Don't use phrases like "Let's dive into..."
+- Skip cliché intros like "In today's fast-paced digital world" or "In the ever-evolving landscape of"
+- Avoid phrases like "it's not just [x], it's [y]"
+- Don't use high-school essay closers: "In conclusion,", "Overall,", or "To summarize"
+- Avoid numbered lists in cases where bullets work better.
+- Replace "In conclusion" with direct statements.
+- Avoid hedge words: "might", "perhaps", "potentially" unless uncertainty is real.
+- Don't stack hedging phrases: "may potentially", "it's important to note that".
+- Don't create perfectly symmetrical paragraphs or lists that start with "Firstly... Secondly..."
+- Avoid title-case headings: prefer sentence casing.
+- Remove Unicode artifacts when copy-pasting: smart quotes ("), em-dashes, non-breaking spaces.
+- Use '
+- Delete empty citation placeholders like "[1]" with no actual source
+
+## Punctuation and formatting
+- Use Oxford commas consistently
+- Use exclamation points sparingly
+- Sentences can start with "But" and "And" but don't overuse
+- Use periods instead of commas when possible for clarity
--- a/.github/workflows/benchmark.yml
+++ b/.github/workflows/benchmark.yml
@ -1,10 +1,7 @@
 name: Self-hosted runner (benchmark)

 on:
-  push:
-    branches: [main]
-  pull_request:
-    types: [ opened, labeled, reopened, synchronize ]
+  workflow_dispatch:

 concurrency:
  group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
@ -12,8 +9,6 @@ concurrency:

 env:
  HF_HOME: /mnt/cache
-  DATASET_ID: hf-benchmarks/transformers
-  MODEL_ID: meta-llama/Llama-3.1-8B-Instruct

 jobs:
  benchmark:
@ -36,12 +31,26 @@ jobs:
        with:
          ref: ${{ github.event.pull_request.head.sha || github.sha }}

+      - name: Install libpq-dev & psql
+        run: |
+          apt update
+          apt install -y libpq-dev postgresql-client
+
      - name: Install benchmark script dependencies
-        run: python3 -m pip install -r benchmark_v2/requirements.txt kernels
+        run: python3 -m pip install -r benchmark/requirements.txt

      - name: Reinstall transformers in edit mode (remove the one installed during docker image build)
        working-directory: /transformers
-        run: python3 -m pip uninstall -y transformers && python3 -m pip install -e ".[torch]" && python3 -m pip uninstall -y torchvision # temp fix
+        run: python3 -m pip uninstall -y transformers && python3 -m pip install -e ".[torch]"
+
+      - name: Run database init script
+        run: |
+          psql -f benchmark/utils/init_db.sql
+        env:
+          PGDATABASE: metrics
+          PGHOST: ${{ secrets.TRANSFORMERS_BENCHMARKS_PGHOST }}
+          PGUSER: transformers_benchmarks
+          PGPASSWORD: ${{ secrets.TRANSFORMERS_BENCHMARKS_PGPASSWORD }}

      - name: Run benchmark
        run: |
@ -52,11 +61,13 @@ jobs:
            commit_id=$GITHUB_SHA
          fi
          commit_msg=$(git show -s --format=%s | cut -c1-70)
-          python3 benchmark_v2/run_benchmarks.py -b 32 -s 128 -n 256 --branch-name "$BRANCH_NAME" --commit-id "$commit_id" --commit-message "$commit_msg" --model-id "$MODEL_ID" --log-level INFO --push-result-to-dataset "$DATASET_ID"
+          python3 benchmark/benchmarks_entrypoint.py "huggingface/transformers" "$BRANCH_NAME" "$commit_id" "$commit_msg"
        env:
          HF_TOKEN: ${{ secrets.HF_HUB_READ_TOKEN }}
-          PUSH_TO_HUB_TOKEN: ${{ secrets.PUSH_TO_HUB_TOKEN }}
          # Enable this to see debug logs
          # HF_HUB_VERBOSITY: debug
          # TRANSFORMERS_VERBOSITY: debug
+          PGHOST: ${{ secrets.TRANSFORMERS_BENCHMARKS_PGHOST }}
+          PGUSER: transformers_benchmarks
+          PGPASSWORD: ${{ secrets.TRANSFORMERS_BENCHMARKS_PGPASSWORD }}
          BRANCH_NAME: ${{ github.head_ref || github.ref_name }}
--- a/.github/workflows/check_failed_tests.yml
+++ b/.github/workflows/check_failed_tests.yml
@ -41,14 +41,9 @@ env:

 jobs:
  check_new_failures:
-    name: "Find commits for new failing tests"
-    strategy:
-      matrix:
-        run_idx: [1]
+    name: " "
    runs-on:
      group: aws-g5-4xlarge-cache
-    outputs:
-      process: ${{ steps.check_file.outputs.process }}
    container:
      image: ${{ inputs.docker }}
      options: --gpus all --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/
@ -59,17 +54,14 @@ jobs:
          path: /transformers/ci_results_${{ inputs.job }}

      - name: Check file
-        id: check_file
        working-directory: /transformers
        run: |
          if [ -f ci_results_${{ inputs.job }}/new_failures.json ]; then
            echo "`ci_results_${{ inputs.job }}/new_failures.json` exists, continue ..."
            echo "process=true" >> $GITHUB_ENV
-            echo "process=true" >> $GITHUB_OUTPUT
          else
            echo "`ci_results_${{ inputs.job }}/new_failures.json` doesn't exist, abort."
            echo "process=false" >> $GITHUB_ENV
-            echo "process=false" >> $GITHUB_OUTPUT
          fi

      - uses: actions/download-artifact@v4
@ -126,10 +118,6 @@ jobs:
        run: |
          python3 utils/print_env.py

-      - name: Install pytest-flakefinder
-        if: ${{ env.process == 'true' }}
-        run: python3 -m pip install pytest-flakefinder
-
      - name: Show installed libraries and their versions
        working-directory: /transformers
        if: ${{ env.process == 'true' }}
@ -138,63 +126,25 @@ jobs:
      - name: Check failed tests
        working-directory: /transformers
        if: ${{ env.process == 'true' }}
-        run: python3 utils/check_bad_commit.py --start_commit ${{ inputs.start_sha }} --end_commit ${{ env.END_SHA }} --file ci_results_${{ inputs.job }}/new_failures.json --output_file new_failures_with_bad_commit_${{ inputs.job }}_${{ matrix.run_idx }}.json
+        run: python3 utils/check_bad_commit.py --start_commit ${{ inputs.start_sha }} --end_commit ${{ env.END_SHA }} --file ci_results_${{ inputs.job }}/new_failures.json --output_file new_failures_with_bad_commit.json

      - name: Show results
        working-directory: /transformers
        if: ${{ env.process == 'true' }}
        run: |
-          ls -l new_failures_with_bad_commit_${{ inputs.job }}_${{ matrix.run_idx }}.json
-          cat new_failures_with_bad_commit_${{ inputs.job }}_${{ matrix.run_idx }}.json
+          ls -l new_failures_with_bad_commit.json
+          cat new_failures_with_bad_commit.json

-      - name: Upload artifacts
-        uses: actions/upload-artifact@v4
-        with:
-          name: new_failures_with_bad_commit_${{ inputs.job }}_${{ matrix.run_idx }}
-          path: /transformers/new_failures_with_bad_commit_${{ inputs.job }}_${{ matrix.run_idx }}.json
-
-  process_new_failures_with_commit_info:
-    name: "process bad commit reports"
-    needs: check_new_failures
-    if: needs.check_new_failures.outputs.process == 'true'
-    runs-on:
-      group: aws-g5-4xlarge-cache
-    container:
-      image: ${{ inputs.docker }}
-      options: --gpus all --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/
-    steps:
-      - uses: actions/download-artifact@v4
-        with:
-          name: ci_results_${{ inputs.job }}
-          path: /transformers/ci_results_${{ inputs.job }}
-
-      - uses: actions/download-artifact@v4
-        with:
-          pattern: new_failures_with_bad_commit_${{ inputs.job }}*
-          path: /transformers/new_failures_with_bad_commit_${{ inputs.job }}
-          merge-multiple: true
-
-      - name: Check files
+      - name: Checkout back
        working-directory: /transformers
+        if: ${{ env.process == 'true' }}
        run: |
-          ls -la /transformers
-          ls -la /transformers/new_failures_with_bad_commit_${{ inputs.job }}
-
-      # Currently, we only run with a single runner by using `run_idx: [1]`. We might try to run with multiple runners
-      # to further reduce the false positive caused by flaky tests, which requires further processing to merge reports.
-      - name: Merge files
-        shell: bash
-        working-directory: /transformers
-        run: |
-          cp /transformers/new_failures_with_bad_commit_${{ inputs.job }}/new_failures_with_bad_commit_${{ inputs.job }}_1.json new_failures_with_bad_commit.json
-
-      - name: Update clone
-        working-directory: /transformers
-        run: git fetch && git checkout ${{ inputs.commit_sha || github.sha }}
+          git checkout ${{ inputs.start_sha }}

      - name: Process report
        shell: bash
        working-directory: /transformers
+        if: ${{ env.process == 'true' }}
        env:
          ACCESS_REPO_INFO_TOKEN: ${{ secrets.ACCESS_REPO_INFO_TOKEN }}
          TRANSFORMERS_CI_RESULTS_UPLOAD_TOKEN: ${{ secrets.TRANSFORMERS_CI_RESULTS_UPLOAD_TOKEN }}
@ -206,6 +156,7 @@ jobs:
      - name: Process report
        shell: bash
        working-directory: /transformers
+        if: ${{ env.process == 'true' }}
        env:
          ACCESS_REPO_INFO_TOKEN: ${{ secrets.ACCESS_REPO_INFO_TOKEN }}
          TRANSFORMERS_CI_RESULTS_UPLOAD_TOKEN: ${{ secrets.TRANSFORMERS_CI_RESULTS_UPLOAD_TOKEN }}
@ -220,12 +171,13 @@ jobs:

      - name: Prepare Slack report title
        working-directory: /transformers
+        if: ${{ env.process == 'true' }}
        run: |
          pip install slack_sdk
          echo "title=$(python3 -c 'import sys; sys.path.append("utils"); from utils.notification_service import job_to_test_map; ci_event = "${{ inputs.ci_event }}"; job = "${{ inputs.job }}"; test_name = job_to_test_map[job]; title = f"New failed tests of {ci_event}" + ":" + f" {test_name}"; print(title)')" >> $GITHUB_ENV

      - name: Send processed report
-        if: ${{ !endsWith(env.REPORT_TEXT, '{}') }}
+        if: ${{ env.process == 'true' && !endsWith(env.REPORT_TEXT, '{}') }}
        uses: slackapi/slack-github-action@6c661ce58804a1a20f6dc5fbee7f0381b469e001
        with:
          # Slack channel id, channel name, or user id to post message.
--- a/.github/workflows/pr_build_doc_with_comment.yml
+++ b/.github/workflows/pr_build_doc_with_comment.yml
@ -98,7 +98,7 @@ jobs:
      commit_sha: ${{ needs.get-pr-info.outputs.PR_HEAD_SHA }}
      pr_number: ${{ needs.get-pr-number.outputs.PR_NUMBER }}
      package: transformers
-      languages: ar de en es fr hi it ja ko pt zh
+      languages: ar de en es fr hi it ko pt tr zh ja te

  update_run_status:
    name: Update Check Run Status
--- a/.gitignore
+++ b/.gitignore
@ -98,7 +98,6 @@ celerybeat-schedule
 # Environments
 .env
 .venv
-.venv*
 env/
 venv/
 ENV/
@ -172,6 +171,3 @@ tags

 # modular conversion
 *.modular_backup
-
-# Cursor IDE files
-.cursor/
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -112,125 +112,7 @@ New models are constantly released and if you want to implement a new model, ple

 If you are willing to contribute the model yourself, let us know so we can help you add it to 🤗 Transformers!

-We have a technical guide for [how to add a model to 🤗 Transformers](https://huggingface.co/docs/transformers/modular_transformers).
-
-### Vision-Language Model Contribution Checklist
-
-If you're contributing a **vision-language model** (or any multimodal model that processes images/videos), please follow this checklist. Maintainers will use this to review your PR, and completing these steps will significantly increase the likelihood of your PR being merged quickly.
-
-**Required checklist for all vision-language model contributions:**
-
-☐ **1. Implement a modular file**
-
-All new models should use the modular architecture pattern. Create a `modular_<model_name>.py` file using the modular model converter:
-
- Use the CLI, [`transformers add-new-model-like`](https://github.com/huggingface/transformers/blob/main/src/transformers/cli/add_new_model_like.py) to generate a modular skeleton and get started
- All code should be in the modular file if possible. Modeling must be in it, it's better if configuration is in it as well. 
- Reuse existing patterns from similar models as much as possible
-
-To verify your modular file is correct, run:
-
-```bash
-python utils/modular_model_converter.py <model_name>
-```
-
-This will generate the separate files (`modeling_*.py`, `configuration_*.py`, etc.) from your modular file. The CI will enforce that these generated files match your modular file.
-
-☐ **2. Add a fast image processor (for image models)**
-
-If your model processes images, implement a fast image processor that uses `torch` and `torchvision` instead of PIL/numpy for better inference performance:
-
- See the detailed guide in [#36978](https://github.com/huggingface/transformers/issues/36978)
- Fast processors inherit from `BaseImageProcessorFast`
- Examples: `LlavaOnevisionImageProcessorFast`, `Idefics2ImageProcessorFast`
-
-☐ **3. Create a weight conversion script**
-
-Add a `convert_<model_name>_to_hf.py` script that converts the original model weights to the HuggingFace format:
-
- Script should handle checkpoint loading, key mapping, and saving in HF format
- Include usage examples and documentation in the script
- Examples: [`convert_llava_onevision_weights_to_hf.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llava_onevision/convert_llava_onevision_weights_to_hf.py), [`convert_idefics2_weights_to_hf.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/idefics2/convert_idefics2_weights_to_hf.py)
-
-☐ **4. Add integration tests with exact output matching**
-
-At minimum, add an `IntegrationTest` class that tests end-to-end generation (processing and modelling) with **exact** output matching:
-
- For generative models: test that generated text matches expected output exactly
- For non-generative models: test that output logits match expected values
- Tests should use real checkpoints (load in 4-bit or half precision if the checkpoint is too big to fit in our CI runners) and real inputs
- Example pattern:
-
-```python
-class MyModelIntegrationTest(unittest.TestCase):
-    @slow
-    def test_model_integration(self):
-        model = MyModelForConditionalGeneration.from_pretrained("org/model-name")
-        processor = AutoProcessor.from_pretrained("org/model-name")
-
-        inputs = processor(images=image, text=prompt, return_tensors="pt")
-        output = model.generate(**inputs, max_new_tokens=20)
-
-        EXPECTED_TEXT = "exact expected output"
-        self.assertEqual(processor.decode(output[0]), EXPECTED_TEXT)
-```
-
-See `tests/models/llava_onevision/test_modeling_llava_onevision.py` for complete examples.
-
-☐ **5. Update documentation**
-
-Add or update model documentation:
-
- Create if the cli hasn't `docs/source/en/model_doc/<model_name>.md` with usage examples
- Include model description, paper link, and basic usage with `Pipeline` and `AutoModel`
- Add the model to the appropriate TOC files
-
-☐ **6. Look for reusable patterns**
-
-The library has 400+ models with many established patterns:
-
- Search for similar models (e.g., other vision-language models)
- Reuse attention mechanisms, layer implementations, and processing patterns
- Check models like LLaVA, Idefics2, Fuyu for vision-language patterns
- Use provided decorators like (`auto_docstring`, `can_return_tuple`, `check_model_inputs` and `_can_record_outputs`) where relevant. 
- Don't reinvent the wheel
-
-☐ **7. Run quality checks and read the output**
-
-Before submitting your PR, install quality dependencies and run the full check suite:
-
-```bash
-pip install -e ".[quality]"
-make fixup
-```
-
-**Important**: Take time to read the output of `make fixup`. It will:
- Lint and format your code automatically
- Run consistency checks (imports, docstrings, etc.)
- Show any remaining issues that need manual fixes
-
-All checks must pass before your PR can be merged.
-
-**If this checklist is complete, your PR has a very high likelihood of being merged!** Following these steps makes the maintainers' work much easier and will reduce the number of review iterations, getting your important work out there faster.
-
-#### Copy-pastable checklist for maintainers
-
-Here's a condensed version maintainers can copy into PRs:
-
-```markdown
-## Multimodal Model Addition Checklist
-
-Please ensure your PR completes all following items. See the [full checklist](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#vision-language-model-contribution-checklist) for details.
-
- [ ] **Modular file**: `modular_<model_name>.py` implemented and verified with `python utils/modular_model_converter.py <model_name>`
- [ ] **Fast image processor**: Implemented using `BaseImageProcessorFast` (see [#36978](https://github.com/huggingface/transformers/issues/36978))
- [ ] **Conversion script**: `convert_<model_name>_to_hf.py` added with usage examples
- [ ] **Integration tests**: End-to-end tests with exact output matching (text or logits)
- [ ] **Documentation**: Model docs added/updated in `docs/source/en/model_doc/`
- [ ] **Pattern reuse**: Verified against similar models (LLaVA, Idefics2, etc.)
- [ ] **Quality checks**: `make fixup` passes with no errors
-
-```
+We have a technical guide for [how to add a model to 🤗 Transformers](https://huggingface.co/docs/transformers/add_new_model).

 ## Do you want to add documentation?

--- a/benchmark/benches/llama.py
+++ b/benchmark/benches/llama.py
@ -16,6 +16,7 @@ import sys
 from logging import Logger
 from threading import Event, Thread
 from time import perf_counter, sleep
+from typing import Optional


 # Add the parent directory to Python path to import benchmarks_entrypoint
@ -41,7 +42,7 @@ except ImportError:
    GenerationConfig = None
    StaticCache = None

-os.environ["HF_XET_HIGH_PERFORMANCE"] = "1"
+os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
 os.environ["TOKENIZERS_PARALLELISM"] = "1"

 # Only set torch precision if torch is available
@ -144,7 +145,7 @@ def run_benchmark(
            q = torch.empty_like(probs_sort).exponential_(1)
            return torch.argmax(probs_sort / q, dim=-1, keepdim=True).to(dtype=torch.int)

-        def logits_to_probs(logits, temperature: float = 1.0, top_k: int | None = None):
+        def logits_to_probs(logits, temperature: float = 1.0, top_k: Optional[int] = None):
            logits = logits / max(temperature, 1e-5)

            if top_k is not None:
@ -154,7 +155,7 @@ def run_benchmark(
            probs = torch.nn.functional.softmax(logits, dim=-1)
            return probs

-        def sample(logits, temperature: float = 1.0, top_k: int | None = None):
+        def sample(logits, temperature: float = 1.0, top_k: Optional[int] = None):
            probs = logits_to_probs(logits[0, -1], temperature, top_k)
            idx_next = multinomial_sample_one_no_sync(probs)
            return idx_next, probs
--- a/benchmark/requirements.txt
+++ b/benchmark/requirements.txt
@ -2,5 +2,5 @@ gpustat==1.1.1
 psutil==6.0.0
 psycopg2==2.9.9
 torch>=2.4.0
-hf_xet
+hf_transfer
 pandas>=1.5.0
--- a/benchmark_v2/framework/benchmark_config.py
+++ b/benchmark_v2/framework/benchmark_config.py
@ -1,7 +1,7 @@
 import hashlib
 import json
 import logging
-from typing import Any
+from typing import Any, Optional


 KERNELIZATION_AVAILABLE = False
@ -22,16 +22,16 @@ class BenchmarkConfig:
        self,
        warmup_iterations: int = 5,
        measurement_iterations: int = 20,
-        gpu_monitoring: bool = True,  # NOTE: you may want to disable this at times as we have obsvered it could heavily slow down benchmarks on AMD
+        gpu_monitoring: bool = False,  # False by default because it slows down the benchmark by a lot
        batch_size: int = 1,
        sequence_length: int = 128,
        num_tokens_to_generate: int = 128,
        attn_implementation: str = "eager",
-        sdpa_backend: str | None = None,
-        compile_mode: str | None = None,
-        compile_options: dict[str, Any] | None = None,
+        sdpa_backend: Optional[str] = None,
+        compile_mode: Optional[str] = None,
+        compile_options: Optional[dict[str, Any]] = None,
        kernelize: bool = False,
-        name: str | None = None,
+        name: Optional[str] = None,
        skip_validity_check: bool = False,
    ) -> None:
        # Benchmark parameters
@ -104,7 +104,7 @@ class BenchmarkConfig:
            "attn_implementation": self.attn_implementation,
            "sdpa_backend": self.sdpa_backend,
            "compile_mode": self.compile_mode,
-            "compile_options": self.compile_options | {},  # to avoid inplace modification of the original dict
+            "compile_options": self.compile_options,
            "kernelize": self.kernelize,
        }

@ -128,15 +128,15 @@ class BenchmarkConfig:


 def cross_generate_configs(
-    attn_impl_and_sdpa_backend: list[tuple[str, str | None]],
-    compiled_mode: list[str | None],
+    attn_impl_and_sdpa_backend: list[tuple[str, Optional[str]]],
+    compiled_mode: list[Optional[str]],
    kernelized: list[bool],
    warmup_iterations: int = 5,
    measurement_iterations: int = 20,
    batch_size: int = 1,
    sequence_length: int = 128,
    num_tokens_to_generate: int = 128,
-    gpu_monitoring: bool = True,
+    gpu_monitoring: bool = False,  # this slows down the benchmark by a lot so we disable it by default
 ) -> list[BenchmarkConfig]:
    # Create kwargs common to all configs
    kwargs = {
@ -169,7 +169,7 @@ def generate_all_configs(
    batch_size: int = 1,
    sequence_length: int = 128,
    num_tokens_to_generate: int = 128,
-    gpu_monitoring: bool = True,
+    gpu_monitoring: bool = False,
 ) -> list[BenchmarkConfig]:
    all_attn_implementations = [
        ("flash_attention_2", None),
@ -191,24 +191,28 @@ def generate_all_configs(
    )


-def generate_main_configs(
+def generate_default_configs(
    warmup_iterations: int = 5,
    measurement_iterations: int = 20,
    batch_size: int = 1,
    sequence_length: int = 128,
    num_tokens_to_generate: int = 128,
+    gpu_monitoring: bool = False,
 ) -> list[BenchmarkConfig]:
-    # Create kwargs common to all configs
-    kwargs = {
-        "warmup_iterations": warmup_iterations,
-        "measurement_iterations": measurement_iterations,
-        "batch_size": batch_size,
-        "sequence_length": sequence_length,
-        "num_tokens_to_generate": num_tokens_to_generate,
-    }
-    return [  # TODO: test max-autotune instead of default
-        BenchmarkConfig(attn_implementation="flex_attention", compile_mode="default", gpu_monitoring=False, **kwargs),
-        BenchmarkConfig(attn_implementation="flex_attention", compile_mode="default", gpu_monitoring=True, **kwargs),
-        BenchmarkConfig(attn_implementation="eager", compile_mode="default", gpu_monitoring=True, **kwargs),
-        BenchmarkConfig(attn_implementation="flash_attention_2", gpu_monitoring=True, **kwargs),
+    all_attn_implementations = [
+        ("flash_attention_2", None),
+        ("eager", None),
+        ("sdpa", "math"),
+        ("sdpa", "flash_attention"),  # note: this one can fail with compile because of attn mask
    ]
+    return cross_generate_configs(
+        attn_impl_and_sdpa_backend=all_attn_implementations,
+        compiled_mode=[None, "max-autotune"],
+        kernelized=[False, KERNELIZATION_AVAILABLE],
+        warmup_iterations=warmup_iterations,
+        measurement_iterations=measurement_iterations,
+        batch_size=batch_size,
+        sequence_length=sequence_length,
+        num_tokens_to_generate=num_tokens_to_generate,
+        gpu_monitoring=gpu_monitoring,
+    )
--- a/benchmark_v2/framework/benchmark_runner.py
+++ b/benchmark_v2/framework/benchmark_runner.py
@ -4,16 +4,13 @@ import logging
 import os
 import pathlib
 import re
-import tempfile
 import time
 from contextlib import nullcontext
 from datetime import datetime
 from queue import Queue
-from typing import Any
+from typing import Any, Optional

 import torch
-from datasets import Dataset
-from huggingface_hub import HfApi
 from tqdm import trange

 from transformers import (
@ -53,8 +50,6 @@ DEFAULT_PROMPT = "\n".join([
    "Its instability ended in the coup of 18 Brumaire and the establishment of the Consulate, with Napoleon Bonaparte as First Consul.",
 ])  # fmt: skip

-PUSH_TO_HUB_TOKEN = os.getenv("PUSH_TO_HUB_TOKEN", None)
-

 def compact_json_numeric_arrays(data: dict):
    # Match arrays that contain only numbers (ints/floats), whitespace, commas, and newlines
@ -79,7 +74,7 @@ def get_git_revision() -> str:
        return git_hash.readline().strip()


-def get_sdpa_backend(backend_name: str | None) -> torch.nn.attention.SDPBackend | None:
+def get_sdpa_backend(backend_name: Optional[str]) -> Optional[torch.nn.attention.SDPBackend]:
    """Get the SDPA backend enum from string name."""
    if backend_name is None:
        return None
@ -125,19 +120,15 @@ def flush_memory():

 class BenchmarkStreamer(BaseStreamer):
    def __init__(self, **kwargs) -> None:
-        self.timeout = kwargs.pop("timeout", 10)
        self.timestamps = []
        self.text_queue = Queue()
-        self.stop_signal = None

    def put(self, value):
        """Receives tokens and logs the timestamp of the generation."""
        self.timestamps.append(time.perf_counter())
-        self.text_queue.put(value)

    def end(self):
        self.timestamps.append(time.perf_counter())
-        self.text_queue.put(self.stop_signal)

    def __iter__(self):
        return self
@ -154,34 +145,25 @@ class BenchmarkRunner:
    """Main benchmark runner that coordinates benchmark execution."""

    def __init__(
-        self,
-        logger: logging.Logger,
-        output_dir: str | None = None,
-        branch_name: str | None = None,
-        commit_id: str | None = None,
-        commit_message: str | None = None,
+        self, logger: logging.Logger, output_dir: str = "benchmark_results", commit_id: Optional[str] = None
    ) -> None:
        # Those stay constant for the whole run
        self.logger = logger
-        if output_dir is None:
-            output_dir = os.path.join(os.path.dirname(os.path.dirname(__file__)), "benchmark_results")
        self.output_dir = output_dir
-        self.branch_name = branch_name
        self.commit_id = get_git_revision() if commit_id is None else commit_id
-        self.commit_message = commit_message
        os.makedirs(self.output_dir, exist_ok=True)
        self.profile_dir = None
        # Attributes that are reset for each model
        self._setup_for = ""
        # Attributes that are reset for each run
-        self.model: GenerationMixin | None = None
+        self.model: Optional[GenerationMixin] = None

    def cleanup(self) -> None:
        del self.model
        self.model = None
        flush_memory()

-    def setup_benchmark(self, model_id: str, config: BenchmarkConfig) -> None:
+    def setup_one_run(self, model_id: str, config: BenchmarkConfig) -> None:
        # Some attributes only need to be set once per model
        if self._setup_for != model_id:
            self.tokenizer = AutoTokenizer.from_pretrained(model_id)
@ -218,13 +200,10 @@ class BenchmarkRunner:
        self.model = self.model.eval().to(config.device)

        # Kernelize the model if needed
-        if config.kernelize and kernelize is not None and Mode is not None:
+        if config.kernelize:
            self.model = kernelize(self.model, mode=Mode.INFERENCE)

-    def run_benchmark(
-        self, model_id: str, config: BenchmarkConfig, num_tokens_to_profile: int = 0
-    ) -> dict[str, Any] | None:
-        """Run a single benchmark with the given model ID and config."""
+    def run_one_benchmark(self, model_id: str, config: BenchmarkConfig, num_tokens_to_profile: int = 0) -> None:
        sdpa_ctx = nullcontext()
        if config.attn_implementation == "sdpa":
            sdpa_backend = get_sdpa_backend(config.sdpa_backend)
@ -235,7 +214,7 @@ class BenchmarkRunner:

            # Quick validation: try one measurement first to see if this scenario works
            flush_memory()
-            e2e_latency, token_generation_times, shape_and_decoded_output, gpu_metrics = self.time_generate(
+            e2e_latency, token_generation_times, decoded_output, gpu_metrics = self.time_generate(
                max_new_tokens=1, gpu_monitor=None
            )
            if e2e_latency < 0:
@ -252,11 +231,11 @@ class BenchmarkRunner:
            result = BenchmarkResult()
            self.logger.info(f"Benchmarking with {config.measurement_iterations} iterations.")
            for _ in trange(config.measurement_iterations):
-                e2e_latency, token_generation_times, shape_and_decoded_output, gpu_metrics = self.time_generate(
+                e2e_latency, token_generation_times, decoded_output, gpu_metrics = self.time_generate(
                    max_new_tokens=config.num_tokens_to_generate,
                    gpu_monitor=(GPUMonitor(logger=self.logger) if config.gpu_monitoring else None),
                )
-                result.accumulate(e2e_latency, token_generation_times, shape_and_decoded_output, gpu_metrics)
+                result.accumulate(e2e_latency, token_generation_times, decoded_output, gpu_metrics)
            self.logger.info("Benchmarking done. Cleaning up.")

            # Profile if needed
@ -264,12 +243,7 @@ class BenchmarkRunner:
                self.profile_generate(num_tokens_to_profile, config.name)

            return {
-                "metadata": BenchmarkMetadata(
-                    model_id=model_id,
-                    branch_name=self.branch_name,
-                    commit_id=self.commit_id,
-                    commit_message=self.commit_message,
-                ),
+                "metadata": BenchmarkMetadata(model_id=model_id, commit_id=self.commit_id),
                "measurements": result,
                "config": config,
            }
@ -277,8 +251,8 @@ class BenchmarkRunner:
    def time_generate(
        self,
        max_new_tokens: int,
-        gpu_monitor: GPUMonitor | None = None,
-    ) -> tuple[float, list[float], str, GPURawMetrics | None]:
+        gpu_monitor: Optional[GPUMonitor] = None,
+    ) -> tuple[float, list[float], str, Optional[GPURawMetrics]]:
        """Time the latency of a call to model.generate() with the given (inputs) and (max_new_tokens)."""
        # Prepare gpu monitoring if needed
        if gpu_monitor is not None:
@ -303,11 +277,10 @@ class BenchmarkRunner:
            raise RuntimeError(f"Generated {new_tokens} tokens, expected {max_new_tokens}")
        # Decode outputs
        decoded_output = self.tokenizer.decode(outputs[0, input_tokens:], skip_special_tokens=True)
-        shape_and_decoded_output = f"{tuple(outputs.shape)} | {decoded_output}"
        # Compute intermediate quantities
        e2e_latency = wall_time_1 - wall_time_0
        token_generation_times = [t - wall_time_0 for t in streamer.timestamps[1:]]
-        return e2e_latency, token_generation_times, shape_and_decoded_output, gpu_metrics
+        return e2e_latency, token_generation_times, decoded_output, gpu_metrics

    def profile_generate(self, num_tokens_to_profile: int, config_name: str) -> None:
        """Profile the latency of a call to model.generate() with the given (inputs) and (max_new_tokens)."""
@ -331,8 +304,7 @@ class BenchmarkRunner:
        benchmark_configs: list[BenchmarkConfig],
        num_tokens_to_profile: int = 0,
        pretty_print_summary: bool = True,
-    ) -> tuple[str, dict[str, Any]]:
-        """Run multiple benchmarks for the given model ID and list of benchmark configs."""
+    ) -> dict[str, Any]:
        all_results = {}
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        start_time = time.perf_counter()
@ -351,14 +323,14 @@ class BenchmarkRunner:
                continue

            # Otherwise, run the benchmark
-            self.setup_benchmark(model_id, config)
+            self.setup_one_run(model_id, config)
            self.logger.info(
                f"Running benchmark of model {model_id} with scenario: {config.name} ({i + 1}/{n_configs})"
            )

            # Launch benchmark in a try/except block to avoid stopping the whole run if one benchmark fails
            try:
-                results = self.run_benchmark(model_id, config, num_tokens_to_profile)
+                results = self.run_one_benchmark(model_id, config, num_tokens_to_profile)
                if results is not None:
                    all_results[config.hash] = results

@ -379,13 +351,13 @@ class BenchmarkRunner:
                first_metadata = all_results[first_key]["metadata"].to_dict()
                hardware_info = first_metadata.pop("hardware_info")
                pretty_print_dict(first_metadata | hardware_info, tabs=1)
-            for result in all_results.values():
+            for value in all_results.values():
                print("=" * 100)
-                print(f"Config: {result['config'].infer_name(compact=False)}\n")
-                result["measurements"].pprint(batch_size=result["config"].batch_size, tabs=1)
+                print(f"Config: {value['config'].infer_name(compact=False)}\n")
+                value["measurements"].pprint(tabs=1)
            print("=" * 100)

-        return (timestamp, all_results)
+        return all_results

    def save_results(self, model_name: str, results: dict, timestamp: str = "") -> str:
        """Save benchmark results to JSON file."""
@ -414,43 +386,3 @@ class BenchmarkRunner:

        self.logger.info(f"Results saved to {filepath}")
        return filepath
-
-    def push_results_to_hub(self, dataset_id: str, results: dict[Any, Any], timestamp: str) -> None:
-        if PUSH_TO_HUB_TOKEN is None:
-            raise ValueError(
-                "PUSH_TO_HUB_TOKEN is not set, cannot push results to the Hub. When setting dataset_id, please also set the PUSH_TO_HUB_TOKEN environment variable."
-            )
-
-        n_results = len(results)
-        self.logger.info(f"Pushing {n_results} results to: {dataset_id}")
-        rows = []
-        for cfg_hash, entry in results.items():
-            row = {
-                "benchmark_config_hash": cfg_hash,
-                "config": entry["config"].to_dict(),
-                "measurements": entry["measurements"].to_dict(),
-                "metadata": entry["metadata"].to_dict(),
-            }
-            rows.append(row)
-
-        ds = Dataset.from_list(rows)
-        with tempfile.TemporaryDirectory() as tmp:
-            jsonl_path = os.path.join(tmp, "data.jsonl")
-            with open(jsonl_path, "w") as f:
-                json_lines = []
-                for ex in ds:
-                    json_lines.append(json.dumps(ex, ensure_ascii=False))
-                f.write("\n".join(json_lines))
-
-            api = HfApi()
-            # NOTE: we expect the repository to already exist
-            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") if not timestamp else timestamp
-            file_name = f"benchmark_run_{timestamp}.jsonl"
-            api.upload_file(
-                path_or_fileobj=jsonl_path,
-                path_in_repo=file_name,
-                repo_id=dataset_id,
-                repo_type="dataset",
-                token=PUSH_TO_HUB_TOKEN,
-            )
-        self.logger.info(f"Succesfully uploaded results to: {dataset_id}")
--- a/benchmark_v2/framework/data_classes.py
+++ b/benchmark_v2/framework/data_classes.py
@ -1,6 +1,6 @@
 from dataclasses import dataclass
-from datetime import datetime, timezone
-from typing import Any
+from datetime import datetime
+from typing import Any, Optional, Union

 import numpy as np

@ -59,26 +59,19 @@ class BenchmarkMetadata:

    model_id: str
    timestamp: str
-    branch_name: str
    commit_id: str
-    commit_message: str
    hardware_info: HardwareInfo

-    def __init__(self, model_id: str, commit_id: str, branch_name: str = "main", commit_message: str = "") -> None:
+    def __init__(self, model_id: str, commit_id: str):
        self.model_id = model_id
-        self.timestamp = datetime.now(timezone.utc).isoformat()
-        self.branch_name = branch_name
+        self.timestamp = datetime.utcnow().isoformat()
        self.commit_id = commit_id
-        self.commit_message = commit_message
        self.hardware_info = HardwareInfo()

    def to_dict(self) -> dict[str, Any]:
        return {
-            "model_id": self.model_id,
            "timestamp": self.timestamp,
-            "branch_name": self.branch_name,
            "commit_id": self.commit_id,
-            "commit_message": self.commit_message,
            "hardware_info": self.hardware_info.to_dict(),
        }

@ -89,22 +82,22 @@ class BenchmarkResult:
    def __init__(self) -> None:
        self.e2e_latency = []
        self.token_generation_times = []  # time at which each token was generated (relative to start of the generation)
-        self.shape_and_decoded_outputs = []
+        self.decoded_outputs = []
        self.gpu_metrics = []

    def accumulate(
        self,
        e2e_latency: float,
        token_generation_times: list[float],
-        shape_and_decoded_output: str,
-        gpu_metrics: GPURawMetrics | None,
+        decoded_output: str,
+        gpu_metrics: Optional[GPURawMetrics],
    ) -> None:
        self.e2e_latency.append(e2e_latency)
        self.token_generation_times.append(token_generation_times)
-        self.shape_and_decoded_outputs.append(shape_and_decoded_output)
+        self.decoded_outputs.append(decoded_output)
        self.gpu_metrics.append(gpu_metrics)

-    def to_dict(self) -> dict[str, None | int | float]:
+    def to_dict(self) -> dict[str, Union[None, int, float]]:
        # Save GPU metrics as None if it contains only None values
        if all(gm is None for gm in self.gpu_metrics):
            gpu_metrics = None
@ -113,12 +106,12 @@ class BenchmarkResult:
        return {
            "e2e_latency": self.e2e_latency,
            "token_generation_times": self.token_generation_times,
-            "shape_and_decoded_outputs": self.shape_and_decoded_outputs,
+            "decoded_outputs": self.decoded_outputs,
            "gpu_metrics": gpu_metrics,
        }

    @classmethod
-    def from_dict(cls, data: dict[str, None | int | float]) -> "BenchmarkResult":
+    def from_dict(cls, data: dict[str, Union[None, int, float]]) -> "BenchmarkResult":
        # Handle GPU metrics, which is saved as None if it contains only None values
        if data["gpu_metrics"] is None:
            gpu_metrics = [None for _ in range(len(data["e2e_latency"]))]
@ -130,7 +123,7 @@ class BenchmarkResult:
            new_instance.accumulate(
                e2e_latency=data["e2e_latency"][i],
                token_generation_times=data["token_generation_times"][i],
-                shape_and_decoded_output=data["shape_and_decoded_outputs"][i],
+                decoded_output=data["decoded_output"][i],
                gpu_metrics=gpu_metrics[i],
            )
        return new_instance
@ -141,27 +134,19 @@ class BenchmarkResult:
    def get_measured_itl(self) -> list[float]:
        return [(dt[-1] - dt[0]) / (len(dt) - 1) for dt in self.token_generation_times if len(dt) > 1]

-    def get_throughput(self, batch_size: int) -> float:
-        return [
-            batch_size * len(dt) / e2e_latency
-            for e2e_latency, dt in zip(self.e2e_latency, self.token_generation_times)
-        ]
-
-    def pprint(self, batch_size: int = 0, tabs: int = 0) -> None:
-        stats_to_collate = [
-            add_unit_to_duration(compute_basic_statistics(self.e2e_latency)),
-            add_unit_to_duration(compute_basic_statistics(self.get_measured_ttft())),
-            add_unit_to_duration(compute_basic_statistics(self.get_measured_itl())),
-        ]
-        if batch_size > 0:
-            throughput_stats = compute_basic_statistics(self.get_throughput(batch_size))
-            stats_to_collate.append({key: f"{value:.2f}tok/s" for key, value in throughput_stats.items()})
-        collated_stats = equalize_lengths_and_collate(stats_to_collate)
-        dict_to_pprint = {
-            "E2E Latency": collated_stats[0],
-            "Time to First Token": collated_stats[1],
-            "Inter-Token Latency": collated_stats[2],
-        }
-        if batch_size > 0:
-            dict_to_pprint["Throughput"] = collated_stats[3]
-        pretty_print_dict(dict_to_pprint, tabs=tabs)
+    def pprint(self, tabs: int = 0) -> None:
+        collated_stats = equalize_lengths_and_collate(
+            [
+                add_unit_to_duration(compute_basic_statistics(self.e2e_latency)),
+                add_unit_to_duration(compute_basic_statistics(self.get_measured_ttft())),
+                add_unit_to_duration(compute_basic_statistics(self.get_measured_itl())),
+            ]
+        )
+        pretty_print_dict(
+            {
+                "E2E Latency": collated_stats[0],
+                "Time to First Token": collated_stats[1],
+                "Inter-Token Latency": collated_stats[2],
+            },
+            tabs=tabs,
+        )
--- a/benchmark_v2/framework/hardware_metrics.py
+++ b/benchmark_v2/framework/hardware_metrics.py
@ -7,6 +7,7 @@ import time
 from dataclasses import dataclass
 from enum import Enum
 from logging import Logger
+from typing import Optional, Union

 import gpustat
 import psutil
@ -41,7 +42,7 @@ class HardwareInfo:
        self.cpu_count = psutil.cpu_count()
        self.memory_total_mb = int(psutil.virtual_memory().total / (1024 * 1024))

-    def to_dict(self) -> dict[str, None | int | float | str]:
+    def to_dict(self) -> dict[str, Union[None, int, float, str]]:
        return {
            "gpu_name": self.gpu_name,
            "gpu_memory_total_gb": self.gpu_memory_total_gb,
@ -108,7 +109,7 @@ class GPURawMetrics:
    timestamp_0: float  # in seconds
    monitoring_status: GPUMonitoringStatus

-    def to_dict(self) -> dict[str, None | int | float | str]:
+    def to_dict(self) -> dict[str, Union[None, int, float, str]]:
        return {
            "utilization": self.utilization,
            "memory_used": self.memory_used,
@ -122,7 +123,7 @@ class GPURawMetrics:
 class GPUMonitor:
    """Monitor GPU utilization during benchmark execution."""

-    def __init__(self, sample_interval_sec: float = 0.1, logger: Logger | None = None):
+    def __init__(self, sample_interval_sec: float = 0.1, logger: Optional[Logger] = None):
        self.sample_interval_sec = sample_interval_sec
        self.logger = logger if logger is not None else logging.getLogger(__name__)

--- a/benchmark_v2/requirements.txt
+++ b/benchmark_v2/requirements.txt
@ -4,4 +4,4 @@ gpustat>=1.0.0
 torch>=2.0.0
 transformers>=4.30.0
 datasets>=2.10.0
-huggingface_hub>=0.16.0
+huggingface_hub>=0.16.0 
--- a/benchmark_v2/run_benchmarks.py
+++ b/benchmark_v2/run_benchmarks.py
@ -20,43 +20,31 @@ in the ./benches directory, organizing outputs into model-specific subfolders.

 import argparse
 import logging
+import random
 import sys
 import uuid

-from framework.benchmark_config import BenchmarkConfig, generate_all_configs, generate_main_configs
+from framework.benchmark_config import BenchmarkConfig, generate_all_configs
 from framework.benchmark_runner import BenchmarkRunner


 if __name__ == "__main__":
    # Parse arguments
    parser = argparse.ArgumentParser()
-    parser.add_argument("--output-dir", type=str, default=None, help="Output dir for benchmark results")
+    parser.add_argument("--output-dir", type=str, default="benchmark_results", help="Output dir for benchmark results")
    parser.add_argument("--log-level", type=str, choices=["DEBUG", "INFO", "WARNING", "ERROR"], default="INFO")
    parser.add_argument("--model-id", type=str, help="Specific model ID to benchmark (if supported by benchmarks)")
-    parser.add_argument("--warmup", "-w", type=int, default=3, help="Number of warmup iterations")
-    parser.add_argument("--iterations", "-i", type=int, default=10, help="Number of measurement iterations")
+
+    parser.add_argument("--warmup", type=int, default=5, help="Number of warmup iterations")
+    parser.add_argument("--iterations", type=int, default=20, help="Number of measurement iterations")

    parser.add_argument("--batch-size", "-b", type=int, nargs="+", help="Batch size")
    parser.add_argument("--sequence-length", "-s", type=int, nargs="+", help="Sequence length")
    parser.add_argument("--num-tokens-to-generate", "-n", type=int, nargs="+", help="Number of tokens to generate")

-    parser.add_argument("--cross-generate", action="store_true", help="Cross-generate all combinations of configs")
    parser.add_argument("--num-tokens-to-profile", "-p", type=int, default=0, help="Number of tokens to profile")

-    parser.add_argument("--branch-name", type=str, help="Git branch name")
    parser.add_argument("--commit-id", type=str, help="Git commit ID (if not provided, will auto-detect from git)")
-    parser.add_argument("--commit-message", type=str, help="Git commit message")
-
-    parser.add_argument(
-        "--no-gpu-monitoring", action="store_true", help="Disables GPU monitoring during benchmark runs"
-    )
-
-    parser.add_argument(
-        "--push-result-to-dataset",
-        type=str,
-        default=None,
-        help="Name of the dataset to push results to. If not provided, results are not pushed to the Hub.",
-    )
    args = parser.parse_args()

    # Setup logging
@ -81,62 +69,43 @@ if __name__ == "__main__":

    # If there is only one (batch_size, sequence_length, num_tokens_to_generate), we benchmark across configs
    elif len(args.batch_size) * len(args.sequence_length) * len(args.num_tokens_to_generate) == 1:
-        if args.cross_generate:
-            benchmark_configs = generate_all_configs(
-                warmup_iterations=args.warmup,
-                measurement_iterations=args.iterations,
-                batch_size=args.batch_size[0],
-                sequence_length=args.sequence_length[0],
-                num_tokens_to_generate=args.num_tokens_to_generate[0],
-                gpu_monitoring=not args.no_gpu_monitoring,
-            )
-        else:
-            benchmark_configs = generate_main_configs(
-                warmup_iterations=args.warmup,
-                measurement_iterations=args.iterations,
-                batch_size=args.batch_size[0],
-                sequence_length=args.sequence_length[0],
-                num_tokens_to_generate=args.num_tokens_to_generate[0],
-            )
-
-    # Otherwise, we benchmark across all combinations of dimensions
-    else:
-        main_config = generate_main_configs(
+        benchmark_configs = generate_all_configs(
            warmup_iterations=args.warmup,
            measurement_iterations=args.iterations,
            batch_size=args.batch_size[0],
            sequence_length=args.sequence_length[0],
            num_tokens_to_generate=args.num_tokens_to_generate[0],
-        )[0]
+        )
+        random.shuffle(benchmark_configs)
+
+    # Otherwise, we benchmark across all combinations of dimensions
+    else:
+        kwargs = {
+            "warmup_iterations": args.warmup,
+            "measurement_iterations": args.iterations,
+            "gpu_monitoring": False,
+            "batch_size": args.batch_size[0],
+            "sequence_length": args.sequence_length[0],
+            "num_tokens_to_generate": args.num_tokens_to_generate[0],
+            "attn_implementation": "flex_attention",
+            "sdpa_backend": None,
+            "compile_mode": "default",
+            "kernelize": False,
+        }
        benchmark_configs = []
        for num_tokens_to_generate in args.num_tokens_to_generate:
            for sequence_length in args.sequence_length:
                for batch_size in args.batch_size:
-                    cfg_dict = main_config.to_dict()
-                    cfg_dict["batch_size"] = batch_size
-                    cfg_dict["sequence_length"] = sequence_length
-                    cfg_dict["num_tokens_to_generate"] = num_tokens_to_generate
-                    cfg_dict.pop("name")
-                    benchmark_configs.append(BenchmarkConfig.from_dict(cfg_dict))
+                    kwargs["batch_size"] = batch_size
+                    kwargs["sequence_length"] = sequence_length
+                    kwargs["num_tokens_to_generate"] = num_tokens_to_generate
+                    benchmark_configs.append(BenchmarkConfig(**kwargs))

-    runner = BenchmarkRunner(
-        logger,
-        args.output_dir,
-        args.branch_name,
-        args.commit_id,
-        args.commit_message,
-    )
-    timestamp, results = runner.run_benchmarks(
+    runner = BenchmarkRunner(logger, args.output_dir, args.commit_id)
+    results = runner.run_benchmarks(
        args.model_id,
-        benchmark_configs,
+        benchmark_configs[:3],
        args.num_tokens_to_profile,
        pretty_print_summary=True,
    )
-
-    dataset_id = args.push_result_to_dataset
-    if dataset_id is not None and len(results) > 0:
-        runner.push_results_to_hub(
-            dataset_id,
-            results,
-            timestamp,
-        )
+    # runner.save_results(args.model_id, results)
--- a/conftest.py
+++ b/conftest.py
@ -58,6 +58,7 @@ NOT_DEVICE_TESTS = {
    "test_model_get_set_embeddings",
    "test_model_main_input_name",
    "test_correct_missing_keys",
+    "test_tie_model_weights",
    "test_can_use_safetensors",
    "test_load_save_without_tied_weights",
    "test_tied_weights_keys",
--- a/docker/examples-torch.dockerfile
+++ b/docker/examples-torch.dockerfile
@ -5,7 +5,7 @@ USER root
 RUN apt-get update &&  apt-get install -y --no-install-recommends libsndfile1-dev espeak-ng time git g++ cmake pkg-config openssh-client git-lfs ffmpeg curl
 ENV UV_PYTHON=/usr/local/bin/python
 RUN pip --no-cache-dir install uv && uv pip install --no-cache-dir -U pip setuptools
-RUN uv pip install --no-cache-dir 'torch<2.9' 'torchaudio' 'torchvision' 'torchcodec<0.8' --index-url https://download.pytorch.org/whl/cpu
+RUN uv pip install --no-cache-dir 'torch<2.9' 'torchaudio' 'torchvision' 'torchcodec' --index-url https://download.pytorch.org/whl/cpu
 RUN uv pip install --no-deps timm accelerate --extra-index-url https://download.pytorch.org/whl/cpu
 RUN uv pip install --no-cache-dir librosa "git+https://github.com/huggingface/transformers.git@${REF}#egg=transformers[sklearn,sentencepiece,vision,testing]" seqeval albumentations jiwer

--- a/docker/pipeline-torch.dockerfile
+++ b/docker/pipeline-torch.dockerfile
@ -5,7 +5,7 @@ USER root
 RUN apt-get update &&  apt-get install -y --no-install-recommends libsndfile1-dev espeak-ng time git pkg-config openssh-client git ffmpeg curl
 ENV UV_PYTHON=/usr/local/bin/python
 RUN pip --no-cache-dir install uv && uv pip install --no-cache-dir -U pip setuptools
-RUN uv pip install --no-cache-dir 'torch<2.9' 'torchaudio' 'torchvision' 'torchcodec<0.8' --index-url https://download.pytorch.org/whl/cpu
+RUN uv pip install --no-cache-dir 'torch<2.9' 'torchaudio' 'torchvision' 'torchcodec' --index-url https://download.pytorch.org/whl/cpu
 RUN uv pip install --no-deps timm accelerate --extra-index-url https://download.pytorch.org/whl/cpu
 RUN uv pip install --no-cache-dir librosa "git+https://github.com/huggingface/transformers.git@${REF}#egg=transformers[sklearn,sentencepiece,vision,testing]"

--- a/docker/torch-light.dockerfile
+++ b/docker/torch-light.dockerfile
@ -5,7 +5,7 @@ USER root
 RUN apt-get update &&  apt-get install -y --no-install-recommends libsndfile1-dev espeak-ng time git g++ cmake pkg-config openssh-client git-lfs ffmpeg curl
 ENV UV_PYTHON=/usr/local/bin/python
 RUN pip --no-cache-dir install uv && uv pip install --no-cache-dir -U pip setuptools
-RUN uv pip install --no-cache-dir 'torch<2.9' 'torchaudio' 'torchvision' 'torchcodec<0.8' --index-url https://download.pytorch.org/whl/cpu
+RUN uv pip install --no-cache-dir 'torch<2.9' 'torchaudio' 'torchvision' 'torchcodec' --index-url https://download.pytorch.org/whl/cpu
 RUN uv pip install --no-deps timm accelerate --extra-index-url https://download.pytorch.org/whl/cpu
 RUN uv pip install --no-cache-dir librosa "git+https://github.com/huggingface/transformers.git@${REF}#egg=transformers[sklearn,sentencepiece,vision,testing,tiktoken,num2words,video]"

--- a/docker/transformers-all-latest-gpu/Dockerfile
+++ b/docker/transformers-all-latest-gpu/Dockerfile
@ -24,8 +24,7 @@ RUN git clone https://github.com/huggingface/transformers && cd transformers &&
 # 1. Put several commands in a single `RUN` to avoid image/layer exporting issue. Could be revised in the future.
 # 2. Regarding `torch` part, We might need to specify proper versions for `torchvision` and `torchaudio`.
 #    Currently, let's not bother to specify their versions explicitly (so installed with their latest release versions).
-# 3. For `torchcodec<0.8`: this is quickly added as torch 2.9.0 + torchcodec 0.8.0 fails on our CI env. Need to remove later once they work.
-RUN python3 -m pip install --no-cache-dir -e ./transformers[dev,onnxruntime] && [ ${#PYTORCH} -gt 0 -a "$PYTORCH" != "pre" ] && VERSION='torch=='$PYTORCH'.*' ||  VERSION='torch'; echo "export VERSION='$VERSION'" >> ~/.profile && echo torch=$VERSION && [ "$PYTORCH" != "pre" ] && python3 -m pip install --no-cache-dir -U $VERSION torchvision torchaudio "torchcodec<0.8" --extra-index-url https://download.pytorch.org/whl/$CUDA || python3 -m pip install --no-cache-dir -U --pre torch torchvision torchaudio torchcodec --extra-index-url https://download.pytorch.org/whl/nightly/$CUDA
+RUN python3 -m pip install --no-cache-dir -e ./transformers[dev,onnxruntime] && [ ${#PYTORCH} -gt 0 -a "$PYTORCH" != "pre" ] && VERSION='torch=='$PYTORCH'.*' ||  VERSION='torch'; echo "export VERSION='$VERSION'" >> ~/.profile && echo torch=$VERSION && [ "$PYTORCH" != "pre" ] && python3 -m pip install --no-cache-dir -U $VERSION torchvision torchaudio torchcodec --extra-index-url https://download.pytorch.org/whl/$CUDA || python3 -m pip install --no-cache-dir -U --pre torch torchvision torchaudio torchcodec --extra-index-url https://download.pytorch.org/whl/nightly/$CUDA

 RUN python3 -m pip install --no-cache-dir -U timm

--- a/docker/transformers-pytorch-xpu/Dockerfile
+++ b/docker/transformers-pytorch-xpu/Dockerfile
@ -3,10 +3,11 @@ LABEL maintainer="Hugging Face"

 SHELL ["/bin/bash", "-c"]

-ARG PYTHON_VER=3.12
+ARG PYTHON_VER=3.11
 ENV TORCH_DEVICE_BACKEND_AUTOLOAD=0
 ENV DEBIAN_FRONTEND=noninteractive

+RUN apt-get remove -y python3.10 && apt-get autoremove -y
 RUN apt-get update && \
    apt-get install -y software-properties-common && \
    add-apt-repository -y ppa:deadsnakes/ppa && \
@ -22,6 +23,7 @@ RUN apt-get update && \
        apt-utils \
        build-essential \
        ca-certificates \
+        clinfo \
        curl \
        git \
        git-lfs \
@ -33,6 +35,7 @@ RUN apt-get update && \
        rsync \
        sudo \
        libnl-genl-3-200 \
+        xpu-smi \
        unzip \
        ffmpeg \
        tesseract-ocr \
@ -42,47 +45,34 @@ RUN apt-get update && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

+
 RUN apt-get update && \
    apt-get install -y \
-        linux-headers-$(uname -r) linux-modules-extra-$(uname -r) \
+        linux-headers-$(uname -r) \
+        linux-modules-extra-$(uname -r) \
        flex bison \
-        intel-fw-gpu intel-i915-dkms xpu-smi intel-ocloc clinfo\
+        intel-fw-gpu intel-i915-dkms xpu-smi \
        intel-opencl-icd libze-intel-gpu1 libze1 \
        intel-media-va-driver-non-free libmfx-gen1 libvpl2 \
-        libegl-mesa0 libegl1 libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \
+        libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \
        libglapi-mesa libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers \
-        mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo hwinfo \
+        mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo hwinfo clinfo intel-ocloc \
        libigc-dev intel-igc-cm libigdfcl-dev libigfxcmrt-dev libze-dev && \
    apt-get clean && \
    rm -rf  /var/lib/apt/lists/*

-# Use virtual env because Ubuntu-24 does not allowed pip on original python
-RUN curl -LsSf https://astral.sh/uv/install.sh | sh
-ENV PATH="/root/.local/bin:$PATH"
-ENV VIRTUAL_ENV="/opt/venv"
-ENV UV_PYTHON_INSTALL_DIR=/opt/uv/python
-RUN uv venv --python ${PYTHON_VER} --seed ${VIRTUAL_ENV}
-ENV PATH="$VIRTUAL_ENV/bin:$PATH"
+RUN pip install --upgrade pip
+RUN pip install triton==3.3.0

-RUN pip install --upgrade pip wheel
-RUN pip install triton==3.4.0
+RUN pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/xpu --no-cache-dir

-RUN pip install torch==2.8.0+xpu torchvision==0.23.0+xpu torchaudio==2.8.0+xpu --index-url https://download.pytorch.org/whl/xpu --no-cache-dir
+RUN pip install evaluate torchdata pyctcdecode pytesseract decord galore-torch fire scipy scikit-learn sentencepiece sacremoses nltk rouge_score librosa soundfile g2p_en mpi4py requests_mock
+RUN pip install pretty_midi essentia resampy Levenshtein av sacrebleu phonemizer invisible_watermark schedulefree
+RUN pip install gguf hqq compressed_tensors gptqmodel mergekit autoawq deepspeed torchao onnx
+RUN pip install hf_transfer huggingface-hub hf-doc-builder datasets optimum-quanto timm transformers accelerate optimum peft

-RUN pip install torchcodec torchdata --no-cache-dir
-
-RUN pip install evaluate pyctcdecode pytesseract decord galore-torch fire scipy scikit-learn sentencepiece sacremoses nltk rouge_score librosa soundfile g2p_en mpi4py requests_mock
-RUN pip install pretty_midi essentia resampy Levenshtein av sacrebleu phonemizer invisible_watermark schedulefree setuptools
-RUN pip install gptqmodel --no-build-isolation
-RUN pip install gguf hqq compressed_tensors autoawq deepspeed torchao onnx auto_round
-RUN pip install hf_transfer huggingface-hub hf-doc-builder datasets optimum-quanto timm transformers accelerate optimum peft diffusers trl kernels
-
-# install liger-kernel
 RUN pip install git+https://github.com/linkedin/Liger-Kernel.git --extra-index-url https://download.pytorch.org/whl/test/xpu

-# install mergekit
-RUN pip install --break-system-packages git+https://github.com/arcee-ai/mergekit.git@v0.1.3
-
 # install bitsandbytes
 RUN pip install git+https://github.com/bitsandbytes-foundation/bitsandbytes.git

--- a/docs/source/ar/_toctree.yml
+++ b/docs/source/ar/_toctree.yml
@ -123,6 +123,8 @@
    title: تشغيل التدريب على Amazon SageMaker
  - local: serialization
    title: التصدير إلى ONNX
+  - local: torchscript
+    title: التصدير إلى TorchScript
  - local: notebooks
    title: دفاتر الملاحظات مع الأمثلة
  - local: community
--- a/docs/source/ar/serialization.md
+++ b/docs/source/ar/serialization.md
@ -32,7 +32,7 @@
 لتصدير نموذج 🤗 Transformers إلى ONNX، قم أولاً بتثبيت اعتماد إضافي:

 ```bash
-pip install optimum-onnx
+pip install optimum[exporters]
 ```

 للاطلاع على جميع المعامﻻت المتاحة، يرجى الرجوع إلى [وثائق 🤗 Optimum](https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/export_a_model#exporting-a-model-to-onnx-using-the-cli)، أو عرض المساعدة في سطر الأوامر:
@ -111,3 +111,60 @@ optimum-cli export onnx --model keras-io/transformers-qa distilbert_base_cased_s
 ### تصدير نموذج لهندسة غير مدعومة

 إذا كنت ترغب في المساهمة من خلال إضافة دعم لنموذج لا يُمكن تصديره حاليًا، فيجب عليك أولاً التحقق مما إذا كان مدعومًا في [`optimum.exporters.onnx`](https://huggingface.co/docs/optimum/exporters/onnx/overview)، وإذا لم يكن مدعومًا، [فيمكنك المساهمة في 🤗 Optimum](https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/contribute) مُباشرةً.
+
+### تصدير نموذج باستخدام `transformers.onnx`
+
+<Tip warning={true}>
+
+لم يعد يتم دعم `transformers.onnx`  يُرجى تصدير النماذج باستخدام 🤗 Optimum كما هو موضح أعلاه. سيتم إزالة هذا القسم في الإصدارات القادمة.
+
+</Tip>
+
+لتصدير نموذج 🤗 Transformers إلى ONNX باستخدام `transformers.onnx`، ثبّت التبعيات الإضافية:
+
+```bash
+pip install transformers[onnx]
+```
+
+استخدم حزمة `transformers.onnx` كنموذج Python لتصدير نقطة حفظ باستخدام تكوين جاهز:
+
+```bash
+python -m transformers.onnx --model=distilbert/distilbert-base-uncased onnx/
+```
+
+يُصدّر هذا رسمًا بيانيًا ONNX لنقطة الحفظ المُحددة بواسطة وسيطة `--model`. مرر أي نقطة حفظ على 🤗 Hub أو نقطة حفظ مُخزنة محليًا.
+يُمكن بعد ذلك تشغيل ملف `model.onnx` الناتج على أحد المُسرعات العديدة التي تدعم معيار ONNX. على سبيل المثال، قم بتحميل وتشغيل النموذج باستخدام ONNX Runtime كما يلي:
+
+```python
+>>> from transformers import AutoTokenizer
+>>> from onnxruntime import InferenceSession
+
+>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")
+>>> session = InferenceSession("onnx/model.onnx")
+>>> # يتوقع ONNX Runtime مصفوفات NumPy كمدخلات
+>>> inputs = tokenizer("Using DistilBERT with ONNX Runtime!", return_tensors="np")
+>>> outputs = session.run(output_names=["last_hidden_state"], input_feed=dict(inputs))
+```
+
+يُمكن الحصول على أسماء المخرجات المطلوبة (مثل `["last_hidden_state"]`) من خلال إلقاء نظرة على تكوين ONNX لكل نموذج. على سبيل المثال، بالنسبة لـ DistilBERT، لدينا:
+
+```python
+>>> from transformers.models.distilbert import DistilBertConfig, DistilBertOnnxConfig
+
+>>> config = DistilBertConfig()
+>>> onnx_config = DistilBertOnnxConfig(config)
+>>> print(list(onnx_config.outputs.keys()))
+["last_hidden_state"]
+```
+
+العمليات مُتطابقة لنقاط الحفظ TensorFlow على Hub. على سبيل المثال، صدّر نقطة حفظ TensorFlow خالصة كما يلي:
+
+```bash
+python -m transformers.onnx --model=keras-io/transformers-qa onnx/
+```
+
+لتصدير نموذج مُخزن محليًا، احفظ أوزان النموذج ومجزىء اللغوى في نفس الدليل (على سبيل المثال `local-pt-checkpoint`)، ثم قم بتصديره إلى ONNX عن طريق توجيه وسيط `--model` لحزمة `transformers.onnx` إلى الدليل المطلوب:
+
+```bash
+python -m transformers.onnx --model=local-pt-checkpoint onnx/
+```
--- a/docs/source/ar/torchscript.md
+++ b/docs/source/ar/torchscript.md
@ -0,0 +1,154 @@
+# التصدير إلى TorchScript
+
+<Tip>
+
+هذه هي بداية تجاربنا مع TorchScript ولا زلنا نستكشف قدراته مع نماذج المدخلات المتغيرة الحجم. إنه مجال اهتمامنا وسنعمق تحليلنا في الإصدارات القادمة، مع المزيد من الأمثلة البرمجية، وتنفيذ أكثر مرونة، ومقاييس مقارنة بين  الأكواد القائمة على Python مع أكواد TorchScript المُجمّعة.
+
+</Tip>
+
+وفقًا لـ [وثائق TorchScript](https://pytorch.org/docs/stable/jit.html):
+
+> TorchScript هي طريقة لإنشاء نماذج قابلة للتسلسل والتحسين من تعليمات PyTorch البرمجية.
+
+هناك وحدتان من PyTorch، [JIT and TRACE](https://pytorch.org/docs/stable/jit.html)، تتيحان للمطورين تصدير نماذجهم لإعادة استخدامها في برامج أخرى مثل برامج C++ المُحسّنة للأداء.
+
+نقدم واجهة تتيح لك تصدير نماذج 🤗 Transformers إلى TorchScript بحيث يمكن إعادة استخدامها في بيئة مختلفة عن برامج Python القائمة إلى PyTorch. هنا نشرح كيفية تصدير نماذجنا واستخدامها باستخدام TorchScript.
+
+يتطلب تصدير نموذج أمرين:
+
+- تهيئة مثيل للنموذج باستخدام علامة `torchscript`
+- تمرير مُدخلات وهمية (dummy inputs) خلال النموذج
+
+تنطوي هذه الضرورات على عدة أمور يجب على المطورين توخي الحذر بشأنها كما هو مفصل أدناه.
+
+## علامة TorchScript والأوزان المرتبطة
+
+علامة `torchscript` ضرورية لأن معظم نماذج اللغة 🤗 Transformers لها أوزان مرتبطة بين طبقة `Embedding` وطبقة `Decoding`. لا يسمح لك TorchScript بتصدير النماذج ذات الأوزان المرتبطة، لذلك من الضروري فصل الأوزان ونسخها مسبقًا.
+
+النماذج المُهيأة باستخدام علامة `torchscript` لها طبقة `Embedding` وطبقة`Decoding` منفصلتين، مما يعني أنه لا ينبغي تدريبها لاحقًا. سيؤدي التدريب إلى عدم تزامن الطبقتين، مما يؤدي إلى نتائج غير متوقعة.
+
+هذا لا ينطبق على النماذج التي لا تحتوي على رأس نموذج اللغة، حيث لا تملك أوزانًا مرتبطة. يمكن تصدير هذه النماذج بأمان دون علامة `torchscript`.
+
+## المدخلات الوهمية والأطوال القياسية
+
+تُستخدم المُدخلات الوهمية لتمرير أمامي خلال النموذج. أثناء انتشار قيم المُدخلات عبر الطبقات، يتتبع PyTorch العمليات المختلفة التي يتم تنفيذها على كل مصفوفة(tensor). ثم يتم استخدام هذه العمليات المُسجلة بعد ذلك لإنشاء *أثر* النموذج.
+
+يتم إنشاء التتبع بالنسبة لأبعاد المُدخلات. وبالتالي، فهو مُقيّد بأبعاد المُدخلات الوهمية، ولن يعمل لأي طول تسلسل أو حجم دفعة مختلف. عند المحاولة بحجم مختلف، يتم رفع الخطأ التالي:
+
+```
+`The expanded size of the tensor (3) must match the existing size (7) at non-singleton dimension 2`
+```
+
+نوصي بتتبع النموذج باستخدام حجم مُدخلات وهمية لا يقل عن أكبر مُدخل سيتم تقديمه للنموذج أثناء الاستدلال. يمكن أن تساعد الحشوة(padding) في ملء القيم المفقودة. ومع ذلك، نظرًا لتتبع النموذج بحجم مُدخل أكبر، ستكون أبعاد المصفوفة ستكون كبيرة أيضًا، مما يؤدي عنه المزيد من الحسابات.
+
+انتبه إلى إجمالي عدد العمليات المُنفذة على كل مُدخل وتابع الأداء عن كثب عند تصدير نماذج متغيرة طول التسلسل.
+
+## استخدام TorchScript في Python
+
+يوضح هذا القسم كيفية حفظ النماذج وتحميلها، بالإضافة إلى كيفية استخدام التتبع للاستدلال.
+
+### حفظ نموذج
+
+لتصدير `BertModel` باستخدام TorchScript، قم بتهيئة ـ `BertModel` من فئة `BertConfig` ثم احفظه على القرص تحت اسم الملف `traced_bert.pt`:
+
+```python
+from transformers import BertModel, BertTokenizer, BertConfig
+import torch
+
+enc = BertTokenizer.from_pretrained("google-bert/bert-base-uncased")
+
+# Tokenizing input text
+text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
+tokenized_text = enc.tokenize(text)
+
+# Masking one of the input tokens
+masked_index = 8
+tokenized_text[masked_index] = "[MASK]"
+indexed_tokens = enc.convert_tokens_to_ids(tokenized_text)
+segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
+
+# Creating a dummy input
+tokens_tensor = torch.tensor([indexed_tokens])
+segments_tensors = torch.tensor([segments_ids])
+dummy_input = [tokens_tensor, segments_tensors]
+
+# Initializing the model with the torchscript flag
+# Flag set to True even though it is not necessary as this model does not have an LM Head.
+config = BertConfig(
+    vocab_size_or_config_json_file=32000,
+    hidden_size=768,
+    num_hidden_layers=12,
+    num_attention_heads=12,
+    intermediate_size=3072,
+    torchscript=True,
+)
+
+# Instantiating the model
+model = BertModel(config)
+
+# The model needs to be in evaluation mode
+model.eval()
+
+# If you are instantiating the model with *from_pretrained* you can also easily set the TorchScript flag
+model = BertModel.from_pretrained("google-bert/bert-base-uncased", torchscript=True)
+
+# Creating the trace
+traced_model = torch.jit.trace(model, [tokens_tensor, segments_tensors])
+torch.jit.save(traced_model, "traced_bert.pt")
+```
+
+### تحميل نموذج
+
+يمكنك الآن تحميل `BertModel` المُحفظ سابقًا، `traced_bert.pt`، من القرص واستخدامه على `dummy_input` المُهيأ سابقًا:
+
+```python
+loaded_model = torch.jit.load("traced_bert.pt")
+loaded_model.eval()
+
+all_encoder_layers, pooled_output = loaded_model(*dummy_input)
+```
+
+### استخدام نموذج مُتتبع للاستدلال
+
+استخدم النموذج المُتتبع للاستدلال باستخدام أسلوب `__call__` الخاص به:
+
+```python
+traced_model(tokens_tensor, segments_tensors)
+```
+
+## نشر نماذج Hugging Face TorchScript على AWS باستخدام Neuron SDK
+
+قدمت AWS عائلة [Amazon EC2 Inf1](https://aws.amazon.com/ec2/instance-types/inf1/) من اﻷجهزة لخفض التكلفة وأداء التعلم الآلي عالي الأداء في البيئة السحابية. تعمل أجهزة Inf1 بواسطة شريحة Inferentia من AWS، وهي مُسرّع أجهزة مُخصص، متخصص في أعباء عمل الاستدلال للتعلم العميق. [AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/#) هي SDK لـ Inferentia التي تدعم تتبع نماذج المحولات وتحسينها للنشر على Inf1. توفر Neuron SDK ما يلي:
+
+1. واجهة برمجة تطبيقات سهلة الاستخدام مع تغيير سطر واحد من التعليمات البرمجية لتتبع نموذج TorchScript وتحسينه للاستدلال في البيئة السحابية.
+2. تحسينات الأداء الجاهزة للاستخدام [تحسين التكلفة والأداء](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/benchmark/>).
+3. دعم نماذج Hugging Face المحولات المبنية باستخدام إما [PyTorch](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/pytorch/bert_tutorial/tutorial_pretrained_bert.html) أو [TensorFlow](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/tensorflow/huggingface_bert/huggingface_bert.html).
+
+### الآثار المترتبة
+
+تعمل نماذج المحولات المستندة إلى بنية [BERT (تمثيلات الترميز ثنائية الاتجاه من المحولات)](https://huggingface.co/docs/transformers/main/model_doc/bert) أو متغيراتها مثل [distilBERT](https://huggingface.co/docs/transformers/main/model_doc/distilbert) و [roBERTa](https://huggingface.co/docs/transformers/main/model_doc/roberta) بشكل أفضل على Inf1 للمهام غير التوليدية مثل الإجابة على الأسئلة الاستخراجية، وتصنيف التسلسلات، وتصنيف الرموز (tokens). ومع ذلك، يمكن تكييف مهام توليد النصوص للعمل على Inf1 وفقًا لهذا [برنامج تعليمي AWS Neuron MarianMT](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/pytorch/transformers-marianmt.html). يمكن العثور على مزيد من المعلومات حول النماذج التي يمكن تحويلها جاهزة على Inferentia في قسم [ملاءمة بنية النموذج](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/models/models-inferentia.html#models-inferentia) من وثائق Neuron.
+
+### التبعيات (Dependencies)
+
+يتطلب استخدام AWS Neuron لتحويل النماذج [بيئة SDK Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuron-frameworks/pytorch-neuron/index.html#installation-guide) والتي تأتي مسبقًا على [AMI للتعلم العميق من AWS](https://docs.aws.amazon.com/dlami/latest/devguide/tutorial-inferentia-launching.html).
+
+### تحويل نموذج لـ AWS Neuron
+
+قم بتحويل نموذج لـ AWS NEURON باستخدام نفس التعليمات البرمجية من [استخدام TorchScript في Python](torchscript#using-torchscript-in-python) لتتبع `BertModel`. قم باستيراد امتداد إطار عمل `torch.neuron` للوصول إلى مكونات Neuron SDK من خلال واجهة برمجة تطبيقات Python:
+
+```python
+from transformers import BertModel, BertTokenizer, BertConfig
+import torch
+import torch.neuron
+```
+
+كل ما عليك فعله هو تعديل السطر التالي:
+
+```diff
+- torch.jit.trace(model, [tokens_tensor, segments_tensors])
+ torch.neuron.trace(model, [token_tensor, segments_tensors])
+```
+
+يتيح ذلك لـ Neuron SDK تتبع النموذج وتحسينه لمثيلات Inf1.
+
+لمعرفة المزيد حول ميزات AWS Neuron SDK والأدوات ودروس البرامج التعليمية والتحديثات الأخيرة، يرجى الاطلاع على [وثائق AWS NeuronSDK](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/index.html).
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@ -88,8 +88,6 @@
      title: Tool use
    - local: chat_templating_writing
      title: Writing a chat template
-    - local: chat_response_parsing
-      title: Response parsing
    title: Chat with models
  - sections:
    - local: serving
@ -229,6 +227,8 @@
    title: ONNX
  - local: executorch
    title: ExecuTorch
+  - local: torchscript
+    title: TorchScript
  title: Export to production
 - isExpanded: false
  sections:
@ -1255,8 +1255,6 @@
      title: Importing Utilities
    - local: internal/time_series_utils
      title: Utilities for Time Series
-    - local: internal/rope_utils
-      title: Rotary Embeddings Utilities
    title: Internal helpers
  - sections:
    - local: reference/environment_variables
--- a/docs/source/en/accelerator_selection.md
+++ b/docs/source/en/accelerator_selection.md
@ -55,7 +55,6 @@ deepspeed --num_gpus 2 trainer-program.py ...
 </hfoptions>

 ## Order of accelerators
-
 To select specific accelerators to use and their order, use the environment variable appropriate for your hardware. This is often set on the command line for each run, but can also be added to your `~/.bashrc` or other startup config file.

 For example, if there are 4 accelerators (0, 1, 2, 3) and you only want to run accelerators 0 and 2:
--- a/docs/source/en/chat_extras.md
+++ b/docs/source/en/chat_extras.md
@ -95,12 +95,9 @@ print(tokenizer.decode(outputs[0][len(inputs["input_ids"][0]):]))

 The chat model called the `get_current_temperature` tool with the correct parameters from the docstring. It inferred France as the location based on Paris, and that it should use Celsius for the units of temperature.

-A model **cannot actually call the tool itself**. It requests a tool call, and it's your job to handle the call and append it and the result to the chat history. For
-models that support [response parsing](./chat_response_parsing), the response parsing will be handled automatically, and you can just use
-[`~PreTrainedTokenizer.parse_response] to extract the tool call. For other models, you'll need to manually translate the output
-string into a tool call dict.
+A model **cannot actually call the tool itself**. It requests a tool call, and it's your job to handle the call and append it and the result to the chat history.

-Regardless of the approach you use, the tool call should go in the `tool_calls` key of an `assistant` message. This is the recommended API, and should be supported by the chat template of most tool-using models.
+Hold the call in the `tool_calls` key of an `assistant` message. This is the recommended API, and should be supported by the chat template of most tool-using models.

 > [!WARNING]
 > Although `tool_calls` is similar to the OpenAI API, the OpenAI API uses a JSON string as its `tool_calls` format. This may cause errors or strange model behavior if used in Transformers, which expects a dict.
--- a/docs/source/en/chat_response_parsing.md
+++ b/docs/source/en/chat_response_parsing.md
@ -1,233 +0,0 @@
-<!--Copyright 2025 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-
-⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
-rendered properly in your Markdown viewer.
-
-->
-
-# Response Parsing
-
-It is increasingly common for chat models to generate structured outputs, rather than just a single reply string. 
-The most common uses for structured outputs are [tool calling](./chat_extras) and [reasoning models](https://huggingface.co/reasoning-course).
-Tool calling models can output tool calls, containing the name of the tool to call and any arguments to be passed to it,
-while reasoning models often output reasoning steps as a "chain of thought". Some recent models even use both of these,
-and may output reasoning and/or one or more tool calls before their final answer.
-
-Models with structured outputs pose a challenge for chat templating, because the output needs to be parsed before it
-can be appended to the chat. For a concrete example, let's say we ask [GPT-OSS](https://huggingface.co/openai/gpt-oss-120b)
-what the weather is like, and it thinks and decides to call a tool. Here's what the raw model output might look like:
-
-```txt
-<|start|>analysis<|message|>The user asks: "What is the weather like in SF?" We need to get the location of the user? The user explicitly asks about SF (San Francisco).
-So we need to get the current weather in San Francisco, CA. We need to call get_current_weather function. But we need to call function to get weather data.
-So we should call get_current_weather with location "San Francisco, CA". Let's do that.
-We will call function get_current_weather.<|end|><|start|>commentary to=functions.get_current_weather<|channel|>commentary <|constrain|>json<|message|>{"location":"San Francisco, CA"}<|call|>
-}
-```
-
-But if you want to append this to a chat, you'll need to format it as a chat message dict, like this:
-
-```json
-{
-  "role": "assistant",
-  "thinking": "The user asks: \"What is the weather like in SF?\" We need to get the location of the user? The user explicitly asks about SF (San Francisco). So we need to get the current weather in San Francisco, CA. We need to call get_current_weather function. But we need to call function to get weather data. So we should call get_current_weather with location \"San Francisco, CA\". Let's do that.",
-  "tool_calls": [
-    {
-      "name": "get_current_weather",
-      "arguments": {
-        "location": "San Francisco, CA"
-      }
-    }
-  ]
-}
-```
-
-Chat **templates** give us a way to turn messages into formatted input for a model, but we need something else to
-parse model output back into a standard message dict. This is what chat **parsing** is for.
-
-## The [parse_response](~PreTrainedTokenizerBase.parse_response) method
-
-Parsing a chat response on a model that supports it is straightforward. Simply take the raw, decoded output from
-[generate](`~generation.GenerationMixin.generate`), and pass it to the tokenizer's [parse_response](~PreTrainedTokenizerBase.parse_response) method:
-
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-
-checkpoint = "HuggingFaceTB/SmolLM3-3B"
-
-tokenizer = AutoTokenizer.from_pretrained(checkpoint)
-model = AutoModelForCausalLM.from_pretrained(checkpoint, dtype="auto", device_map="auto")
-
-messages = [
-    {
-        "role": "user",
-        "content": "Hey! Can you summarize the end of the Cold War as briefly as possible? Like, comically briefly. It should really leave out almost most of the relevant information."
-    }
-]
-
-input_ids = tokenizer.apply_chat_template(
-    messages,
-    add_generation_prompt=True,
-    tokenize=True,
-    return_tensors="pt"
-).to(model.device)
-
-outputs = model.generate(input_ids, max_new_tokens=1024)[0, input_ids.shape[1]:]
-out_text = tokenizer.decode(outputs)
-parsed = tokenizer.parse_response(out_text)
-print(parsed.keys())
-```
-
-And you should get:
-
-```text
-dict_keys(['thinking', 'content'])
-```
-
-And that's all you need to start using response parsing! `parse_response` should return a complete message dict that is ready to be appended to the chat history. 
-When the tokenizer does not support response parsing, `parse_response` will throw an error. We hope to add support
-to more tokenizers over time.
-
-## Developers: Understanding a simple response schema
-
-Under the hood, `parse_response` uses a **JSON schema** to parse the model output. A JSON schema represents
-the structure of the output message dict. The schema is augmented with additional fields that indicate how the 
-output message string should be parsed into the expected format. Let's take a look at the schema for a SmolLM response,
-excluding tool calls for now:
-
-```python
-{
-    "x-regex": "(?:<think>\n?(?P<thinking>.+?)\n?</think>)?\s*(?P<content>.+?)?\s*(?:<\|im_end\|>|$)",
-    "type": "object",
-    "properties": {
-        "role": {"const": "assistant"},
-        "content": {"type": "string"},
-        "thinking": {"type": "string"}
-    }
-}
-```
-
-We can see that the schema describes a JSON "object" (a `dict`, in other words) with three keys: `role`, `content`, and `thinking`.
-Because all assistant responses have the role "assistant", the `role` key is a `const`(ant). The other two keys are strings, extracted
-from the named groups in the regex in the `x-regex` field.
-
-Like chat templates, response schemas are set as a property of the tokenizer. To enable response parsing, all you need
-to do is set `tokenizer.response_schema` to a valid schema dict, and `tokenizer.parse_response()` will work! Again, like
-chat templates, this schema will be saved with the processor, so once you set it, you can use `save_pretrained()` or `push_to_hub()` to
-save and share the schema. 
-
-## Developers: Complex schemas
-
-Now, let's look at a more complex schema, which includes tool calls, to gain more of an understanding of the parser
-internals. For this, we'll use the `GPT-OSS` schema. GPT-OSS emits both tool calls and thinking blocks, and it uses
-an unusual format where model responses are tagged with one of three "channels": `commentary` for things like
-tool calls, `analysis` for chain of thought blocks, and `final` for messages intended to be sent to the user. 
-A full message where the model calls a tool named `get_current_weather` might look like this, with some extra linebreaks added for clarity:
-
-```text
-<|channel|>analysis<|message|>
-The user asks: "What is the weather like in SF?" So we need to get the current weather in San Francisco, CA. 
-We need to call get_current_weather function. So we should call get_current_weather with location "San Francisco, CA".
-<|end|>
-<|start|>assistant<|channel|>commentary 
-to=functions.get_current_weather <|constrain|>json<|message|>
-{
-  "location": "San Francisco, CA"
-}
-<|call|>
-```
-
-Parsing proceeds recursively; the output of a regex (or other parser) at one level becomes the input to the nodes below it.
-In other words, don't feel like you have to parse the entire output in one enormous regex! Instead, start with the schema,
-and then add regexes to extract the relevant chunks as you go. Here's a schema that will parse it, with some
-explanatory comments:
-
-```python
-{
-    "type": "object",
-    "properties": {
-        "role": {"const": "assistant"},
-        # "content" and "thinking" are both similar to the previous example, and just extract a single string
-        # However, rather than using a single regex with named groups to extract both, we use a regex in each subkey.
-        # When an object node has no parser/regex, the entire input string is passed to all of its children, so 
-        # parsing can either be done with named groups at the object level, or with separate regexes at the property level.
-        "content": {"type": "string", "x-regex": r"<\|channel\|>final<\|message\|>(.*?)(?:<\|end\|>|$)"},
-        "thinking": {"type": "string", "x-regex": r"<\|channel\|>analysis<\|message\|>(.*?)<\|end\|>"},
-        "tool_calls": {
-            # "x-regex-iterator" uses re.findall to find multiple possible manages, and returns them as an
-            # array/list. You don't need to worry about array handling, though - each item in the array will be
-            # parsed by the `items` schema, so just write the schema for a single item.
-            "x-regex-iterator": r"<\|channel\|>commentary (to=functions\..*?<\|message\|>.*?)(?:<\|call\|>|$)",
-            "type": "array",
-            "items": {
-                "type": "object",
-                "properties": {
-                    # A const property is a fixed value, and the input has no effect on it.
-                    "type": {"const": "function"},
-                    # Here, we wrap the entire tool call dict in a `{"function": ...}` block. The input string is passed through to it unchanged.
-                    "function": {
-                        "type": "object",
-                        "properties": {
-                            "name": {"type": "string", "x-regex": r"^to=functions\.(\w+)"},
-                            "arguments": {
-                                "type": "object",
-                                "x-regex": "<\|message\|>(.*)",
-                                # The "x-parser" field indicates that the extracted string should be parsed as JSON.
-                                # The output is then passed to the schema nodes below and recursive parsing continues.
-                                "x-parser": "json",
-                                "additionalProperties": {"type": "any"},
-                            },
-                        },
-                    },
-                },
-            },
-        },
-    },
-}
-```
-
-## Developers: Understanding the parser logic
-
-The parser follows a few simple rules:
-
-1. Each level of the schema receives input from the level above, applies any regex or parser it has, and then passes the output to its children.
-2. The root level receives the entire decoded model output string as input.
-3. If a node has structured content after parsing (for example, if the regex has named groups and returns a dict, or if the parser returns a dict or list),
-   then that structured content is mapped to the node's children, and each child node receives its corresponding value as input.
-4. If an `object` (dict) node has unstructured (string) output, then the entire string is passed to all of its children. This allows child nodes
-   to handle parsing individually rather than requiring a single parent regex to extract all keys at once.
-5. If an `array` (list) node has unstructured (string) output, then this throws an error.
-
-There is a small set of allowable `x-` keys that indicate how parsing should be done at each node:
- `x-regex`: A regex string to apply to the input. If the regex has named groups, the output is a dict of group names to values. Named groups should only be used in `object` nodes.
-  Otherwise, the regex must have exactly one unnamed capturing group, and the output is the value of that group as a string.
- `x-regex-iterator`: A regex string to apply to the input using `re.findall()`. The output is a list of all matches.
-  This should only be used in `array` nodes, and the regex must have exactly one unnamed capturing group. The output is distributed to
-  the node's `items` schema.
- `x-parser`: Calls a built-in parser to apply to the input. Currently, the only supported parser is `json`, which parses the input string as JSON.
-  The output is passed to the child nodes for further parsing. Note that the `json` parser can return deeply nested output - in this case, the output
-  will be progressively unwrapped as it is passed through child nodes. The child nodes do not need additional `x-parser` or `x-regex` fields in this case, 
-  but their structure must match the structure of the parsed JSON.
- `x-parser-args`: Only allowed in conjunction with `x-parser`. This is a dict of additional arguments that control parsing. Right now, the only supported
-  argument is `transform`, which specifies a `jmespath` transformation to apply to the output. This is useful when the JSON parser returns a structure
-  that needs to be modified to match the schema.
- `x-regex-key-value`: This is rarely necessary, but it can be useful when parsing key-value pairs in non-JSON format where the names of the keys are not known
-  in advance, such as when a model emits XML tool calls with arbitrary argument names. The regex must have exactly two named capturing groups, 
-  `key` and `value`, and the output is a dict mapping keys to values. This should only be used in `object` nodes.
-
-In general, multiple regexes/parsers cannot be combined at the same level. The exception is that `x-regex`, returning a single string, can be combined with the other parsers. In this case,
-`x-regex` is applied first, and then the output is passed to the other parser, either `x-regex-iterator`, `x-parser`, or `x-regex-key-value`.
-
-Putting these ideas together, you can see that the input flows through the schema, being parsed at each level and then distributed to child nodes. Each level
-only needs to extract the input content that is relevant for that part of the schema, and can then let its child nodes handle the rest. Internally, this is handled
-with a parser function that receives input, applies any regexes/parsers at the current level, then maps the result to its child nodes before recursively calling itself on each of them.
-Recursion terminates when it reaches leaf nodes, usually primitive types like `string` or `number`, which simply return the input they receive.
--- a/docs/source/en/community.md
+++ b/docs/source/en/community.md
@ -6,13 +6,13 @@ rendered properly in your Markdown viewer.

 This page regroups resources around 🤗 Transformers developed by the community.

-## Community resources
+## Community resources:

 | Resource     |      Description      |      Author      |
 |:----------|:-------------|------:|
 | [Hugging Face Transformers Glossary Flashcards](https://www.darigovresearch.com/huggingface-transformers-glossary-flashcards) | A set of flashcards based on the [Transformers Docs Glossary](glossary) that has been put into a form which can be easily learned/revised using [Anki](https://apps.ankiweb.net/) an open source, cross platform app specifically designed for long term knowledge retention. See this [Introductory video on how to use the flashcards](https://www.youtube.com/watch?v=Dji_h7PILrw). | [Darigov Research](https://www.darigovresearch.com/) |

-## Community notebooks
+## Community notebooks:

 | Notebook     |      Description      |      Author      |      |
 |:----------|:-------------|:-------------|------:|
--- a/docs/source/en/executorch.md
+++ b/docs/source/en/executorch.md
@ -18,7 +18,7 @@ rendered properly in your Markdown viewer.

 [ExecuTorch](https://pytorch.org/executorch/stable/index.html) runs PyTorch models on mobile and edge devices. Export your Transformers models to the ExecuTorch format with [Optimum ExecuTorch](https://github.com/huggingface/optimum-executorch) with the command below.

-```bash
+```
 optimum-cli export executorch \
    --model "HuggingFaceTB/SmolLM2-135M-Instruct" \
    --task "text-generation" \
@ -29,5 +29,4 @@ optimum-cli export executorch \
    --qembedding 8w \
    --output_dir="hf_smollm2"
 ```
-
 Run `optimum-cli export executorch --help` to see all export options. For detailed export instructions, check the [README](optimum/exporters/executorch/README.md).
--- a/docs/source/en/hpo_train.md
+++ b/docs/source/en/hpo_train.md
@ -37,6 +37,7 @@ def model_init(trial):
        config=config,
        cache_dir=model_args.cache_dir,
        revision=model_args.model_revision,
+        token=True if model_args.use_auth_token else None,
    )
 ```

--- a/docs/source/en/internal/model_debugging_utils.md
+++ b/docs/source/en/internal/model_debugging_utils.md
@ -320,7 +320,7 @@ df.sort_values(by=['skipped_proportion'], ascending=False)
 You can focus on a specific test method using `--test_method_name`:

 ```bash
-python utils/scan_skipped_tests.py --test_method_name test_inputs_embeds --output_dir path/to/output
+$ python utils/scan_skipped_tests.py --test_method_name test_inputs_embeds --output_dir path/to/output
 ```

 - `--test_method_name`: Name of the test method to scan (e.g., `test_inputs_embeds`).
--- a/docs/source/en/internal/rope_utils.md
+++ b/docs/source/en/internal/rope_utils.md
@ -1,83 +0,0 @@
-<!--Copyright 2020 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-
-⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
-rendered properly in your Markdown viewer.
-
-->
-
-# Utilities for Rotary Embedding
-
-This page explains how the Rotary Embedding is computed and applied in Transformers and what types of RoPE are supported.
-
-## Overview
-
-Rotary Position Embeddings are a technique used to inject positional information into attention mechanisms without relying on explicit position encodings.  
-Instead of adding position vectors to token embeddings, RoPE rotates query and key vectors in the complex plane according to their positions enabling relative positional awareness and better extrapolation to unseen sequence lengths.
-
-The Transformers library provides a flexible and extensible implementation of various RoPE types defined in `[`~modeling_rope_utils.ROPE_VALIDATION_FUNCTIONS`]`, including both the default and scaled variants:
-
-| Rope Type | Description |
-|------------|-------------|
-| `"default"` | Standard rotary embedding as in LLaMA. |
-| `"linear"` | Linear-scaled RoPE which allows longer context windows. |
-| `"dynamic"` | NTK-aware scaling computed by rescaling frequency base (`θ`) for longer context. |
-| `"yarn"` | YaRN scaling variant providing smoother extrapolation and stability. |
-| `"longrope"` | [LongRoPE](https://github.com/microsoft/LongRoPE) scaling as in Phi-2 model series. |
-| `"llama3"` | RoPE scaling as in Llama3.1. |
-
-## Configuration in Model Configs
-
-To enable and customize rotary embeddings, add a `rope_parameters` field to your model’s configuration file (`config.json`). This field controls the RoPE behavior across model layers. Note that each RoPE variant defines its own set of expected keys and missing keys will raise an error. See the example below which creates a llama config with default RoPE parameters:
-
-```python
-from transformers import LlamaConfig
-
-config = LlamaConfig()
-config.rope_parameters = {
-    "rope_type": "default", # type of RoPE to use
-    "rope_theta": 10000.0 # base frequency parameter
-}
-
-# If we want to apply a scaled RoPE type, we need to pass extra parameters
-config.rope_parameters = {
-    "rope_type": "linear",
-    "rope_theta": 10000.0,
-    "factor": 8.0  # scale factor for context extension
-}
-```
-
-## Per-Layer-Type RoPE Configuration
-
-Some models such as Gemma-3 use different layer types with different attention mechanisms, i.e. "full attention" in some blocks and "sliding-window attention" in others. Transformers supports specifying distinct RoPE parameters per layer type for these models. In this case, `rope_parameters` should be a nested dictionary, where top-level keys correspond to `config.layer_types` and values are per-type RoPE parameters. During model initialization, each decoder layer will automatically look up the matching RoPE configuration based on its declared layer type.
-
-```python
-from transformers import Gemma3Config
-
-config = Gemma3Config()
-config.rope_parameters = {
-    "full_attention": {
-        "rope_type": "dynamic",
-        "rope_theta": 1000000.0,
-        "factor": 8.0,
-        "original_max_position_embeddings": 8096,
-    },
-    "sliding_attention": {
-        "rope_type": "default",
-        "rope_theta": 10000.0,
-    }
-}
-```
-
-## Utilities
-
-[[autodoc]] RopeParameters
-    - __call__
--- a/docs/source/en/kernel_doc/overview.md
+++ b/docs/source/en/kernel_doc/overview.md
@ -1,3 +1,3 @@
 # Overview

-Kernels in transformers are used to optimize the performance of models with custom layers from the hub and very low effort.
+Kernels in transformers are used to optimize the performance of models with custom layers from the hub and very low effort.
--- a/docs/source/en/kv_cache.md
+++ b/docs/source/en/kv_cache.md
@ -208,7 +208,7 @@ Some models have a unique way of storing past kv pairs or states that is not com

 Mamba models, such as [Mamba](./model_doc/mamba), require a specific cache because the model doesn't have an attention mechanism or kv states. Thus, they are not compatible with the above [`Cache`] classes.

-## Iterative generation
+# Iterative generation

 A cache can also work in iterative generation settings where there is back-and-forth interaction with a model (chatbots). Like regular generation, iterative generation with a cache allows a model to efficiently handle ongoing conversations without recomputing the entire context at each step.

--- a/docs/source/en/main_classes/data_collator.md
+++ b/docs/source/en/main_classes/data_collator.md
@ -67,6 +67,6 @@ Examples of use can be found in the [example scripts](../examples) or [example n

 [[autodoc]] data.data_collator.DataCollatorWithFlattening

-## DataCollatorForMultipleChoice
+# DataCollatorForMultipleChoice

 [[autodoc]] data.data_collator.DataCollatorForMultipleChoice
--- a/docs/source/en/main_classes/pipelines.md
+++ b/docs/source/en/main_classes/pipelines.md
@ -267,7 +267,6 @@ about how many forward passes you inputs are actually going to trigger, you can
 independently of the inputs. The caveats from the previous section still apply.

 ## Pipeline FP16 inference
-
 Models can be run in FP16 which can be significantly faster on GPU while saving memory. Most models will not suffer noticeable performance loss from this. The larger the model, the less likely that it will.

 To enable FP16 inference, you can simply pass `dtype=torch.float16` or `dtype='float16'` to the pipeline constructor. Note that this only works for models with a PyTorch backend. Your inputs will be converted to FP16 internally.
@ -335,7 +334,6 @@ Pipelines available for audio tasks include the following.
 Pipelines available for computer vision tasks include the following.

 ### DepthEstimationPipeline
-
 [[autodoc]] DepthEstimationPipeline
    - __call__
    - all
--- a/docs/source/en/main_classes/quantization.md
+++ b/docs/source/en/main_classes/quantization.md
@ -43,7 +43,6 @@ Learn how to quantize models in the [Quantization](../quantization) guide.
 [[autodoc]] AwqConfig

 ## EetqConfig
-
 [[autodoc]] EetqConfig

 ## GPTQConfig
--- a/docs/source/en/main_classes/tokenizer.md
+++ b/docs/source/en/main_classes/tokenizer.md
@ -50,14 +50,14 @@ several advanced alignment methods which can be used to map between the original
 token space (e.g., getting the index of the token comprising a given character or the span of characters corresponding
 to a given token).

-## Multimodal Tokenizer
+# Multimodal Tokenizer

 Apart from that each tokenizer can be a "multimodal" tokenizer which means that the tokenizer will hold all relevant special tokens
 as part of tokenizer attributes for easier access. For example, if the tokenizer is loaded from a vision-language model like LLaVA, you will
 be able to access `tokenizer.image_token_id` to obtain the special image token used as a placeholder.

 To enable extra special tokens for any type of tokenizer, you have to add the following lines and save the tokenizer. Extra special tokens do not
-have to be modality related and can be anything that the model often needs access to. In the below code, tokenizer at `output_dir` will have direct access
+have to be modality related and can ne anything that the model often needs access to. In the below code, tokenizer at `output_dir` will have direct access
 to three more special tokens.  

 ```python
--- a/docs/source/en/main_classes/video_processor.md
+++ b/docs/source/en/main_classes/video_processor.md
@ -23,7 +23,6 @@ The video processor extends the functionality of image processors by allowing Vi
 When adding a new VLM or updating an existing one to enable distinct video preprocessing, saving and reloading the processor configuration will store the video related arguments in a dedicated file named `video_preprocessing_config.json`. Don't worry if you haven't updated your VLM, the processor will try to load video related configurations from a file named `preprocessing_config.json`.

 ### Usage Example
-
 Here's an example of how to load a video processor with [`llava-hf/llava-onevision-qwen2-0.5b-ov-hf`](https://huggingface.co/llava-hf/llava-onevision-qwen2-0.5b-ov-hf) model:

 ```python
--- a/docs/source/en/model_doc/aimv2.md
+++ b/docs/source/en/model_doc/aimv2.md
@ -13,66 +13,51 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2024-11-21 and added to Hugging Face Transformers on 2025-07-08.*
+
+*This model was released on 2024-11-21 and added to Hugging Face Transformers on 2025-07-08 and contributed by [yaswanthgali](https://huggingface.co/yaswanthgali).*

 # AIMv2

-## Overview
+[AIMv2](https://huggingface.co/papers/2411.14402) presents a novel method for pre-training large-scale vision encoders in a multimodal setting, combining images and text. The model, characterized by a straightforward pre-training process and scalability, pairs a vision encoder with a multimodal decoder that autoregressively generates raw image patches and text tokens. AIMV2 excels in both multimodal evaluations and vision benchmarks such as localization, grounding, and classification. Notably, the AIMV2-3B encoder achieves 89.5% accuracy on ImageNet-1k with a frozen trunk and outperforms state-of-the-art contrastive models like CLIP and SigLIP in multimodal image understanding across various settings.

-The AIMv2 model was proposed in [Multimodal Autoregressive Pre-training of Large Vision Encoders](https://huggingface.co/papers/2411.14402) by Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor Guilherme Turrisi da Costa, Louis Béthune, Zhe Gan, Alexander T Toshev, Marcin Eichner, Moin Nabi, Yinfei Yang, Joshua M. Susskind, Alaaeldin El-Nouby.
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-The abstract from the paper is the following:
+```py
+import torch
+from transformers import pipeline

-*We introduce a novel method for pre-training of large-scale vision encoders. Building on recent advancements in autoregressive pre-training of vision models, we extend this framework to a multimodal setting, i.e., images and text. In this paper, we present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process, scalability, and remarkable performance across a range of downstream tasks. This is achieved by pairing the vision encoder with a multimodal decoder that autoregressively generates raw image patches and text tokens. Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification. Notably, our AIMV2-3B encoder achieves 89.5% accuracy on ImageNet-1k with a frozen trunk. Furthermore, AIMV2 consistently outperforms state-of-the-art contrastive models (e.g., CLIP, SigLIP) in multimodal image understanding across diverse settings.*
-
-This model was contributed by [Yaswanth Gali](https://huggingface.co/yaswanthgali).
-The original code can be found [here](https://github.com/apple/ml-aim).
-
-## Usage Example
-
-Here is an example of Image Feature Extraction using specific checkpoints on resized images and native resolution images:
-
-```python
-import requests
-from PIL import Image
-from transformers import AutoImageProcessor, AutoModel
-
-url = "http://images.cocodataset.org/val2017/000000039769.jpg"
-image = Image.open(requests.get(url, stream=True).raw)
-
-processor = AutoImageProcessor.from_pretrained("apple/aimv2-large-patch14-native")
-model = AutoModel.from_pretrained("apple/aimv2-large-patch14-native")
-
-inputs = processor(images=image, return_tensors="pt")
-outputs = model(**inputs)
+pipeline = pipeline(task="zero-shot-classification", model="apple/aimv2-large-patch14-native", dtype="auto")
+pipeline("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg")
 ```

-Here is an example of a checkpoint performing zero-shot classification:
+</hfoption>
+<hfoption id="AutoModel">

 ```python
+import torch
 import requests
 from PIL import Image
 from transformers import AutoProcessor, AutoModel

-url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
 image = Image.open(requests.get(url, stream=True).raw)
 text = ["Picture of a dog.", "Picture of a cat.", "Picture of a horse."]

 processor = AutoProcessor.from_pretrained("apple/aimv2-large-patch14-224-lit")
-model = AutoModel.from_pretrained("apple/aimv2-large-patch14-224-lit")
+model = AutoModel.from_pretrained("apple/aimv2-large-patch14-224-lit", dtype="auto")

-inputs = processor(
-    images=image,
-    text=text,
-    add_special_tokens=True,
-    truncation=True,
-    padding=True,
-    return_tensors="pt",
-)
+inputs = processor(images=image, text=text, add_special_tokens=True, truncation=True, padding=True, return_tensors="pt",)
 outputs = model(**inputs)
 probs = outputs.logits_per_image.softmax(dim=-1)
+pred_idx = torch.argmax(probs, dim=-1).item()
+predicted_label = text[pred_idx]
+print(f"Predicted label: {predicted_label}")
 ```

+</hfoption>
+</hfoptions>
+
 ## Aimv2Config

 [[autodoc]] Aimv2Config
@ -99,3 +84,4 @@ probs = outputs.logits_per_image.softmax(dim=-1)

 [[autodoc]] Aimv2TextModel
    - forward
+
--- a/docs/source/en/model_doc/albert.md
+++ b/docs/source/en/model_doc/albert.md
@ -13,32 +13,17 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2019-09-26 and added to Hugging Face Transformers on 2020-11-16.*
+*This model was released on 2019-09-26 and added to Hugging Face Transformers on 2020-11-16 and contributed by [lysandre](https://huggingface.co/lysandre).*

 <div style="float: right;">
    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white" >
-        <img alt="SDPA" src= "https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white" >
+        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
    </div>
 </div>

 # ALBERT

-[ALBERT](https://huggingface.co/papers/1909.11942) is designed to address memory limitations of scaling and training of [BERT](./bert). It adds two parameter reduction techniques. The first, factorized embedding parametrization, splits the larger vocabulary embedding matrix into two smaller matrices so you can grow the hidden size without adding a lot more parameters. The second, cross-layer parameter sharing, allows layer to share parameters which keeps the number of learnable parameters lower.
-
-ALBERT was created to address problems like -- GPU/TPU memory limitations, longer training times, and unexpected model degradation in BERT. ALBERT uses two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT:
-
- **Factorized embedding parameterization:** The large vocabulary embedding matrix is decomposed into two smaller matrices, reducing memory consumption.
- **Cross-layer parameter sharing:** Instead of learning separate parameters for each transformer layer, ALBERT shares parameters across layers, further reducing the number of learnable weights.
-
-ALBERT uses absolute position embeddings (like BERT) so padding is applied at right. Size of embeddings is 128 While BERT uses 768. ALBERT can processes maximum 512 token at a time.
-
-You can find all the original ALBERT checkpoints under the [ALBERT community](https://huggingface.co/albert) organization.
-
-> [!TIP]
-> Click on the ALBERT models in the right sidebar for more examples of how to apply ALBERT to different language tasks.
-
-The example below demonstrates how to predict the `[MASK]` token with [`Pipeline`], [`AutoModel`], and from the command line.
+[ALBERT](https://huggingface.co/papers/1909.11942) presents parameter-reduction techniques to enhance BERT by splitting the embedding matrix and using repeating layers. These methods reduce memory usage and training time, enabling better scalability. The model employs a self-supervised loss to improve inter-sentence coherence, achieving state-of-the-art results on GLUE, RACE, and SQuAD benchmarks with fewer parameters than BERT-large.

 <hfoptions id="usage">
 <hfoption id="Pipeline">
@ -47,13 +32,8 @@ The example below demonstrates how to predict the `[MASK]` token with [`Pipeline
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    task="fill-mask",
-    model="albert-base-v2",
-    dtype=torch.float16,
-    device=0
-)
-pipeline("Plants create [MASK] through a process known as photosynthesis.", top_k=5)
+pipeline = pipeline(task="fill-mask", model="albert/albert-base-v2", dtype="auto")
+pipeline("Plants create [MASK] through a process known as photosynthesis.")
 ```

 </hfoption>
@ -63,76 +43,25 @@ pipeline("Plants create [MASK] through a process known as photosynthesis.", top_
 import torch
 from transformers import AutoModelForMaskedLM, AutoTokenizer

+model = AutoModelForMaskedLM.from_pretrained("albert/albert-base-v2", dtype="auto")
 tokenizer = AutoTokenizer.from_pretrained("albert/albert-base-v2")
-model = AutoModelForMaskedLM.from_pretrained(
-    "albert/albert-base-v2",
-    dtype=torch.float16,
-    attn_implementation="sdpa",
-    device_map="auto"
-)

-prompt = "Plants create energy through a process known as [MASK]."
-inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
-
-with torch.no_grad():
-    outputs = model(**inputs)
-    mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
-    predictions = outputs.logits[0, mask_token_index]
-
-top_k = torch.topk(predictions, k=5).indices.tolist()
-for token_id in top_k[0]:
-    print(f"Prediction: {tokenizer.decode([token_id])}")
+inputs = tokenizer("Plants create [MASK] through a process known as photosynthesis.", return_tensors="pt")
+outputs = model(**inputs)
+mask_token_id = tokenizer.mask_token_id
+mask_position = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
+predicted_word = tokenizer.decode(outputs.logits[0, mask_position].argmax(dim=-1))
+print(f"Predicted word: {predicted_word}")
 ```

 </hfoption>
-<hfoption id="transformers CLI">
-
-```bash
-echo -e "Plants create [MASK] through a process known as photosynthesis." | transformers run --task fill-mask --model albert-base-v2 --device 0
-```
-
-</hfoption>
-
 </hfoptions>

-## Notes
+## Usage tips

- Inputs should be padded on the right because BERT uses absolute position embeddings.
- The embedding size `E` is different from the hidden size `H` because the embeddings are context independent (one embedding vector represents one token) and the hidden states are context dependent (one hidden state represents a sequence of tokens). The embedding matrix is also larger because `V x E` where `V` is the vocabulary size. As a result, it's more logical if `H >> E`. If `E < H`, the model has less parameters.
+- ALBERT uses absolute position embeddings. Pad inputs on the right, not the left.

-## Resources
-
-The resources provided in the following sections consist of a list of official Hugging Face and community (indicated by 🌎) resources to help you get started with AlBERT. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
-
-<PipelineTag pipeline="text-classification"/>
-
- [`AlbertForSequenceClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification).
-
- Check the [Text classification task guide](../tasks/sequence_classification) on how to use the model.
-
-<PipelineTag pipeline="token-classification"/>
-
- [`AlbertForTokenClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/token-classification).
-
- [Token classification](https://huggingface.co/course/chapter7/2?fw=pt) chapter of the 🤗 Hugging Face Course.
- Check the [Token classification task guide](../tasks/token_classification) on how to use the model.
-
-<PipelineTag pipeline="fill-mask"/>
-
- [`AlbertForMaskedLM`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling#robertabertdistilbert-and-masked-language-modeling) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb).
- [Masked language modeling](https://huggingface.co/course/chapter7/3?fw=pt) chapter of the 🤗 Hugging Face Course.
- Check the [Masked language modeling task guide](../tasks/masked_language_modeling) on how to use the model.
-
-<PipelineTag pipeline="question-answering"/>
-
- [`AlbertForQuestionAnswering`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering.ipynb).
- [Question answering](https://huggingface.co/course/chapter7/7?fw=pt) chapter of the 🤗 Hugging Face Course.
- Check the [Question answering task guide](../tasks/question_answering) on how to use the model.
-
-**Multiple choice**
-
- [`AlbertForMultipleChoice`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/multiple-choice) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice.ipynb).
- Check the [Multiple choice task guide](../tasks/multiple_choice) on how to use the model.
+- The embedding size E differs from hidden size H for good reason. Embeddings represent individual tokens (context-independent). Hidden states represent token sequences (context-dependent). This makes H >> E logical. The embedding matrix spans V × E dimensions, where V is vocabulary size. Keeping E < H reduces parameter count.

 ## AlbertConfig

@ -140,7 +69,11 @@ The resources provided in the following sections consist of a list of official H

 ## AlbertTokenizer

-[[autodoc]] AlbertTokenizer - build_inputs_with_special_tokens - get_special_tokens_mask - create_token_type_ids_from_sequences - save_vocabulary
+[[autodoc]] AlbertTokenizer
+    - build_inputs_with_special_tokens
+    - get_special_tokens_mask
+    - create_token_type_ids_from_sequences
+    - save_vocabulary

 ## AlbertTokenizerFast

@ -152,19 +85,23 @@ The resources provided in the following sections consist of a list of official H

 ## AlbertModel

-[[autodoc]] AlbertModel - forward
+[[autodoc]] AlbertModel
+    - forward

 ## AlbertForPreTraining

-[[autodoc]] AlbertForPreTraining - forward
+[[autodoc]] AlbertForPreTraining
+    - forward

 ## AlbertForMaskedLM

-[[autodoc]] AlbertForMaskedLM - forward
+[[autodoc]] AlbertForMaskedLM
+    - forward

 ## AlbertForSequenceClassification

-[[autodoc]] AlbertForSequenceClassification - forward
+[[autodoc]] AlbertForSequenceClassification
+    - forward

 ## AlbertForMultipleChoice

@ -172,8 +109,10 @@ The resources provided in the following sections consist of a list of official H

 ## AlbertForTokenClassification

-[[autodoc]] AlbertForTokenClassification - forward
+[[autodoc]] AlbertForTokenClassification
+    - forward

 ## AlbertForQuestionAnswering

-[[autodoc]] AlbertForQuestionAnswering - forward
+[[autodoc]] AlbertForQuestionAnswering
+    - forward
--- a/docs/source/en/model_doc/align.md
+++ b/docs/source/en/model_doc/align.md
@ -13,46 +13,21 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2021-02-11 and added to Hugging Face Transformers on 2023-03-01.*
-<div style="float: right;">
-  <div class="flex flex-wrap space-x-1">
-    <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-    <img alt="Transformers" src="https://img.shields.io/badge/Transformers-6B5B95?style=flat&logo=transformers&logoColor=white">
-  </div>
-</div>
+*This model was released on 2021-02-11 and added to Hugging Face Transformers on 2023-03-01 and contributed by [adirik](https://huggingface.co/adirik).*

 # ALIGN

-[ALIGN](https://huggingface.co/papers/2102.05918) is pretrained on a noisy 1.8 billion alt‑text and image pair dataset to show that scale can make up for the noise. It uses a dual‑encoder architecture, [EfficientNet](./efficientnet) for images and [BERT](./bert) for text, and a contrastive loss to align similar image–text embeddings together while pushing different embeddings apart. Once trained, ALIGN can encode any image and candidate captions into a shared vector space for zero‑shot retrieval or classification without requiring extra labels. This scale‑first approach reduces dataset curation costs and powers state‑of‑the‑art image–text retrieval and zero‑shot ImageNet classification.
-
-You can find all the original ALIGN checkpoints under the [Kakao Brain](https://huggingface.co/kakaobrain?search_models=align) organization.
-
-> [!TIP]
-> Click on the ALIGN models in the right sidebar for more examples of how to apply ALIGN to different vision and text related tasks.
-
-The example below demonstrates zero-shot image classification with [`Pipeline`] or the [`AutoModel`] class.
-
-<hfoptions id="usage">  
+[ALIGN](https://huggingface.co/papers/2102.05918) is a multi-modal vision and language model utilizing a dual-encoder architecture with EfficientNet for vision and BERT for text. It employs contrastive learning to align visual and text representations using a noisy dataset of over one billion image-alt text pairs. Despite the noise, the scale of the dataset enables state-of-the-art performance in image classification and image-text retrieval tasks, surpassing more complex models.

+<hfoptions id="usage">
 <hfoption id="Pipeline">

 ```py
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    task="zero-shot-image-classification",
-    model="kakaobrain/align-base",
-    device=0,
-    dtype=torch.bfloat16
-)
-
-candidate_labels = [
-    "a photo of a dog",
-    "a photo of a cat",
-    "a photo of a person"
-]
-
+pipeline = pipeline(task="zero-shot-image-classification", model="kakaobrain/align-base", dtype="auto")
+candidate_labels = ["a photo of a dog", "a photo of a cat", "a photo of a person"]
 pipeline("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg", candidate_labels=candidate_labels)
 ```

@ -66,7 +41,7 @@ from PIL import Image
 from transformers import AutoProcessor, AutoModelForZeroShotImageClassification

 processor = AutoProcessor.from_pretrained("kakaobrain/align-base")
-model = AutoModelForZeroShotImageClassification.from_pretrained("kakaobrain/align-base", device_map="auto")
+model = AutoModelForZeroShotImageClassification.from_pretrained("kakaobrain/align-base", dtype="auto")

 url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
 image = requests.get(url, stream=True)
@ -92,65 +67,8 @@ for label, score in zip(candidate_labels, probs):
 ```

 </hfoption>
-
 </hfoptions>

-## Notes
-
- ALIGN projects the text and visual features into latent space and the dot product between the projected image and text features is used as the similarity score. The example below demonstrates how to calculate the image-text similarity score with [`AlignProcessor`] and [`AlignModel`].
-
-  ```py
-  # Example of using ALIGN for image-text similarity
-  from transformers import AlignProcessor, AlignModel
-  import torch
-  from PIL import Image
-  import requests
-  from io import BytesIO
-  
-  # Load processor and model
-  processor = AlignProcessor.from_pretrained("kakaobrain/align-base")
-  model = AlignModel.from_pretrained("kakaobrain/align-base")
-  
-  # Download image from URL
-  url = "https://huggingface.co/roschmid/dog-races/resolve/main/images/Golden_Retriever.jpg"
-  response = requests.get(url)
-  image = Image.open(BytesIO(response.content))  # Convert the downloaded bytes to a PIL Image
-  
-  texts = ["a photo of a cat", "a photo of a dog"]
-  
-  # Process image and text inputs
-  inputs = processor(images=image, text=texts, return_tensors="pt")
-  
-  # Get the embeddings
-  with torch.no_grad():
-      outputs = model(**inputs)
-  
-  image_embeds = outputs.image_embeds
-  text_embeds = outputs.text_embeds
-  
-  # Normalize embeddings for cosine similarity
-  image_embeds = image_embeds / image_embeds.norm(dim=1, keepdim=True)
-  text_embeds = text_embeds / text_embeds.norm(dim=1, keepdim=True)
-  
-  # Calculate similarity scores
-  similarity_scores = torch.matmul(text_embeds, image_embeds.T)
-  
-  # Print raw scores
-  print("Similarity scores:", similarity_scores)
-  
-  # Convert to probabilities
-  probs = torch.nn.functional.softmax(similarity_scores, dim=0)
-  print("Probabilities:", probs)
-  
-  # Get the most similar text
-  most_similar_idx = similarity_scores.argmax().item()
-  print(f"Most similar text: '{texts[most_similar_idx]}'")
-  ```
-
-## Resources
-
- Refer to the [Kakao Brain’s Open Source ViT, ALIGN, and the New COYO Text-Image Dataset](https://huggingface.co/blog/vit-align) blog post for more details.
-
 ## AlignConfig

 [[autodoc]] AlignConfig
@ -183,3 +101,4 @@ for label, score in zip(candidate_labels, probs):

 [[autodoc]] AlignVisionModel
    - forward
+
--- a/docs/source/en/model_doc/altclip.md
+++ b/docs/source/en/model_doc/altclip.md
@ -13,35 +13,37 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2022-11-12 and added to Hugging Face Transformers on 2023-01-04.*
-
-<div style="float: right;">
-  <div class="flex flex-wrap space-x-1">
-    <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+*This model was released on 2022-11-12 and added to Hugging Face Transformers on 2023-01-04 and contributed by [jongjyh](https://huggingface.co/jongjyh).*

 # AltCLIP

-[AltCLIP](https://huggingface.co/papers/2211.06679) replaces the [CLIP](./clip) text encoder with a multilingual XLM-R encoder and aligns image and text representations with teacher learning and contrastive learning.
+[AltCLIP](https://huggingface.co/papers/2211.06679v2) alters the text encoder in CLIP by replacing it with a pretrained multilingual text encoder XLM-R. This modification enables the model to achieve state-of-the-art performance on tasks such as ImageNet-CN, Flicker30k-CN, and COCO-CN, while maintaining performance close to CLIP on other tasks. The approach involves a two-stage training schema with teacher learning and contrastive learning to align language and image representations, extending CLIP's capabilities to multilingual understanding.

-You can find all the original AltCLIP checkpoints under the [AltClip](https://huggingface.co/collections/BAAI/alt-clip-diffusion-66987a97de8525205f1221bf) collection.
-
-> [!TIP]
-> Click on the AltCLIP models in the right sidebar for more examples of how to apply AltCLIP to different tasks.
-
-The examples below demonstrates how to calculate similarity scores between an image and one or more captions with the [`AutoModel`] class.
+This model was contributed by [jongjyh](https://huggingface.co/jongjyh).

 <hfoptions id="usage">
+<hfoption id="Pipeline">
+
+```py
+import torch
+from transformers import pipeline
+
+pipeline = pipeline(task="zero-shot-image-classification", model="kakaobrain/align-base", dtype="auto")
+candidate_labels = ["a photo of a dog", "a photo of a cat", "a photo of a person"]
+pipeline("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg", candidate_labels=candidate_labels)
+```
+
+</hfoption>
 <hfoption id="AutoModel">

-```python
+```py
 import torch
 import requests
 from PIL import Image
-from transformers import AltCLIPModel, AltCLIPProcessor
+from transformers import AltCLIPModel, AutoProcessor

-model = AltCLIPModel.from_pretrained("BAAI/AltCLIP", dtype=torch.bfloat16)
-processor = AltCLIPProcessor.from_pretrained("BAAI/AltCLIP")
+model = AltCLIPModel.from_pretrained("BAAI/AltCLIP", dtype="auto")
+processor = AutoProcessor.from_pretrained("BAAI/AltCLIP")

 url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
 image = Image.open(requests.get(url, stream=True).raw)
@ -49,8 +51,8 @@ image = Image.open(requests.get(url, stream=True).raw)
 inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)

 outputs = model(**inputs)
-logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
-probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities
+logits_per_image = outputs.logits_per_image
+probs = logits_per_image.softmax(dim=1)

 labels = ["a photo of a cat", "a photo of a dog"]
 for label, prob in zip(labels, probs[0]):
@ -60,48 +62,10 @@ for label, prob in zip(labels, probs[0]):
 </hfoption>
 </hfoptions>

-Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
-
-The example below uses [torchao](../quantization/torchao) to only quantize the weights to int4.
-
-```python
-# !pip install torchao
-import torch
-import requests
-from PIL import Image
-from transformers import AltCLIPModel, AltCLIPProcessor, TorchAoConfig
-
-model = AltCLIPModel.from_pretrained(
-    "BAAI/AltCLIP",
-    quantization_config=TorchAoConfig("int4_weight_only", group_size=128),
-    dtype=torch.bfloat16,
-)
-
-processor = AltCLIPProcessor.from_pretrained("BAAI/AltCLIP")
-
-url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
-image = Image.open(requests.get(url, stream=True).raw)
-
-inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
-
-outputs = model(**inputs)
-logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
-probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities
-
-labels = ["a photo of a cat", "a photo of a dog"]
-for label, prob in zip(labels, probs[0]):
-    print(f"{label}: {prob.item():.4f}")
-```
-
-## Notes
-
- AltCLIP uses bidirectional attention instead of causal attention and it uses the `[CLS]` token in XLM-R to represent a text embedding.
- Use [`CLIPImageProcessor`] to resize (or rescale) and normalize images for the model.
- [`AltCLIPProcessor`] combines [`CLIPImageProcessor`] and [`XLMRobertaTokenizer`] into a single instance to encode text and prepare images.
-
 ## AltCLIPConfig

 [[autodoc]] AltCLIPConfig
+    - from_text_vision_configs

 ## AltCLIPTextConfig

@ -111,18 +75,24 @@ for label, prob in zip(labels, probs[0]):

 [[autodoc]] AltCLIPVisionConfig

+## AltCLIPProcessor
+
+[[autodoc]] AltCLIPProcessor
+
 ## AltCLIPModel

 [[autodoc]] AltCLIPModel
+    - forward
+    - get_text_features
+    - get_image_features

 ## AltCLIPTextModel

 [[autodoc]] AltCLIPTextModel
+    - forward

 ## AltCLIPVisionModel

 [[autodoc]] AltCLIPVisionModel
+    - forward

-## AltCLIPProcessor
-
-[[autodoc]] AltCLIPProcessor
--- a/docs/source/en/model_doc/apertus.md
+++ b/docs/source/en/model_doc/apertus.md
@ -13,28 +13,20 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2025-09-02 and added to Hugging Face Transformers on 2025-08-28.*
-
-# Apertus
+*This model was released on 2025-09-02 and added to Hugging Face Transformers on 2025-10-07.*

 <div style="float: right;">
    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
        <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
        <img alt="Tensor parallelism" src="https://img.shields.io/badge/Tensor%20parallelism-06b6d4?style=flat&logoColor=white">
    </div>
 </div>

-## Overview
+# Apertus

 [Apertus](https://www.swiss-ai.org) is a family of large language models from the Swiss AI Initiative.

-> [!TIP]
-> Coming soon
-
-The example below demonstrates how to generate text with [`Pipeline`] or the [`AutoModel`], and from the command line.
-
 <hfoptions id="usage">
 <hfoption id="Pipeline">

@ -42,13 +34,8 @@ The example below demonstrates how to generate text with [`Pipeline`] or the [`A
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    task="text-generation",
-    model="swiss-ai/Apertus-8B",
-    dtype=torch.bfloat16,
-    device=0
-)
-pipeline("Plants create energy through a process known as")
+pipeline = pipeline(task="text-generation", model="swiss-ai/Apertus-8B", dtype="auto")
+pipeline("Plants generate energy through a process known as  ")
 ```

 </hfoption>
@ -56,28 +43,15 @@ pipeline("Plants create energy through a process known as")

 ```py
 import torch
-from transformers import AutoModelForCausalLM, AutoTokenizer
+from transformers import AutoTokenizer, AutoModelForCausalLM

-tokenizer = AutoTokenizer.from_pretrained(
-    "swiss-ai/Apertus-8B",
-)
-model = AutoModelForCausalLM.from_pretrained(
-    "swiss-ai/Apertus-8B",
-    dtype=torch.bfloat16,
-    device_map="auto",
-    attn_implementation="sdpa"
-)
-input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to("cuda")
+tokenizer = AutoTokenizer.from_pretrained("swiss-ai/Apertus-8B")
+model = ArceeForCausalLM.from_pretrained("swiss-ai/Apertus-8B", dtype="auto")

-output = model.generate(**input_ids)
-print(tokenizer.decode(output[0], skip_special_tokens=True))
-```
-
-</hfoption>
-<hfoption id="transformers CLI">
-
-```bash
-echo -e "Plants create energy through a process known as" | transformers run --task text-generation --model swiss-ai/Apertus-8B --device 0
+inputs = tokenizer("Plants generate energy through a process known as  ", return_tensors="pt")
+with torch.no_grad():
+    outputs = model.generate(**inputs, max_new_tokens=50)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```

 </hfoption>
--- a/docs/source/en/model_doc/arcee.md
+++ b/docs/source/en/model_doc/arcee.md
@ -17,7 +17,6 @@ rendered properly in your Markdown viewer.

 <div style="float: right;">
    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
        <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
    </div>
@ -29,11 +28,6 @@ rendered properly in your Markdown viewer.

 The Arcee model is architecturally similar to Llama but uses `x * relu(x)` in MLP layers for improved gradient flow and is optimized for efficiency in both training and inference scenarios.

-> [!TIP]
-> The Arcee model supports extended context with RoPE scaling and all standard transformers features including Flash Attention 2, SDPA, gradient checkpointing, and quantization support.
-
-The example below demonstrates how to generate text with Arcee using [`Pipeline`] or the [`AutoModel`].
-
 <hfoptions id="usage">
 <hfoption id="Pipeline">

@ -41,15 +35,8 @@ The example below demonstrates how to generate text with Arcee using [`Pipeline`
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    task="text-generation",
-    model="arcee-ai/AFM-4.5B",
-    dtype=torch.float16,
-    device=0
-)
-
-output = pipeline("The key innovation in Arcee is")
-print(output[0]["generated_text"])
+pipeline = pipeline(task="text-generation", model="arcee-ai/AFM-4.5B", dtype="auto")
+pipeline("Plants generate energy through a process known as  ")
 ```

 </hfoption>
@ -57,16 +44,12 @@ print(output[0]["generated_text"])

 ```py
 import torch
-from transformers import AutoTokenizer, ArceeForCausalLM
+from transformers import AutoTokenizer, AutoModelForCausalLM

 tokenizer = AutoTokenizer.from_pretrained("arcee-ai/AFM-4.5B")
-model = ArceeForCausalLM.from_pretrained(
-    "arcee-ai/AFM-4.5B",
-    dtype=torch.float16,
-    device_map="auto"
-)
+model = ArceeForCausalLM.from_pretrained("arcee-ai/AFM-4.5B", dtype="auto")

-inputs = tokenizer("The key innovation in Arcee is", return_tensors="pt")
+inputs = tokenizer("Plants generate energy through a process known as  ", return_tensors="pt")
 with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=50)
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
@ -102,4 +85,4 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ## ArceeForTokenClassification

 [[autodoc]] ArceeForTokenClassification
-    - forward
+    - forward
--- a/docs/source/en/model_doc/aria.md
+++ b/docs/source/en/model_doc/aria.md
@ -13,11 +13,10 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2024-10-08 and added to Hugging Face Transformers on 2024-12-06.*
+*This model was released on 2024-10-08 and added to Hugging Face Transformers on 2024-12-06 and contributed by [m-ric](https://huggingface.co/m-ric).*

 <div style="float: right;">
    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
        <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
    </div>
@ -25,48 +24,27 @@ rendered properly in your Markdown viewer.

 # Aria

-[Aria](https://huggingface.co/papers/2410.05993) is a multimodal mixture-of-experts (MoE) model. The goal of this model is to open-source a training recipe for creating a multimodal native model from scratch. Aria has 3.9B and 3.5B activated parameters per visual and text token respectively. Text is handled by a MoE decoder and visual inputs are handled by a lightweight visual encoder. It is trained in 4 stages, language pretraining, multimodal pretraining, multimodal long-context pretraining, and multimodal post-training.
-
-You can find all the original Aria checkpoints under the [Aria](https://huggingface.co/rhymes-ai?search_models=aria) organization.
-
-> [!TIP]
-> Click on the Aria models in the right sidebar for more examples of how to apply Aria to different multimodal tasks.
-
-The example below demonstrates how to generate text based on an image with [`Pipeline`] or the [`AutoModel`] class.
+[Aria](https://huggingface.co/papers/2410.05993) is an open multimodal-native model designed to integrate diverse information sources and deliver comprehensive understanding. It employs a Mixture-of-Experts architecture with 3.9B and 3.5B activated parameters per visual and text token, respectively. Aria outperforms models like Pixtral-12B and Llama3.2-11B across various multimodal, language, and coding tasks. The model is pre-trained through a 4-stage pipeline that enhances language understanding, multimodal capabilities, long context handling, and instruction following. Aria's weights and codebase are open-sourced to facilitate adoption and adaptation in real-world applications.

 <hfoptions id="usage">
 <hfoption id="Pipeline">

-```python
+```py
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    "image-to-text",
-    model="rhymes-ai/Aria",
-    device=0,
-    dtype=torch.bfloat16
-)
-pipeline(
-    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
-    text="What is shown in this image?"
-)
+pipeline = pipeline(task="image-to-text", model="rhymes-ai/Aria", dtype="auto")
+pipeline("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg", text="What is shown in this image?")
 ```

 </hfoption>
 <hfoption id="AutoModel">

-```python
+```py
 import torch
 from transformers import AutoModelForCausalLM, AutoProcessor

-model = AutoModelForCausalLM.from_pretrained(
-    "rhymes-ai/Aria",
-    device_map="auto",
-    dtype=torch.bfloat16,
-    attn_implementation="sdpa"
-)
-
+model = AutoModelForCausalLM.from_pretrained("rhymes-ai/Aria", dtype="auto")
 processor = AutoProcessor.from_pretrained("rhymes-ai/Aria")

 messages = [
@ -81,8 +59,7 @@ messages = [
 inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt")
 ipnuts = inputs.to(model.device, torch.bfloat16)

-output = model.generate(
-    **inputs,
+output = model.generate(**inputs,
    max_new_tokens=15,
    stop_strings=["<|im_end|>"],
    tokenizer=processor.tokenizer,
@ -97,51 +74,6 @@ print(response)
 </hfoption>
 </hfoptions>

-Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
-
-The example below uses [torchao](../quantization/torchao) to only quantize the weights to int4 and the [rhymes-ai/Aria-sequential_mlp](https://huggingface.co/rhymes-ai/Aria-sequential_mlp) checkpoint. This checkpoint replaces grouped GEMM with `torch.nn.Linear` layers for easier quantization.
-
-```py
-# pip install torchao
-import torch
-from transformers import TorchAoConfig, AutoModelForCausalLM, AutoProcessor
-
-quantization_config = TorchAoConfig("int4_weight_only", group_size=128)
-model = AutoModelForCausalLM.from_pretrained(
-    "rhymes-ai/Aria-sequential_mlp",
-    dtype=torch.bfloat16,
-    device_map="auto",
-    quantization_config=quantization_config
-)
-processor = AutoProcessor.from_pretrained(
-    "rhymes-ai/Aria-sequential_mlp",
-)
-
-messages = [
-    {
-        "role": "user", "content": [
-            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
-            {"type": "text", "text": "What is shown in this image?"},
-        ]
-    },
-]
-
-inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt")
-inputs = inputs.to(model.device, torch.bfloat16)
-
-output = model.generate(
-    **inputs,
-    max_new_tokens=15,
-    stop_strings=["<|im_end|>"],
-    tokenizer=processor.tokenizer,
-    do_sample=True,
-    temperature=0.9,
-)
-output_ids = output[0][inputs["input_ids"].shape[1]:]
-response = processor.decode(output_ids, skip_special_tokens=True)
-print(response)
-```
-
 ## AriaImageProcessor

 [[autodoc]] AriaImageProcessor
@ -162,15 +94,17 @@ print(response)

 [[autodoc]] AriaTextModel

-## AriaModel
-
-[[autodoc]] AriaModel
-
 ## AriaTextForCausalLM

 [[autodoc]] AriaTextForCausalLM

+## AriaModel
+
+[[autodoc]] AriaModel
+    - forward
+
 ## AriaForConditionalGeneration

 [[autodoc]] AriaForConditionalGeneration
    - forward
+
--- a/docs/source/en/model_doc/audio-spectrogram-transformer.md
+++ b/docs/source/en/model_doc/audio-spectrogram-transformer.md
@ -13,82 +13,55 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2021-04-05 and added to Hugging Face Transformers on 2022-11-21.*
+*This model was released on 2021-04-05 and added to Hugging Face Transformers on 2022-11-21 and contributed by [nielsr](https://huggingface.co/nielsr).*
+
+<div style="float: right;">
+    <div class="flex flex-wrap space-x-1">
+        <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
+        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
+    </div>
+</div>

 # Audio Spectrogram Transformer

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
-<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[Audio Spectrogram Transformer](https://huggingface.co/papers/2104.01778) applies a Vision Transformer to audio by converting audio into spectrograms, achieving state-of-the-art results in audio classification without using convolutional layers. It outperforms existing models on benchmarks like AudioSet, ESC-50, and Speech Commands V2, demonstrating the effectiveness of purely attention-based models in this domain.

-## Overview
-
-The Audio Spectrogram Transformer model was proposed in [AST: Audio Spectrogram Transformer](https://huggingface.co/papers/2104.01778) by Yuan Gong, Yu-An Chung, James Glass.
-The Audio Spectrogram Transformer applies a [Vision Transformer](vit) to audio, by turning audio into an image (spectrogram). The model obtains state-of-the-art results
-for audio classification.
-
-The abstract from the paper is the following:
-
-*In the past decade, convolutional neural networks (CNNs) have been widely adopted as the main building block for end-to-end audio classification models, which aim to learn a direct mapping from audio spectrograms to corresponding labels. To better capture long-range global context, a recent trend is to add a self-attention mechanism on top of the CNN, forming a CNN-attention hybrid model. However, it is unclear whether the reliance on a CNN is necessary, and if neural networks purely based on attention are sufficient to obtain good performance in audio classification. In this paper, we answer the question by introducing the Audio Spectrogram Transformer (AST), the first convolution-free, purely attention-based model for audio classification. We evaluate AST on various audio classification benchmarks, where it achieves new state-of-the-art results of 0.485 mAP on AudioSet, 95.6% accuracy on ESC-50, and 98.1% accuracy on Speech Commands V2.*
-
-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/audio_spectogram_transformer_architecture.png"
-alt="drawing" width="600"/>
-
-<small> Audio Spectrogram Transformer architecture. Taken from the <a href="https://huggingface.co/papers/2104.01778">original paper</a>.</small>
-
-This model was contributed by [nielsr](https://huggingface.co/nielsr).
-The original code can be found [here](https://github.com/YuanGongND/ast).
-
-## Usage tips
-
- When fine-tuning the Audio Spectrogram Transformer (AST) on your own dataset, it's recommended to take care of the input normalization (to make
-sure the input has mean of 0 and std of 0.5). [`ASTFeatureExtractor`] takes care of this. Note that it uses the AudioSet
-mean and std by default. You can check [`ast/src/get_norm_stats.py`](https://github.com/YuanGongND/ast/blob/master/src/get_norm_stats.py) to see how
-the authors compute the stats for a downstream dataset.
- Note that the AST needs a low learning rate (the authors use a 10 times smaller learning rate compared to their CNN model proposed in the
-[PSLA paper](https://huggingface.co/papers/2102.01243)) and converges quickly, so please search for a suitable learning rate and learning rate scheduler for your task.
-
-### Using Scaled Dot Product Attention (SDPA)
-
-PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function
-encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the
-[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html)
-or the [GPU Inference](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention)
-page for more information.
-
-SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set
-`attn_implementation="sdpa"` in `from_pretrained()` to explicitly request SDPA to be used.
+<hfoptions id="usage">
+<hfoption id="Pipeline">

 ```py
-from transformers import ASTForAudioClassification
-model = ASTForAudioClassification.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593", attn_implementation="sdpa", dtype=torch.float16)
-...
+import torch
+from transformers import pipeline
+
+pipeline = pipeline(task="audio-classification",model="MIT/ast-finetuned-audioset-10-10-0.4593", dtype="auto")
+pipeline("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac")
 ```

-For the best speedups, we recommend loading the model in half-precision (e.g. `torch.float16` or `torch.bfloat16`).
+</hfoption>
+<hfoption id="AutoModel"

-On a local benchmark (A100-40GB, PyTorch 2.3.0, OS Ubuntu 22.04) with `float32` and `MIT/ast-finetuned-audioset-10-10-0.4593` model, we saw the following speedups during inference.
+```py
+import torch
+from datasets import load_dataset
+from transformers import AutoFeatureExtractor, AutoModelForAudioClassification

-|   Batch size |   Average inference time (ms), eager mode |   Average inference time (ms), sdpa model |   Speed up, Sdpa / Eager (x) |
-|--------------|-------------------------------------------|-------------------------------------------|------------------------------|
-|            1 |                                        27 |                                         6 |                      4.5 |
-|            2 |                                        12 |                                         6 |                      2   |
-|            4 |                                        21 |                                         8 |                      2.62 |
-|            8 |                                        40 |                                        14 |                      2.86 |
+dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation").sort("id")
+sampling_rate = dataset.features["audio"].sampling_rate

-## Resources
+feature_extractor = AutoFeatureExtractor.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593")
+model = AutoModelForAudioClassification.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593")

-A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with the Audio Spectrogram Transformer.
+inputs = feature_extractor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")

-<PipelineTag pipeline="audio-classification"/>
+with torch.no_grad():
+    logits = model(**inputs).logits

- A notebook illustrating inference with AST for audio classification can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/AST).
- [`ASTForAudioClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/audio-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/audio_classification.ipynb).
- See also: [Audio classification](../tasks/audio_classification).
+predicted_class_ids = torch.argmax(logits, dim=-1).item()
+print(f"Predicted label: {model.config.id2label[predicted_class_ids]}")
+```

-If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
+</hfoption>
+</hfoptions>

 ## ASTConfig

@ -108,3 +81,4 @@ If you're interested in submitting a resource to be included here, please feel f

 [[autodoc]] ASTForAudioClassification
    - forward
+
--- a/docs/source/en/model_doc/auto.md
+++ b/docs/source/en/model_doc/auto.md
@ -29,7 +29,7 @@ model = AutoModel.from_pretrained("google-bert/bert-base-cased")

 will create a model that is an instance of [`BertModel`].

-There is one class of `AutoModel` for each task.
+There is one class of `AutoModel` for each task, and for each backend (PyTorch, TensorFlow, or Flax).

 ## Extending the Auto Classes

@ -48,7 +48,7 @@ You will then be able to use the auto classes like you would usually do!

 <Tip warning={true}>

-If your `NewModelConfig` is a subclass of [`~transformers.PreTrainedConfig`], make sure its
+If your `NewModelConfig` is a subclass of [`~transformers.PretrainedConfig`], make sure its
 `model_type` attribute is set to the same key you use when registering the config (here `"new-model"`).

 Likewise, if your `NewModel` is a subclass of [`PreTrainedModel`], make sure its
@ -73,14 +73,14 @@ Likewise, if your `NewModel` is a subclass of [`PreTrainedModel`], make sure its

 [[autodoc]] AutoImageProcessor

-## AutoVideoProcessor
-
-[[autodoc]] AutoVideoProcessor
-
 ## AutoProcessor

 [[autodoc]] AutoProcessor

+## AutoVideoProcessor
+
+[[autodoc]] AutoVideoProcessor
+
 ## Generic model classes

 The following auto classes are available for instantiating a base model class without a specific head.
@ -161,10 +161,6 @@ The following auto classes are available for the following computer vision tasks

 [[autodoc]] AutoModelForKeypointDetection

-### AutoModelForKeypointMatching
-
-[[autodoc]] AutoModelForKeypointMatching
-
 ### AutoModelForMaskedImageModeling

 [[autodoc]] AutoModelForMaskedImageModeling
@ -201,6 +197,10 @@ The following auto classes are available for the following computer vision tasks

 [[autodoc]] AutoModelForZeroShotObjectDetection

+### AutoModelForKeypointMatching
+
+[[autodoc]] AutoModelForKeypointMatching
+
 ## Audio

 The following auto classes are available for the following audio tasks.
@ -261,8 +261,6 @@ The following auto classes are available for the following multimodal tasks.

 [[autodoc]] AutoModelForImageTextToText

-## Time Series
-
 ### AutoModelForTimeSeriesPrediction

 [[autodoc]] AutoModelForTimeSeriesPrediction
--- a/docs/source/en/model_doc/autoformer.md
+++ b/docs/source/en/model_doc/autoformer.md
@ -13,32 +13,39 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2021-06-24 and added to Hugging Face Transformers on 2023-05-30.*
+*This model was released on 2021-06-24 and added to Hugging Face Transformers on 2023-05-30 and contributed by [elisim](https://huggingface.co/elisim) and [kashif](https://huggingface.co/kashif).*

 # Autoformer

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://huggingface.co/papers/2106.13008) addresses the challenge of long-term time series forecasting by introducing a novel decomposition architecture. Autoformer integrates an Auto-Correlation mechanism that progressively decomposes trend and seasonal components, enhancing the model's ability to capture intricate temporal patterns. This approach surpasses traditional self-attention methods in both efficiency and accuracy, achieving state-of-the-art results with a 38% relative improvement across six benchmarks in diverse applications including energy, traffic, economics, weather, and disease forecasting.

-## Overview
+<hfoptions id="usage">
+<hfoption id="AutoformerForPrediction">

-The Autoformer model was proposed in [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://huggingface.co/papers/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long.
+```py
+import torch
+from huggingface_hub import hf_hub_download
+from transformers import AutoformerForPrediction

-This model augments the Transformer as a deep decomposition architecture, which can progressively decompose the trend and seasonal components during the forecasting process.
+file = hf_hub_download(
+    repo_id="hf-internal-testing/tourism-monthly-batch", filename="train-batch.pt", repo_type="dataset"
+)
+batch = torch.load(file)

-The abstract from the paper is the following:
+model = AutoformerForPrediction.from_pretrained("huggingface/autoformer-tourism-monthly", dtype="auto")
+outputs = model.generate(
+    past_values=batch["past_values"],
+    past_time_features=batch["past_time_features"],
+    past_observed_mask=batch["past_observed_mask"],
+    static_categorical_features=batch["static_categorical_features"],
+    future_time_features=batch["future_time_features"],
+)

-*Extending the forecasting time is a critical demand for real applications, such as extreme weather early warning and long-term energy consumption planning. This paper studies the long-term forecasting problem of time series. Prior Transformer-based models adopt various self-attention mechanisms to discover the long-range dependencies. However, intricate temporal patterns of the long-term future prohibit the model from finding reliable dependencies. Also, Transformers have to adopt the sparse versions of point-wise self-attentions for long series efficiency, resulting in the information utilization bottleneck. Going beyond Transformers, we design Autoformer as a novel decomposition architecture with an Auto-Correlation mechanism. We break with the pre-processing convention of series decomposition and renovate it as a basic inner block of deep models. This design empowers Autoformer with progressive decomposition capacities for complex time series. Further, inspired by the stochastic process theory, we design the Auto-Correlation mechanism based on the series periodicity, which conducts the dependencies discovery and representation aggregation at the sub-series level. Auto-Correlation outperforms self-attention in both efficiency and accuracy. In long-term forecasting, Autoformer yields state-of-the-art accuracy, with a 38% relative improvement on six benchmarks, covering five practical applications: energy, traffic, economics, weather and disease.*
+mean_prediction = outputs.sequences.mean(dim=1)
+```

-This model was contributed by [elisim](https://huggingface.co/elisim) and [kashif](https://huggingface.co/kashif).
-The original code can be found [here](https://github.com/thuml/Autoformer).
-
-## Resources
-
-A list of official Hugging Face and community (indicated by 🌎) resources to help you get started. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
-
- Check out the Autoformer blog-post in HuggingFace blog: [Yes, Transformers are Effective for Time Series Forecasting (+ Autoformer)](https://huggingface.co/blog/autoformer)
+</hfoption>
+</hfoptions>

 ## AutoformerConfig

@ -53,3 +60,4 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h

 [[autodoc]] AutoformerForPrediction
    - forward
+
--- a/docs/source/en/model_doc/aya_vision.md
+++ b/docs/source/en/model_doc/aya_vision.md
@ -13,250 +13,64 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2025-05-13 and added to Hugging Face Transformers on 2025-03-04.*
+*This model was released on 2025-05-13 and added to Hugging Face Transformers on 2025-03-04 and contributed by [saurabhdash](https://huggingface.co/saurabhdash) and [yonigozlan](https://huggingface.co/yonigozlan).*

-<div style="float: right;">
-    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-    </div>
-</div>
+# AyaVision

-# Aya Vision
-
-[Aya Vision](https://huggingface.co/papers/2505.08751) is a family of open-weight multimodal vision-language models from Cohere Labs. It is trained with a synthetic annotation framework that generates high-quality multilingual image captions, improving Aya Vision's generated responses. In addition, a cross-modal model merging technique is used to prevent the model from losing its text capabilities after adding vision capabilities. The model combines a CommandR-7B language model with a SigLIP vision encoder.
-
-You can find all the original Aya Vision checkpoints under the [Aya Vision](https://huggingface.co/collections/CohereLabs/cohere-labs-aya-vision-67c4ccd395ca064308ee1484) collection.
-
-> [!TIP]
-> This model was contributed by [saurabhdash](https://huggingface.co/saurabhdash) and [yonigozlan](https://huggingface.co/yonigozlan).
->
-> Click on the Aya Vision models in the right sidebar for more examples of how to apply Aya Vision to different image-to-text tasks.
-
-The example below demonstrates how to generate text based on an image with [`Pipeline`] or the [`AutoModel`] class.
+[Aya Vision](https://huggingface.co/papers/2505.08751) ntroduce two key innovations for multilingual multimodal learning: a synthetic annotation framework that generates high-quality, diverse instruction data across languages, and a cross-modal model merging technique that prevents catastrophic forgetting while preserving strong text-only performance. These methods enable effective alignment between vision and language without degrading existing capabilities. Aya-Vision-8B surpasses comparable models like Qwen-2.5-VL-7B, Pixtral-12B, and even larger models such as Llama-3.2-90B-Vision, while the larger Aya-Vision-32B outperforms models more than twice its size, including Molmo-72B. Overall, the approach demonstrates efficient scaling and state-of-the-art multilingual multimodal performance with reduced computational demands.

 <hfoptions id="usage">
 <hfoption id="Pipeline">

-```python
+```py
+import torch
 from transformers import pipeline

-pipe = pipeline(model="CohereLabs/aya-vision-8b", task="image-text-to-text", device_map="auto")
-
-# Format message with the aya-vision chat template
+pipeline = pipeline(task="image-text-to-text", model="CohereLabs/aya-vision-8b", dtype="auto")
 messages = [
    {"role": "user",
     "content": [
-       {"type": "image", "url": "https://media.istockphoto.com/id/458012057/photo/istanbul-turkey.jpg?s=612x612&w=0&k=20&c=qogAOVvkpfUyqLUMr_XJQyq-HkACXyYUSZbKhBlPrxo="},
-        {"type": "text", "text": "Bu resimde hangi anıt gösterilmektedir?"},
+       {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
+        {"type": "text", "text": "Que montre cette image?"},
    ]},
-    ]
-outputs = pipe(text=messages, max_new_tokens=300, return_full_text=False)
-
-print(outputs)
+]
+pipeline(text=messages, max_new_tokens=300, return_full_text=False)
 ```

 </hfoption>
 <hfoption id="AutoModel">

-```python
-# pip install 'git+https://github.com/huggingface/transformers.git@v4.49.0-Aya Vision'
+```py
 import torch
 from transformers import AutoProcessor, AutoModelForImageTextToText

-model_id = "CohereLabs/aya-vision-8b"
+processor = AutoProcessor.from_pretrained("CohereLabs/aya-vision-8b)
+model = AutoModelForImageTextToText.from_pretrained("CohereLabs/aya-vision-8b", dtype="auto")

-processor = AutoProcessor.from_pretrained(model_id)
-model = AutoModelForImageTextToText.from_pretrained(
-    model_id, device_map="auto", dtype=torch.float16
-)
-
-# Format message with the aya-vision chat template
 messages = [
    {"role": "user",
     "content": [
-       {"type": "image", "url": "https://pbs.twimg.com/media/Fx7YvfQWYAIp6rZ?format=jpg&name=medium"},
-        {"type": "text", "text": "चित्र में लिखा पाठ क्या कहता है?"},
+       {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
+        {"type": "text", "text": "Que montre cette image?"},
    ]},
-    ]
+]

 inputs = processor.apply_chat_template(
    messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt"
-).to(model.device)
+)

-gen_tokens = model.generate(
+outputs = model.generate(
    **inputs,
    max_new_tokens=300,
    do_sample=True,
    temperature=0.3,
 )
-
-print(processor.tokenizer.decode(gen_tokens[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
+print(processor.tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
 ```

 </hfoption>
 </hfoptions>

-Quantization reduces the memory footprint of large models by representing weights at lower precision. Refer to the [Quantization](../quantization/overview) overview for supported backends.
-
-The example below uses [bitsandbytes](../quantization/bitsandbytes) to only quantize the weights to 4-bits.
-
-```python
-import torch
-from transformers import (
-    AutoProcessor,
-    AutoModelForImageTextToText,
-    BitsAndBytesConfig
-)
-
-bnb_config = BitsAndBytesConfig(
-    load_in_4bit=True,
-    bnb_4bit_quant_type="nf4",
-    bnb_4bit_compute_dtype=torch.bfloat16,
-    bnb_4bit_use_double_quant=True
-)
-
-processor = AutoProcessor.from_pretrained("CohereLabs/aya-vision-32b", use_fast=True)
-model = AutoModelForImageTextToText.from_pretrained(
-    "CohereLabs/aya-vision-32b",
-    quantization_config=bnb_config,
-    device_map="auto"
-)
-
-inputs = processor.apply_chat_template(
-    [
-    {"role": "user", "content": [
-        {"type": "image", "url": "https://huggingface.co/roschmid/dog-races/resolve/main/images/Border_Collie.jpg"},
-        {"type": "text",  "text":"Describe what you see."}
-    ]}
-    ],
-    padding=True,
-    add_generation_prompt=True,
-    tokenize=True,
-    return_tensors="pt"
-).to(model.device)
-
-generated = model.generate(**inputs, max_new_tokens=50)
-print(processor.tokenizer.decode(generated[0], skip_special_tokens=True))
-```
-
-## Notes
-
- Images are represented with the `<image>` tag in the chat template.
-
- Use the [`~ProcessorMixin.apply_chat_template`] method to correctly format inputs.
-
- The example below demonstrates inference with multiple images.
-  
-    ```py
-    import torch
-    from transformers import AutoProcessor, AutoModelForImageTextToText
-        
-    processor = AutoProcessor.from_pretrained("CohereForAI/aya-vision-8b")
-    model = AutoModelForImageTextToText.from_pretrained(
-        "CohereForAI/aya-vision-8b", device_map="auto", dtype=torch.float16
-    )
-    
-    messages = [
-        {
-            "role": "user",
-            "content": [
-                {
-                    "type": "image",
-                    "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg",
-                },
-                {
-                    "type": "image",
-                    "url": "https://thumbs.dreamstime.com/b/golden-gate-bridge-san-francisco-purple-flowers-california-echium-candicans-36805947.jpg",
-                },
-                {
-                    "type": "text",
-                    "text": "These images depict two different landmarks. Can you identify them?",
-                },
-            ],
-        },
-    ]
-    
-    inputs = processor.apply_chat_template(
-        messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt"
-    ).to(model.device)
-    
-    gen_tokens = model.generate(
-        **inputs, 
-        max_new_tokens=300, 
-        do_sample=True, 
-        temperature=0.3,
-    )
-    
-    gen_text = processor.tokenizer.decode(gen_tokens[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
-    print(gen_text)
-    ```
-
- The example below demonstrates inference with batched inputs.
-  
-    ```py
-    import torch
-    from transformers import AutoProcessor, AutoModelForImageTextToText
-        
-    processor = AutoProcessor.from_pretrained(model_id)
-    model = AutoModelForImageTextToText.from_pretrained(
-        "CohereForAI/aya-vision-8b", device_map="auto", dtype=torch.float16
-    )
-    
-    batch_messages = [
-        [
-            {
-                "role": "user",
-                "content": [
-                    {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
-                    {"type": "text", "text": "Write a haiku for this image"},
-                ],
-            },
-        ],
-        [
-            {
-                "role": "user",
-                "content": [
-                    {
-                        "type": "image",
-                        "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg",
-                    },
-                    {
-                        "type": "image",
-                        "url": "https://thumbs.dreamstime.com/b/golden-gate-bridge-san-francisco-purple-flowers-california-echium-candicans-36805947.jpg",
-                    },
-                    {
-                        "type": "text",
-                        "text": "These images depict two different landmarks. Can you identify them?",
-                    },
-                ],
-            },
-        ],
-    ]
-    
-    batch_inputs = processor.apply_chat_template(
-        batch_messages, 
-        padding=True, 
-        add_generation_prompt=True, 
-        tokenize=True, 
-        return_dict=True, 
-        return_tensors="pt"
-    ).to(model.device)
-    
-    batch_outputs = model.generate(
-        **batch_inputs,
-        max_new_tokens=300,
-        do_sample=True,
-        temperature=0.3,
-    )
-    
-    for i, output in enumerate(batch_outputs):
-        response = processor.tokenizer.decode(
-            output[batch_inputs.input_ids.shape[1]:], 
-            skip_special_tokens=True
-        )
-        print(f"Response {i+1}:\n{response}\n")
-    ```
-
 ## AyaVisionProcessor

 [[autodoc]] AyaVisionProcessor
@ -268,6 +82,7 @@ print(processor.tokenizer.decode(generated[0], skip_special_tokens=True))
 ## AyaVisionModel

 [[autodoc]] AyaVisionModel
+    - forward

 ## AyaVisionForConditionalGeneration

--- a/docs/source/en/model_doc/bamba.md
+++ b/docs/source/en/model_doc/bamba.md
@ -13,11 +13,10 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2024-12-18 and added to Hugging Face Transformers on 2024-12-19.*
+*This model was released on 2024-12-18 and added to Hugging Face Transformers on 2024-12-19 and contributed by [ani300](https://github.com/ani300) and [fabianlim](https://github.com/fabianlim).*

 <div style="float: right;">
    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
        <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
    </div>
@ -25,106 +24,52 @@ rendered properly in your Markdown viewer.

 # Bamba

-[Bamba](https://huggingface.co/blog/bamba) is a 9B parameter decoder-only language model built on the [Mamba-2](./mamba2) architecture. It is pretrained in two stages - it starts by training on 2T tokens from the [Dolma v1.7](https://huggingface.co/datasets/allenai/dolma) dataset and then trained on an additional 200B tokens from [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) and [Cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia).
-
-You can find all the original Bamba checkpoints under the [Bamba](https://huggingface.co/collections/ibm-ai-platform/bamba-674f1388b9bbc98b413c7bab) collection.
-
-> [!TIP]
-> This model was contributed by [ani300](https://github.com/ani300) and [fabianlim](https://github.com/fabianlim).
->
-> Click on the Bamba models in the right sidebar for more examples of how to apply Bamba to different text generation tasks.
-
-The example below demonstrates how to generate text with [`Pipeline`], [`AutoModel`], and from the command line.
+[Bamba-9B](https://github.com/state-spaces/mamba) is a new hybrid language model that combines Mamba2 and Transformer layers to improve inference efficiency. By interleaving Mamba2 layers, it avoids the memory bottleneck of the Transformer’s growing KV-cache, achieving up to 2.5× higher throughput and 2× lower latency in vLLM. The model has 9 billion parameters and was trained on 2.2 trillion tokens of open data, with full training recipes and checkpoints released for reproducibility. It integrates seamlessly with Hugging Face tools like Transformers, TRL, vLLM, and llama.cpp, and comes with additional resources such as a stateless shuffle dataloader and quantization support. Developed in collaboration with IBM, Princeton, CMU, and UIUC, Bamba is intended as an open, efficient foundation for experimenting with hybrid architectures.

 <hfoptions id="usage">
 <hfoption id="Pipeline">

-```python
+```py
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    task="text-generation",
-    model="ibm-ai-platform/Bamba-9B-v2",
-    dtype=torch.bfloat16,
-    device=0
-)
-pipeline("Plants create energy through a process known as")
+pipeline = pipeline(task="text-generation", model="ibm-fms/Bamba-9B", dtype="auto")
+pipeline("Plants generate energy through a process known as  ")
 ```

 </hfoption>
-
 <hfoption id="AutoModel">

-```python
+```py
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer

-tokenizer = AutoTokenizer.from_pretrained("ibm-ai-platform/Bamba-9B-v2")
-model = AutoModelForCausalLM.from_pretrained("ibm-ai-platform/Bamba-9B-v2", dtype=torch.bfloat16, device_map="auto", attn_implementation="sdpa")
-input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to(model.device)
+model = AutoModelForCausalLM.from_pretrained("ibm-fms/Bamba-9B", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("ibm-fms/Bamba-9B")

-output = model.generate(**input_ids)
-print(tokenizer.decode(output[0], skip_special_tokens=True))
+inputs = tokenizer("Plants generate energy through a process known as  ", return_tensors='pt', return_token_type_ids=False)
+outputs = model.generate(**inputs, max_new_tokens=64)
+print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
 ```

-</hfoption>
-
-<hfoption id="transformers CLI">
-```bash
-echo "Plants create energy through a process known as" | transformers run --task text-generation --model ibm-ai-platform/Bamba-9B-v2 --device 0
-```
 </hfoption>
 </hfoptions>

-Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
+## Usage tips

-The example below uses [torchao](../quantization/torchao) to only quantize the weights to int4.
+- Bamba supports padding-free training. This concatenates distinct training examples while processing inputs as separate batches. Expect ~2x inference acceleration (varies by model and data distribution). Memory usage drops when examples have varying lengths since you avoid padding token overhead.

-```python
-import torch
-from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
+- Padding-free training requires the flash-attn, mamba-ssm, and causal-conv1d packages. Pass these arguments alongside `input_ids` and `labels`:

-quantization_config = TorchAoConfig("int4_weight_only", group_size=128)
-tokenizer = AutoTokenizer.from_pretrained("ibm-ai-platform/Bamba-9B-v2")
-model = AutoModelForCausalLM.from_pretrained(
-   "ibm-ai-platform/Bamba-9B-v2",
-   quantization_config=quantization_config,
-   device_map="auto",
-   attn_implementation="sdpa"
-)
+- `position_ids`: `torch.LongTensor` - position index of each token in each sequence
+- `seq_idx`: `torch.LongTensor` - index of each sequence in the batch
+- `FlashAttentionKwargs`:
+  - `cu_seq_lens_q`: `torch.LongTensor` - cumulative sequence lengths of all queries
+  - `cu_seq_lens_k`: `torch.LongTensor` - cumulative sequence lengths of all keys  
+  - `max_length_q`: `int` - longest query length in the batch
+  - `max_length_k`: `int` - longest key length in the batch

-inputs = tokenizer("Plants create energy through a process known as", return_tensors="pt").to(model.device)
-output = model.generate(**inputs)
-print(tokenizer.decode(output[0], skip_special_tokens=True))
-```
-
-## Notes
-
- Bamba supports padding-free training which concatenates distinct training examples while still processing inputs as separate batches. It can significantly accelerate inference by [~2x](https://github.com/huggingface/transformers/pull/35861#issue-2807873129) (depending on model and data distribution) and reduce memory-usage if there are examples of varying lengths by avoiding unnecessary compute and memory overhead from padding tokens.
-
-  Padding-free training requires the `flash-attn`, `mamba-ssm`, and `causal-conv1d` packages and the following arguments must be passed to the model in addition to `input_ids` and `labels`.
-
-  - `position_ids: torch.LongTensor`: the position index of each token in each sequence.
-  - `seq_idx: torch.IntTensor`: the index of each sequence in the batch.
-  - Each of the [`FlashAttentionKwargs`]
-    - `cu_seq_lens_q: torch.LongTensor`: the cumulative sequence lengths of all queries.
-    - `cu_seq_lens_k: torch.LongTensor`: the cumulative sequence lengths of all keys.
-    - `max_length_q: int`: the longest query length in the batch.
-    - `max_length_k: int`: the longest key length in the batch.
-
-  The `attention_mask` inputs should not be provided. The [`DataCollatorWithFlattening`] programmatically generates the set of additional arguments above using `return_seq_idx=True` and `return_flash_attn_kwargs=True`. See the [Improving Hugging Face Training Efficiency Through Packing with Flash Attention](https://huggingface.co/blog/packing-with-FA2) blog post for additional information.
-
-  ```python
-  from transformers import DataCollatorWithFlattening
-
-  # Example of using padding-free training
-  data_collator = DataCollatorWithFlattening(
-      tokenizer=tokenizer,
-      return_seq_idx=True,
-      return_flash_attn_kwargs=True
-  )
-  ```
+- Don't provide `attention_mask` inputs. The [`DataCollatorWithFlattening`] generates these arguments automatically when you set `return_seq_idx=True` and `return_flash_attn_kwargs=True`. See the [Improving Hugging Face Training Efficiency Through Packing with Flash Attention](https://huggingface.co/blog/packing-with-FA2) blog post for details.

 ## BambaConfig

--- a/docs/source/en/model_doc/bark.md
+++ b/docs/source/en/model_doc/bark.md
@ -9,165 +9,50 @@ Unless required by applicable law or agreed to in writing, software distributed
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
-*This model was released on 2023-04-09 and added to Hugging Face Transformers on 2023-07-17.*
+*This model was released on {release_date} and added to Hugging Face Transformers on 2023-07-17 and contributed by [ylacombe](https://huggingface.co/ylacombe) and [sanchit-gandhi](https://github.com/sanchit-gandhi).*
+
+<div style="float: right;">
+    <div class="flex flex-wrap space-x-1">
+        <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
+    </div>
+</div>

 # Bark

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
-</div>
+[Bark](https://github.com/suno-ai/bark) is a text-to-audio generative model capable of producing realistic speech, music, and sound effects directly from text prompts. It’s built using a transformer-based architecture that models audio tokens rather than phonemes, enabling it to capture tone, emotion, and multilingual speech without explicit linguistic preprocessing. Bark uses semantic and coarse acoustic tokens, trained on diverse multilingual datasets, to generate natural prosody and expressive delivery. Its outputs are decoded from discrete audio representations, similar in spirit to models like EnCodec or VALL-E, allowing highly expressive and context-aware audio synthesis.

-## Overview
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-[Bark](https://huggingface.co/suno/bark) is a transformer-based text-to-speech model proposed by Suno AI in [suno-ai/bark](https://github.com/suno-ai/bark).
-
-Bark is made of 4 main models:
-
- [`BarkSemanticModel`] (also referred to as the 'text' model): a causal auto-regressive transformer model that takes as input tokenized text, and predicts semantic text tokens that capture the meaning of the text.
- [`BarkCoarseModel`] (also referred to as the 'coarse acoustics' model): a causal autoregressive transformer, that takes as input the results of the [`BarkSemanticModel`] model. It aims at predicting the first two audio codebooks necessary for EnCodec.
- [`BarkFineModel`] (the 'fine acoustics' model), this time a non-causal autoencoder transformer, which iteratively predicts the last codebooks based on the sum of the previous codebooks embeddings.
- having predicted all the codebook channels from the [`EncodecModel`], Bark uses it to decode the output audio array.
-
-It should be noted that each of the first three modules can support conditional speaker embeddings to condition the output sound according to specific predefined voice.
-
-This model was contributed by [Yoach Lacombe (ylacombe)](https://huggingface.co/ylacombe) and [Sanchit Gandhi (sanchit-gandhi)](https://github.com/sanchit-gandhi).
-The original code can be found [here](https://github.com/suno-ai/bark).
-
-### Optimizing Bark
-
-Bark can be optimized with just a few extra lines of code, which **significantly reduces its memory footprint** and **accelerates inference**.
-
-#### Using half-precision
-
-You can speed up inference and reduce memory footprint by 50% simply by loading the model in half-precision.
-
-```python
-from transformers import BarkModel
-from accelerate import Accelerator
+```py
 import torch
+from transformers import pipeline

-device = Accelerator().device
-model = BarkModel.from_pretrained("suno/bark-small", dtype=torch.float16).to(device)
+pipeline = pipeline(task="text-to-audio", model="suno/bark-small", dtype="auto")
+output = pipeline("Plants create energy through a process known as photosynthesis.")
+audio = output["audio"]
 ```

-#### Using CPU offload
+</hfoption>
+<hfoption id="BarkModel">

-As mentioned above, Bark is made up of 4 sub-models, which are called up sequentially during audio generation. In other words, while one sub-model is in use, the other sub-models are idle.
-
-If you're using a CUDA GPU or Intel XPU, a simple solution to benefit from an 80% reduction in memory footprint is to offload the submodels from device to CPU when they're idle. This operation is called *CPU offloading*. You can use it with one line of code as follows:
-
-```python
-model.enable_cpu_offload()
-```
-
-Note that 🤗 Accelerate must be installed before using this feature. [Here's how to install it.](https://huggingface.co/docs/accelerate/basic_tutorials/install)
-
-#### Using Flash Attention 2
-
-Flash Attention 2 is an even faster, optimized version of the previous optimization.
-
-##### Installation
-
-First, check whether your hardware is compatible with Flash Attention 2. The latest list of compatible hardware can be found in the [official documentation](https://github.com/Dao-AILab/flash-attention#installation-and-features).
-Next, [install](https://github.com/Dao-AILab/flash-attention#installation-and-features) the latest version of Flash Attention 2:
-
-```bash
-pip install -U flash-attn --no-build-isolation
-```
-
-##### Usage
-
-To load a model using Flash Attention 2, we can pass the `attn_implementation="flash_attention_2"` flag to [`.from_pretrained`](https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel.from_pretrained). We'll also load the model in half-precision (e.g. `torch.float16`), since it results in almost no degradation to audio quality but significantly lower memory usage and faster inference:
-
-```python
-model = BarkModel.from_pretrained("suno/bark-small", dtype=torch.float16, attn_implementation="flash_attention_2").to(device)
-```
-
-##### Performance comparison
-
-The following diagram shows the latency for the native attention implementation (no optimisation) against Flash Attention 2. In all cases, we generate 400 semantic tokens on a 40GB A100 GPU with PyTorch 2.1:
-
-<div style="text-align: center">
-<img src="https://huggingface.co/datasets/ylacombe/benchmark-comparison/resolve/main/Bark%20Optimization%20Benchmark.png">
-</div>
-
-To put this into perspective, on an NVIDIA A100 and when generating 400 semantic tokens with a batch size of 16, you can get 17 times the [throughput](https://huggingface.co/blog/optimizing-bark#throughput) and still be 2 seconds faster than generating sentences one by one with the native model implementation. In other words, all the samples will be generated 17 times faster.
-
-#### Combining optimization techniques
-
-You can combine optimization techniques, and use CPU offload, half-precision and Flash Attention 2 all at once.
-
-```python
-from transformers import BarkModel
-from accelerate import Accelerator
+```py
 import torch
+from scipy.io.wavfile import write as write_wav
+from transformers import AutoProcessor, BarkModel

-device = Accelerator().device
+processor = AutoProcessor.from_pretrained("suno/bark")
+model = BarkModel.from_pretrained("suno/bark", dtype="auto")

-# load in fp16 and use Flash Attention 2
-model = BarkModel.from_pretrained("suno/bark-small", dtype=torch.float16, attn_implementation="flash_attention_2").to(device)
-
-# enable CPU offload
-model.enable_cpu_offload()
+inputs = processor("Plants create energy through a process known as photosynthesis.", voice_preset="v2/en_speaker_6")
+audio_array = model.generate(**inputs)
+audio_array = audio_array.cpu().numpy().squeeze()
+sample_rate = model.generation_config.sample_rate
+write_wav("bark_generation.wav", sample_rate, audio_array)
 ```

-Find out more on inference optimization techniques [here](https://huggingface.co/docs/transformers/perf_infer_gpu_one).
-
-### Usage tips
-
-Suno offers a library of voice presets in a number of languages [here](https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c).
-These presets are also uploaded in the hub [here](https://huggingface.co/suno/bark-small/tree/main/speaker_embeddings) or [here](https://huggingface.co/suno/bark/tree/main/speaker_embeddings).
-
-```python
->>> from transformers import AutoProcessor, BarkModel
-
->>> processor = AutoProcessor.from_pretrained("suno/bark")
->>> model = BarkModel.from_pretrained("suno/bark")
-
->>> voice_preset = "v2/en_speaker_6"
-
->>> inputs = processor("Hello, my dog is cute", voice_preset=voice_preset)
-
->>> audio_array = model.generate(**inputs)
->>> audio_array = audio_array.cpu().numpy().squeeze()
-```
-
-Bark can generate highly realistic, **multilingual** speech as well as other audio - including music, background noise and simple sound effects.
-
-```python
->>> # Multilingual speech - simplified Chinese
->>> inputs = processor("惊人的！我会说中文")
-
->>> # Multilingual speech - French - let's use a voice_preset as well
->>> inputs = processor("Incroyable! Je peux générer du son.", voice_preset="fr_speaker_5")
-
->>> # Bark can also generate music. You can help it out by adding music notes around your lyrics.
->>> inputs = processor("♪ Hello, my dog is cute ♪")
-
->>> audio_array = model.generate(**inputs)
->>> audio_array = audio_array.cpu().numpy().squeeze()
-```
-
-The model can also produce **nonverbal communications** like laughing, sighing and crying.
-
-```python
->>> # Adding non-speech cues to the input text
->>> inputs = processor("Hello uh ... [clears throat], my dog is cute [laughter]")
-
->>> audio_array = model.generate(**inputs)
->>> audio_array = audio_array.cpu().numpy().squeeze()
-```
-
-To save the audio, simply take the sample rate from the model config and some scipy utility:
-
-```python
->>> from scipy.io.wavfile import write as write_wav
-
->>> # save audio to disk, but first take the sample rate from the model config
->>> sample_rate = model.generation_config.sample_rate
->>> write_wav("bark_generation.wav", sample_rate, audio_array)
-```
+</hfoption>
+</hfoptions>

 ## BarkConfig

@ -220,3 +105,4 @@ To save the audio, simply take the sample rate from the model config and some sc

 [[autodoc]] BarkSemanticConfig
    - all
+
--- a/docs/source/en/model_doc/bart.md
+++ b/docs/source/en/model_doc/bart.md
@ -13,22 +13,18 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2019-10-29 and added to Hugging Face Transformers on 2020-11-16.*
+*This model was released on 2019-10-29 and added to Hugging Face Transformers on 2020-11-16 and contributed by [sshleifer](https://huggingface.co/sshleifer).*

 <div style="float: right;">
    <div class="flex flex-wrap space-x-1">
-    <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-    <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
-    <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
+        <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
+        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
+    </div>
 </div>

 # BART

-[BART](https://huggingface.co/papers/1910.13461) is a sequence-to-sequence model that combines the pretraining objectives from BERT and GPT. It's pretrained by corrupting text in different ways like deleting words, shuffling sentences, or masking tokens and learning how to fix it. The encoder encodes the corrupted document and the corrupted text is fixed by the decoder. As it learns to recover the original text, BART gets really good at both understanding and generating language.
-
-You can find all the original BART checkpoints under the [AI at Meta](https://huggingface.co/facebook?search_models=bart) organization.
-
-The example below demonstrates how to predict the `[MASK]` token with [`Pipeline`], [`AutoModel`], and from the command line.
+[BART](https://huggingface.co/papers/1910.13461) is a Transformer-based sequence-to-sequence model trained as a denoising autoencoder: text is corrupted with noise and the model learns to reconstruct the original. Its architecture combines a bidirectional encoder like BERT with a left-to-right decoder like GPT, making it a general framework for many pretraining approaches. Using techniques like sentence shuffling and span in-filling, BART achieves strong results on both generation and comprehension tasks, matching RoBERTa on GLUE and SQuAD while setting new state-of-the-art results in summarization, dialogue, and question answering. It also boosts machine translation performance and allows ablation experiments that replicate and compare other pretraining schemes.

 <hfoptions id="usage">
 <hfoption id="Pipeline">
@ -37,14 +33,8 @@ The example below demonstrates how to predict the `[MASK]` token with [`Pipeline
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    task="fill-mask",
-    model="facebook/bart-large",
-    dtype=torch.float16,
-    device=0
-)
-pipeline("Plants create <mask> through a process known as photosynthesis.")
-
+pipeline = pipeline(task="summarization", model="facebook/bart-large-cnn", dtype="auto")
+pipeline("The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930.")
 ```

 </hfoption>
@ -52,48 +42,30 @@ pipeline("Plants create <mask> through a process known as photosynthesis.")

 ```py
 import torch
-from transformers import AutoModelForMaskedLM, AutoTokenizer
+from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

-tokenizer = AutoTokenizer.from_pretrained(
-    "facebook/bart-large",
-)
-model = AutoModelForMaskedLM.from_pretrained(
-    "facebook/bart-large",
-    dtype=torch.float16,
-    device_map="auto",
-    attn_implementation="sdpa"
-)
-inputs = tokenizer("Plants create <mask> through a process known as photosynthesis.", return_tensors="pt").to(model.device)
+model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")

-with torch.no_grad():
-    outputs = model(**inputs)
-    predictions = outputs.logits
-
-masked_index = torch.where(inputs['input_ids'] == tokenizer.mask_token_id)[1]
-predicted_token_id = predictions[0, masked_index].argmax(dim=-1)
-predicted_token = tokenizer.decode(predicted_token_id)
-
-print(f"The predicted token is: {predicted_token}")
-```
-
-</hfoption>
-<hfoption id="transformers CLI">
-
-```bash
-echo -e "Plants create <mask> through a process known as photosynthesis." | transformers run --task fill-mask --model facebook/bart-large --device 0
+text="""
+The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930.
+"""
+inputs = tokenizer(text, return_tensors="pt")
+outputs = model.generate(**inputs, max_length=50)
+print(tokenizer.decode(outputs[0]))
 ```

 </hfoption>
 </hfoptions>

-## Notes
+## Usage tips

- Inputs should be padded on the right because BERT uses absolute position embeddings.
- The [facebook/bart-large-cnn](https://huggingface.co/facebook/bart-large-cnn) checkpoint doesn't include `mask_token_id` which means it can't perform mask-filling tasks.
- BART doesn't use `token_type_ids` for sequence classification. Use [`BartTokenizer`] or [`~PreTrainedTokenizerBase.encode`] to get the proper splitting.
- The forward pass of [`BartModel`] creates the `decoder_input_ids` if they're not passed. This can be different from other model APIs, but it is a useful feature for mask-filling tasks.
- Model predictions are intended to be identical to the original implementation when `forced_bos_token_id=0`. This only works if the text passed to `fairseq.encode` begins with a space.
- [`~GenerationMixin.generate`] should be used for conditional generation tasks like summarization.
+- Pad inputs on the right. BERT uses absolute position embeddings.
+- The facebook/bart-large-cnn checkpoint lacks `mask_token_id`. It can't perform mask-filling tasks.
+- BART ignores `token_type_ids` for sequence classification. Use [`BartTokenizer`] or `encode()` for proper splitting.
+- [`BartModel`] creates `decoder_input_ids` automatically if you don't pass them. This differs from other model APIs but helps with mask-filling tasks.
+- Model predictions match the original implementation when `forced_bos_token_id=0.` This works only if your text starts with a space.
+- Use [`generate`] for conditional generation tasks like summarization.

 ## BartConfig

@ -134,3 +106,4 @@ echo -e "Plants create <mask> through a process known as photosynthesis." | tran

 [[autodoc]] BartForCausalLM
    - forward
+
--- a/docs/source/en/model_doc/barthez.md
+++ b/docs/source/en/model_doc/barthez.md
@ -13,25 +13,11 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2020-10-23 and added to Hugging Face Transformers on 2020-11-27.*
-
-<div style="float: right;">
-    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-    </div>
-</div>
+*This model was released on 2020-10-23 and added to Hugging Face Transformers on 2020-11-27 and contributed by [moussakam](https://huggingface.co/moussakam).*

 # BARThez

-[BARThez](https://huggingface.co/papers/2010.12321) is a [BART](./bart) model designed for French language tasks. Unlike existing French BERT models, BARThez includes a pretrained encoder-decoder, allowing it to generate text as well. This model is also available as a multilingual variant, mBARThez, by continuing pretraining multilingual BART on a French corpus.
-
-You can find all of the original BARThez checkpoints under the [BARThez](https://huggingface.co/collections/dascim/barthez-670920b569a07aa53e3b6887) collection.
-
-> [!TIP]
-> This model was contributed by [moussakam](https://huggingface.co/moussakam).
-> Refer to the [BART](./bart) docs for more usage examples.
-
-The example below demonstrates how to predict the `<mask>` token with [`Pipeline`], [`AutoModel`], and from the command line.
+[BARThez](https://huggingface.co/papers/2010.12321) is the first BART model for the French language, pretrained on a large monolingual French corpus. Unlike BERT-based models like CamemBERT and FlauBERT, BARThez includes both an encoder and a decoder pretrained, making it well-suited for generative tasks. Evaluated on the FLUE benchmark and a new summarization dataset, OrangeSum, BARThez demonstrates strong performance. Additionally, continuing the pretraining of multilingual BART on BARThez's corpus results in mBARTHez, which outperforms or matches CamemBERT and FlauBERT.

 <hfoptions id="usage">
 <hfoption id="Pipeline">
@ -40,13 +26,8 @@ The example below demonstrates how to predict the `<mask>` token with [`Pipeline
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    task="fill-mask",
-    model="moussaKam/barthez",
-    dtype=torch.float16,
-    device=0
-)
-pipeline("Les plantes produisent <mask> grâce à un processus appelé photosynthèse.")
+pipeline = pipeline("fill-mask", model="moussaKam/barthez", dtype="auto")
+pipeline("Les plantes créent <mask> grâce à un processus appelé photosynthèse.")
 ```

 </hfoption>
@ -56,32 +37,15 @@ pipeline("Les plantes produisent <mask> grâce à un processus appelé photosynt
 import torch
 from transformers import AutoModelForMaskedLM, AutoTokenizer

-tokenizer = AutoTokenizer.from_pretrained(
-    "moussaKam/barthez",
-)
-model = AutoModelForMaskedLM.from_pretrained(
-    "moussaKam/barthez",
-    dtype=torch.float16,
-    device_map="auto",
-)
-inputs = tokenizer("Les plantes produisent <mask> grâce à un processus appelé photosynthèse.", return_tensors="pt").to(model.device)
+model = AutoModelForMaskedLM.from_pretrained("moussaKam/barthez", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("moussaKam/barthez")

-with torch.no_grad():
-    outputs = model(**inputs)
-    predictions = outputs.logits
-
-masked_index = torch.where(inputs['input_ids'] == tokenizer.mask_token_id)[1]
-predicted_token_id = predictions[0, masked_index].argmax(dim=-1)
-predicted_token = tokenizer.decode(predicted_token_id)
-
-print(f"The predicted token is: {predicted_token}")
-```
-
-</hfoption>
-<hfoption id="transformers CLI">
-
-```bash
-echo -e "Les plantes produisent <mask> grâce à un processus appelé photosynthèse." | transformers run --task fill-mask --model moussaKam/barthez --device 0
+inputs = tokenizer("Les plantes créent <mask> grâce à un processus appelé photosynthèse.", return_tensors="pt")
+outputs = model(**inputs)
+mask_token_id = tokenizer.mask_token_id
+mask_position = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
+predicted_word = tokenizer.decode(outputs.logits[0, mask_position].argmax(dim=-1))
+print(f"Predicted word: {predicted_word}")
 ```

 </hfoption>
--- a/docs/source/en/model_doc/bartpho.md
+++ b/docs/source/en/model_doc/bartpho.md
@ -13,92 +13,47 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2021-09-20 and added to Hugging Face Transformers on 2021-10-18.*
-
-<div style="float: right;">
-   <div class="flex flex-wrap space-x-1">
-      <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-   </div>
-</div>
+*This model was released on 2021-09-20 and added to Hugging Face Transformers on 2021-10-18 and contributed by [dqnguyen](https://huggingface.co/dqnguyen).*

 # BARTpho

-[BARTpho](https://huggingface.co/papers/2109.09701) is a large-scale Vietnamese sequence-to-sequence model. It offers a word-based and syllable-based version. This model is built on the [BART](./bart) large architecture with its denoising pretraining.
+[BARTpho](https://huggingface.co/papers/2109.09701) introduces two versions—BARTpho_word and BARTpho_syllable—as the first large-scale monolingual sequence-to-sequence models pre-trained for Vietnamese. Leveraging the "large" architecture and pre-training scheme of BART, BARTpho excels in generative NLP tasks. Evaluations on Vietnamese text summarization demonstrate that BARTpho surpasses mBART, setting a new state-of-the-art. The model is released to support future research and applications in generative Vietnamese NLP.

-You can find all the original checkpoints under the [VinAI](https://huggingface.co/vinai/models?search=bartpho) organization.
-
-> [!TIP]
-> This model was contributed by [dqnguyen](https://huggingface.co/dqnguyen).
-> Check out the right sidebar for examples of how to apply BARTpho to different language tasks.
-
-The example below demonstrates how to summarize text with [`Pipeline`] or the [`AutoModel`] class.
+This model was contributed by [dqnguyen](https://huggingface.co/dqnguyen). The original code can be found [here](https://github.com/VinAIResearch/BARTpho).

 <hfoptions id="usage">
 <hfoption id="Pipeline">

-```python
+```py
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-   task="summarization",
-   model="vinai/bartpho-word",
-   dtype=torch.float16,
-   device=0
-)
-
-text = """
-Quang tổng hợp hay gọi tắt là quang hợp là quá trình thu nhận và chuyển hóa năng lượng ánh sáng Mặt trời của thực vật,
-tảo và một số vi khuẩn để tạo ra hợp chất hữu cơ phục vụ bản thân cũng như làm nguồn thức ăn cho hầu hết các sinh vật
-trên Trái Đất. Quang hợp trong thực vật thường liên quan đến chất tố diệp lục màu xanh lá cây và tạo ra oxy như một sản phẩm phụ
-"""
-pipeline(text)
+pipeline = pipeline("text2text-generation", model="vinai/bartpho-syllable", dtype="auto")
+pipeline("Thực vật tạo ra năng lượng thông qua một quá trình được gọi là")
 ```

 </hfoption>
 <hfoption id="AutoModel">

-```python
+```py
 import torch
-from transformers import BartForConditionalGeneration, AutoTokenizer
+from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

-tokenizer = AutoTokenizer.from_pretrained(
-    "vinai/bartpho-word",
-)
-model = BartForConditionalGeneration.from_pretrained(
-    "vinai/bartpho-word",
-    dtype=torch.float16,
-    device_map="auto",
-)
+model = AutoModelForSeq2SeqLM.from_pretrained("vinai/bartpho-syllable", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("vinai/bartpho-syllable")

-text = """
-Quang tổng hợp hay gọi tắt là quang hợp là quá trình thu nhận và chuyển hóa năng lượng ánh sáng Mặt trời của thực vật,
-tảo và một số vi khuẩn để tạo ra hợp chất hữu cơ phục vụ bản thân cũng như làm nguồn thức ăn cho hầu hết các sinh vật
-trên Trái Đất. Quang hợp trong thực vật thường liên quan đến chất tố diệp lục màu xanh lá cây và tạo ra oxy như một sản phẩm phụ
-"""
-inputs = tokenizer(text, return_tensors="pt").to(model.device)
-
-outputs = model.generate(inputs["input_ids"], num_beams=2, min_length=0, max_length=20)
-tokenizer.batch_decode(outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
-```
-
-</hfoption>
-<hfoption id="transformers CLI">
-
-```bash
-echo -e "Quang tổng hợp hay gọi tắt là quang hợp là quá trình thu nhận và chuyển hóa năng lượng ánh sáng Mặt trời của thực vật,
-tảo và một số vi khuẩn để tạo ra hợp chất hữu cơ phục vụ bản thân cũng như làm nguồn thức ăn cho hầu hết các sinh vật
-trên Trái Đất. Quang hợp trong thực vật thường liên quan đến chất tố diệp lục màu xanh lá cây và tạo ra oxy như một sản phẩm phụ" | \
-transformers run --task summarization --model vinai/bartpho-word --device 0
+inputs = tokenizer("Thực vật tạo ra năng lượng thông qua một quá trình được gọi là", return_tensors="pt")
+outputs = model.generate(**inputs, max_length=50)
+print(tokenizer.decode(outputs[0]))
 ```

 </hfoption>
 </hfoptions>

-## Notes
+## Usage tips

- BARTpho uses the large architecture of BART with an additional layer-normalization layer on top of the encoder and decoder. The BART-specific classes should be replaced with the mBART-specific classes.
- This implementation only handles tokenization through the `monolingual_vocab_file` file. This is a Vietnamese-specific subset of token types taken from that multilingual vocabulary. If you want to use this tokenizer for another language, replace the `monolingual_vocab_file` with one specialized for your target language.
+- BARTpho uses BART's large architecture plus an extra layer-normalization layer on the encoder and decoder. Replace BART-specific classes with mBART-specific classes.
+- This implementation handles tokenization through the `monolingual_vocab_file`. This contains Vietnamese-specific token types from the multilingual vocabulary. For other languages, replace `monolingual_vocab_file` with one specialized for your target language.

 ## BartphoTokenizer

--- a/docs/source/en/model_doc/beit.md
+++ b/docs/source/en/model_doc/beit.md
@ -13,120 +13,55 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2021-06-15 and added to Hugging Face Transformers on 2021-08-04.*
+*This model was released on 2021-06-15 and added to Hugging Face Transformers on 2021-08-04 and contributed by [nielsr](https://huggingface.co/nielsr).*
+
+<div style="float: right;">
+    <div class="flex flex-wrap space-x-1">
+        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
+    </div>
+</div>

 # BEiT

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[BEiT: BERT Pre-Training of Image Transformers](https://huggingface.co/papers/2106.08254) introduces a self-supervised vision representation model inspired by BERT. BEiT pre-trains Vision Transformers by predicting visual tokens from masked image patches. This approach outperforms supervised pre-training methods. Experiments show that BEiT achieves competitive results on image classification and semantic segmentation, with a base-size model reaching 83.2% top-1 accuracy on ImageNet-1K, surpassing DeiT trained from scratch. A large-size BEiT model achieves 86.3% on ImageNet-1K, even outperforming a ViT-L model pre-trained on ImageNet-22K.

-## Overview
-
-The BEiT model was proposed in [BEiT: BERT Pre-Training of Image Transformers](https://huggingface.co/papers/2106.08254) by
-Hangbo Bao, Li Dong and Furu Wei. Inspired by BERT, BEiT is the first paper that makes self-supervised pre-training of
-Vision Transformers (ViTs) outperform supervised pre-training. Rather than pre-training the model to predict the class
-of an image (as done in the [original ViT paper](https://huggingface.co/papers/2010.11929)), BEiT models are pre-trained to
-predict visual tokens from the codebook of OpenAI's [DALL-E model](https://huggingface.co/papers/2102.12092) given masked
-patches.
-
-The abstract from the paper is the following:
-
-*We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional Encoder representation
-from Image Transformers. Following BERT developed in the natural language processing area, we propose a masked image
-modeling task to pretrain vision Transformers. Specifically, each image has two views in our pre-training, i.e, image
-patches (such as 16x16 pixels), and visual tokens (i.e., discrete tokens). We first "tokenize" the original image into
-visual tokens. Then we randomly mask some image patches and fed them into the backbone Transformer. The pre-training
-objective is to recover the original visual tokens based on the corrupted image patches. After pre-training BEiT, we
-directly fine-tune the model parameters on downstream tasks by appending task layers upon the pretrained encoder.
-Experimental results on image classification and semantic segmentation show that our model achieves competitive results
-with previous pre-training methods. For example, base-size BEiT achieves 83.2% top-1 accuracy on ImageNet-1K,
-significantly outperforming from-scratch DeiT training (81.8%) with the same setup. Moreover, large-size BEiT obtains
-86.3% only using ImageNet-1K, even outperforming ViT-L with supervised pre-training on ImageNet-22K (85.2%).*
-
-This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/microsoft/unilm/tree/master/beit).
-
-## Usage tips
-
- BEiT models are regular Vision Transformers, but pre-trained in a self-supervised way rather than supervised. They
-  outperform both the [original model (ViT)](vit) as well as [Data-efficient Image Transformers (DeiT)](deit) when fine-tuned on ImageNet-1K and CIFAR-100. You can check out demo notebooks regarding inference as well as
-  fine-tuning on custom data [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer) (you can just replace
-  [`ViTImageProcessor`] by [`BeitImageProcessor`] and
-  [`ViTForImageClassification`] by [`BeitForImageClassification`]).
- There's also a demo notebook available which showcases how to combine DALL-E's image tokenizer with BEiT for
-  performing masked image modeling. You can find it [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/BEiT).
- As the BEiT models expect each image to be of the same size (resolution), one can use
-  [`BeitImageProcessor`] to resize (or rescale) and normalize images for the model.
- Both the patch resolution and image resolution used during pre-training or fine-tuning are reflected in the name of
-  each checkpoint. For example, `microsoft/beit-base-patch16-224` refers to a base-sized architecture with patch
-  resolution of 16x16 and fine-tuning resolution of 224x224. All checkpoints can be found on the [hub](https://huggingface.co/models?search=microsoft/beit).
- The available checkpoints are either (1) pre-trained on [ImageNet-22k](http://www.image-net.org/) (a collection of
-  14 million images and 22k classes) only, (2) also fine-tuned on ImageNet-22k or (3) also fine-tuned on [ImageNet-1k](http://www.image-net.org/challenges/LSVRC/2012/) (also referred to as ILSVRC 2012, a collection of 1.3 million
-  images and 1,000 classes).
- BEiT uses relative position embeddings, inspired by the T5 model. During pre-training, the authors shared the
-  relative position bias among the several self-attention layers. During fine-tuning, each layer's relative position
-  bias is initialized with the shared relative position bias obtained after pre-training. Note that, if one wants to
-  pre-train a model from scratch, one needs to either set the `use_relative_position_bias` or the
-  `use_relative_position_bias` attribute of [`BeitConfig`] to `True` in order to add
-  position embeddings.
-
-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/beit_architecture.jpg"
-alt="drawing" width="600"/>
-
-<small> BEiT pre-training. Taken from the <a href="https://huggingface.co/papers/2106.08254">original paper.</a> </small>
-
-### Using Scaled Dot Product Attention (SDPA)
-
-PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function
-encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the
-[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html)
-or the [GPU Inference](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention)
-page for more information.
-
-SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set
-`attn_implementation="sdpa"` in `from_pretrained()` to explicitly request SDPA to be used.
+<hfoptions id="usage">
+<hfoption id="Pipeline">

 ```py
-from transformers import BeitForImageClassification
-model = BeitForImageClassification.from_pretrained("microsoft/beit-base-patch16-224", attn_implementation="sdpa", dtype=torch.float16)
-...
+import torch
+from transformers import pipeline
+
+pipeline = pipeline(task="image-classification", model="microsoft/beit-base-patch16-224-pt22k", dtype="auto")
+pipeline("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg")
 ```

-For the best speedups, we recommend loading the model in half-precision (e.g. `torch.float16` or `torch.bfloat16`).
+</hfoption>
+<hfoption id="AutoModel">

-On a local benchmark (NVIDIA GeForce RTX 2060-8GB, PyTorch 2.5.1, OS Ubuntu 20.04) with `float16` and
-`microsoft/beit-base-patch16-224` model, we saw the following improvements during training and inference:
+```python
+import torch
+import requests
+from PIL import Image
+from transformers import AutoImageProcessor, AutoModelForImageClassification

-#### Training
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
+image = Image.open(requests.get(url, stream=True).raw)

-| num_training_steps | batch_size | image_size   | is_cuda | Time per batch (eager - s) | Time per batch (sdpa - s) | Speedup (%) | Eager peak mem (MB) | SDPA peak mem (MB) | Mem saving (%) |
-|--------------------|------------|--------------|---------|----------------------------|---------------------------|-------------|----------------------|--------------------|----------------|
-| 50                 | 2          | (1048, 640)  | True    | 0.984                      | 0.746                     | 31.975      | 6738.915            | 4319.886          | 55.998         |
+image_processor = AutoImageProcessor.from_pretrained("microsoft/beit-base-patch16-224-pt22k")
+model = AutoModelForImageClassification.from_pretrained("microsoft/beit-base-patch16-224-pt22k", dtype="auto")

-#### Inference
+inputs = image_processor(image, return_tensors="pt")

-|   Image batch size |   Eager (s/iter) | Eager CI, %   |   Eager memory (MB) |   SDPA (s/iter) | SDPA CI, %   |   SDPA memory (MB) |   SDPA speedup | SDPA memory saved (%) |
-|-------------------:|-----------------:|:--------------|--------------------:|----------------:|:-------------|-------------------:|---------------:|----------------------:|
-|                  1 |            0.012 | ±0.3%         |         3.76657e+08 |           0.011 | ±0.5%        |        3.75739e+08 |          1.05  |                 0.244 |
-|                  4 |            0.013 | ±0.1%         |         4.03147e+08 |           0.011 | ±0.2%        |        3.90554e+08 |          1.178 |                 3.225 |
-|                 16 |            0.045 | ±0.1%         |         4.96697e+08 |           0.035 | ±0.1%        |        4.51232e+08 |          1.304 |                10.076 |
-|                 32 |            0.088 | ±0.1%         |         6.24417e+08 |           0.066 | ±0.1%        |        5.33488e+08 |          1.325 |                17.044 |
+with torch.no_grad():
+    logits = model(**inputs).logits

-## Resources
+predicted_label = logits.argmax(-1).item()
+print(model.config.id2label[predicted_label])
+```

-A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with BEiT.
-
-<PipelineTag pipeline="image-classification"/>
-
- [`BeitForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb).
- See also: [Image classification task guide](../tasks/image_classification)
-
-**Semantic segmentation**
-
- [Semantic segmentation task guide](../tasks/semantic_segmentation)
-
-If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
+</hfoption>
+</hfoptions>

 ## BEiT specific outputs

@ -167,3 +102,4 @@ If you're interested in submitting a resource to be included here, please feel f

 [[autodoc]] BeitForSemanticSegmentation
    - forward
+
--- a/docs/source/en/model_doc/bert-generation.md
+++ b/docs/source/en/model_doc/bert-generation.md
@ -13,131 +13,46 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2019-07-29 and added to Hugging Face Transformers on 2020-11-16.*
-
-<div style="float: right;">
-    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-    </div>
-</div>
+*This model was released on 2019-07-29 and added to Hugging Face Transformers on 2020-11-16 and contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten).*

 # BertGeneration

-[BertGeneration](https://huggingface.co/papers/1907.12461) leverages pretrained BERT checkpoints for sequence-to-sequence tasks with the [`EncoderDecoderModel`] architecture. BertGeneration adapts the [`BERT`] for generative tasks.
-
-You can find all the original BERT checkpoints under the [BERT](https://huggingface.co/collections/google/bert-release-64ff5e7a4be99045d1896dbc) collection.
-
-> [!TIP]
-> This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten).
->
-> Click on the BertGeneration models in the right sidebar for more examples of how to apply BertGeneration to different sequence generation tasks.
-
-The example below demonstrates how to use BertGeneration with [`EncoderDecoderModel`] for sequence-to-sequence tasks.
+[BertGeneration](https://huggingface.co/papers/1907.12461) leverages pre-trained BERT checkpoints for sequence-to-sequence tasks using an EncoderDecoderModel framework. This approach achieves state-of-the-art results in Machine Translation, Text Summarization, Sentence Splitting, and Sentence Fusion, demonstrating the utility of initializing both encoder and decoder with pre-trained models.

 <hfoptions id="usage">
 <hfoption id="Pipeline">

-```python
+```py
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    task="text2text-generation",
-    model="google/roberta2roberta_L-24_discofuse",
-    dtype=torch.float16,
-    device=0
-)
-pipeline("Plants create energy through ")
+pipeline = pipeline(task="text2text-generation", model="google/bert_for_seq_generation_L-24_bbc_encoder", dtype="auto")
+pipeline("Plants generate energy through a process known as  ")
 ```

 </hfoption>
 <hfoption id="AutoModel">

-```python
+```py
 import torch
-from transformers import EncoderDecoderModel, AutoTokenizer
+from transformers import AutoModelForCausalLM, AutoTokenizer

-model = EncoderDecoderModel.from_pretrained("google/roberta2roberta_L-24_discofuse", dtype="auto")
-tokenizer = AutoTokenizer.from_pretrained("google/roberta2roberta_L-24_discofuse")
+model = AutoModelForCausalLM.from_pretrained("google/bert_for_seq_generation_L-24_bbc_encoder", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("google/bert_for_seq_generation_L-24_bbc_encoder")

-input_ids = tokenizer(
-    "Plants create energy through ", add_special_tokens=False, return_tensors="pt"
-).input_ids
-
-outputs = model.generate(input_ids)
+inputs = tokenizer("Plants generate energy through a process known as  ", return_tensors="pt")
+outputs = model.generate(**inputs, max_length=50)
 print(tokenizer.decode(outputs[0]))
 ```

-</hfoption>
-<hfoption id="transformers CLI">
-
-```bash
-echo -e "Plants create energy through " | transformers run --task text2text-generation --model "google/roberta2roberta_L-24_discofuse" --device 0
-```
-
 </hfoption>
 </hfoptions>

-Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
+## Usage tips

-The example below uses [BitsAndBytesConfig](../quantizationbitsandbytes) to quantize the weights to 4-bit.
-
-```python
-import torch
-from transformers import EncoderDecoderModel, AutoTokenizer, BitsAndBytesConfig
-
-# Configure 4-bit quantization
-quantization_config = BitsAndBytesConfig(
-    load_in_4bit=True,
-    bnb_4bit_compute_dtype=torch.float16
-)
-
-model = EncoderDecoderModel.from_pretrained(
-    "google/roberta2roberta_L-24_discofuse",
-    quantization_config=quantization_config,
-    dtype="auto"
-)
-tokenizer = AutoTokenizer.from_pretrained("google/roberta2roberta_L-24_discofuse")
-
-input_ids = tokenizer(
-    "Plants create energy through ", add_special_tokens=False, return_tensors="pt"
-).input_ids
-
-outputs = model.generate(input_ids)
-print(tokenizer.decode(outputs[0]))
-```
-
-## Notes
-
- [`BertGenerationEncoder`] and [`BertGenerationDecoder`] should be used in combination with [`EncoderDecoderModel`] for sequence-to-sequence tasks.
-
-   ```python
-   from transformers import BertGenerationEncoder, BertGenerationDecoder, BertTokenizer, EncoderDecoderModel
-   
-   # leverage checkpoints for Bert2Bert model
-   # use BERT's cls token as BOS token and sep token as EOS token
-   encoder = BertGenerationEncoder.from_pretrained("google-bert/bert-large-uncased", bos_token_id=101, eos_token_id=102)
-   # add cross attention layers and use BERT's cls token as BOS token and sep token as EOS token
-   decoder = BertGenerationDecoder.from_pretrained(
-       "google-bert/bert-large-uncased", add_cross_attention=True, is_decoder=True, bos_token_id=101, eos_token_id=102
-   )
-   bert2bert = EncoderDecoderModel(encoder=encoder, decoder=decoder)
-
-   # create tokenizer
-   tokenizer = BertTokenizer.from_pretrained("google-bert/bert-large-uncased")
-
-   input_ids = tokenizer(
-       "This is a long article to summarize", add_special_tokens=False, return_tensors="pt"
-   ).input_ids
-   labels = tokenizer("This is a short summary", return_tensors="pt").input_ids
-
-   # train
-   loss = bert2bert(input_ids=input_ids, decoder_input_ids=labels, labels=labels).loss
-   loss.backward()
-   ```
-
- For summarization, sentence splitting, sentence fusion and translation, no special tokens are required for the input.
- No EOS token should be added to the end of the input for most generation tasks.
+- Use [`BertGenerationEncoder`] and [`BertGenerationDecoder`] with [`EncoderDecoderModel`] for sequence-to-sequence tasks.
+- Summarization, sentence splitting, sentence fusion, and translation don't require special tokens in the input.
+- Don't add `EOS` tokens to the end of inputs for most generation tasks.

 ## BertGenerationConfig

--- a/docs/source/en/model_doc/bert-japanese.md
+++ b/docs/source/en/model_doc/bert-japanese.md
@ -13,73 +13,49 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2019-03-24 and added to Hugging Face Transformers on 2020-11-16.*
+*This model was released on 2019-03-24 and added to Hugging Face Transformers on 2020-11-16 and contributed by [cl-tohoku](https://huggingface.co/cl-tohoku).*

 # BertJapanese

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[BERTJapanese](https://github.com/cl-tohoku/bert-japanese) is a collection of pretrained BERT models for Japanese, developed at Tohoku University and released on Hugging Face. The models follow the original BERT architecture, with base models (12 layers, 768 hidden units, 12 heads) and large models (24 layers, 1024 hidden units, 16 heads). Training was performed on large-scale Japanese corpora such as Wikipedia and the Japanese portion of Common Crawl, with different tokenization strategies including subword and character-based. Multiple versions exist (v1, v2, v3), improving coverage and accuracy for Japanese natural language processing tasks

-## Overview
+Run the command below to install the Japanese dependencies.

-The BERT models trained on Japanese text.
-
-There are models with two different tokenization methods:
-
- Tokenize with MeCab and WordPiece. This requires some extra dependencies, [fugashi](https://github.com/polm/fugashi) which is a wrapper around [MeCab](https://taku910.github.io/mecab/).
- Tokenize into characters.
-
-To use *MecabTokenizer*, you should `pip install transformers["ja"]` (or `pip install -e .["ja"]` if you install
-from source) to install dependencies.
-
-See [details on cl-tohoku repository](https://github.com/cl-tohoku/bert-japanese).
-
-Example of using a model with MeCab and WordPiece tokenization:
-
-```python
->>> import torch
->>> from transformers import AutoModel, AutoTokenizer
-
->>> bertjapanese = AutoModel.from_pretrained("cl-tohoku/bert-base-japanese")
->>> tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese")
-
->>> ## Input Japanese Text
->>> line = "吾輩は猫である。"
-
->>> inputs = tokenizer(line, return_tensors="pt")
-
->>> print(tokenizer.decode(inputs["input_ids"][0]))
-[CLS] 吾輩 は 猫 で ある 。 [SEP]
-
->>> outputs = bertjapanese(**inputs)
+```bash
+!pip install transformers["ja"]
 ```

-Example of using a model with Character tokenization:
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-```python
->>> bertjapanese = AutoModel.from_pretrained("cl-tohoku/bert-base-japanese-char")
->>> tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese-char")
+```py
+import torch
+from transformers import pipeline

->>> ## Input Japanese Text
->>> line = "吾輩は猫である。"
-
->>> inputs = tokenizer(line, return_tensors="pt")
-
->>> print(tokenizer.decode(inputs["input_ids"][0]))
-[CLS] 吾 輩 は 猫 で あ る 。 [SEP]
-
->>> outputs = bertjapanese(**inputs)
+pipeline = pipeline(task="fill-mask", model="tohoku-nlp/bert-base-japanese", dtype="auto")
+pipeline("植物は[MASK]を光合成と呼ばれる過程を通じて作り出します。")
 ```

-This model was contributed by [cl-tohoku](https://huggingface.co/cl-tohoku).
+</hfoption>
+<hfoption id="AutoModel">

-<Tip>
+```py
+import torch
+from transformers import AutoModelForMaskedLM, AutoTokenizer

-This implementation is the same as BERT, except for tokenization method. Refer to [BERT documentation](bert) for
-API reference information.
+model = AutoModelForMaskedLM.from_pretrained("tohoku-nlp/bert-base-japanese", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("tohoku-nlp/bert-base-japanese")

-</Tip>
+inputs = tokenizer("植物は[MASK]を光合成と呼ばれる過程を通じて作り出します。", return_tensors="pt")
+outputs = model(**inputs)
+mask_token_id = tokenizer.mask_token_id
+mask_position = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
+predicted_word = tokenizer.decode(outputs.logits[0, mask_position].argmax(dim=-1))
+print(f"Predicted word: {predicted_word}")
+```
+
+</hfoption>
+</hfoptions>

 ## BertJapaneseTokenizer

--- a/docs/source/en/model_doc/bert.md
+++ b/docs/source/en/model_doc/bert.md
@ -13,25 +13,17 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2018-10-11 and added to Hugging Face Transformers on 2020-11-16.*
+*This model was released on 2018-10-11 and added to Hugging Face Transformers on 2020-11-16 and contributed by [thomwolf](https://huggingface.co/thomwolf).*

 <div style="float: right;">
    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
    </div>
 </div>

 # BERT

-[BERT](https://huggingface.co/papers/1810.04805) is a bidirectional transformer pretrained on unlabeled text to predict masked tokens in a sentence and to predict whether one sentence follows another. The main idea is that by randomly masking some tokens, the model can train on text to the left and right, giving it a more thorough understanding. BERT is also very versatile because its learned language representations can be adapted for other NLP tasks by fine-tuning an additional layer or head.
-
-You can find all the original BERT checkpoints under the [BERT](https://huggingface.co/collections/google/bert-release-64ff5e7a4be99045d1896dbc) collection.
-
-> [!TIP]
-> Click on the BERT models in the right sidebar for more examples of how to apply BERT to different language tasks.
-
-The example below demonstrates how to predict the `[MASK]` token with [`Pipeline`], [`AutoModel`], and from the command line.
+[BERT](https://huggingface.co/papers/1810.04805) introduces a bidirectional transformer model for language representation, pre-trained using masked language modeling and next sentence prediction. BERT achieves state-of-the-art results across various NLP tasks by fine-tuning with minimal task-specific modifications, significantly improving benchmarks like GLUE, MultiNLI, and SQuAD.

 <hfoptions id="usage">
 <hfoption id="Pipeline">
@ -40,12 +32,7 @@ The example below demonstrates how to predict the `[MASK]` token with [`Pipeline
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    task="fill-mask",
-    model="google-bert/bert-base-uncased",
-    dtype=torch.float16,
-    device=0
-)
+pipeline = pipeline(task="fill-mask", model="google-bert/bert-base-uncased", dtype="auto")
 pipeline("Plants create [MASK] through a process known as photosynthesis.")
 ```

@ -56,41 +43,23 @@ pipeline("Plants create [MASK] through a process known as photosynthesis.")
 import torch
 from transformers import AutoModelForMaskedLM, AutoTokenizer

-tokenizer = AutoTokenizer.from_pretrained(
-    "google-bert/bert-base-uncased",
-)
-model = AutoModelForMaskedLM.from_pretrained(
-    "google-bert/bert-base-uncased",
-    dtype=torch.float16,
-    device_map="auto",
-    attn_implementation="sdpa"
-)
-inputs = tokenizer("Plants create [MASK] through a process known as photosynthesis.", return_tensors="pt").to(model.device)
+model = AutoModelForMaskedLM.from_pretrained("google-bert/bert-base-uncased", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")

-with torch.no_grad():
-    outputs = model(**inputs)
-    predictions = outputs.logits
-
-masked_index = torch.where(inputs['input_ids'] == tokenizer.mask_token_id)[1]
-predicted_token_id = predictions[0, masked_index].argmax(dim=-1)
-predicted_token = tokenizer.decode(predicted_token_id)
-
-print(f"The predicted token is: {predicted_token}")
-```
-
-</hfoption>
-<hfoption id="transformers CLI">
-
-```bash
-echo -e "Plants create [MASK] through a process known as photosynthesis." | transformers run --task fill-mask --model google-bert/bert-base-uncased --device 0
+inputs = tokenizer("Plants create [MASK] through a process known as photosynthesis.", return_tensors="pt")
+outputs = model(**inputs)
+mask_token_id = tokenizer.mask_token_id
+mask_position = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
+predicted_word = tokenizer.decode(outputs.logits[0, mask_position].argmax(dim=-1))
+print(f"Predicted word: {predicted_word}")
 ```

 </hfoption>
 </hfoptions>

-## Notes
+## Usage tips

- Inputs should be padded on the right because BERT uses absolute position embeddings.
+- Pad inputs on the right. BERT uses absolute position embeddings.

 ## BertConfig

@ -109,6 +78,12 @@ echo -e "Plants create [MASK] through a process known as photosynthesis." | tran

 [[autodoc]] BertTokenizerFast

+## Bert specific outputs
+
+[[autodoc]] models.bert.modeling_bert.BertForPreTrainingOutput
+
+] models.bert.modeling_flax_bert.FlaxBertForPreTrainingOutput
+
 ## BertModel

 [[autodoc]] BertModel
@ -153,7 +128,3 @@ echo -e "Plants create [MASK] through a process known as photosynthesis." | tran

 [[autodoc]] BertForQuestionAnswering
    - forward
-
-## Bert specific outputs
-
-[[autodoc]] models.bert.modeling_bert.BertForPreTrainingOutput
--- a/docs/source/en/model_doc/bertweet.md
+++ b/docs/source/en/model_doc/bertweet.md
@ -13,25 +13,11 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2020-05-20 and added to Hugging Face Transformers on 2020-11-16.*
+*This model was released on 2020-05-20 and added to Hugging Face Transformers on 2020-11-16 and contributed by [dqnguyen](https://huggingface.co/dqnguyen).*

 # BERTweet

-<div style="float: right;">
-    <div class="flex flex-wrap space-x-1">
-    <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
-
-## BERTweet
-
-[BERTweet](https://huggingface.co/papers/2005.10200) shares the same architecture as [BERT-base](./bert), but it's pretrained like [RoBERTa](./roberta) on English Tweets. It performs really well on Tweet-related tasks like part-of-speech tagging, named entity recognition, and text classification.
-
-You can find all the original BERTweet checkpoints under the [VinAI Research](https://huggingface.co/vinai?search_models=BERTweet) organization.
-
-> [!TIP]
-> Refer to the [BERT](./bert) docs for more examples of how to apply BERTweet to different language tasks.
-
-The example below demonstrates how to predict the `<mask>` token with [`Pipeline`], [`AutoModel`], and from the command line.
+[BERTweet](https://huggingface.co/papers/2005.10200) is a large-scale pre-trained language model for English Tweets, sharing the architecture of BERT-base and trained using the RoBERTa pre-training procedure. It surpasses strong baselines like RoBERTa-base and XLM-R-base, achieving superior results in Part-of-speech tagging, Named-entity recognition, and text classification tasks.

 <hfoptions id="usage">
 <hfoption id="Pipeline">
@ -40,58 +26,37 @@ The example below demonstrates how to predict the `<mask>` token with [`Pipeline
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    task="fill-mask",
-    model="vinai/bertweet-base",
-    dtype=torch.float16,
-    device=0
-)
-pipeline("Plants create <mask> through a process known as photosynthesis.")
+pipeline = pipeline(task="text-classification", model="vinai/bertweet-base", dtype="auto")
+result = pipeline("SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:")
+print(f"Label: {result[0]['label']}, Score: {result[0]['score']}")
 ```

 </hfoption>
-<hfoption id="AutoModel">
+<hfoption id="Pipeline">

 ```py
 import torch
-from transformers import AutoModelForMaskedLM, AutoTokenizer
+from transformers import AutoModelForSequenceClassification, AutoTokenizer

-tokenizer = AutoTokenizer.from_pretrained(
-   "vinai/bertweet-base",
-)
-model = AutoModelForMaskedLM.from_pretrained(
-    "vinai/bertweet-base",
-    dtype=torch.float16,
-    device_map="auto"
-)
-inputs = tokenizer("Plants create <mask> through a process known as photosynthesis.", return_tensors="pt").to(model.device)
+model = AutoModelForSequenceClassification.from_pretrained("vinai/bertweet-base", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")

-with torch.no_grad():
-    outputs = model(**inputs)
-    predictions = outputs.logits
-
-masked_index = torch.where(inputs['input_ids'] == tokenizer.mask_token_id)[1]
-predicted_token_id = predictions[0, masked_index].argmax(dim=-1)
-predicted_token = tokenizer.decode(predicted_token_id)
-
-print(f"The predicted token is: {predicted_token}")
-```
-
-</hfoption>
-<hfoption id="transformers CLI">
-
-```bash
-echo -e "Plants create <mask> through a process known as photosynthesis." | transformers run --task fill-mask --model vinai/bertweet-base --device 0
+inputs = tokenizer("SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:", return_tensors="pt")
+outputs = model(**inputs)
+predicted_class_id = outputs.logits.argmax(dim=-1).item()
+label = model.config.id2label[predicted_class_id]
+print(f"Predicted label: {label}")
 ```

 </hfoption>
 </hfoptions>

-## Notes
+## Usage tips

- Use the [`AutoTokenizer`] or [`BertweetTokenizer`] because it's preloaded with a custom vocabulary adapted to tweet-specific tokens like hashtags (#), mentions (@), emojis, and common abbreviations. Make sure to also install the [emoji](https://pypi.org/project/emoji/) library.
- Inputs should be padded on the right (`padding="max_length"`) because BERT uses absolute position embeddings.
+- Use [`AutoTokenizer`] or [`BertweetTokenizer`]. They come preloaded with custom vocabulary for tweet-specific tokens like hashtags (#), mentions (@), emojis, and common abbreviations. Install the [emoji](https://pypi.org/project/emoji/) library too.
+- Pad inputs on the right (`padding="max_length"`). BERT uses absolute position embeddings.

 ## BertweetTokenizer

 [[autodoc]] BertweetTokenizer
+
--- a/docs/source/en/model_doc/big_bird.md
+++ b/docs/source/en/model_doc/big_bird.md
@ -13,24 +13,11 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2020-07-28 and added to Hugging Face Transformers on 2021-03-30.*
-
-<div style="float: right;">
-    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white" >
-    </div>
-</div>
+*This model was released on 2020-07-28 and added to Hugging Face Transformers on 2021-03-30 and contributed by [vasudevgupta](https://huggingface.co/vasudevgupta).*

 # BigBird

-[BigBird](https://huggingface.co/papers/2007.14062) is a transformer model built to handle sequence lengths up to 4096 compared to 512 for [BERT](./bert). Traditional transformers struggle with long inputs because attention gets really expensive as the sequence length grows. BigBird fixes this by using a sparse attention mechanism, which means it doesn’t try to look at everything at once. Instead, it mixes in local attention, random attention, and a few global tokens to process the whole input. This combination gives it the best of both worlds. It keeps the computation efficient while still capturing enough of the sequence to understand it well. Because of this, BigBird is great at tasks involving long documents, like question answering, summarization, and genomic applications.
-
-You can find all the original BigBird checkpoints under the [Google](https://huggingface.co/google?search_models=bigbird) organization.
-
-> [!TIP]
-> Click on the BigBird models in the right sidebar for more examples of how to apply BigBird to different language tasks.
-
-The example below demonstrates how to predict the `[MASK]` token with [`Pipeline`], [`AutoModel`], and from the command line.
+[BigBird: Transformers for Longer Sequences](https://huggingface.co/papers/2007.14062) introduces a sparse-attention mechanism that reduces the quadratic dependency on sequence length to linear, enabling handling of much longer sequences compared to models like BERT. BigBird combines sparse, global, and random attention to approximate full attention efficiently. This allows it to process sequences up to 8 times longer on similar hardware, improving performance on long document NLP tasks such as question answering and summarization. Additionally, the model supports novel applications in genomics.

 <hfoptions id="usage">
 <hfoption id="Pipeline">
@ -39,12 +26,7 @@ The example below demonstrates how to predict the `[MASK]` token with [`Pipeline
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    task="fill-mask",
-    model="google/bigbird-roberta-base",
-    dtype=torch.float16,
-    device=0
-)
+pipeline = pipeline(task="fill-mask", model="google/bigbird-roberta-base", dtype="auto")
 pipeline("Plants create [MASK] through a process known as photosynthesis.")
 ```

@ -55,47 +37,26 @@ pipeline("Plants create [MASK] through a process known as photosynthesis.")
 import torch
 from transformers import AutoModelForMaskedLM, AutoTokenizer

-tokenizer = AutoTokenizer.from_pretrained(
-    "google/bigbird-roberta-base",
-)
-model = AutoModelForMaskedLM.from_pretrained(
-    "google/bigbird-roberta-base",
-    dtype=torch.float16,
-    device_map="auto",
-)
-inputs = tokenizer("Plants create [MASK] through a process known as photosynthesis.", return_tensors="pt").to(model.device)
+model = AutoModelForMaskedLM.from_pretrained("google/bigbird-roberta-base", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("google/bigbird-roberta-base")

-with torch.no_grad():
-    outputs = model(**inputs)
-    predictions = outputs.logits
-
-masked_index = torch.where(inputs['input_ids'] == tokenizer.mask_token_id)[1]
-predicted_token_id = predictions[0, masked_index].argmax(dim=-1)
-predicted_token = tokenizer.decode(predicted_token_id)
-
-print(f"The predicted token is: {predicted_token}")
-```
-
-</hfoption>
-<hfoption id="transformers CLI">
-
-```bash
-!echo -e "Plants create [MASK] through a process known as photosynthesis." | transformers run --task fill-mask --model google/bigbird-roberta-base --device 0
+inputs = tokenizer("Plants create [MASK] through a process known as photosynthesis.", return_tensors="pt")
+outputs = model(**inputs)
+mask_token_id = tokenizer.mask_token_id
+mask_position = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
+predicted_word = tokenizer.decode(outputs.logits[0, mask_position].argmax(dim=-1))
+print(f"Predicted word: {predicted_word}")
 ```

 </hfoption>
 </hfoptions>

-## Notes
+## Usage tips

- Inputs should be padded on the right because BigBird uses absolute position embeddings.
- BigBird supports `original_full` and `block_sparse` attention. If the input sequence length is less than 1024, it is recommended to use `original_full` since sparse patterns don't offer much benefit for smaller inputs.
- The current implementation uses window size of 3 blocks and 2 global blocks, only supports the ITC-implementation, and doesn't support `num_random_blocks=0`.
- The sequence length must be divisible by the block size.
-
-## Resources
-
- Read the [BigBird](https://huggingface.co/blog/big-bird) blog post for more details about how its attention works.
+- Pad inputs on the right. BigBird uses absolute position embeddings.
+- BigBird supports `original_full` and `block_sparse` attention. Use `original_full` for sequences under 1024 tokens since sparse patterns don't help much with smaller inputs.
+- Current implementation uses 3-block window size and 2 global blocks. It only supports ITC-implementation and doesn't support `num_random_blocks=0`.
+- Sequence length must be divisible by the block size.

 ## BigBirdConfig

@ -156,3 +117,4 @@ print(f"The predicted token is: {predicted_token}")

 [[autodoc]] BigBirdForQuestionAnswering
    - forward
+
--- a/docs/source/en/model_doc/bigbird_pegasus.md
+++ b/docs/source/en/model_doc/bigbird_pegasus.md
@ -13,26 +13,11 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2020-07-28 and added to Hugging Face Transformers on 2021-05-07.*
-
-<div style="float: right;">
-    <div class="flex flex-wrap space-x-1">
-           <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-    </div>
-</div>
+*This model was released on 2020-07-28 and added to Hugging Face Transformers on 2021-05-07 and contributed by [vasudevgupta](https://huggingface.co/vasudevgupta).*

 # BigBirdPegasus

-[BigBirdPegasus](https://huggingface.co/papers/2007.14062) is an encoder-decoder (sequence-to-sequence) transformer model for long-input summarization. It extends the [BigBird](./big_bird) architecture with an additional pretraining objective borrowed from [Pegasus](./pegasus) called gap sequence generation (GSG). Whole sentences are masked and the model has to fill in the gaps in the document. BigBirdPegasus's ability to keep track of long contexts makes it effective at summarizing lengthy inputs, surpassing the performance of base Pegasus models.
-
-You can find all the original BigBirdPegasus checkpoints under the [Google](https://huggingface.co/google/models?search=bigbird-pegasus) organization.
-
-> [!TIP]
-> This model was contributed by [vasudevgupta](https://huggingface.co/vasudevgupta).
->
-> Click on the BigBirdPegasus models in the right sidebar for more examples of how to apply BigBirdPegasus to different language tasks.
-
-The example below demonstrates how to summarize text with [`Pipeline`], [`AutoModel`], and from the command line.
+[BigBird: Transformers for Longer Sequences](https://huggingface.co/papers/2007.14062) introduces a sparse-attention mechanism that reduces the quadratic dependency on sequence length to linear, enabling handling of much longer sequences compared to models like BERT. BigBird combines sparse, global, and random attention to approximate full attention efficiently. This allows it to process sequences up to 8 times longer on similar hardware, improving performance on long document NLP tasks such as question answering and summarization. The model is also a universal approximator of sequence functions and Turing complete, preserving the capabilities of full attention models. Additionally, BigBird explores applications in genomics data.

 <hfoptions id="usage">
 <hfoption id="Pipeline">
@ -41,16 +26,8 @@ The example below demonstrates how to summarize text with [`Pipeline`], [`AutoMo
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    task="summarization",
-    model="google/bigbird-pegasus-large-arxiv",
-    dtype=torch.float32,
-    device=0
-)
-pipeline("""Plants are among the most remarkable and essential life forms on Earth, possessing a unique ability to produce their own food through a process known as photosynthesis. This complex biochemical process is fundamental not only to plant life but to virtually all life on the planet.
-Through photosynthesis, plants capture energy from sunlight using a green pigment called chlorophyll, which is located in specialized cell structures called chloroplasts. In the presence of light, plants absorb carbon dioxide from the atmosphere through small pores in their leaves called stomata, and take in water from the soil through their root systems.
-These ingredients are then transformed into glucose, a type of sugar that serves as a source of chemical energy, and oxygen, which is released as a byproduct into the atmosphere. The glucose produced during photosynthesis is not just used immediately; plants also store it as starch or convert it into other organic compounds like cellulose, which is essential for building their cellular structure.
-This energy reserve allows them to grow, develop leaves, produce flowers, bear fruit, and carry out various physiological processes throughout their lifecycle.""")
+pipeline = pipeline(task="summarization", model="google/bigbird-pegasus-large-arxiv", dtype="auto")
+pipeline("Plants are among the most remarkable and essential life forms on Earth, possessing a unique ability to produce their own food through a process known as photosynthesis. This complex biochemical process is fundamental not only to plant life but to virtually all life on the planet. Through photosynthesis, plants capture energy from sunlight using a green pigment called chlorophyll, which is located in specialized cell structures called chloroplasts. In the presence of light, plants absorb carbon dioxide from the atmosphere through small pores in their leaves called stomata, and take in water from the soil through their root systems. These ingredients are then transformed into glucose, a type of sugar that serves as a source of chemical energy, and oxygen, which is released as a byproduct into the atmosphere. The glucose produced during photosynthesis is not just used immediately; plants also store it as starch or convert it into other organic compounds like cellulose, which is essential for building their cellular structure.")
 ```

 </hfoption>
@ -58,82 +35,31 @@ This energy reserve allows them to grow, develop leaves, produce flowers, bear f

 ```py
 import torch
-from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
+from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

-tokenizer = AutoTokenizer.from_pretrained(
-    "google/bigbird-pegasus-large-arxiv"
-)
-model = AutoModelForSeq2SeqLM.from_pretrained(
-    "google/bigbird-pegasus-large-arxiv",
-    dtype=torch.bfloat16,
-    device_map="auto",
-)
+model = AutoModelForSeq2SeqLM.from_pretrained("google/bigbird-pegasus-large-arxiv", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("google/bigbird-pegasus-large-arxiv")

-input_text = """Plants are among the most remarkable and essential life forms on Earth, possessing a unique ability to produce their own food through a process known as photosynthesis. This complex biochemical process is fundamental not only to plant life but to virtually all life on the planet.
+text="""
+Plants are among the most remarkable and essential life forms on Earth, possessing a unique ability to produce their own food through a process known as photosynthesis. This complex biochemical process is fundamental not only to plant life but to virtually all life on the planet.
 Through photosynthesis, plants capture energy from sunlight using a green pigment called chlorophyll, which is located in specialized cell structures called chloroplasts. In the presence of light, plants absorb carbon dioxide from the atmosphere through small pores in their leaves called stomata, and take in water from the soil through their root systems.
 These ingredients are then transformed into glucose, a type of sugar that serves as a source of chemical energy, and oxygen, which is released as a byproduct into the atmosphere. The glucose produced during photosynthesis is not just used immediately; plants also store it as starch or convert it into other organic compounds like cellulose, which is essential for building their cellular structure.
-This energy reserve allows them to grow, develop leaves, produce flowers, bear fruit, and carry out various physiological processes throughout their lifecycle."""
-input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)
-
-output = model.generate(**input_ids, cache_implementation="static")
-print(tokenizer.decode(output[0], skip_special_tokens=True))
-```
-
-</hfoption>
-<hfoption id="transformers">
-
-```bash
-echo -e "Plants are among the most remarkable and essential life forms on Earth, possessing a unique ability to produce their own food through a process known as photosynthesis. This complex biochemical process is fundamental not only to plant life but to virtually all life on the planet. Through photosynthesis, plants capture energy from sunlight using a green pigment called chlorophyll, which is located in specialized cell structures called chloroplasts." | transformers run --task summarization --model google/bigbird-pegasus-large-arxiv --device 0
+"""
+inputs = tokenizer(text, return_tensors="pt")
+outputs = model.generate(**inputs, max_length=50)
+print(tokenizer.decode(outputs[0]))
 ```

 </hfoption>
 </hfoptions>

-Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
+## Usage tips

-The example below uses [bitsandbytes](../quantization/bitsandbytes) to only quantize the weights to int4.
-
-```py
-import torch
-from transformers import BitsAndBytesConfig, AutoModelForSeq2SeqLM, AutoTokenizer
-
-quantization_config = BitsAndBytesConfig(
-    load_in_4bit=True,
-    bnb_4bit_compute_dtype=torch.bfloat16,
-    bnb_4bit_quant_type="nf4"
-)
-model = AutoModelForSeq2SeqLM.from_pretrained(
-    "google/bigbird-pegasus-large-arxiv",
-    dtype=torch.bfloat16,
-    device_map="auto",
-    quantization_config=quantization_config
-)
-
-tokenizer = AutoTokenizer.from_pretrained(
-    "google/bigbird-pegasus-large-arxiv"
-)
-
-input_text = """Plants are among the most remarkable and essential life forms on Earth, possessing a unique ability to produce their own food through a process known as photosynthesis. This complex biochemical process is fundamental not only to plant life but to virtually all life on the planet.
-Through photosynthesis, plants capture energy from sunlight using a green pigment called chlorophyll, which is located in specialized cell structures called chloroplasts. In the presence of light, plants absorb carbon dioxide from the atmosphere through small pores in their leaves called stomata, and take in water from the soil through their root systems.
-These ingredients are then transformed into glucose, a type of sugar that serves as a source of chemical energy, and oxygen, which is released as a byproduct into the atmosphere. The glucose produced during photosynthesis is not just used immediately; plants also store it as starch or convert it into other organic compounds like cellulose, which is essential for building their cellular structure.
-This energy reserve allows them to grow, develop leaves, produce flowers, bear fruit, and carry out various physiological processes throughout their lifecycle."""
-input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)
-
-output = model.generate(**input_ids, cache_implementation="static")
-print(tokenizer.decode(output[0], skip_special_tokens=True))
-```
-
-## Notes
-
- BigBirdPegasus also uses the [`PegasusTokenizer`].
- Inputs should be padded on the right because BigBird uses absolute position embeddings.
- BigBirdPegasus supports `original_full` and `block_sparse` attention. If the input sequence length is less than 1024, it is recommended to use `original_full` since sparse patterns don't offer much benefit for smaller inputs.
- The current implementation uses window size of 3 blocks and 2 global blocks, only supports the ITC-implementation, and doesn't support `num_random_blocks=0`.
- The sequence length must be divisible by the block size.
-
-## Resources
-
-Read the [Understanding BigBird's Block Sparse Attention](https://huggingface.co/blog/big-bird) blog post for more details about how BigBird's attention works.
+- BigBirdPegasus uses [`PegasusTokenizer`].
+- Pad inputs on the right. BigBird uses absolute position embeddings.
+- BigBirdPegasus supports `original_full` and `block_sparse` attention. Use `original_full` for sequences under 1024 tokens since sparse patterns don't help much with smaller inputs.
+- Current implementation uses 3-block window size and 2 global blocks. It only supports ITC-implementation and doesn't support `num_random_blocks=0`.
+- Sequence length must be divisible by the block size.

 ## BigBirdPegasusConfig

@ -164,3 +90,4 @@ Read the [Understanding BigBird's Block Sparse Attention](https://huggingface.co

 [[autodoc]] BigBirdPegasusForCausalLM
    - forward
+
--- a/docs/source/en/model_doc/biogpt.md
+++ b/docs/source/en/model_doc/biogpt.md
@ -13,26 +13,17 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2022-10-19 and added to Hugging Face Transformers on 2022-12-05.*
+*This model was released on {release_date} and added to Hugging Face Transformers on 2022-12-05 and contributed by [kamalkraj](https://huggingface.co/kamalkraj).*

 <div style="float: right;">
    <div class="flex flex-wrap space-x-1">
-            <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-            <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
-            <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
+        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
    </div>
 </div>

 # BioGPT

-[BioGPT](https://huggingface.co/papers/2210.10341) is a generative Transformer model based on [GPT-2](./gpt2) and pretrained on 15 million PubMed abstracts. It is designed for biomedical language tasks.
-
-You can find all the original BioGPT checkpoints under the [Microsoft](https://huggingface.co/microsoft?search_models=biogpt) organization.
-
-> [!TIP]
-> Click on the BioGPT models in the right sidebar for more examples of how to apply BioGPT to different language tasks.
-
-The example below demonstrates how to generate biomedical text with [`Pipeline`], [`AutoModel`], and also from the command line.
+[BioGPT](https://huggingface.co/papers/bbac409) is a domain-specific generative Transformer language model designed for biomedical text generation and mining. Trained on 15M PubMed abstracts, BioGPT excels in various biomedical NLP tasks, outperforming previous models. It achieves notable F1 scores of 44.98%, 38.42%, and 40.76% on BC5CDR, KD-DTI, and DDI end-to-end relation extraction tasks, respectively, and sets a new record with 78.2% accuracy on PubMedQA. Additionally, BioGPT demonstrates superior text generation capabilities, producing fluent descriptions for biomedical terms.

 <hfoptions id="usage">
 <hfoption id="Pipeline">
@ -41,14 +32,8 @@ The example below demonstrates how to generate biomedical text with [`Pipeline`]
 import torch
 from transformers import pipeline

-generator = pipeline(
-    task="text-generation",
-    model="microsoft/biogpt",
-    dtype=torch.float16,
-    device=0,
-)
-result = generator("Ibuprofen is best used for", truncation=True, max_length=50, do_sample=True)[0]["generated_text"]
-print(result)
+pipeline = pipeline(task="text-generation", model="microsoft/biogpt", dtype="auto")
+pipeline("Ibuprofen is best used for ")
 ```

 </hfoption>
@ -58,77 +43,21 @@ print(result)
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer

+model = AutoModelForCausalLM.from_pretrained("microsoft/biogpt", dtype="auto")
 tokenizer = AutoTokenizer.from_pretrained("microsoft/biogpt")
-model = AutoModelForCausalLM.from_pretrained(
-    "microsoft/biogpt",
-    dtype=torch.float16,
-    device_map="auto",
-    attn_implementation="sdpa"
-)

-input_text = "Ibuprofen is best used for"
-inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
-
-with torch.no_grad():
-    generated_ids = model.generate(**inputs, max_length=50)
-
-output = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
-print(output)
-```
-
-</hfoption>
-<hfoption id="transformers CLI">
-
-```bash
-echo -e "Ibuprofen is best used for" | transformers run --task text-generation --model microsoft/biogpt --device 0
+inputs = tokenizer("Ibuprofen is best used for ", return_tensors="pt")
+outputs = model.generate(**inputs, max_length=50)
+print(tokenizer.decode(outputs[0]))
 ```

 </hfoption>
 </hfoptions>

-Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
+## Usage tips

-The example below uses [bitsandbytes](../quantization/bitsandbytes) to only quantize the weights to 4-bit precision.
-
-```py
-import torch
-from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
-
-bnb_config = BitsAndBytesConfig(
-    load_in_4bit=True,
-    bnb_4bit_quant_type="nf4",
-    bnb_4bit_compute_dtype=torch.bfloat16,
-    bnb_4bit_use_double_quant=True
-)
-
-tokenizer = AutoTokenizer.from_pretrained("microsoft/BioGPT-Large")
-model = AutoModelForCausalLM.from_pretrained(
-    "microsoft/BioGPT-Large",
-    quantization_config=bnb_config,
-    dtype=torch.bfloat16,
-    device_map="auto"
-)
-
-input_text = "Ibuprofen is best used for"
-inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
-with torch.no_grad():
-    generated_ids = model.generate(**inputs, max_length=50)
-output = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
-print(output)
-```
-
-## Notes
-
- Pad inputs on the right because BioGPT uses absolute position embeddings.
- BioGPT can reuse previously computed key-value attention pairs. Access this feature with the [past_key_values](https://huggingface.co/docs/transformers/main/en/model_doc/biogpt#transformers.BioGptModel.forward.past_key_values) parameter in [`BioGPTModel.forward`].
-
-   ```py
-   from transformers import AutoModelForCausalLM
-
-   model = AutoModelForCausalLM.from_pretrained(
-      "microsoft/biogpt",
-      attn_implementation="eager"
-   )
+- Pad inputs on the right. BioGPT uses absolute position embeddings.
+- BioGPT reuses previously computed key-value attention pairs. Access this feature with the `past_key_values` parameter in [`BioGPTModel.forward`].

 ## BioGptConfig

@ -148,7 +77,7 @@ print(output)

 [[autodoc]] BioGptForCausalLM
    - forward
-
+    
 ## BioGptForTokenClassification

 [[autodoc]] BioGptForTokenClassification
@ -158,3 +87,4 @@ print(output)

 [[autodoc]] BioGptForSequenceClassification
    - forward
+
--- a/docs/source/en/model_doc/bit.md
+++ b/docs/source/en/model_doc/bit.md
@ -13,43 +13,49 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2019-12-24 and added to Hugging Face Transformers on 2022-12-07.*
+*This model was released on 2019-12-24 and added to Hugging Face Transformers on 2022-12-07 and contributed by [nielsr](https://huggingface.co/nielsr).*

 # Big Transfer (BiT)

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[Big Transfer (BiT): General Visual Representation Learning](https://huggingface.co/papers/1912.11370) proposes a method for scaling up pre-training of ResNetv2 architectures. This approach, called Big Transfer (BiT), combines specific components and uses a simple heuristic for transfer learning, achieving strong performance across over 20 datasets. BiT demonstrates robustness across various data regimes, from 1 example per class to 1M total examples. It achieves 87.5% top-1 accuracy on ILSVRC-2012, 99.4% on CIFAR-10, and 76.3% on the 19-task Visual Task Adaptation Benchmark (VTAB). On small datasets, BiT reaches 76.8% on ILSVRC-2012 with 10 examples per class and 97.0% on CIFAR-10 with 10 examples per class. The paper includes a detailed analysis of the key components contributing to high transfer performance.

-## Overview
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-The BiT model was proposed in [Big Transfer (BiT): General Visual Representation Learning](https://huggingface.co/papers/1912.11370) by Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, Neil Houlsby.
-BiT is a simple recipe for scaling up pre-training of [ResNet](resnet)-like architectures (specifically, ResNetv2). The method results in significant improvements for transfer learning.
+```py
+import torch
+from transformers import pipeline

-The abstract from the paper is the following:
+pipeline = pipeline(task="image-classification", model="google/bit-50", dtype="auto")
+pipeline("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg")
+```

-*Transfer of pre-trained representations improves sample efficiency and simplifies hyperparameter tuning when training deep neural networks for vision. We revisit the paradigm of pre-training on large supervised datasets and fine-tuning the model on a target task. We scale up pre-training, and propose a simple recipe that we call Big Transfer (BiT). By combining a few carefully selected components, and transferring using a simple heuristic, we achieve strong performance on over 20 datasets. BiT performs well across a surprisingly wide range of data regimes -- from 1 example per class to 1M total examples. BiT achieves 87.5% top-1 accuracy on ILSVRC-2012, 99.4% on CIFAR-10, and 76.3% on the 19 task Visual Task Adaptation Benchmark (VTAB). On small datasets, BiT attains 76.8% on ILSVRC-2012 with 10 examples per class, and 97.0% on CIFAR-10 with 10 examples per class. We conduct detailed analysis of the main components that lead to high transfer performance.*
+</hfoption>
+<hfoption id="AutoModel">

-This model was contributed by [nielsr](https://huggingface.co/nielsr).
-The original code can be found [here](https://github.com/google-research/big_transfer).
+```python
+import torch
+import requests
+from PIL import Image
+from transformers import AutoImageProcessor, AutoModelForImageClassification

-## Usage tips
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
+image = Image.open(requests.get(url, stream=True).raw)

- BiT models are equivalent to ResNetv2 in terms of architecture, except that: 1) all batch normalization layers are replaced by [group normalization](https://huggingface.co/papers/1803.08494),
+image_processor = AutoImageProcessor.from_pretrained("google/bit-50")
+model = AutoModelForImageClassification.from_pretrained("google/bit-50", dtype="auto")

-2) [weight standardization](https://huggingface.co/papers/1903.10520) is used for convolutional layers. The authors show that the combination of both is useful for training with large batch sizes, and has a significant
-impact on transfer learning.
+inputs = image_processor(image, return_tensors="pt")

-## Resources
+with torch.no_grad():
+    logits = model(**inputs).logits

-A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with BiT.
+predicted_label = logits.argmax(-1).item()
+print(model.config.id2label[predicted_label])
+```

-<PipelineTag pipeline="image-classification"/>
-
- [`BitForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb).
- See also: [Image classification task guide](../tasks/image_classification)
-
-If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
+</hfoption>
+</hfoptions>

 ## BitConfig

@ -74,3 +80,4 @@ If you're interested in submitting a resource to be included here, please feel f

 [[autodoc]] BitForImageClassification
    - forward
+
--- a/docs/source/en/model_doc/bitnet.md
+++ b/docs/source/en/model_doc/bitnet.md
@ -17,6 +17,14 @@ rendered properly in your Markdown viewer.

 # BitNet

+```py
+import torch
+from transformers import pipeline
+
+pipeline = pipeline(task="text-generation", model="microsoft/BitNet-b1.58-3B", dtype="auto")
+pipeline("The future of artificial intelligence is")
+```
+
 ## Overview

 Trained on a corpus of 4 trillion tokens, this model demonstrates that native 1-bit LLMs can achieve performance comparable to leading open-weight, full-precision models of similar size, while offering substantial advantages in computational efficiency (memory, energy, latency).
@ -38,22 +46,22 @@ Several versions of the model weights are available on Hugging Face:
 ### Model Details

 * **Architecture:** Transformer-based, modified with `BitLinear` layers (BitNet framework).
-  * Uses Rotary Position Embeddings (RoPE).
-  * Uses squared ReLU (ReLU²) activation in FFN layers.
-  * Employs [`subln`](https://proceedings.mlr.press/v202/wang23u.html) normalization.
-  * No bias terms in linear or normalization layers.
+    * Uses Rotary Position Embeddings (RoPE).
+    * Uses squared ReLU (ReLU²) activation in FFN layers.
+    * Employs [`subln`](https://proceedings.mlr.press/v202/wang23u.html) normalization.
+    * No bias terms in linear or normalization layers.
 * **Quantization:** Native 1.58-bit weights and 8-bit activations (W1.58A8).
-  * Weights are quantized to ternary values {-1, 0, +1} using absmean quantization during the forward pass.
-  * Activations are quantized to 8-bit integers using absmax quantization (per-token).
-  * **Crucially, the model was *trained from scratch* with this quantization scheme, not post-training quantized.**
+    * Weights are quantized to ternary values {-1, 0, +1} using absmean quantization during the forward pass.
+    * Activations are quantized to 8-bit integers using absmax quantization (per-token).
+    * **Crucially, the model was *trained from scratch* with this quantization scheme, not post-training quantized.**
 * **Parameters:** ~2 Billion
 * **Training Tokens:** 4 Trillion
-* **Context Length:** Maximum sequence length of **4096 tokens**.
-  * *Recommendation:* For optimal performance on tasks requiring very long contexts (beyond the pre-training length or for specialized long-reasoning tasks), we recommend performing intermediate long-sequence adaptation/training before the final fine-tuning stage.
+*   **Context Length:** Maximum sequence length of **4096 tokens**.
+    *   *Recommendation:* For optimal performance on tasks requiring very long contexts (beyond the pre-training length or for specialized long-reasoning tasks), we recommend performing intermediate long-sequence adaptation/training before the final fine-tuning stage.
 * **Training Stages:**
-    1. **Pre-training:** Large-scale training on public text/code and synthetic math data using a two-stage learning rate and weight decay schedule.
-    2. **Supervised Fine-tuning (SFT):** Fine-tuned on instruction-following and conversational datasets using sum loss aggregation and specific hyperparameter tuning.
-    3. **Direct Preference Optimization (DPO):** Aligned with human preferences using preference pairs.
+    1.  **Pre-training:** Large-scale training on public text/code and synthetic math data using a two-stage learning rate and weight decay schedule.
+    2.  **Supervised Fine-tuning (SFT):** Fine-tuned on instruction-following and conversational datasets using sum loss aggregation and specific hyperparameter tuning.
+    3.  **Direct Preference Optimization (DPO):** Aligned with human preferences using preference pairs.
 * **Tokenizer:** LLaMA 3 Tokenizer (vocab size: 128,256).

 ## Usage tips
--- a/docs/source/en/model_doc/blenderbot-small.md
+++ b/docs/source/en/model_doc/blenderbot-small.md
@ -13,53 +13,44 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2020-04-28 and added to Hugging Face Transformers on 2021-01-05.*
+*This model was released on 2020-04-28 and added to Hugging Face Transformers on 2021-01-05 and contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten).*

 # Blenderbot Small

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
-<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[Blender](https://huggingface.co/papers/2004.13637) focuses on building open-domain chatbots by emphasizing the importance of various conversational skills beyond just scaling model parameters and data size. The model variants include 90M, 2.7B, and 9.4B parameters, demonstrating that with the right training data and generation strategies, large-scale models can learn to provide engaging talking points, listen, display knowledge, empathy, and personality, while maintaining a consistent persona. Human evaluations indicate that the best models outperform existing approaches in terms of engagingness and humanness in multi-turn dialogues. The paper also analyzes failure cases to highlight the limitations of the work.

-Note that [`BlenderbotSmallModel`] and
-[`BlenderbotSmallForConditionalGeneration`] are only used in combination with the checkpoint
-[facebook/blenderbot-90M](https://huggingface.co/facebook/blenderbot-90M). Larger Blenderbot checkpoints should
-instead be used with [`BlenderbotModel`] and
-[`BlenderbotForConditionalGeneration`]
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-## Overview
+```py
+import torch
+from transformers import pipeline

-The Blender chatbot model was proposed in [Recipes for building an open-domain chatbot](https://huggingface.co/papers/2004.13637) Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu,
-Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston on 30 Apr 2020.
+pipeline = pipeline(task="text-generation", model="facebook/blenderbot_small-90M", dtype="auto")
+pipeline("Plants create energy through a process known as photosynthesis.")
+```

-The abstract of the paper is the following:
+</hfoption>
+<hfoption id="AutoModel">

-*Building open-domain chatbots is a challenging area for machine learning research. While prior work has shown that
-scaling neural models in the number of parameters and the size of the data they are trained on gives improved results,
-we show that other ingredients are important for a high-performing chatbot. Good conversation requires a number of
-skills that an expert conversationalist blends in a seamless way: providing engaging talking points and listening to
-their partners, and displaying knowledge, empathy and personality appropriately, while maintaining a consistent
-persona. We show that large scale models can learn these skills when given appropriate training data and choice of
-generation strategy. We build variants of these recipes with 90M, 2.7B and 9.4B parameter models, and make our models
-and code publicly available. Human evaluations show our best models are superior to existing approaches in multi-turn
-dialogue in terms of engagingness and humanness measurements. We then discuss the limitations of this work by analyzing
-failure cases of our models.*
+```py
+import torch
+from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

-This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The authors' code can be
-found [here](https://github.com/facebookresearch/ParlAI).
+model = AutoModelForSeq2SeqLM.from_pretrained("facebook/blenderbot_small-90M", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("facebook/blenderbot_small-90M")
+
+inputs = tokenizer("Plants create energy through a process known as photosynthesis.", return_tensors="pt")
+outputs = model.generate(**inputs)
+print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
+```
+
+</hfoption>
+</hfoptions>

 ## Usage tips

-Blenderbot Small is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
-the left.
-
-## Resources
-
- [Causal language modeling task guide](../tasks/language_modeling)
- [Translation task guide](../tasks/translation)
- [Summarization task guide](../tasks/summarization)
+- Pad inputs on the right. Blenderbot Small uses absolute position embeddings.

 ## BlenderbotSmallConfig

@ -91,3 +82,4 @@ the left.

 [[autodoc]] BlenderbotSmallForCausalLM
    - forward
+
--- a/docs/source/en/model_doc/blenderbot.md
+++ b/docs/source/en/model_doc/blenderbot.md
@ -13,69 +13,46 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2020-04-28 and added to Hugging Face Transformers on 2020-11-16.*
+*This model was released on 2020-04-28 and added to Hugging Face Transformers on 2020-11-16 and contributed by [sshleifer](https://huggingface.co/sshleifer).*

 # Blenderbot

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
-<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[Blender](https://huggingface.co/papers/2004.13637) focuses on building open-domain chatbots by emphasizing the importance of various conversational skills beyond just scaling model parameters and data size. The model variants include 90M, 2.7B, and 9.4B parameters, demonstrating that with the right training data and generation strategies, large-scale models can learn to provide engaging talking points, listen, display knowledge, empathy, and personality, while maintaining a consistent persona. Human evaluations indicate that the best models outperform existing approaches in terms of engagingness and humanness in multi-turn dialogues. The paper also analyzes failure cases to highlight the limitations of the work.

-## Overview
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-The Blender chatbot model was proposed in [Recipes for building an open-domain chatbot](https://huggingface.co/papers/2004.13637) Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu,
-Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston on 30 Apr 2020.
+```py
+import torch
+from transformers import pipeline

-The abstract of the paper is the following:
-
-*Building open-domain chatbots is a challenging area for machine learning research. While prior work has shown that
-scaling neural models in the number of parameters and the size of the data they are trained on gives improved results,
-we show that other ingredients are important for a high-performing chatbot. Good conversation requires a number of
-skills that an expert conversationalist blends in a seamless way: providing engaging talking points and listening to
-their partners, and displaying knowledge, empathy and personality appropriately, while maintaining a consistent
-persona. We show that large scale models can learn these skills when given appropriate training data and choice of
-generation strategy. We build variants of these recipes with 90M, 2.7B and 9.4B parameter models, and make our models
-and code publicly available. Human evaluations show our best models are superior to existing approaches in multi-turn
-dialogue in terms of engagingness and humanness measurements. We then discuss the limitations of this work by analyzing
-failure cases of our models.*
-
-This model was contributed by [sshleifer](https://huggingface.co/sshleifer). The authors' code can be found [here](https://github.com/facebookresearch/ParlAI) .
-
-## Usage tips and example
-
-Blenderbot is a model with absolute position embeddings so it's usually advised to pad the inputs on the right
-rather than the left.
-
-An example:
-
-```python
->>> from transformers import BlenderbotTokenizer, BlenderbotForConditionalGeneration
-
->>> mname = "facebook/blenderbot-400M-distill"
->>> model = BlenderbotForConditionalGeneration.from_pretrained(mname)
->>> tokenizer = BlenderbotTokenizer.from_pretrained(mname)
->>> UTTERANCE = "My friends are cool but they eat too many carbs."
->>> inputs = tokenizer([UTTERANCE], return_tensors="pt")
->>> reply_ids = model.generate(**inputs)
->>> print(tokenizer.batch_decode(reply_ids))
-["<s> That's unfortunate. Are they trying to lose weight or are they just trying to be healthier?</s>"]
+pipeline = pipeline(task="text-generation", model="facebook/blenderbot-400M-distill", dtype="auto")
+pipeline("Plants create energy through a process known as photosynthesis.")
 ```

-## Implementation Notes
+</hfoption>
+<hfoption id="AutoModel">

- Blenderbot uses a standard [seq2seq model transformer](https://huggingface.co/papers/1706.03762) based architecture.
- Available checkpoints can be found in the [model hub](https://huggingface.co/models?search=blenderbot).
- This is the *default* Blenderbot model class. However, some smaller checkpoints, such as
-  `facebook/blenderbot_small_90M`, have a different architecture and consequently should be used with
-  [BlenderbotSmall](blenderbot-small).
+```py
+import torch
+from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

-## Resources
+model = AutoModelForSeq2SeqLM.from_pretrained("facebook/blenderbot-400M-distill", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("facebook/blenderbot-400M-distill")

- [Causal language modeling task guide](../tasks/language_modeling)
- [Translation task guide](../tasks/translation)
- [Summarization task guide](../tasks/summarization)
+inputs = tokenizer("Plants create energy through a process known as photosynthesis.", return_tensors="pt")
+outputs = model.generate(**inputs)
+print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
+```
+
+</hfoption>
+</hfoptions>
+
+## Usage tips
+
+- Pad inputs on the right. Blenderbot uses absolute position embeddings.
+- Blenderbot uses a standard seq2seq transformer architecture.
+- This is the default Blenderbot model class. Smaller checkpoints like `facebook/blenderbot_small_90M` have different architectures and need [`BlenderbotSmall`].

 ## BlenderbotConfig

@ -109,3 +86,4 @@ See [`~transformers.BartForConditionalGeneration`] for arguments to *forward* an

 [[autodoc]] BlenderbotForCausalLM
    - forward
+
--- a/docs/source/en/model_doc/blip-2.md
+++ b/docs/source/en/model_doc/blip-2.md
@ -13,49 +13,48 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2023-01-30 and added to Hugging Face Transformers on 2023-02-09.*
+*This model was released on 2023-01-30 and added to Hugging Face Transformers on 2023-02-09 and contributed by [nielsr](https://huggingface.co/nielsr).*

 # BLIP-2

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[BLIP-2](https://huggingface.co/papers/2301.12597) bootstraps vision-language pre-training using frozen image encoders and large language models. It employs a lightweight, 12-layer Transformer encoder to bridge the modality gap, achieving state-of-the-art results on various vision-language tasks. Specifically, BLIP-2 surpasses Flamingo by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters. The model also demonstrates strong zero-shot image-to-text generation capabilities following natural language instructions.

-## Overview
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-The BLIP-2 model was proposed in [BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](https://huggingface.co/papers/2301.12597) by
-Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi. BLIP-2 leverages frozen pre-trained image encoders and large language models (LLMs) by training a lightweight, 12-layer Transformer
-encoder in between them, achieving state-of-the-art performance on various vision-language tasks. Most notably, BLIP-2 improves upon [Flamingo](https://huggingface.co/papers/2204.14198), an 80 billion parameter model, by 8.7%
-on zero-shot VQAv2 with 54x fewer trainable parameters.
+```py
+import torch
+from transformers import pipeline

-The abstract from the paper is the following:
+pipeline = pipeline(task="visual-question-answering", model="Salesforce/blip2-opt-2.7b", dtype="auto")
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
+pipeline(question="What is shown in this image?", image=url)
+```

-*The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen language model. BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods. For example, our model outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters. We also demonstrate the model's emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.*
+</hfoption>
+<hfoption id="AutoModel">

-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/blip2_architecture.jpg"
-alt="drawing" width="600"/>
+```py
+import requests
+import torch
+from PIL import Image
+from transformers import AutoProcessor, AutoModelForVisualQuestionAnswering

-<small> BLIP-2 architecture. Taken from the <a href="https://huggingface.co/papers/2301.12597">original paper.</a> </small>
+processor = AutoProcessor.from_pretrained("Salesforce/blip2-opt-2.7b")
+model = AutoModelForVisualQuestionAnswering.from_pretrained("Salesforce/blip2-opt-2.7b", dtype="auto")

-This model was contributed by [nielsr](https://huggingface.co/nielsr).
-The original code can be found [here](https://github.com/salesforce/LAVIS/tree/5ee63d688ba4cebff63acee04adaef2dee9af207).
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
+image = Image.open(requests.get(url, stream=True).raw)

-## Usage tips
+question = "Question: What is shown in this image? Answer:"
+inputs = processor(images=image, text=question, return_tensors="pt")

- BLIP-2 can be used for conditional text generation given an image and an optional text prompt. At inference time, it's recommended to use the [`generate`] method.
- One can use [`Blip2Processor`] to prepare images for the model, and decode the predicted tokens ID's back to text.
+output = model.generate(**inputs)
+print(processor.batch_decode(output, skip_special_tokens=True)[0])
+```

-> [!NOTE]
-> BLIP models after release v4.46 will raise warnings about adding `processor.num_query_tokens = {{num_query_tokens}}` and expand model embeddings layer to add special `<image>` token. It is strongly recommended to add the attributes to the processor if you own the model checkpoint, or open a PR if it is not owned by you. Adding these attributes means that BLIP will add the number of query tokens required per image and expand the text with as many `<image>` placeholders as there will be query tokens. Usually it is around 500 tokens per image, so make sure that the text is not truncated as otherwise there will be failure when merging the embeddings.
-The attributes can be obtained from model config, as `model.config.num_query_tokens` and model embeddings expansion can be done by following [this link](https://gist.github.com/zucchini-nlp/e9f20b054fa322f84ac9311d9ab67042).
-
-## Resources
-
-A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with BLIP-2.
-
- Demo notebooks for BLIP-2 for image captioning, visual question answering (VQA) and chat-like conversations can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/BLIP-2).
-
-If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
+</hfoption>
+</hfoptions>

 ## Blip2Config

@ -109,3 +108,4 @@ If you're interested in submitting a resource to be included here, please feel f
 ## Blip2VisionModelWithProjection

 [[autodoc]] Blip2VisionModelWithProjection
+
--- a/docs/source/en/model_doc/blip.md
+++ b/docs/source/en/model_doc/blip.md
@ -13,77 +13,49 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2022-01-28 and added to Hugging Face Transformers on 2022-12-21.*
-
-<div style="float: right;">
-    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-    </div>
-</div>
+*This model was released on 2022-01-28 and added to Hugging Face Transformers on 2022-12-21 and contributed by [ybelkada](https://huggingface.co/ybelkada).*

 # BLIP

-[BLIP](https://huggingface.co/papers/2201.12086) (Bootstrapped Language-Image Pretraining) is a vision-language pretraining (VLP) framework designed for *both* understanding and generation tasks. Most existing pretrained models are only good at one or the other. It uses a captioner to generate captions and a filter to remove the noisy captions. This increases training data quality and more effectively uses the messy web data.
-
-You can find all the original BLIP checkpoints under the [BLIP](https://huggingface.co/collections/Salesforce/blip-models-65242f40f1491fbf6a9e9472) collection.
-
-> [!TIP]
-> This model was contributed by [ybelkada](https://huggingface.co/ybelkada).
->
-> Click on the BLIP models in the right sidebar for more examples of how to apply BLIP to different vision language tasks.
-
-The example below demonstrates how to visual question answering with [`Pipeline`] or the [`AutoModel`] class.
+[BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation](https://huggingface.co/papers/2201.12086) proposes a new VLP framework that excels in both vision-language understanding and generation tasks. BLIP enhances the use of noisy web data through a bootstrapping process involving synthetic caption generation and noise filtering. This approach leads to state-of-the-art results in image-text retrieval, image captioning, and visual question answering, with notable improvements in recall@1, CIDEr, and VQA scores. Additionally, BLIP demonstrates strong generalization to videolanguage tasks in a zero-shot setting.

 <hfoptions id="usage">
 <hfoption id="Pipeline">

-```python
+```py
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    task="visual-question-answering",
-    model="Salesforce/blip-vqa-base",
-    dtype=torch.float16,
-    device=0
-)
+pipeline = pipeline(task="visual-question-answering", model="Salesforce/blip-vqa-base", dtype="auto")
 url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
-pipeline(question="What is the weather in this image?", image=url)
+pipeline(question="What is shown in this image?", image=url)
 ```

 </hfoption>
 <hfoption id="AutoModel">

-```python
+```py
 import requests
 import torch
 from PIL import Image
 from transformers import AutoProcessor, AutoModelForVisualQuestionAnswering

 processor = AutoProcessor.from_pretrained("Salesforce/blip-vqa-base")
-model = AutoModelForVisualQuestionAnswering.from_pretrained(
-    "Salesforce/blip-vqa-base",
-    dtype=torch.float16,
-    device_map="auto"
-)
+model = AutoModelForVisualQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base", dtype="auto")

 url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
 image = Image.open(requests.get(url, stream=True).raw)

-question = "What is the weather in this image?"
-inputs = processor(images=image, text=question, return_tensors="pt").to(model.device, torch.float16)
+question = "What is shown in this image?"
+inputs = processor(images=image, text=question, return_tensors="pt")

 output = model.generate(**inputs)
-processor.batch_decode(output, skip_special_tokens=True)[0]
+print(processor.batch_decode(output, skip_special_tokens=True)[0])
 ```

 </hfoption>
 </hfoptions>

-## Resources
-
-Refer to this [notebook](https://github.com/huggingface/notebooks/blob/main/examples/image_captioning_blip.ipynb) to learn how to fine-tune BLIP for image captioning on a custom dataset.
-
 ## BlipConfig

 [[autodoc]] BlipConfig
@ -124,11 +96,6 @@ Refer to this [notebook](https://github.com/huggingface/notebooks/blob/main/exam
 [[autodoc]] BlipTextModel
    - forward

-## BlipTextLMHeadModel
-
-[[autodoc]] BlipTextLMHeadModel
-    - forward
-
 ## BlipVisionModel

 [[autodoc]] BlipVisionModel
@ -148,3 +115,9 @@ Refer to this [notebook](https://github.com/huggingface/notebooks/blob/main/exam

 [[autodoc]] BlipForQuestionAnswering
    - forward
+
+## BlipTextLMHeadModel
+
+[[autodoc]] BlipTextLMHeadModel
+    - forward
+
--- a/docs/source/en/model_doc/bloom.md
+++ b/docs/source/en/model_doc/bloom.md
@ -17,46 +17,36 @@ rendered properly in your Markdown viewer.

 # BLOOM

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[BLOOM](https://huggingface.co/papers/2211.05100) is a 176-billion parameter open-access large language model built collaboratively by hundreds of researchers to promote wider accessibility of LLM technology. It is a decoder-only Transformer trained on the ROOTS corpus, which includes text from hundreds of sources across 46 natural and 13 programming languages. BLOOM demonstrates competitive performance across diverse benchmarks, with further gains achieved through multitask prompted finetuning. The model and code are publicly released under the Responsible AI License to support open research and applications.

-## Overview
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-The [BLOOM](https://huggingface.co/papers/2211.05100) model has been proposed with its various versions through the [BigScience Workshop](https://bigscience.huggingface.co/). BigScience is inspired by other open science initiatives where researchers have pooled their time and resources to collectively achieve a higher impact.
-The architecture of BLOOM is essentially similar to GPT3 (auto-regressive model for next token prediction), but has been trained on 46 different languages and 13 programming languages.
-Several smaller versions of the models have been trained on the same dataset. BLOOM is available in the following versions:
+```py
+import torch
+from transformers import pipeline

- [bloom-560m](https://huggingface.co/bigscience/bloom-560m)
- [bloom-1b1](https://huggingface.co/bigscience/bloom-1b1)
- [bloom-1b7](https://huggingface.co/bigscience/bloom-1b7)
- [bloom-3b](https://huggingface.co/bigscience/bloom-3b)
- [bloom-7b1](https://huggingface.co/bigscience/bloom-7b1)
- [bloom](https://huggingface.co/bigscience/bloom) (176B parameters)
+pipeline = pipeline(task="text-generation", model="bigscience/bloom-560m", dtype="auto")
+pipeline("Plants create energy through a process known as photosynthesis.")
+```

-## Resources
+</hfoption>
+<hfoption id="AutoModel">

-A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with BLOOM. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
+```py
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer

-<PipelineTag pipeline="text-generation"/>
+model = AutoModelForCausalLM.from_pretrained("bigscience/bloom-560m")
+tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-560m")

- [`BloomForCausalLM`] is supported by this [causal language modeling example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling#gpt-2gpt-and-causal-language-modeling) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb).
+inputs = tokenizer("Plants create energy through a process known as photosynthesis.", return_tensors="pt")
+outputs = model.generate(**inputs, max_length=50)
+print(tokenizer.decode(outputs[0]))
+```

-See also:
-
- [Causal language modeling task guide](../tasks/language_modeling)
- [Text classification task guide](../tasks/sequence_classification)
- [Token classification task guide](../tasks/token_classification)
- [Question answering task guide](../tasks/question_answering)
-
-⚡️ Inference
-
- A blog on [Optimization story: Bloom inference](https://huggingface.co/blog/bloom-inference-optimization).
- A blog on [Incredibly Fast BLOOM Inference with DeepSpeed and Accelerate](https://huggingface.co/blog/bloom-inference-pytorch-scripts).
-
-⚙️ Training
-
- A blog on [The Technology Behind BLOOM Training](https://huggingface.co/blog/bloom-megatron-deepspeed).
+</hfoption>
+</hfoptions>

 ## BloomConfig

@ -92,3 +82,4 @@ See also:

 [[autodoc]] BloomForQuestionAnswering
    - forward
+
--- a/docs/source/en/model_doc/blt.md
+++ b/docs/source/en/model_doc/blt.md
@ -13,13 +13,11 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2024-12-13 and added to Hugging Face Transformers on 2025-09-19.*
+
+*This model was released on 2024-12-13 and added to Hugging Face Transformers on 2025-10-07 and contributed by [itazap](https://huggingface.co/itazap).*

 <div style="float: right;">
    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-        <img alt="Flax" src="https://img.shields.io/badge/Flax-29a79b.svg?style=flat&logo=data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAC0AAAAtCAMAAAANxBKoAAAC7lBMVEUAAADg5vYHPVgAoJH+/v76+v39/f9JbLP///9+AIgAnY3///+mcqzt8fXy9fgkXa3Ax9709fr+///9/f8qXq49qp5AaLGMwrv8/P0eW60VWawxYq8yqJzG2dytt9Wyu9elzci519Lf3O3S2efY3OrY0+Xp7PT///////+dqNCexMc6Z7AGpJeGvbenstPZ5ejQ1OfJzOLa7ejh4+/r8fT29vpccbklWK8PVa0AS6ghW63O498vYa+lsdKz1NDRt9Kw1c672tbD3tnAxt7R6OHp5vDe7OrDyuDn6vLl6/EAQKak0MgATakkppo3ZK/Bz9y8w9yzu9jey97axdvHzeG21NHH4trTwthKZrVGZLSUSpuPQJiGAI+GAI8SWKydycLL4d7f2OTi1+S9xNzL0ePT6OLGzeEAo5U0qJw/aLEAo5JFa7JBabEAp5Y4qZ2QxLyKmsm3kL2xoMOehrRNb7RIbbOZgrGre68AUqwAqZqNN5aKJ5N/lMq+qsd8kMa4pcWzh7muhLMEV69juq2kbKqgUaOTR5uMMZWLLZSGAI5VAIdEAH+ovNDHuNCnxcy3qcaYx8K8msGplrx+wLahjbYdXrV6vbMvYK9DrZ8QrZ8tqJuFms+Sos6sw8ecy8RffsNVeMCvmb43aLltv7Q4Y7EZWK4QWa1gt6meZKUdr6GOAZVeA4xPAISyveLUwtivxtKTpNJ2jcqfvcltiMiwwcfAoMVxhL+Kx7xjdrqTe60tsaNQs6KaRKACrJ6UTZwkqpqTL5pkHY4AloSgsd2ptNXPvNOOncuxxsqFl8lmg8apt8FJcr9EbryGxLqlkrkrY7dRa7ZGZLQ5t6iXUZ6PPpgVpZeJCJFKAIGareTa0+KJod3H0deY2M+esM25usmYu8d2zsJOdcBVvrCLbqcAOaaHaKQAMaScWqKBXqCXMJ2RHpiLF5NmJZAdAHN2kta11dKu1M+DkcZLdb+Mcql3TppyRJdzQ5ZtNZNlIY+DF4+voCOQAAAAZ3RSTlMABAT+MEEJ/RH+/TP+Zlv+pUo6Ifz8+fco/fz6+evr39S9nJmOilQaF/7+/f38+smmoYp6b1T+/v7++vj189zU0tDJxsGzsrKSfv34+Pf27dDOysG9t6+n/vv6+vr59uzr1tG+tZ6Qg9Ym3QAABR5JREFUSMeNlVVUG1EQhpcuxEspXqS0SKEtxQp1d3d332STTRpIQhIISQgJhODu7lAoDoUCpe7u7u7+1puGpqnCPOyZvffbOXPm/PsP9JfQgyCC+tmTABTOcbxDz/heENS7/1F+9nhvkHePG0wNDLbGWwdXL+rbLWvpmZHXD8+gMfBjTh+aSe6Gnn7lwQIOTR0c8wfX3PWgv7avbdKwf/ZoBp1Gp/PvuvXW3vw5ib7emnTW4OR+3D4jB9vjNJ/7gNvfWWeH/TO/JyYrsiKCRjVEZA3UB+96kON+DxOQ/NLE8PE5iUYgIXjFnCOlxEQMaSGVxjg4gxOnEycGz8bptuNjVx08LscIgrzH3umcn+KKtiBIyvzOO2O99aAdR8cF19oZalnCtvREUw79tCd5sow1g1UKM6kXqUx4T8wsi3sTjJ3yzDmmhenLXLpo8u45eG5y4Vvbk6kkC4LLtJMowkSQxmk4ggVJEG+7c6QpHT8vvW9X7/o7+3ELmiJi2mEzZJiz8cT6TBlanBk70cB5GGIGC1gRDdZ00yADLW1FL6gqhtvNXNG5S9gdSrk4M1qu7JAsmYshzDS4peoMrU/gT7qQdqYGZaYhxZmVbGJAm/CS/HloWyhRUlknQ9KYcExTwS80d3VNOxUZJpITYyspl0LbhArhpZCD9cRWEQuhYkNGMHToQ/2Cs6swJlb39CsllxdXX6IUKh/H5jbnSsPKjgmoaFQ1f8wRLR0UnGE/RcDEjj2jXG1WVTwUs8+zxfcrVO+vSsuOpVKxCfYZiQ0/aPKuxQbQ8lIz+DClxC8u+snlcJ7Yr1z1JPqUH0V+GDXbOwAib931Y4Imaq0NTIXPXY+N5L18GJ37SVWu+hwXff8l72Ds9XuwYIBaXPq6Shm4l+Vl/5QiOlV+uTk6YR9PxKsI9xNJny31ygK1e+nIRC1N97EGkFPI+jCpiHe5PCEy7oWqWSwRrpOvhFzcbTWMbm3ZJAOn1rUKpYIt/lDhW/5RHHteeWFN60qo98YJuoq1nK3uW5AabyspC1BcIEpOhft+SZAShYoLSvnmSfnYADUERP5jJn2h5XtsgCRuhYQqAvwTwn33+YWEKUI72HX5AtfSAZDe8F2DtPPm77afhl0EkthzuCQU0BWApgQIH9+KB0JhopMM7bJrdTRoleM2JAVNMyPF+wdoaz+XJpGoVAQ7WXUkcV7gT3oUZyi/ISIJAVKhgNp+4b4veCFhYVJw4locdSjZCp9cPUhLF9EZ3KKzURepMEtCDPP3VcWFx4UIiZIklIpFNfHpdEafIF2aRmOcrUmjohbT2WUllbmRvgfbythbQO3222fpDJoufaQPncYYuqoGtUEsCJZL6/3PR5b4syeSjZMQG/T2maGANlXT2v8S4AULWaUkCxfLyW8iW4kdka+nEMjxpL2NCwsYNBp+Q61PF43zyDg9Bm9+3NNySn78jMZUUkumqE4Gp7JmFOdP1vc8PpRrzj9+wPinCy8K1PiJ4aYbnTYpCCbDkBSbzhu2QJ1Gd82t8jI8TH51+OzvXoWbnXUOBkNW+0mWFwGcGOUVpU81/n3TOHb5oMt2FgYGjzau0Nif0Ss7Q3XB33hjjQHjHA5E5aOyIQc8CBrLdQSs3j92VG+3nNEjbkbdbBr9zm04ruvw37vh0QKOdeGIkckc80fX3KH/h7PT4BOjgCty8VZ5ux1MoO5Cf5naca2LAsEgehI+drX8o/0Nu+W0m6K/I9gGPd/dfx/EN/wN62AhsBWuAAAAAElFTkSuQmCC
-        ">
        <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
    </div>
@ -27,62 +25,36 @@ rendered properly in your Markdown viewer.

 # Byte Latent Transformer (BLT)

-## Overview
+[Byte Latent Transformer](https://huggingface.co/papers/2412.09871) is a byte-level LLM architecture that matches tokenization-based LLM performance at scale. It encodes bytes into dynamically sized patches based on entropy, optimizing compute and model capacity where data complexity is higher. This approach improves inference efficiency and robustness, with the first flop-controlled scaling study up to 8B parameters and 4T training bytes. BLT demonstrates better scaling than tokenization-based models by dynamically selecting long patches for predictable data, enhancing reasoning and long-tail generalization.

-The BLT model was proposed in [Byte Latent Transformer: Patches Scale Better Than Tokens](https://huggingface.co/papers/2412.09871) by Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li1, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman†, Srinivasan Iyer.
-BLT is a byte-level LLM that achieves tokenization-level performance through entropy-based dynamic patching.
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-The abstract from the paper is the following:
-
-*We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at scale with significant improvements in inference
-efficiency and robustness. BLT encodes bytes into dynamically sized patches, which serve as the primary units of computation. Patches are segmented based on the entropy of the next byte, allocating
-more compute and model capacity where increased data complexity demands it. We present the first flop controlled scaling study of byte-level models up to 8B parameters and 4T training bytes. Our results demonstrate the feasibility of scaling models trained on raw bytes without a fixed vocabulary. Both training and inference efficiency improve due to dynamically selecting long patches when data is predictable, along with qualitative improvements on reasoning and long tail generalization. Overall, for fixed inference costs, BLT shows significantly better scaling than tokenization-based models, by simultaneously growing both patch and model size.*
-
-## Usage Tips
-
- **Dual Model Architecture**: BLT consists of two separate trained models:
-  - **Patcher (Entropy Model)**: A smaller transformer model that predicts byte-level entropy to determine patch boundaries and segment input.
-  - **Main Transformer Model**: The primary model that processes the patches through a Local Encoder, Global Transformer, and Local Decoder.
-
- **Dynamic Patching**: The model uses entropy-based dynamic patching where:
-  - High-entropy regions (complex data) get shorter patches with more computational attention
-  - Low-entropy regions (predictable data) get longer patches for efficiency
-  - This allows the model to allocate compute resources where they're most needed
-
- **Local Encoder**: Processes byte sequences with cross-attention to patch embeddings
- **Global Transformer**: Processes patch-level representations with full attention across patches
- **Local Decoder**: Generates output with cross-attention back to the original byte sequence
-
- **Byte-Level Tokenizer**: Unlike traditional tokenizers that use learned vocabularies, BLT's tokenizer simply converts text to UTF-8 bytes and maps each byte to a token ID. There is no need for a vocabulary.
-
-The model can be loaded via:
-
-<hfoption id="AutoModel">
-
-```python
+```py
 import torch
-from transformers import AutoTokenizer, AutoModelForCausalLM
+from transformers import pipeline

-tokenizer = AutoTokenizer.from_pretrained("itazap/blt-1b-hf")
-model = AutoModelForCausalLM.from_pretrained(
-    "itazap/blt-1b-hf",
-    device_map="auto",
-)
-
-inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
-
-prompt = "my name is"
-generated_ids = model.generate(
-    **inputs, max_new_tokens=NUM_TOKENS_TO_GENERATE, do_sample=False, use_cache=False
-)
-
-print(tokenizer.decode(generated_ids[0]))
+pipeline = pipeline(task="text-generation", model="itazap/blt-1b-hf", dtype="auto")
+pipeline("Plants generate energy through a process known as  ")
 ```

 </hfoption>
+<hfoption id="AutoModel">

-This model was contributed by [itazap](https://huggingface.co/<itazap>).
-The original code can be found [here](<https://github.com/facebookresearch/blt>).
+```py
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+model = AutoModelForCausalLM.from_pretrained("itazap/blt-1b-hf", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("itazap/blt-1b-hf")
+
+inputs = tokenizer("Plants generate energy through a process known as  ", return_tensors='pt', return_token_type_ids=False)
+outputs = model.generate(**inputs, max_new_tokens=64)
+print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
+```
+
+</hfoption>
+</hfoptions>

 ## BltConfig

@ -95,3 +67,4 @@ The original code can be found [here](<https://github.com/facebookresearch/blt>)

 [[autodoc]] BltForCausalLM
    - forward
+
--- a/docs/source/en/model_doc/bort.md
+++ b/docs/source/en/model_doc/bort.md
@ -13,48 +13,49 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2020-10-20 and added to Hugging Face Transformers on 2023-06-20.*
+*This model was released on 2020-10-20 and added to Hugging Face Transformers on 2023-06-20 and contributed by [stefan-it](https://huggingface.co/stefan-it).*
+
+> [!WARNING]
+> This model is in maintenance mode only, we do not accept any new PRs changing its code.
+>
+> If you run into any issues running this model, please reinstall the last version that supported this model: v4.30.0. You can do so by running the following command: pip install -U transformers==4.30.0.

 # BORT

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[BORT](https://huggingface.co/papers/2010.10499) extracts an optimal subset of architectural parameters from BERT, significantly reducing its size to 5.5% of BERT-large's effective size and 16% of its net size. BORT can be pretrained in 288 GPU hours, which is 1.2% of the time required for RoBERTa-large and 33% of BERT-large. It is 7.9x faster on a CPU and outperforms other compressed and some non-compressed variants, achieving performance improvements of 0.3% to 31% on various NLU benchmarks.

-<Tip warning={true}>
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-This model is in maintenance mode only, we do not accept any new PRs changing its code.
+```py
+import torch
+from transformers import pipeline

-If you run into any issues running this model, please reinstall the last version that supported this model: v4.30.0.
-You can do so by running the following command: `pip install -U transformers==4.30.0`.
+pipeline = pipeline(task="fill-mask", model="amazon/bort", dtype="auto")
+pipeline("Plants create [MASK] through a process known as photosynthesis.")
+```

-</Tip>
+</hfoption>
+<hfoption id="AutoModel">

-## Overview
+```py
+import torch
+from transformers import AutoModelForMaskedLM, AutoTokenizer

-The BORT model was proposed in [Optimal Subarchitecture Extraction for BERT](https://huggingface.co/papers/2010.10499) by
-Adrian de Wynter and Daniel J. Perry. It is an optimal subset of architectural parameters for the BERT, which the
-authors refer to as "Bort".
+model = AutoModelForMaskedLM.from_pretrained("amazon/bort", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("amazon/bort")

-The abstract from the paper is the following:
+inputs = tokenizer("Plants create [MASK] through a process known as photosynthesis.", return_tensors="pt")
+outputs = model(**inputs)
+mask_token_id = tokenizer.mask_token_id
+mask_position = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
+predicted_word = tokenizer.decode(outputs.logits[0, mask_position].argmax(dim=-1))
+print(f"Predicted word: {predicted_word}")
+```

-*We extract an optimal subset of architectural parameters for the BERT architecture from Devlin et al. (2018) by
-applying recent breakthroughs in algorithms for neural architecture search. This optimal subset, which we refer to as
-"Bort", is demonstrably smaller, having an effective (that is, not counting the embedding layer) size of 5.5% the
-original BERT-large architecture, and 16% of the net size. Bort is also able to be pretrained in 288 GPU hours, which
-is 1.2% of the time required to pretrain the highest-performing BERT parametric architectural variant, RoBERTa-large
-(Liu et al., 2019), and about 33% of that of the world-record, in GPU hours, required to train BERT-large on the same
-hardware. It is also 7.9x faster on a CPU, as well as being better performing than other compressed variants of the
-architecture, and some of the non-compressed variants: it obtains performance improvements of between 0.3% and 31%,
-absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks.*
-
-This model was contributed by [stefan-it](https://huggingface.co/stefan-it). The original code can be found [here](https://github.com/alexa/bort/).
+</hfoption>
+</hfoptions>

 ## Usage tips

- BORT's model architecture is based on BERT, refer to [BERT's documentation page](bert) for the
-  model's API reference as well as usage examples.
- BORT uses the RoBERTa tokenizer instead of the BERT tokenizer, refer to [RoBERTa's documentation page](roberta) for the tokenizer's API reference as well as usage examples.
- BORT requires a specific fine-tuning algorithm, called [Agora](https://adewynter.github.io/notes/bort_algorithms_and_applications.html#fine-tuning-with-algebraic-topology) ,
-  that is sadly not open-sourced yet. It would be very useful for the community, if someone tries to implement the
-  algorithm to make BORT fine-tuning work.
+- BORT uses the RoBERTa tokenizer instead of the BERT tokenizer. Check RoBERTa's documentation for API reference and usage examples.
--- a/docs/source/en/model_doc/bridgetower.md
+++ b/docs/source/en/model_doc/bridgetower.md
@ -13,124 +13,44 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2022-06-17 and added to Hugging Face Transformers on 2023-01-25.*
+*This model was released on 2022-06-17 and added to Hugging Face Transformers on 2023-01-25 and contributed by [anahita-b](https://huggingface.co/anahita-b), [Tile](https://huggingface.co/Tile), and [shaoyent](https://huggingface.co/shaoyent).*

 # BridgeTower

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[BridgeTower](https://huggingface.co/papers/2206.08657) introduces bridge layers connecting the top layers of uni-modal encoders to each layer of the cross-modal encoder, enabling effective bottom-up cross-modal alignment and fusion. Pre-trained with only 4M images, BRIDGETOWER achieves state-of-the-art performance on various vision-language tasks, outperforming previous models with similar pre-training data and minimal additional parameters and computational costs. When scaled, it surpasses models trained on much larger datasets.

-## Overview
+<hfoptions id="usage">
+<hfoption id="BridgeTowerForContrastiveLearning">

-The BridgeTower model was proposed in [BridgeTower: Building Bridges Between Encoders in Vision-Language Representative Learning](https://huggingface.co/papers/2206.08657) by Xiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, Wanxiang Che, Nan Duan. The goal of this model is to build a
-bridge between each uni-modal encoder and the cross-modal encoder to enable comprehensive and detailed interaction at each layer of the cross-modal encoder thus achieving remarkable performance on various downstream tasks with almost negligible additional performance and computational costs.
+```py
+import torch
+import requests
+from PIL import Image
+from transformers import AutoProcessor, BridgeTowerForContrastiveLearning

-This paper has been accepted to the [AAAI'23](https://aaai.org/Conferences/AAAI-23/) conference.
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
+image = Image.open(requests.get(url, stream=True).raw)
+texts = ["An image of a cat walking in the snow", "A football player scoring a goal"]

-The abstract from the paper is the following:
+processor = AutoProcessor.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")
+model = BridgeTowerForContrastiveLearning.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc", dtype="auto")

-*Vision-Language (VL) models with the TWO-TOWER architecture have dominated visual-language representation learning in recent years.
-Current VL models either use lightweight uni-modal encoders and learn to extract, align and fuse both modalities simultaneously in a deep cross-modal encoder, or feed the last-layer uni-modal representations from the deep pre-trained uni-modal encoders into the top cross-modal encoder.
-Both approaches potentially restrict vision-language representation learning and limit model performance. In this paper, we propose BRIDGETOWER, which introduces multiple bridge layers that build a connection between the top layers of uni-modal encoders and each layer of the crossmodal encoder.
-This enables effective bottom-up cross-modal alignment and fusion between visual and textual representations of different semantic levels of pre-trained uni-modal encoders in the cross-modal encoder. Pre-trained with only 4M images, BRIDGETOWER achieves state-of-the-art performance on various downstream vision-language tasks.
-In particular, on the VQAv2 test-std set, BRIDGETOWER achieves an accuracy of 78.73%, outperforming the previous state-of-the-art model METER by 1.09% with the same pre-training data and almost negligible additional parameters and computational costs.
-Notably, when further scaling the model, BRIDGETOWER achieves an accuracy of 81.15%, surpassing models that are pre-trained on orders-of-magnitude larger datasets.*
+scores = dict()
+for text in texts:
+    # prepare inputs
+    encoding = processor(image, text, return_tensors="pt")
+    outputs = model(**encoding)
+    # Get similarity score by computing cosine similarity
+    score = torch.cosine_similarity(outputs.image_embeds, outputs.text_embeds, dim=1).item()
+    scores[text] = score
+    print(f"Text: '{text}' - Score: {score:.4f}")

-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/bridgetower_architecture%20.jpg"
-alt="drawing" width="600"/>
-
-<small> BridgeTower architecture. Taken from the <a href="https://huggingface.co/papers/2206.08657">original paper.</a> </small>
-
-This model was contributed by [Anahita Bhiwandiwalla](https://huggingface.co/anahita-b), [Tiep Le](https://huggingface.co/Tile) and [Shaoyen Tseng](https://huggingface.co/shaoyent). The original code can be found [here](https://github.com/microsoft/BridgeTower).
-
-## Usage tips and examples
-
-BridgeTower consists of a visual encoder, a textual encoder and cross-modal encoder with multiple lightweight bridge layers.
-The goal of this approach was to build a bridge between each uni-modal encoder and the cross-modal encoder to enable comprehensive and detailed interaction at each layer of the cross-modal encoder.
-In principle, one can apply any visual, textual or cross-modal encoder in the proposed architecture.
-
-The [`BridgeTowerProcessor`] wraps [`RobertaTokenizer`] and [`BridgeTowerImageProcessor`] into a single instance to both
-encode the text and prepare the images respectively.
-
-The following example shows how to run contrastive learning using [`BridgeTowerProcessor`] and [`BridgeTowerForContrastiveLearning`].
-
-```python
->>> from transformers import BridgeTowerProcessor, BridgeTowerForContrastiveLearning
->>> import requests
->>> from PIL import Image
-
->>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
->>> image = Image.open(requests.get(url, stream=True).raw)
->>> texts = ["An image of two cats chilling on a couch", "A football player scoring a goal"]
-
->>> processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")
->>> model = BridgeTowerForContrastiveLearning.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")
-
->>> # forward pass
->>> scores = dict()
->>> for text in texts:
-...     # prepare inputs
-...     encoding = processor(image, text, return_tensors="pt")
-...     outputs = model(**encoding)
-...     scores[text] = outputs
+best_text = max(scores, key=scores.get)
+print(f"\nBest matching text: '{best_text}' with score: {scores[best_text]:.4f}")
 ```

-The following example shows how to run image-text retrieval using [`BridgeTowerProcessor`] and [`BridgeTowerForImageAndTextRetrieval`].
-
-```python
->>> from transformers import BridgeTowerProcessor, BridgeTowerForImageAndTextRetrieval
->>> import requests
->>> from PIL import Image
-
->>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
->>> image = Image.open(requests.get(url, stream=True).raw)
->>> texts = ["An image of two cats chilling on a couch", "A football player scoring a goal"]
-
->>> processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-base-itm-mlm")
->>> model = BridgeTowerForImageAndTextRetrieval.from_pretrained("BridgeTower/bridgetower-base-itm-mlm")
-
->>> # forward pass
->>> scores = dict()
->>> for text in texts:
-...     # prepare inputs
-...     encoding = processor(image, text, return_tensors="pt")
-...     outputs = model(**encoding)
-...     scores[text] = outputs.logits[0, 1].item()
-```
-
-The following example shows how to run masked language modeling using [`BridgeTowerProcessor`] and [`BridgeTowerForMaskedLM`].
-
-```python
->>> from transformers import BridgeTowerProcessor, BridgeTowerForMaskedLM
->>> from PIL import Image
->>> import requests
-
->>> url = "http://images.cocodataset.org/val2017/000000360943.jpg"
->>> image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
->>> text = "a <mask> looking out of the window"
-
->>> processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-base-itm-mlm")
->>> model = BridgeTowerForMaskedLM.from_pretrained("BridgeTower/bridgetower-base-itm-mlm")
-
->>> # prepare inputs
->>> encoding = processor(image, text, return_tensors="pt")
-
->>> # forward pass
->>> outputs = model(**encoding)
-
->>> results = processor.decode(outputs.logits.argmax(dim=-1).squeeze(0).tolist())
-
->>> print(results)
-.a cat looking out of the window.
-```
-
-Tips:
-
- This implementation of BridgeTower uses [`RobertaTokenizer`] to generate text embeddings and OpenAI's CLIP/ViT model to compute visual embeddings.
- Checkpoints for pre-trained [bridgeTower-base](https://huggingface.co/BridgeTower/bridgetower-base) and [bridgetower masked language modeling and image text matching](https://huggingface.co/BridgeTower/bridgetower-base-itm-mlm) are released.
- Please refer to [Table 5](https://huggingface.co/papers/2206.08657) for BridgeTower's performance on Image Retrieval and other down stream tasks.
- The PyTorch version of this model is only available in torch 1.10 and higher.
+</hfoption>
+</hfoptions>

 ## BridgeTowerConfig

@ -178,3 +98,4 @@ Tips:

 [[autodoc]] BridgeTowerForImageAndTextRetrieval
    - forward
+
--- a/docs/source/en/model_doc/bros.md
+++ b/docs/source/en/model_doc/bros.md
@ -9,83 +9,38 @@ Unless required by applicable law or agreed to in writing, software distributed
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
-*This model was released on 2021-08-10 and added to Hugging Face Transformers on 2023-09-15.*
+*This model was released on 2021-08-10 and added to Hugging Face Transformers on 2023-09-15 and contributed by [jinho8345](https://huggingface.co/jinho8345).*

 # BROS

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[BROS](https://huggingface.co/papers/2108.04539) is a pre-trained language model designed for key information extraction (KIE) from document images by focusing on the spatial relationships of text rather than visual features. It encodes the relative 2D positions of text elements and uses an area-masking pre-training strategy to learn spatial-textual dependencies from unlabeled documents. Unlike vision-text models, BROS effectively integrates text and layout information alone, achieving competitive or superior results on major KIE benchmarks (FUNSD, SROIE*, CORD, SciTSR). The model also addresses two key challenges in KIE—handling incorrect text order and learning efficiently with limited labeled data.

-## Overview
+<hfoptions id="usage">
+<hfoption id="BrosForTokenClassification">

-The BROS model was proposed in [BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents](https://huggingface.co/papers/2108.04539) by Teakgyu Hong, Donghyun Kim, Mingi Ji, Wonseok Hwang, Daehyun Nam, Sungrae Park.
+```py
+import torch
+from transformers import AutoProcessor, AutoModelForTokenClassification

-BROS stands for *BERT Relying On Spatiality*. It is an encoder-only Transformer model that takes a sequence of tokens and their bounding boxes as inputs and outputs a sequence of hidden states. BROS encode relative spatial information instead of using absolute spatial information.
+processor = AutoProcessor.from_pretrained("jinho8345/bros-base-uncased")
+model = AutoModelForTokenClassification.from_pretrained("jinho8345/bros-base-uncased", dtype="auto")

-It is pre-trained with two objectives: a token-masked language modeling objective (TMLM) used in BERT, and a novel area-masked language modeling objective (AMLM)
-In TMLM, tokens are randomly masked, and the model predicts the masked tokens using spatial information and other unmasked tokens.
-AMLM is a 2D version of TMLM. It randomly masks text tokens and predicts with the same information as TMLM, but it masks text blocks (areas).
+text = "Plants create energy through a process known as photosynthesis."
+encoding = processor.tokenizer(text, add_special_tokens=False, return_tensors="pt")
+bbox = torch.tensor([[[0, 0, 1, 1]]]).repeat(1, encoding["input_ids"].shape[-1], 1)
+encoding["bbox"] = bbox

-`BrosForTokenClassification` has a simple linear layer on top of BrosModel. It predicts the label of each token.
-`BrosSpadeEEForTokenClassification` has an `initial_token_classifier` and `subsequent_token_classifier` on top of BrosModel. `initial_token_classifier` is used to predict the first token of each entity, and `subsequent_token_classifier` is used to predict the next token of within entity. `BrosSpadeELForTokenClassification` has an `entity_linker` on top of BrosModel. `entity_linker` is used to predict the relation between two entities.
+outputs = model(**encoding)
+predictions = torch.argmax(outputs.logits, dim=-1)
+tokens = processor.tokenizer.convert_ids_to_tokens(encoding["input_ids"][0])

-`BrosForTokenClassification` and `BrosSpadeEEForTokenClassification` essentially perform the same job. However, `BrosForTokenClassification` assumes input tokens are perfectly serialized (which is very challenging task since they exist in a 2D space), while `BrosSpadeEEForTokenClassification` allows for more flexibility in handling serialization errors as it predicts next connection tokens from one token.
-
-`BrosSpadeELForTokenClassification` perform the intra-entity linking task. It predicts relation from one token (of one entity) to another token (of another entity) if these two entities share some relation.
-
-BROS achieves comparable or better result on Key Information Extraction (KIE) benchmarks such as FUNSD, SROIE, CORD and SciTSR, without relying on explicit visual features.
-
-The abstract from the paper is the following:
-
-*Key information extraction (KIE) from document images requires understanding the contextual and spatial semantics of texts in two-dimensional (2D) space. Many recent studies try to solve the task by developing pre-trained language models focusing on combining visual features from document images with texts and their layout. On the other hand, this paper tackles the problem by going back to the basic: effective combination of text and layout. Specifically, we propose a pre-trained language model, named BROS (BERT Relying On Spatiality), that encodes relative positions of texts in 2D space and learns from unlabeled documents with area-masking strategy. With this optimized training scheme for understanding texts in 2D space, BROS shows comparable or better performance compared to previous methods on four KIE benchmarks (FUNSD, SROIE*, CORD, and SciTSR) without relying on visual features. This paper also reveals two real-world challenges in KIE tasks-(1) minimizing the error from incorrect text ordering and (2) efficient learning from fewer downstream examples-and demonstrates the superiority of BROS over previous methods.*
-
-This model was contributed by [jinho8345](https://huggingface.co/jinho8345). The original code can be found [here](https://github.com/clovaai/bros).
-
-## Usage tips and examples
-
- [`~transformers.BrosModel.forward`] requires `input_ids` and `bbox` (bounding box). Each bounding box should be in (x0, y0, x1, y1) format (top-left corner, bottom-right corner). Obtaining of Bounding boxes depends on external OCR system. The `x` coordinate should be normalized by document image width, and the `y` coordinate should be normalized by document image height.
-
-```python
-def expand_and_normalize_bbox(bboxes, doc_width, doc_height):
-    # here, bboxes are numpy array
-
-    # Normalize bbox -> 0 ~ 1
-    bboxes[:, [0, 2]] = bboxes[:, [0, 2]] / width
-    bboxes[:, [1, 3]] = bboxes[:, [1, 3]] / height
+print("Token predictions:")
+for token, pred in zip(tokens, predictions[0]):
+    print(f"'{token}' -> Class {pred.item()}")
 ```

- [`~transformers.BrosForTokenClassification.forward`, `~transformers.BrosSpadeEEForTokenClassification.forward`, `~transformers.BrosSpadeEEForTokenClassification.forward`] require not only `input_ids` and `bbox` but also `box_first_token_mask` for loss calculation. It is a mask to filter out non-first tokens of each box. You can obtain this mask by saving start token indices of bounding boxes when creating `input_ids` from words. You can make `box_first_token_mask` with following code,
-
-```python
-def make_box_first_token_mask(bboxes, words, tokenizer, max_seq_length=512):
-
-    box_first_token_mask = np.zeros(max_seq_length, dtype=np.bool_)
-
-    # encode(tokenize) each word from words (list[str])
-    input_ids_list: list[list[int]] = [tokenizer.encode(e, add_special_tokens=False) for e in words]
-
-    # get the length of each box
-    tokens_length_list: list[int] = [len(l) for l in input_ids_list]
-
-    box_end_token_indices = np.array(list(itertools.accumulate(tokens_length_list)))
-    box_start_token_indices = box_end_token_indices - np.array(tokens_length_list)
-
-    # filter out the indices that are out of max_seq_length
-    box_end_token_indices = box_end_token_indices[box_end_token_indices < max_seq_length - 1]
-    if len(box_start_token_indices) > len(box_end_token_indices):
-        box_start_token_indices = box_start_token_indices[: len(box_end_token_indices)]
-
-    # set box_start_token_indices to True
-    box_first_token_mask[box_start_token_indices] = True
-
-    return box_first_token_mask
-
-```
-
-## Resources
-
- Demo scripts can be found [here](https://github.com/clovaai/bros).
+</hfoption>
+</hfoptions>

 ## BrosConfig

@ -115,3 +70,4 @@ def make_box_first_token_mask(bboxes, words, tokenizer, max_seq_length=512):

 [[autodoc]] BrosSpadeELForTokenClassification
    - forward
+
--- a/docs/source/en/model_doc/byt5.md
+++ b/docs/source/en/model_doc/byt5.md
@ -13,127 +13,49 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2021-05-28 and added to Hugging Face Transformers on 2021-06-01.*
-<div style="float: right;">
-  <div class="flex flex-wrap space-x-1">
-    <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-  </div>
-</div>
+*This model was released on 2021-05-28 and added to Hugging Face Transformers on 2021-06-01 and contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten).*

 # ByT5

-[ByT5](https://huggingface.co/papers/2105.13626) is tokenizer-free version of the [T5](./t5) model designed to works directly on raw UTF-8 bytes. This means it can process any language, more robust to noise like typos, and simpler to use because it doesn't require a preprocessing pipeline.
-
-You can find all the original ByT5 checkpoints under the [Google](https://huggingface.co/google?search_models=byt5) organization.
-
-> [!TIP]
-> Refer to the [T5](./t5) docs for more examples of how to apply ByT5 to different language tasks.
-
-The example below demonstrates how to generate text with [`Pipeline`], [`AutoModel`] and from the command line.
+[ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://huggingface.co/papers/2105.13626) explores the use of standard Transformer architectures to process byte sequences directly, eliminating the need for tokenization. This approach offers benefits such as language-agnostic processing, robustness to noise, and reduced preprocessing complexity. The study demonstrates that byte-level models can compete with token-level models in terms of parameter count, training computational cost, and inference speed. Additionally, byte-level models show superior performance on tasks sensitive to spelling and pronunciation. The paper introduces a new set of pre-trained byte-level Transformer models based on the T5 architecture.

 <hfoptions id="usage">
 <hfoption id="Pipeline">

-```python
+```py
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    task="text2text-generation",
-    model="google/byt5-small",
-    dtype=torch.float16,
-    device=0
-)
-pipeline("translate English to French: The weather is nice today")
+pipeline = pipeline(task="text2text-generation", model="google/byt5-small", dtype="auto")
+pipeline("translate English to French: Plants generate energy through a process known as photosynthesis.")
 ```

 </hfoption>
 <hfoption id="AutoModel">

-```python
+```py
 import torch
 from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

-tokenizer = AutoTokenizer.from_pretrained(
-    "google/byt5-small"
-)
-model = AutoModelForSeq2SeqLM.from_pretrained(
-    "google/byt5-small",
-    dtype=torch.float16,
-    device_map="auto"
-)
+model = AutoModelForSeq2SeqLM.from_pretrained("google/byt5-small", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("google/byt5-small")

-input_ids = tokenizer("summarize: Photosynthesis is the process by which plants, algae, and some bacteria convert light energy into chemical energy.", return_tensors="pt").to(model.device)
-
-output = model.generate(**input_ids)
-print(tokenizer.decode(output[0], skip_special_tokens=True))
+inputs = tokenizer("translate English to French: Plants generate energy through a process known as photosynthesis.", return_tensors="pt")
+outputs = model.generate(**inputs, max_length=50)
+print(tokenizer.decode(outputs[0]))
 ```

-</hfoption>
-<hfoption id="transformers">
-
-```bash
-echo -e "translate English to French: Life is beautiful." | transformers run --task text2text-generation --model google/byt5-small --device 0
-```
-
-</hfoption>
+</hfopton>
 </hfoptions>

-## Quantization
+## Usage tips

-Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
-
-The example below uses [torchao](../quantization/torchao) to only quantize the weights to int4.
-
-```python
-# pip install torchao
-import torch
-from transformers import TorchAoConfig, AutoModelForSeq2SeqLM, AutoTokenizer
-
-quantization_config = TorchAoConfig("int4_weight_only", group_size=128)
-
-model = AutoModelForSeq2SeqLM.from_pretrained(
-    "google/byt5-xl",
-    dtype=torch.bfloat16,
-    device_map="auto",
-    quantization_config=quantization_config
-)
-
-tokenizer = AutoTokenizer.from_pretrained("google/byt5-xl")
-input_ids = tokenizer("translate English to French: The weather is nice today.", return_tensors="pt").to(model.device)
-
-output = model.generate(**input_ids)
-print(tokenizer.decode(output[0], skip_special_tokens=True))
-```
-
-## Notes
-
- It is recommended to use the tokenizer for batched inference and training.
- The example below shows how to use the model without a tokenizer.
-
-    ```python
-    import torch
-    from transformers import AutoModelForSeq2SeqLM
-
-    model = AutoModelForSeq2SeqLM.from_pretrained("google/byt5-small")
-
-    num_special_tokens = 3
-
-    input_ids = torch.tensor([list("Life is like a box of chocolates.".encode("utf-8"))]) + num_special_tokens
-    labels = torch.tensor([list("La vie est comme une boîte de chocolat.".encode("utf-8"))]) + num_special_tokens
-    loss = model(input_ids, labels=labels).loss
-    loss.item()
-    ```
-
- ByT5 uses the top byte values (258, 257, etc.) for masking instead of sentinel tokens like `{extra_id_0}`.
-
-    ```python
-    # Example: character-level denoising with mask tokens
-    input_ids = tokenizer("The dog chases a ball in the park.").input_ids
-    masked_input = torch.tensor([input_ids[:8] + [258] + input_ids[14:21] + [257] + input_ids[28:]])
-    output = model.generate(masked_input, max_length=100)
-    ```
+- Use the tokenizer for batched inference and training.
+- ByT5 uses top byte values (258, 257, etc.) for masking instead of sentinel tokens like `{extra_id_0}`.

 ## ByT5Tokenizer

 [[autodoc]] ByT5Tokenizer
+
+See [`ByT5Tokenizer`] for all details.
+
--- a/docs/source/en/model_doc/camembert.md
+++ b/docs/source/en/model_doc/camembert.md
@ -13,108 +13,50 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2019-11-10 and added to Hugging Face Transformers on 2020-11-16.*
+*This model was released on 2019-11-10 and added to Hugging Face Transformers on 2020-11-16 and contributed by [almanach](https://huggingface.co/almanach).*

 <div style="float: right;">
- <div class="flex flex-wrap space-x-1">
-  <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
+    <div class="flex flex-wrap space-x-1">
        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
- </div>
+    </div>
 </div>

 # CamemBERT

-[CamemBERT](https://huggingface.co/papers/1911.03894) is a language model based on [RoBERTa](./roberta), but trained specifically on French text from the OSCAR dataset, making it more effective for French language tasks.
-
-What sets CamemBERT apart is that it learned from a huge, high quality collection of French data, as opposed to mixing lots of languages. This helps it really understand French better than many multilingual models.
-
-Common applications of CamemBERT include masked language modeling (Fill-mask prediction), text classification (sentiment analysis), token classification (entity recognition) and sentence pair classification (entailment tasks).
-
-You can find all the original CamemBERT checkpoints under the [ALMAnaCH](https://huggingface.co/almanach/models?search=camembert) organization.
-
-> [!TIP]
-> This model was contributed by the [ALMAnaCH (Inria)](https://huggingface.co/almanach) team.
->
-> Click on the CamemBERT models in the right sidebar for more examples of how to apply CamemBERT to different NLP tasks.
-
-The examples below demonstrate how to predict the `<mask>` token with [`Pipeline`], [`AutoModel`], and from the command line.
+[CamemBERT: a Tasty French Language Model](https://huggingface.co/papers/1911.03894) is a French version of the BERT model, trained on 138GB of French text. It addresses the limitation of existing models that are either English-centric or multilingual, offering improved performance in French-specific tasks such as part-of-speech tagging, dependency parsing, named-entity recognition, and natural language inference. The pretrained CamemBERT model is released to encourage further research and applications in French NLP.

 <hfoptions id="usage">
-
 <hfoption id="Pipeline">

-```python
+```py
 import torch
 from transformers import pipeline

-pipeline = pipeline("fill-mask", model="camembert-base", dtype=torch.float16, device=0)
-pipeline("Le camembert est un délicieux fromage <mask>.")
+pipeline = pipeline(task="fill-mask", model="almanach/camembert-base", dtype="auto")
+pipeline("Les plantes créent <mask> grâce à un processus appelé photosynthèse.")
 ```

 </hfoption>
-
 <hfoption id="AutoModel">

-```python
+```py
 import torch
-from transformers import AutoTokenizer, AutoModelForMaskedLM
+from transformers import AutoModelForMaskedLM, AutoTokenizer

-tokenizer = AutoTokenizer.from_pretrained("camembert-base")
-model = AutoModelForMaskedLM.from_pretrained("camembert-base", dtype="auto", device_map="auto", attn_implementation="sdpa")
-inputs = tokenizer("Le camembert est un délicieux fromage <mask>.", return_tensors="pt").to(model.device)
+model = AutoModelForMaskedLM.from_pretrained("almanach/camembert-base", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("almanach/camembert-base")

-with torch.no_grad():
-    outputs = model(**inputs)
-    predictions = outputs.logits
-
-masked_index = torch.where(inputs['input_ids'] == tokenizer.mask_token_id)[1]
-predicted_token_id = predictions[0, masked_index].argmax(dim=-1)
-predicted_token = tokenizer.decode(predicted_token_id)
-
-print(f"The predicted token is: {predicted_token}")
+inputs = tokenizer("Les plantes créent <mask> grâce à un processus appelé photosynthèse.", return_tensors="pt")
+outputs = model(**inputs)
+mask_token_id = tokenizer.mask_token_id
+mask_position = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
+predicted_word = tokenizer.decode(outputs.logits[0, mask_position].argmax(dim=-1))
+print(f"Predicted word: {predicted_word}")
 ```

 </hfoption>
-
-<hfoption id="transformers CLI">
-
-```bash
-echo -e "Le camembert est un délicieux fromage <mask>." | transformers run --task fill-mask --model camembert-base --device 0
-```
-
-</hfoption>
-
 </hfoptions>

-Quantization reduces the memory burden of large models by representing weights in lower precision. Refer to the [Quantization](../quantization/overview) overview for available options.
-
-The example below uses [bitsandbytes](../quantization/bitsandbytes) quantization to quantize the weights to 8-bits.
-
-```python
-from transformers import AutoTokenizer, AutoModelForMaskedLM, BitsAndBytesConfig
-import torch
-
-quant_config = BitsAndBytesConfig(load_in_8bit=True)
-model = AutoModelForMaskedLM.from_pretrained(
-    "almanach/camembert-large",
-    quantization_config=quant_config,
-    device_map="auto"
-)
-tokenizer = AutoTokenizer.from_pretrained("almanach/camembert-large")
-
-inputs = tokenizer("Le camembert est un délicieux fromage <mask>.", return_tensors="pt").to(model.device)
-
-with torch.no_grad():
-    outputs = model(**inputs)
-    predictions = outputs.logits
-
-masked_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
-predicted_token_id = predictions[0, masked_index].argmax(dim=-1)
-predicted_token = tokenizer.decode(predicted_token_id)
-
-print(f"The predicted token is: {predicted_token}")
-```
-
 ## CamembertConfig

 [[autodoc]] CamembertConfig
@ -158,3 +100,4 @@ print(f"The predicted token is: {predicted_token}")
 ## CamembertForQuestionAnswering

 [[autodoc]] CamembertForQuestionAnswering
+
--- a/docs/source/en/model_doc/canine.md
+++ b/docs/source/en/model_doc/canine.md
@ -13,24 +13,11 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2021-03-11 and added to Hugging Face Transformers on 2021-06-30.*
-
-<div style="float: right;">
-    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-    </div>
-</div>
+*This model was released on 2021-03-11 and added to Hugging Face Transformers on 2021-06-30 and contributed by [nielsr](https://huggingface.co/nielsr).*

 # CANINE

-[CANINE](https://huggingface.co/papers/2103.06874) is a tokenization-free Transformer. It skips the usual step of splitting text into subwords or wordpieces and processes text character by character. That means it works directly with raw Unicode, making it especially useful for languages with complex or inconsistent tokenization rules and even noisy inputs like typos. Since working with characters means handling longer sequences, CANINE uses a smart trick. The model compresses the input early on (called downsampling) so the transformer doesn't have to process every character individually. This keeps things fast and efficient.
-
-You can find all the original CANINE checkpoints under the [Google](https://huggingface.co/google?search_models=canine) organization.
-
-> [!TIP]
-> Click on the CANINE models in the right sidebar for more examples of how to apply CANINE to different language tasks.
-
-The example below demonstrates how to generate embeddings with [`Pipeline`], [`AutoModel`], and from the command line.
+[CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://huggingface.co/papers/2103.06874) presents CANINE, a neural encoder that processes text directly at the Unicode character level without explicit tokenization or vocabulary. It addresses the challenges of varying language suitability and vocabulary limitations by using a downsampling strategy to manage longer sequences and a deep Transformer stack to capture context. CANINE achieves a 2.8 F1 score improvement on TyDi QA compared to a similar mBERT model, despite having 28% fewer parameters.

 <hfoptions id="usage">
 <hfoption id="Pipeline">
@ -39,13 +26,8 @@ The example below demonstrates how to generate embeddings with [`Pipeline`], [`A
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    task="feature-extraction",
-    model="google/canine-c",
-    device=0,
-)
-
-pipeline("Plant create energy through a process known as photosynthesis.")
+pipeline = pipeline(task="text-classification", model="google/canine-s", dtype="auto")
+pipeline("Plants are amazing because they can create energy from the sun.")
 ```

 </hfoption>
@ -53,41 +35,25 @@ pipeline("Plant create energy through a process known as photosynthesis.")

 ```py
 import torch
-from transformers import AutoModel
+from transformers import AutoModelForSequenceClassification, AutoTokenizer

-model = AutoModel.from_pretrained("google/canine-c")
+model = AutoModelForSequenceClassification.from_pretrained("google/canine-s", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("google/canine-s")

-text = "Plant create energy through a process known as photosynthesis."
-input_ids = torch.tensor([[ord(char) for char in text]])
-
-outputs = model(input_ids)
-pooled_output = outputs.pooler_output
-sequence_output = outputs.last_hidden_state
-```
-
-</hfoption>
-<hfoption id="transformers CLI">
-
-```bash
-echo -e "Plant create energy through a process known as photosynthesis." | transformers run --task feature-extraction --model google/canine-c --device 0
+inputs = tokenizer("Plants are amazing because they can create energy from the sun.", return_tensors="pt")
+outputs = model(**inputs)
+predicted_class_id = outputs.logits.argmax(dim=-1).item()
+label = model.config.id2label[predicted_class_id]
+print(f"Predicted label: {label}")
 ```

 </hfoption>
 </hfoptions>

-## Notes
+## Usage tips

- CANINE skips tokenization entirely — it works directly on raw characters, not subwords. You can use it with or without a tokenizer. For batched inference and training, it is recommended to use the tokenizer to pad and truncate all sequences to the same length.
-
-    ```py
-    from transformers import AutoTokenizer, AutoModel
-
-    tokenizer = AutoTokenizer("google/canine-c")
-    inputs = ["Life is like a box of chocolates.", "You never know what you gonna get."]
-    encoding = tokenizer(inputs, padding="longest", truncation=True, return_tensors="pt")
-    ```
-
- CANINE is primarily designed to be fine-tuned on a downstream task. The pretrained model can be used for either masked language modeling or next sentence prediction.
+- CANINE skips tokenization entirely. It works directly on raw characters, not subwords. Use it with or without a tokenizer. For batched inference and training, use the tokenizer to pad and truncate all sequences to the same length.
+- CANINE is designed for fine-tuning on downstream tasks. The pretrained model handles masked language modeling or next sentence prediction.

 ## CanineConfig

@ -128,3 +94,4 @@ echo -e "Plant create energy through a process known as photosynthesis." | trans

 [[autodoc]] CanineForQuestionAnswering
    - forward
+
--- a/docs/source/en/model_doc/chameleon.md
+++ b/docs/source/en/model_doc/chameleon.md
@ -13,163 +13,54 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2024-05-16 and added to Hugging Face Transformers on 2024-07-17.*
+*This model was released on 2024-05-16 and added to Hugging Face Transformers on 2024-07-17 and contributed by [joaogante](https://huggingface.co/joaogante) and [RaushanTurganbay](https://huggingface.co/RaushanTurganbay).*
+
+<div style="float: right;">
+    <div class="flex flex-wrap space-x-1">
+        <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
+        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
+    </div>
+</div>

 # Chameleon

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
-<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[Chameleon: Mixed-Modal Early-Fusion Foundation Models](https://huggingface.co/papers/2405.09818v1) is a Vision-Language Model that uses vector quantization to tokenize images, enabling it to generate multimodal output. It handles images and texts in any sequence, including interleaved formats, and produces textual responses. Chameleon demonstrates superior performance in image captioning, outperforms Llama-2 in text-only tasks, and is competitive with Mixtral 8x7B and Gemini-Pro. It also performs non-trivial image generation and matches or exceeds the performance of larger models like Gemini Pro and GPT-4V in long-form mixed-modal generation tasks.

-## Overview
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-The Chameleon model was proposed in [Chameleon: Mixed-Modal Early-Fusion Foundation Models](https://huggingface.co/papers/2405.09818) by META AI Chameleon Team. Chameleon is a Vision-Language Model that use vector quantization to tokenize images which enables the model to generate multimodal output. The model takes images and texts as input, including an interleaved format, and generates textual response. Image generation module is not released yet.
-
-The abstract from the paper is the following:
-
-*We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. We outline a stable training
-approach from inception, an alignment recipe, and an architectural parameterization tailored for the
-early-fusion, token-based, mixed-modal setting. The models are evaluated on a comprehensive range
-of tasks, including visual question answering, image captioning, text generation, image generation, and
-long-form mixed modal generation. Chameleon demonstrates broad and general capabilities, including
-state-of-the-art performance in image captioning tasks, outperforms Llama-2 in text-only tasks while
-being competitive with models such as Mixtral 8x7B and Gemini-Pro, and performs non-trivial image
-generation, all in a single model. It also matches or exceeds the performance of much larger models,
-including Gemini Pro and GPT-4V, according to human judgments on a new long-form mixed-modal
-generation evaluation, where either the prompt or outputs contain mixed sequences of both images and
-text. Chameleon marks a significant step forward in unified modeling of full multimodal documents*
-
-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/chameleon_arch.png"
-alt="drawing" width="600"/>
-
-<small> Chameleon incorporates a vector quantizer module to transform images into discrete tokens. That also enables image generation using an auto-regressive transformer. Taken from the <a href="https://huggingface.co/papers/2405.09818">original paper.</a> </small>
-
-This model was contributed by [joaogante](https://huggingface.co/joaogante) and [RaushanTurganbay](https://huggingface.co/RaushanTurganbay).
-The original code can be found [here](https://github.com/facebookresearch/chameleon).
-
-## Usage tips
-
- We advise users to use `padding_side="left"` when computing batched generation as it leads to more accurate results. Simply make sure to set `processor.tokenizer.padding_side = "left"` before generating.
-
- Note that Chameleon was tuned for safety alignment. If the model is refusing to answer, consider asking a more concrete question, instead of an open question.
-
- Chameleon generates in chat format which means that the generated text will always be the "assistant's turn". You can enable a text completion generation by passing `return_for_text_completion=True` when calling the processor.
-
-> [!NOTE]
-> Chameleon implementation in Transformers uses a special image token to indicate where to merge image embeddings. For special image token we didn't add a new one but used one of the reserved tokens: `<reserved08707>`. You have to add `<image>` to your prompt in the place where the image should be embedded for correct generation.
-
-## Usage example
-
-### Single image inference
-
-Chameleon is a gated model so make sure to have access and login to Hugging Face Hub using a token.
-Here's how to load the model and perform inference in half-precision (`torch.bfloat16`):
-
-```python
-from transformers import ChameleonProcessor, ChameleonForConditionalGeneration
+```py
+import torch
+from transformers import pipeline
+
+pipeline = pipeline(task="image-to-text", model="facebook/chameleon-7b", dtype="auto")
+pipeline("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg", text="What is shown in this image? <image>"
+)
+```
+
+</hfoption>
+<hfoption id="ChameleonForConditionalGeneration">
+
+```py
 import torch
-from PIL import Image
 import requests
+from PIL import Image
+from transformers import AutoProcessor, ChameleonForConditionalGeneration

-processor = ChameleonProcessor.from_pretrained("facebook/chameleon-7b")
-model = ChameleonForConditionalGeneration.from_pretrained("facebook/chameleon-7b", dtype=torch.bfloat16, device_map="auto")
+processor = AutoProcessor.from_pretrained("facebook/chameleon-7b")
+model = ChameleonForConditionalGeneration.from_pretrained("facebook/chameleon-7b", dtype="auto")

-# prepare image and text prompt
-url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
 image = Image.open(requests.get(url, stream=True).raw)
-prompt = "What do you see in this image?<image>"
+prompt = "What is shown in this image?<image>"

-inputs = processor(images=image, text=prompt, return_tensors="pt").to(model.device, dtype=torch.bfloat16)
-
-# autoregressively complete prompt
+inputs = processor(images=image, text=prompt, return_tensors="pt").to(torch.bfloat16)
 output = model.generate(**inputs, max_new_tokens=50)
 print(processor.decode(output[0], skip_special_tokens=True))
 ```

-### Multi image inference
-
-Chameleon can perform inference with multiple images as input, where images either belong to the same prompt or different prompts (in batched inference). Here is how you can do it:
-
-```python
-from transformers import ChameleonProcessor, ChameleonForConditionalGeneration
-import torch
-from PIL import Image
-import requests
-
-processor = ChameleonProcessor.from_pretrained("facebook/chameleon-7b")
-
-model = ChameleonForConditionalGeneration.from_pretrained("facebook/chameleon-7b", dtype=torch.bfloat16, device_map="auto")
-
-# Get three different images
-url = "https://www.ilankelman.org/stopsigns/australia.jpg"
-image_stop = Image.open(requests.get(url, stream=True).raw)
-
-url = "http://images.cocodataset.org/val2017/000000039769.jpg"
-image_cats = Image.open(requests.get(url, stream=True).raw)
-
-url = "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.jpg"
-image_snowman = Image.open(requests.get(url, stream=True).raw)
-
-# Prepare a batched prompt, where the first one is a multi-image prompt and the second is not
-prompts = [
-    "What do these images have in common?<image><image>",
-    "<image>What is shown in this image?"
-]
-
-# We can simply feed images in the order they have to be used in the text prompt
-# Each "<image>" token uses one image leaving the next for the subsequent "<image>" tokens
-inputs = processor(images=[image_stop, image_cats, image_snowman], text=prompts, padding=True, return_tensors="pt").to(device=model.device, dtype=torch.bfloat16)
-
-# Generate
-generate_ids = model.generate(**inputs, max_new_tokens=50)
-processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
-```
-
-## Model optimization
-
-### Quantization using Bitsandbytes
-
-The model can be loaded in 8 or 4 bits, greatly reducing the memory requirements while maintaining the performance of the original model. First make sure to install bitsandbytes, `pip install bitsandbytes` and to have access to a GPU/accelerator that is supported by the library.
-
-<Tip>
-
-bitsandbytes is being refactored to support multiple backends beyond CUDA. Currently, ROCm (AMD GPU) and Intel CPU implementations are mature, with Intel XPU in progress and Apple Silicon support expected by Q4/Q1. For installation instructions and the latest backend updates, visit [this link](https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend).
-
-We value your feedback to help identify bugs before the full release! Check out [these docs](https://huggingface.co/docs/bitsandbytes/main/en/non_cuda_backends) for more details and feedback links.
-
-</Tip>
-
-Simply change the snippet above with:
-
-```python
-from transformers import ChameleonForConditionalGeneration, BitsAndBytesConfig
-
-# specify how to quantize the model
-quantization_config = BitsAndBytesConfig(
-    load_in_4bit=True,
-    bnb_4bit_quant_type="nf4",
-    bnb_4bit_compute_dtype=torch.bfloat16,
-)
-
-model = ChameleonForConditionalGeneration.from_pretrained("facebook/chameleon-7b", quantization_config=quantization_config, device_map="auto")
-```
-
-### Use Flash-Attention 2 and SDPA to further speed-up generation
-
-The models supports both, Flash-Attention 2 and PyTorch's [`torch.nn.functional.scaled_dot_product_attention`](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention.html) which can be enables for optimization. SDPA is the default options when you load the model, If you want to switch for Flash Attention 2, first make sure to install flash-attn. Refer to the [original repository](https://github.com/Dao-AILab/flash-attention) regarding that package installation. Simply change the snippet above with:
-
-```python
-from transformers import ChameleonForConditionalGeneration
-
-model_id = "facebook/chameleon-7b"
-model = ChameleonForConditionalGeneration.from_pretrained(
-    model_id,
-    dtype=torch.bfloat16,
-    attn_implementation="flash_attention_2"
-).to(0)
-```
+</hfoption>
+</hfoptions>

 ## ChameleonConfig

@ -207,3 +98,4 @@ model = ChameleonForConditionalGeneration.from_pretrained(

 [[autodoc]] ChameleonForConditionalGeneration
    - forward
+
--- a/docs/source/en/model_doc/chinese_clip.md
+++ b/docs/source/en/model_doc/chinese_clip.md
@ -13,65 +13,41 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2022-11-02 and added to Hugging Face Transformers on 2022-12-01.*
+*This model was released on 2022-11-02 and added to Hugging Face Transformers on 2022-12-01 and contributed by [OFA-Sys](https://huggingface.co/OFA-Sys).*

 # Chinese-CLIP

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[Chinese-CLIP](https://huggingface.co/papers/2211.01335) constructs a large-scale dataset of Chinese image-text pairs and pretrains models of varying sizes, from 77 to 958 million parameters. It employs a two-stage pretraining method, initially freezing the image encoder before optimizing all parameters. Experiments show superior performance on MUGE, Flickr30K-CN, and COCO-CN for zero-shot learning and finetuning, and competitive results in zero-shot image classification on the ELEVATER benchmark.

-## Overview
+<hfoptions id="usage">
+<hfoption id="ChineseCLIPModel">

-The Chinese-CLIP model was proposed in [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://huggingface.co/papers/2211.01335) by An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou.
-Chinese-CLIP is an implementation of CLIP (Radford et al., 2021) on a large-scale dataset of Chinese image-text pairs. It is capable of performing cross-modal retrieval and also playing as a vision backbone for vision tasks like zero-shot image classification, open-domain object detection, etc. The original Chinese-CLIP code is released [at this link](https://github.com/OFA-Sys/Chinese-CLIP).
+```py
+import torch
+import requests
+from PIL import Image
+from transformers import AutoProcessor, ChineseCLIPModel

-The abstract from the paper is the following:
+model = ChineseCLIPModel.from_pretrained("OFA-Sys/chinese-clip-vit-base-patch16", dtype="auto")
+processor = AutoProcessor.from_pretrained("OFA-Sys/chinese-clip-vit-base-patch16")

-*The tremendous success of CLIP (Radford et al., 2021) has promoted the research and application of contrastive learning for vision-language pretraining. In this work, we construct a large-scale dataset of image-text pairs in Chinese, where most data are retrieved from publicly available datasets, and we pretrain Chinese CLIP models on the new dataset. We develop 5 Chinese CLIP models of multiple sizes, spanning from 77 to 958 million parameters. Furthermore, we propose a two-stage pretraining method, where the model is first trained with the image encoder frozen and then trained with all parameters being optimized, to achieve enhanced model performance. Our comprehensive experiments demonstrate that Chinese CLIP can achieve the state-of-the-art performance on MUGE, Flickr30K-CN, and COCO-CN in the setups of zero-shot learning and finetuning, and it is able to achieve competitive performance in zero-shot image classification based on the evaluation on the ELEVATER benchmark (Li et al., 2022). Our codes, pretrained models, and demos have been released.*
+url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
+image = Image.open(requests.get(url, stream=True).raw)
+# Squirtle, Bulbasaur, Charmander, Pikachu in English
+texts = ["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"]

-The Chinese-CLIP model was contributed by [OFA-Sys](https://huggingface.co/OFA-Sys).
+inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
+outputs = model(**inputs)
+logits_per_image = outputs.logits_per_image
+probs = logits_per_image.softmax(dim=1)

-## Usage example
-
-The code snippet below shows how to compute image & text features and similarities:
-
-```python
->>> from PIL import Image
->>> import requests
->>> from transformers import ChineseCLIPProcessor, ChineseCLIPModel
-
->>> model = ChineseCLIPModel.from_pretrained("OFA-Sys/chinese-clip-vit-base-patch16")
->>> processor = ChineseCLIPProcessor.from_pretrained("OFA-Sys/chinese-clip-vit-base-patch16")
-
->>> url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
->>> image = Image.open(requests.get(url, stream=True).raw)
->>> # Squirtle, Bulbasaur, Charmander, Pikachu in English
->>> texts = ["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"]
-
->>> # compute image feature
->>> inputs = processor(images=image, return_tensors="pt")
->>> image_features = model.get_image_features(**inputs)
->>> image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True)  # normalize
-
->>> # compute text features
->>> inputs = processor(text=texts, padding=True, return_tensors="pt")
->>> text_features = model.get_text_features(**inputs)
->>> text_features = text_features / text_features.norm(p=2, dim=-1, keepdim=True)  # normalize
-
->>> # compute image-text similarity scores
->>> inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
->>> outputs = model(**inputs)
->>> logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
->>> probs = logits_per_image.softmax(dim=1)  # probs: [[1.2686e-03, 5.4499e-02, 6.7968e-04, 9.4355e-01]]
+print("Text-image similarity probabilities:")
+for i, (text, prob) in enumerate(zip(texts, probs[0])):
+    print(f"'{text}' -> {prob.item():.4f} ({prob.item()*100:.1f}%)")
 ```

-Currently, following scales of pretrained Chinese-CLIP models are available on 🤗 Hub:
-
- [OFA-Sys/chinese-clip-vit-base-patch16](https://huggingface.co/OFA-Sys/chinese-clip-vit-base-patch16)
- [OFA-Sys/chinese-clip-vit-large-patch14](https://huggingface.co/OFA-Sys/chinese-clip-vit-large-patch14)
- [OFA-Sys/chinese-clip-vit-large-patch14-336px](https://huggingface.co/OFA-Sys/chinese-clip-vit-large-patch14-336px)
- [OFA-Sys/chinese-clip-vit-huge-patch14](https://huggingface.co/OFA-Sys/chinese-clip-vit-huge-patch14)
+</hfoption>
+</hfoptions>

 ## ChineseCLIPConfig

@ -115,3 +91,4 @@ Currently, following scales of pretrained Chinese-CLIP models are available on

 [[autodoc]] ChineseCLIPVisionModel
    - forward
+
--- a/docs/source/en/model_doc/clap.md
+++ b/docs/source/en/model_doc/clap.md
@ -13,48 +13,35 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2022-11-12 and added to Hugging Face Transformers on 2023-02-16.*
-
-<div style="float: right;">
-  <div class="flex flex-wrap space-x-1">
-    <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-  </div>
-</div>
+*This model was released on 2022-11-12 and added to Hugging Face Transformers on 2023-02-16 and contributed by [ybelkada](https://huggingface.co/ybelkada) and [ArthurZ](https://huggingface.co/ArthurZ).*

 # CLAP

-[CLAP (Contrastive Language-Audio Pretraining)](https://huggingface.co/papers/2211.06687) is a multimodal model that combines audio data with natural language descriptions through contrastive learning.
-
-It incorporates feature fusion and keyword-to-caption augmentation to process variable-length audio inputs and to improve performance. CLAP doesn't require task-specific training data and can learn meaningful audio representations through natural language.
-
-You can find all the original CLAP checkpoints under the [CLAP](https://huggingface.co/collections/laion/clap-contrastive-language-audio-pretraining-65415c0b18373b607262a490) collection.
-
-> [!TIP]
-> This model was contributed by [ybelkada](https://huggingface.co/ybelkada) and [ArthurZ](https://huggingface.co/ArthurZ).
->
-> Click on the CLAP models in the right sidebar for more examples of how to apply CLAP to different audio retrieval and classification tasks.
-
-The example below demonstrates how to extract text embeddings with the [`AutoModel`] class.
+[CLAP](https://huggingface.co/papers/2211.06687) is a neural network trained on a large dataset of audio-text pairs to develop a multimodal representation. It uses a SWINTransformer for audio feature extraction from log-Mel spectrograms and a RoBERTa model for text feature extraction. Both feature sets are projected into a shared latent space, where their similarity is measured using a dot product. The model incorporates feature fusion and keyword-to-caption augmentation to handle variable audio lengths and improve performance. Evaluations across text-to-audio retrieval, zero-shot audio classification, and supervised audio classification show that CLAP achieves superior results in text-to-audio retrieval and state-of-the-art performance in zero-shot audio classification, comparable to non-zero-shot models.

 <hfoptions id="usage">
-<hfoption id="AutoModel">
+<hfoption id="ClapModel">

-```python
-import torch
-from transformers import AutoTokenizer, AutoModel
+```py
+from datasets import load_dataset
+from transformers import AutoProcessor, ClapModel

-model = AutoModel.from_pretrained("laion/clap-htsat-unfused", dtype=torch.float16, device_map="auto")
-tokenizer = AutoTokenizer.from_pretrained("laion/clap-htsat-unfused")
+dataset = load_dataset("hf-internal-testing/ashraq-esc50-1-dog-example")
+audio_sample = dataset["train"]["audio"][0]["array"]

-texts = ["the sound of a cat", "the sound of a dog", "music playing"]
+model = ClapModel.from_pretrained("laion/clap-htsat-unfused", dtype="auto")
+processor = AutoProcessor.from_pretrained("laion/clap-htsat-unfused")

-inputs = tokenizer(texts, padding=True, return_tensors="pt").to(model.device)
+input_text = ["Sound of a dog", "Sound of vacuum cleaner"]

-with torch.no_grad():
-    text_features = model.get_text_features(**inputs)
+inputs = processor(text=input_text, audios=audio_sample, return_tensors="pt", padding=True)

-print(f"Text embeddings shape: {text_features.shape}")
-print(f"Text embeddings: {text_features}")
+outputs = model(**inputs)
+logits_per_audio = outputs.logits_per_audio
+probs = logits_per_audio.softmax(dim=-1)
+
+for i, prob in enumerate(probs[0]):
+    print(f"{input_text[i]}: {prob.item():.3f}")
 ```

 </hfoption>
@ -106,3 +93,4 @@ print(f"Text embeddings: {text_features}")

 [[autodoc]] ClapAudioModelWithProjection
    - forward
+
--- a/docs/source/en/model_doc/clip.md
+++ b/docs/source/en/model_doc/clip.md
@ -13,11 +13,10 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2021-02-26 and added to Hugging Face Transformers on 2021-05-12.*
+*This model was released on 2021-02-26 and added to Hugging Face Transformers on 2021-05-12 and contributed by [valhalla](https://huggingface.co/valhalla).*

 <div style="float: right;">
    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
        <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
    </div>
@ -25,14 +24,7 @@ rendered properly in your Markdown viewer.

 # CLIP

-[CLIP](https://huggingface.co/papers/2103.00020) is a is a multimodal vision and language model motivated by overcoming the fixed number of object categories when training a computer vision model. CLIP learns about images directly from raw text by jointly training on 400M (image, text) pairs. Pretraining on this scale enables zero-shot transfer to downstream tasks. CLIP uses an image encoder and text encoder to get visual features and text features. Both features are projected to a latent space with the same number of dimensions and their dot product gives a similarity score.
-
-You can find all the original CLIP checkpoints under the [OpenAI](https://huggingface.co/openai?search_models=clip) organization.
-
-> [!TIP]
-> Click on the CLIP models in the right sidebar for more examples of how to apply CLIP to different image and language tasks.
-
-The example below demonstrates how to calculate similarity scores between multiple text descriptions and an image with [`Pipeline`] or the [`AutoModel`] class.
+[CLIP](https://huggingface.co/papers/2103.00020) is a neural network trained on 400 million (image, text) pairs from the internet. It learns to predict which caption corresponds to which image, enabling zero-shot transfer to various computer vision tasks. Benchmarked on over 30 datasets, CLIP demonstrates competitive performance without task-specific training, matching ResNet-50's accuracy on ImageNet zero-shot without using its training examples.

 <hfoptions id="usage">
 <hfoption id="Pipeline">
@ -41,49 +33,49 @@ The example below demonstrates how to calculate similarity scores between multip
 import torch
 from transformers import pipeline

-clip = pipeline(
-   task="zero-shot-image-classification",
-   model="openai/clip-vit-base-patch32",
-   dtype=torch.bfloat16,
-   device=0
-)
-labels = ["a photo of a cat", "a photo of a dog", "a photo of a car"]
-clip("http://images.cocodataset.org/val2017/000000039769.jpg", candidate_labels=labels)
+pipeline = pipeline(task="zero-shot-image-classification", model="openai/clip-vit-base-patch32", dtype="auto")
+candidate_labels = ["a photo of a dog", "a photo of a cat", "a photo of a person"]
+pipeline("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg", candidate_labels=candidate_labels)
 ```

 </hfoption>
 <hfoption id="AutoModel">

 ```py
-import requests
 import torch
+import requests
 from PIL import Image
-from transformers import AutoProcessor, AutoModel
+from transformers import AutoProcessor, AutoModelForZeroShotImageClassification

-model = AutoModel.from_pretrained("openai/clip-vit-base-patch32", dtype=torch.bfloat16, attn_implementation="sdpa")
 processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
+model = AutoModelForZeroShotImageClassification.from_pretrained("openai/clip-vit-base-patch32", dtype="auto")

-url = "http://images.cocodataset.org/val2017/000000039769.jpg"
-image = Image.open(requests.get(url, stream=True).raw)
-labels = ["a photo of a cat", "a photo of a dog", "a photo of a car"]
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
+image = requests.get(url, stream=True)
+inputs = Image.open(image.raw).convert("RGB")

-inputs = processor(text=labels, images=image, return_tensors="pt", padding=True)
+image_inputs = processor(images=inputs, return_tensors="pt").to(model.device)
+with torch.no_grad():
+    image_embeds = model.get_image_features(**image_inputs)

-outputs = model(**inputs)
-logits_per_image = outputs.logits_per_image
-probs = logits_per_image.softmax(dim=1)
-most_likely_idx = probs.argmax(dim=1).item()
-most_likely_label = labels[most_likely_idx]
-print(f"Most likely label: {most_likely_label} with probability: {probs[0][most_likely_idx].item():.3f}")
+candidate_labels = ["a photo of a dog", "a photo of a cat", "a photo of a person"]
+text_inputs = processor(text=candidate_labels, padding=True, return_tensors="pt").to(model.device)
+with torch.no_grad():
+    text_embeds = model.get_text_features(**text_inputs)
+
+image_embeds = image_embeds / image_embeds.norm(p=2, dim=-1, keepdim=True)
+text_embeds  = text_embeds  / text_embeds.norm(p=2, dim=-1, keepdim=True)
+
+logits = (image_embeds @ text_embeds.T) * 100.0
+probs  = logits.softmax(dim=-1).cpu().squeeze()
+
+for label, score in zip(candidate_labels, probs):
+    print(f"{label:20s} → {score.item():.4f}")
 ```

 </hfoption>
 </hfoptions>

-## Notes
-
- Use [`CLIPImageProcessor`] to resize (or rescale) and normalizes images for the model.
-
 ## CLIPConfig

 [[autodoc]] CLIPConfig
@ -153,3 +145,4 @@ print(f"Most likely label: {most_likely_label} with probability: {probs[0][most_

 [[autodoc]] CLIPForImageClassification
    - forward
+
--- a/docs/source/en/model_doc/clipseg.md
+++ b/docs/source/en/model_doc/clipseg.md
@ -13,61 +13,41 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2021-12-18 and added to Hugging Face Transformers on 2022-11-08.*
+*This model was released on 2021-12-18 and added to Hugging Face Transformers on 2022-11-08 and contributed by [nielsr](https://huggingface.co/nielsr).*

 # CLIPSeg

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[CLIPSeg](https://huggingface.co/papers/2112.10003) extends the CLIP model with a transformer-based decoder to enable zero-shot and one-shot image segmentation using arbitrary text or image prompts. This unified model can handle referring expression segmentation, zero-shot segmentation, and one-shot segmentation tasks. Trained on an extended PhraseCut dataset, CLIPSeg generates binary segmentation maps based on free-text or image queries, demonstrating adaptability to various binary segmentation tasks involving affordances or properties.

-## Overview
+<hfoptions id="usage">
+<hfoption id="CLIPSegModel">

-The CLIPSeg model was proposed in [Image Segmentation Using Text and Image Prompts](https://huggingface.co/papers/2112.10003) by Timo Lüddecke
-and Alexander Ecker. CLIPSeg adds a minimal decoder on top of a frozen [CLIP](clip) model for zero-shot and one-shot image segmentation.
+```py
+import torch
+from transformers import AutoProcessor, CLIPSegModel
+from transformers.image_utils import load_image

-The abstract from the paper is the following:
+processor = AutoProcessor.from_pretrained("CIDAS/clipseg-rd64-refined")
+model = CLIPSegModel.from_pretrained("CIDAS/clipseg-rd64-refined", dtype="auto")

-*Image segmentation is usually addressed by training a
-model for a fixed set of object classes. Incorporating additional classes or more complex queries later is expensive
-as it requires re-training the model on a dataset that encompasses these expressions. Here we propose a system
-that can generate image segmentations based on arbitrary
-prompts at test time. A prompt can be either a text or an
-image. This approach enables us to create a unified model
-(trained once) for three common segmentation tasks, which
-come with distinct challenges: referring expression segmentation, zero-shot segmentation and one-shot segmentation.
-We build upon the CLIP model as a backbone which we extend with a transformer-based decoder that enables dense
-prediction. After training on an extended version of the
-PhraseCut dataset, our system generates a binary segmentation map for an image based on a free-text prompt or on
-an additional image expressing the query. We analyze different variants of the latter image-based prompts in detail.
-This novel hybrid input allows for dynamic adaptation not
-only to the three segmentation tasks mentioned above, but
-to any binary segmentation task where a text or image query
-can be formulated. Finally, we find our system to adapt well
-to generalized queries involving affordances or properties*
+image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg")
+texts = ["a photo of a cat", "a photo of a dog"]
+inputs = processor(
+    text=texts, images=image, return_tensors="pt", padding=True
+)

-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/clipseg_architecture.png"
-alt="drawing" width="600"/>
+with torch.inference_mode():
+    outputs = model(**inputs)
+logits_per_image = outputs.logits_per_image 
+probs = logits_per_image.softmax(dim=1)

-<small> CLIPSeg overview. Taken from the <a href="https://huggingface.co/papers/2112.10003">original paper.</a> </small>
+print("Text-image similarity probabilities:")
+for i, (text, prob) in enumerate(zip(texts, probs[0])):
+    print(f"'{text}' -> {prob.item():.4f} ({prob.item()*100:.1f}%)")
+```

-This model was contributed by [nielsr](https://huggingface.co/nielsr).
-The original code can be found [here](https://github.com/timojl/clipseg).
-
-## Usage tips
-
- [`CLIPSegForImageSegmentation`] adds a decoder on top of [`CLIPSegModel`]. The latter is identical to [`CLIPModel`].
- [`CLIPSegForImageSegmentation`] can generate image segmentations based on arbitrary prompts at test time. A prompt can be either a text
-(provided to the model as `input_ids`) or an image (provided to the model as `conditional_pixel_values`). One can also provide custom
-conditional embeddings (provided to the model as `conditional_embeddings`).
-
-## Resources
-
-A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with CLIPSeg. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
-
-<PipelineTag pipeline="image-segmentation"/>
-
- A notebook that illustrates [zero-shot image segmentation with CLIPSeg](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/CLIPSeg/Zero_shot_image_segmentation_with_CLIPSeg.ipynb).
+</hfoption>
+</hfoptions>

 ## CLIPSegConfig

@ -106,3 +86,4 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h

 [[autodoc]] CLIPSegForImageSegmentation
    - forward
+
--- a/docs/source/en/model_doc/clvp.md
+++ b/docs/source/en/model_doc/clvp.md
@ -13,63 +13,36 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2023-05-12 and added to Hugging Face Transformers on 2023-11-10.*
+*This model was released on 2023-05-12 and added to Hugging Face Transformers on 2023-11-10 and contributed by [susnato](https://huggingface.co/susnato).*

 # CLVP

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[CLVP](https://huggingface.co/papers/2305.07243) applies advancements from image generation, specifically autoregressive transformers and DDPMs, to speech synthesis. The result is TorToise, an expressive, multi-voice text-to-speech system.

-## Overview
+<hfoptions id="usage">
+<hfoption id="ClvpModelForConditionalGeneration">

-The CLVP (Contrastive Language-Voice Pretrained Transformer) model was proposed in [Better speech synthesis through scaling](https://huggingface.co/papers/2305.07243) by James Betker.
+```py
+import datasets
+import torch
+from transformers import AutoProcessor, ClvpModelForConditionalGeneration

-The abstract from the paper is the following:
+text = "Plants create energy through a process known as photosynthesis."

-*In recent years, the field of image generation has been revolutionized by the application of autoregressive transformers and DDPMs. These approaches model the process of image generation as a step-wise probabilistic processes and leverage large amounts of compute and data to learn the image distribution. This methodology of improving performance need not be confined to images. This paper describes a way to apply advances in the image generative domain to speech synthesis. The result is TorToise - an expressive, multi-voice text-to-speech system.*
+ds = datasets.load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
+ds = ds.cast_column("audio", datasets.Audio(sampling_rate=22050))
+sample = ds[0]["audio"]

-This model was contributed by [Susnato Dhar](https://huggingface.co/susnato).
-The original code can be found [here](https://github.com/neonbjb/tortoise-tts).
+processor = AutoProcessor.from_pretrained("susnato/clvp_dev")
+model = ClvpModelForConditionalGeneration.from_pretrained("susnato/clvp_dev", dtype="auto")

-## Usage tips
-
-1. CLVP is an integral part of the Tortoise TTS model.
-2. CLVP can be used to compare different generated speech candidates with the provided text, and the best speech tokens are forwarded to the diffusion model.
-3. The use of the [`ClvpModelForConditionalGeneration.generate()`] method is strongly recommended for tortoise usage.
-4. Note that the CLVP model expects the audio to be sampled at 22.05 kHz contrary to other audio models which expects 16 kHz.
-
-## Brief Explanation
-
- The [`ClvpTokenizer`] tokenizes the text input, and the [`ClvpFeatureExtractor`] extracts the log mel-spectrogram from the desired audio.
- [`ClvpConditioningEncoder`] takes those text tokens and audio representations and converts them into embeddings conditioned on the text and audio.
- The [`ClvpForCausalLM`] uses those embeddings to generate multiple speech candidates.
- Each speech candidate is passed through the speech encoder ([`ClvpEncoder`]) which converts them into a vector representation, and the text encoder ([`ClvpEncoder`]) converts the text tokens into the same latent space.
- At the end, we compare each speech vector with the text vector to see which speech vector is most similar to the text vector.
- [`ClvpModelForConditionalGeneration.generate()`] compresses all of the logic described above into a single method.  
-
-Example :
-
-```python
->>> import datasets
->>> from transformers import ClvpProcessor, ClvpModelForConditionalGeneration
-
->>> # Define the Text and Load the Audio (We are taking an audio example from HuggingFace Hub using `datasets` library).
->>> text = "This is an example text."
-
->>> ds = datasets.load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
->>> ds = ds.cast_column("audio", datasets.Audio(sampling_rate=22050))
->>> sample = ds[0]["audio"]
-
->>> # Define processor and model.
->>> processor = ClvpProcessor.from_pretrained("susnato/clvp_dev")
->>> model = ClvpModelForConditionalGeneration.from_pretrained("susnato/clvp_dev")
-
->>> # Generate processor output and model output.
->>> processor_output = processor(raw_speech=sample["array"], sampling_rate=sample["sampling_rate"], text=text, return_tensors="pt")
->>> generated_output = model.generate(**processor_output)
+processor_output = processor(raw_speech=sample["array"], sampling_rate=sample["sampling_rate"], text=text, return_tensors="pt")
+outputs = model(**processor_output)
 ```

+</hfoption>
+</hfoptions>
+
 ## ClvpConfig

 [[autodoc]] ClvpConfig
@ -122,3 +95,4 @@ Example :
 ## ClvpDecoder

 [[autodoc]] ClvpDecoder
+
--- a/docs/source/en/model_doc/code_llama.md
+++ b/docs/source/en/model_doc/code_llama.md
@ -13,24 +13,11 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2023-08-24 and added to Hugging Face Transformers on 2023-08-25.*
-
-<div style="float: right;">
-    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-    </div>
-</div>
+*This model was released on 2023-04-27 and added to Hugging Face Transformers on 2023-08-25 and contributed by [ArthurZ](https://huggingface.co/ArthurZ).*

 # CodeLlama

-[Code Llama](https://huggingface.co/papers/2308.12950) is a specialized family of large language models based on [Llama 2](./llama2) for coding tasks.  It comes in different flavors - general code, Python-specific, and instruction-following variant - all available in 7B, 13B, 34B, and 70B parameters. Code Llama models can generate, explain, and even fill in missing parts of your code (called "infilling"). It can also handle very long contexts with stable generation up to 100k tokens, even though it was trained on sequences of 16K tokens.
-
-You can find all the original Code Llama checkpoints under the [Code Llama](https://huggingface.co/collections/meta-llama/code-llama-family-661da32d0a9d678b6f55b933) collection.
-
-> [!TIP]
-> Click on the Code Llama models in the right sidebar for more examples of how to apply Code Llama to different coding tasks.
-
-The example below demonstrates how to generate code with [`Pipeline`], or the [`AutoModel`], and from the command line.
+[CodeLlama](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/) is a family of large language models for code, built on Llama 2, offering state-of-the-art performance among open models. It includes foundation models, Python specializations, and instruction-following models in 7B, 13B, and 34B parameter sizes. These models support infilling, handle large input contexts, and perform zero-shot instruction following for programming tasks. Trained on sequences of 16k tokens, they show improvements with inputs up to 100k tokens. The 7B and 13B Code Llama and Code Llama - Instruct variants support infilling based on surrounding content. Code Llama achieves top scores on HumanEval and MBPP benchmarks, with Code Llama - Python 7B outperforming Llama 2 70B on these tasks. All models outperform other publicly available models on MultiPL-E. Code Llama is released under a permissive license for both research and commercial use.

 <hfoptions id="usage">
 <hfoption id="Pipeline">
@ -39,20 +26,8 @@ The example below demonstrates how to generate code with [`Pipeline`], or the [`
 import torch
 from transformers import pipeline

-pipe = pipeline(
-    "text-generation",
-    model="meta-llama/CodeLlama-7b-hf",
-    dtype=torch.float16,
-    device_map=0
-)
-
-# basic code generation
-result = pipe("# Function to calculate the factorial of a number\ndef factorial(n):", max_new_tokens=256)
-print(result[0]['generated_text'])
-
-# infilling
-infill_result = pipe("def remove_non_ascii(s: str) -> str:\n    \"\"\" <FILL_ME>\n    return result", max_new_tokens=200)
-print(infill_result[0]['generated_text'])
+pipeline = pipeline(task="text-generation", model="meta-llama/CodeLlama-7b-hf", dtype="auto")
+pipeline("def fibonacci(n):")
 ```

 </hfoption>
@ -62,107 +37,24 @@ print(infill_result[0]['generated_text'])
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer

+model = AutoModelForCausalLM.from_pretrained("meta-llama/CodeLlama-7b-hf", dtype="auto")
 tokenizer = AutoTokenizer.from_pretrained("meta-llama/CodeLlama-7b-hf")
-model = AutoModelForCausalLM.from_pretrained(
-    "meta-llama/CodeLlama-7b-hf",
-    dtype=torch.float16,
-    device_map="auto",
-    attn_implementation="sdpa"
-)

-# basic code generation
-prompt = "# Function to calculate the factorial of a number\ndef factorial(n):"
-input_ids = tokenizer(prompt, return_tensors="pt").to(model.device)
-
-output = model.generate(
-    **input_ids,
-    max_new_tokens=256,
-    cache_implementation="static"
-)
-print(tokenizer.decode(output[0], skip_special_tokens=True))
-
-# infilling
-infill_prompt = "def remove_non_ascii(s: str) -> str:\n    \"\"\" <FILL_ME>\n    return result"
-input_ids = tokenizer(infill_prompt, return_tensors="pt").to(model.device)
-
-filled_output = model.generate(**input_ids, max_new_tokens=200)
-filled_text = tokenizer.decode(filled_output[0], skip_special_tokens=True)
-print(filled_text)
-```
-
-</hfoption>
-<hfoption id="transformers CLI">
-
-```bash
-echo -e "# Function to calculate the factorial of a number\ndef factorial(n):" | transformers run --task text-generation --model meta-llama/CodeLlama-7b-hf --device 0
+inputs = tokenizer("def fibonacci(n):", return_tensors="pt")
+outputs = model.generate(**inputs, max_length=50)
+print(tokenizer.decode(outputs[0]))
 ```

 </hfoption>
 </hfoptions>

-Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
+## Usage tips

-The example below uses [bitsandbytes](../quantization/bitsandbytes) to only quantize the weights to 4-bits.
-
-```py
-# pip install bitsandbytes
-import torch
-from transformers import AutoModelForCausalLM, CodeLlamaTokenizer, BitsAndBytesConfig
-
-bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True)
-tokenizer = CodeLlamaTokenizer.from_pretrained("meta-llama/CodeLlama-34b-hf")
-model = AutoModelForCausalLM.from_pretrained(
-   "meta-llama/CodeLlama-34b-hf",
-   dtype=torch.bfloat16,
-   device_map="auto",
-   quantization_config=bnb_config
-)
-
-prompt = "# Write a Python function to check if a string is a palindrome\ndef is_palindrome(s):"
-input_ids = tokenizer(prompt, return_tensors="pt").to(model.device)
-
-output = model.generate(**input_ids, max_new_tokens=200, cache_implementation="static")
-print(tokenizer.decode(output[0], skip_special_tokens=True))
-```
-
-Use the [AttentionMaskVisualizer](https://github.com/huggingface/transformers/blob/beb9b5b02246b9b7ee81ddf938f93f44cfeaad19/src/transformers/utils/attention_visualizer.py#L139) to better understand what tokens the model can and cannot attend to.
-
-```py
-from transformers.utils.attention_visualizer import AttentionMaskVisualizer
-
-visualizer = AttentionMaskVisualizer("meta-llama/CodeLlama-7b-hf")
-visualizer("""def func(a, b):
-  return a + b""")
-```
-
-<div class="flex justify-center">
-    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/codellama-attn-mask.png"/>
-</div>
-
-## Notes
-
- Infilling is only available in the 7B and 13B base models, and not in the Python, Instruct, 34B, or 70B models.
- Use the `<FILL_ME>` token where you want your input to be filled. The tokenizer splits this token to create a formatted input string that follows the [original training pattern](https://github.com/facebookresearch/codellama/blob/cb51c14ec761370ba2e2bc351374a79265d0465e/llama/generation.py#L402). This is more robust than preparing the pattern yourself.
-
-    ```py
-    from transformers import LlamaForCausalLM, CodeLlamaTokenizer
-
-    tokenizer = CodeLlamaTokenizer.from_pretrained("meta-llama/CodeLlama-7b-hf")
-    model = LlamaForCausalLM.from_pretrained("meta-llama/CodeLlama-7b-hf")
-    PROMPT = '''def remove_non_ascii(s: str) -> str:
-        """ <FILL_ME>
-        return result
-    '''
-    input_ids = tokenizer(PROMPT, return_tensors="pt")["input_ids"]
-    generated_ids = model.generate(input_ids, max_new_tokens=128)
-
-    filling = tokenizer.batch_decode(generated_ids[:, input_ids.shape[1]:], skip_special_tokens = True)[0]
-    print(PROMPT.replace("<FILL_ME>", filling))
-    ```
-
- Use `bfloat16` for further training or fine-tuning and `float16` for inference.
- The `BOS` character is not used for infilling when encoding the prefix or suffix, but only at the beginning of each prompt.
- The tokenizer is a byte-pair encoding model based on [SentencePiece](https://github.com/google/sentencepiece). During decoding, if the first token is the start of the word (for example, “Banana”), the tokenizer doesn’t prepend the prefix space to the string.
+- Infilling works only in 7B and 13B base models. It doesn't work in Python, Instruct, 34B, or 70B models.
+- Use the `<FILL_ME>` token where you want input filled. The tokenizer splits this token to create a formatted input string that follows the original training pattern. This beats preparing the pattern yourself.
+- Use `bfloat16` for training or fine-tuning and `float16` for inference.
+- The `BOS` character isn't used for infilling when encoding the prefix or suffix. It only appears at the beginning of each prompt.
+- The tokenizer is a byte-pair encoding model based on SentencePiece. During decoding, if the first token starts a word (like "Banana"), the tokenizer doesn't prepend the prefix space.

 ## CodeLlamaTokenizer

@ -180,3 +72,4 @@ visualizer("""def func(a, b):
    - create_token_type_ids_from_sequences
    - update_post_processor
    - save_vocabulary
+
--- a/docs/source/en/model_doc/codegen.md
+++ b/docs/source/en/model_doc/codegen.md
@ -13,61 +13,40 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2022-03-25 and added to Hugging Face Transformers on 2022-06-24.*
+*This model was released on 2022-03-25 and added to Hugging Face Transformers on 2022-06-24 and contributed by [rooa](https://huggingface.co/rooa).*

 # CodeGen

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[CodeGen](https://huggingface.co/papers/2203.13474) is an autoregressive language model designed for program synthesis through a conversational paradigm. Trained on diverse datasets including The Pile, BigQuery, and BigPython, CodeGen addresses challenges in program synthesis by treating it as a sequence prediction problem where specifications are expressed in natural language. The model demonstrates conversational capabilities and outperforms OpenAI's Codex on the HumanEval benchmark. A multi-turn programming benchmark (MTPB) was developed to evaluate the model's conversational program synthesis abilities. 

-## Overview
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-The CodeGen model was proposed in [A Conversational Paradigm for Program Synthesis](https://huggingface.co/papers/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong.
+```py
+import torch
+from transformers import pipeline

-CodeGen is an autoregressive language model for program synthesis trained sequentially on [The Pile](https://pile.eleuther.ai/), BigQuery, and BigPython.
-
-The abstract from the paper is the following:
-
-*Program synthesis strives to generate a computer program as a solution to a given problem specification. We propose a conversational program synthesis approach via large language models, which addresses the challenges of searching over a vast program space and user intent specification faced in prior approaches. Our new approach casts the process of writing a specification and program as a multi-turn conversation between a user and a system. It treats program synthesis as a sequence prediction problem, in which the specification is expressed in natural language and the desired program is conditionally sampled. We train a family of large language models, called CodeGen, on natural language and programming language data. With weak supervision in the data and the scaling up of data size and model size, conversational capacities emerge from the simple autoregressive language modeling. To study the model behavior on conversational program synthesis, we develop a multi-turn programming benchmark (MTPB), where solving each problem requires multi-step synthesis via multi-turn conversation between the user and the model. Our findings show the emergence of conversational capabilities and the effectiveness of the proposed conversational program synthesis paradigm. In addition, our model CodeGen (with up to 16B parameters trained on TPU-v4) outperforms OpenAI's Codex on the HumanEval benchmark. We make the training library JaxFormer including checkpoints available as open source contribution: [this https URL](https://github.com/salesforce/codegen).*
-
-This model was contributed by [Hiroaki Hayashi](https://huggingface.co/rooa).
-The original code can be found [here](https://github.com/salesforce/codegen).
-
-## Checkpoint Naming
-
-* CodeGen model [checkpoints](https://huggingface.co/models?other=codegen) are available on different pre-training data with variable sizes.
-* The format is: `Salesforce/codegen-{size}-{data}`, where
-  * `size`: `350M`, `2B`, `6B`, `16B`
-  * `data`:
-    * `nl`: Pre-trained on the Pile
-    * `multi`: Initialized with `nl`, then further pre-trained on multiple programming languages data
-    * `mono`: Initialized with `multi`, then further pre-trained on Python data
-* For example, `Salesforce/codegen-350M-mono` offers a 350 million-parameter checkpoint pre-trained sequentially on the Pile, multiple programming languages, and Python.
-
-## Usage example
-
-```python
->>> from transformers import AutoModelForCausalLM, AutoTokenizer
-
->>> checkpoint = "Salesforce/codegen-350M-mono"
->>> model = AutoModelForCausalLM.from_pretrained(checkpoint)
->>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
-
->>> text = "def hello_world():"
-
->>> completion = model.generate(**tokenizer(text, return_tensors="pt"))
-
->>> print(tokenizer.decode(completion[0]))
-def hello_world():
-    print("Hello World")
-
-hello_world()
+pipeline = pipeline(task="text-generation", model="Salesforce/codegen-350M-mono", dtype="auto")
+pipeline("def fibonacci(n):")
 ```

-## Resources
+</hfoption>
+<hfoption id="AutoModel">

- [Causal language modeling task guide](../tasks/language_modeling)
+```py
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen-350M-mono", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-350M-mono")
+
+inputs = tokenizer("def fibonacci(n):", return_tensors="pt")
+outputs = model.generate(**inputs, max_length=50)
+print(tokenizer.decode(outputs[0]))
+```
+
+</hfoption>
+</hfoptions>

 ## CodeGenConfig

@ -93,3 +72,4 @@ hello_world()

 [[autodoc]] CodeGenForCausalLM
    - forward
+
--- a/docs/source/en/model_doc/cohere.md
+++ b/docs/source/en/model_doc/cohere.md
@ -1,4 +1,5 @@
-<!--Copyright 2024 The HuggingFace Team. All rights reserved.
+<!--Copyright 2025 The HuggingFace Team. All rights reserved.
+
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at

@ -8,122 +9,57 @@ Unless required by applicable law or agreed to in writing, software distributed
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.

-⚠️ Note that this file is in Markdown but contains specific syntax for our doc-builder (similar to MDX) that may not be
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
 rendered properly in your Markdown viewer.
+
 -->
-*This model was released on 2024-03-12 and added to Hugging Face Transformers on 2024-03-15.*
+*This model was released on {release_date} and added to Hugging Face Transformers on 2024-03-15 and contributed by [saurabhdash](https://huggingface.co/saurabhdash) and [ahmetustun](https://huggingface.co/ahmetustun).*

 <div style="float: right;">
    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
        <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
-        <img alt="Tensor parallelism" src="https://img.shields.io/badge/Tensor%20parallelism-06b6d4?style=flat&logoColor=white">
    </div>
 </div>

-# Cohere
+# Command-R

-Cohere [Command-R](https://cohere.com/blog/command-r) is a 35B parameter multilingual large language model designed for long context tasks like retrieval-augmented generation (RAG) and calling external APIs and tools. The model is specifically trained for grounded generation and supports both single-step and multi-step tool use. It supports a context length of 128K tokens.
-
-You can find all the original Command-R checkpoints under the [Command Models](https://huggingface.co/collections/CohereForAI/command-models-67652b401665205e17b192ad) collection.
-
-> [!TIP]
-> Click on the Cohere models in the right sidebar for more examples of how to apply Cohere to different language tasks.
-
-The example below demonstrates how to generate text with [`Pipeline`] or the [`AutoModel`], and from the command line.
+[Command-R](https://huggingface.co/papers/2310.06664) is a language model engineered for high-throughput, low-latency retrieval-augmented generation (RAG) and tool use at enterprise scale. It supports a 128,000-token context window, enabling it to reason over very long documents or dialogues, and integrates with external APIs/tools to automate multi-step tasks. The model is optimized for production usage (with strong performance per compute), and fine-tuning of Command R is emphasized as a cost-efficient way to specialize it further.

 <hfoptions id="usage">
 <hfoption id="Pipeline">

-```python
+```py
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    task="text-generation",
-    model="CohereForAI/c4ai-command-r-v01",
-    dtype=torch.float16,
-    device=0
-)
-pipeline("Plants create energy through a process known as")
+pipeline = pipeline(task="text-generation", model="CohereLabs/c4ai-command-r-v01", dtype="auto")
+pipeline("Plants create energy through a process known as photosynthesis.")
 ```

 </hfoption>
 <hfoption id="AutoModel">

-```python
+```py
 import torch
-from transformers import AutoTokenizer, AutoModelForCausalLM
+from transformers import AutoModelForCausalLM, AutoTokenizer

-tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-v01")
-model = AutoModelForCausalLM.from_pretrained("CohereForAI/c4ai-command-r-v01", dtype=torch.float16, device_map="auto", attn_implementation="sdpa")
+model = AutoModelForCausalLM.from_pretrained("CohereLabs/c4ai-command-r-v01")
+tokenizer = AutoTokenizer.from_pretrained("CohereLabs/c4ai-command-r-v01")

-# format message with the Command-R chat template
-messages = [{"role": "user", "content": "How do plants make energy?"}]
-input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)
-output = model.generate(
-    input_ids,
-    max_new_tokens=100,
-    do_sample=True,
-    temperature=0.3,
-    cache_implementation="static",
-)
-print(tokenizer.decode(output[0], skip_special_tokens=True))
-```
+messages = [{"role": "user", "content": "How do plants generate energy?"}]
+input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")

-</hfoption>
-<hfoption id="transformers CLI">
-
-```bash
-# pip install -U flash-attn --no-build-isolation
-transformers chat CohereForAI/c4ai-command-r-v01 --dtype auto --attn_implementation flash_attention_2
+outputs = model.generate(input_ids, max_new_tokens=100, do_sample=True, temperature=0.3,)
+print(tokenizer.decode(outputs[0]))
 ```

 </hfoption>
 </hfoptions>

-Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
+## Usage tips

-The example below uses [bitsandbytes](../quantization/bitsandbytes) to quantize the weights to 4-bits.
-
-```python
-import torch
-from transformers import BitsAndBytesConfig, AutoTokenizer, AutoModelForCausalLM
-
-bnb_config = BitsAndBytesConfig(load_in_4bit=True)
-tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-v01")
-model = AutoModelForCausalLM.from_pretrained("CohereForAI/c4ai-command-r-v01", dtype=torch.float16, device_map="auto", quantization_config=bnb_config, attn_implementation="sdpa")
-
-# format message with the Command-R chat template
-messages = [{"role": "user", "content": "How do plants make energy?"}]
-input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)
-output = model.generate(
-    input_ids,
-    max_new_tokens=100,
-    do_sample=True,
-    temperature=0.3,
-    cache_implementation="static",
-)
-print(tokenizer.decode(output[0], skip_special_tokens=True))
-```
-
-Use the [AttentionMaskVisualizer](https://github.com/huggingface/transformers/blob/beb9b5b02246b9b7ee81ddf938f93f44cfeaad19/src/transformers/utils/attention_visualizer.py#L139) to better understand what tokens the model can and cannot attend to.
-
-```py
-from transformers.utils.attention_visualizer import AttentionMaskVisualizer
-
-visualizer = AttentionMaskVisualizer("CohereForAI/c4ai-command-r-v01")
-visualizer("Plants create energy through a process known as")
-```
-
-<div class="flex justify-center">
-    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/cohere-attn-mask.png"/>
-</div>
-
-## Notes
-
- Don't use the dtype parameter in [`~AutoModel.from_pretrained`] if you're using FlashAttention-2 because it only supports fp16 or bf16. You should use [Automatic Mixed Precision](https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html), set fp16 or bf16 to True if using [`Trainer`], or use [torch.autocast](https://pytorch.org/docs/stable/amp.html#torch.autocast).
+- Don't use the `dtype` parameter in [`~AutoModel.from_pretrained`] with FlashAttention-2. It only supports `fp16` or `bf16`. Use [Automatic Mixed Precision](https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html), set `fp16` or `bf16` to `True` with [`Trainer`], or use [torch.autocast](https://pytorch.org/docs/stable/amp.html#torch.autocast).

 ## CohereConfig

@ -147,3 +83,4 @@ visualizer("Plants create energy through a process known as")

 [[autodoc]] CohereForCausalLM
    - forward
+
--- a/docs/source/en/model_doc/cohere2.md
+++ b/docs/source/en/model_doc/cohere2.md
@ -1,4 +1,5 @@
-<!--Copyright 2024 The HuggingFace Team. All rights reserved.
+<!--Copyright 2025 The HuggingFace Team. All rights reserved.
+
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at

@ -8,121 +9,49 @@ Unless required by applicable law or agreed to in writing, software distributed
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.

-⚠️ Note that this file is in Markdown but contains specific syntax for our doc-builder (similar to MDX) that may not be
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
 rendered properly in your Markdown viewer.
+
 -->
-*This model was released on 2024-12-13 and added to Hugging Face Transformers on 2024-12-13.*
+*This model was released on {release_date} and added to Hugging Face Transformers on 2024-12-13.*

 <div style="float: right;">
    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
        <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
-        <img alt="Tensor parallelism" src="https://img.shields.io/badge/Tensor%20parallelism-06b6d4?style=flat&logoColor=white">
    </div>
 </div>

-# Cohere 2
+# Command R7B

-[Cohere Command R7B](https://cohere.com/blog/command-r7b) is an open weights research release of a 7B billion parameter model. It is a multilingual model trained on 23 languages and has a context window of 128k. The model features three layers with sliding window attention and ROPE for efficient local context modeling and relative positional encoding. A fourth layer uses global attention without positional embeddings, enabling unrestricted token interactions across the entire sequence.
-
-This model is optimized for speed, cost-performance, and compute resources.
-
-You can find all the original Command-R checkpoints under the [Command Models](https://huggingface.co/collections/CohereForAI/command-models-67652b401665205e17b192ad) collection.
-
-> [!TIP]
-> Click on the Cohere models in the right sidebar for more examples of how to apply Cohere to different language tasks.
-
-The example below demonstrates how to generate text with [`Pipeline`] or the [`AutoModel`] class, and from the command line.
+[Command R7B](https://cohere.com/blog/command-r7b) is Cohere’s smallest model in the R series, optimized for speed, efficiency, and high-quality outputs on commodity GPUs and edge devices. It has 7 billion parameters and is fine-tuned for retrieval-augmented generation (RAG), enabling strong grounding in enterprise data while maintaining low latency. The model is designed to balance cost and performance, making it accessible for real-world applications like search, summarization, and knowledge management. R7B continues the R-series focus on practical deployment, emphasizing scalability and adaptability for business use cases

 <hfoptions id="usage">
 <hfoption id="Pipeline">

-```python
+```py
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    task="text-generation",
-    model="CohereLabs/c4ai-command-r7b-12-2024",
-    dtype=torch.float16,
-    device_map=0
-)
-
-messages = [
-    {"role": "user", "content": "Hello, can you please help me book a hotel in Japan?"},
-]
-pipeline(messages)
+pipeline = pipeline(task="text-generation", model="CohereLabs/c4ai-command-r7b-12-2024", dtype="auto")
+pipeline("Plants create energy through a process known as photosynthesis.")
 ```

 </hfoption>
 <hfoption id="AutoModel">

-```python
+```py
 import torch
-from transformers import AutoTokenizer, AutoModelForCausalLM
+from transformers import AutoModelForCausalLM, AutoTokenizer

+model = AutoModelForCausalLM.from_pretrained("CohereLabs/c4ai-command-r7b-12-2024", dtype="auto")
 tokenizer = AutoTokenizer.from_pretrained("CohereLabs/c4ai-command-r7b-12-2024")
-model = AutoModelForCausalLM.from_pretrained(
-    "CohereLabs/c4ai-command-r7b-12-2024",
-    dtype=torch.float16,
-    device_map="auto",
-    attn_implementation="sdpa"
-)

-# format message with the Command-R chat template
-messages = [{"role": "user", "content": "Hello, can you please help me book a hotel in Japan?"}]
-input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)
-output = model.generate(
-    input_ids,
-    max_new_tokens=100,
-    do_sample=True,
-    temperature=0.3,
-    cache_implementation="static",
-)
-print(tokenizer.decode(output[0], skip_special_tokens=True))
-```
+messages = [{"role": "user", "content": "How do plants generate energy?"}]
+input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")

-</hfoption>
-<hfoption id="transformers CLI">
-
-```bash
-# pip install -U flash-attn --no-build-isolation
-transformers chat CohereLabs/c4ai-command-r7b-12-2024 --dtype auto --attn_implementation flash_attention_2
-```
-
-</hfoption>
-</hfoptions>
-
-Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview.md) overview for more available quantization backends.
-
-The example below uses [bitsandbytes](../quantization/bitsandbytes.md) to quantize the weights to 4-bits.
-
-```python
-import torch
-from transformers import BitsAndBytesConfig, AutoTokenizer, AutoModelForCausalLM
-
-bnb_config = BitsAndBytesConfig(load_in_4bit=True)
-tokenizer = AutoTokenizer.from_pretrained("CohereLabs/c4ai-command-r7b-12-2024")
-model = AutoModelForCausalLM.from_pretrained(
-    "CohereLabs/c4ai-command-r7b-12-2024",
-    dtype=torch.float16,
-    device_map="auto",
-    quantization_config=bnb_config,
-    attn_implementation="sdpa"
-)
-
-# format message with the Command-R chat template
-messages = [{"role": "user", "content": "Hello, can you please help me book a hotel in Japan?"}]
-input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)
-output = model.generate(
-    input_ids,
-    max_new_tokens=100,
-    do_sample=True,
-    temperature=0.3,
-    cache_implementation="static",
-)
-print(tokenizer.decode(output[0], skip_special_tokens=True))
+outputs = model.generate(input_ids, max_new_tokens=100, do_sample=True, temperature=0.3,)
+print(tokenizer.decode(outputs[0]))
 ```

 ## Cohere2Config
@ -138,3 +67,4 @@ print(tokenizer.decode(output[0], skip_special_tokens=True))

 [[autodoc]] Cohere2ForCausalLM
    - forward
+
--- a/docs/source/en/model_doc/cohere2_vision.md
+++ b/docs/source/en/model_doc/cohere2_vision.md
@ -15,103 +15,65 @@ rendered properly in your Markdown viewer.
 -->
 *This model was released on 2025-07-31 and added to Hugging Face Transformers on 2025-07-31.*

-# Command A Vision
-
-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
-<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
-<img alt="Tensor parallelism" src="https://img.shields.io/badge/Tensor%20parallelism-06b6d4?style=flat&logoColor=white">
+<div style="float: right;">
+    <div class="flex flex-wrap space-x-1">
+        <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
+        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
+        <img alt="Tensor parallelism" src="https://img.shields.io/badge/Tensor%20parallelism-06b6d4?style=flat&logoColor=white">
+    </div>
 </div>

-## Overview
+# Command A Vision

-Command A Vision ([blog post](https://cohere.com/blog/command-a-vision)) is a state-of-the-art multimodal model designed to seamlessly integrate visual and textual information for a wide range of applications. By combining advanced computer vision techniques with natural language processing capabilities, Command A Vision enables users to analyze, understand, and generate insights from both visual and textual data.
-
-The model excels at tasks including image captioning, visual question answering, document understanding, and chart understanding. This makes it a versatile tool for AI practitioners. Its ability to process complex visual and textual inputs makes it useful in settings where text-only representations are imprecise or unavailable, like real-world image understanding and graphics-heavy document processing.
-
-Command A Vision is built upon a robust architecture that leverages the latest advancements in VLMs. It's highly performant and efficient, even when dealing with large-scale datasets. The model's flexibility makes it suitable for a wide range of use cases, from content moderation and image search to medical imaging analysis and robotics.
-
-## Usage tips
-
-The model and image processor can be loaded as follows:
+[Command A Vision](https://cohere.com/blog/command-a-vision) s a state-of-the-art multimodal generative model optimized for enterprise use, excelling in both visual and text-based tasks. It outperforms leading models like GPT-4.1 and Llama 4 Maverick on benchmarks involving charts, diagrams, documents, and real-world imagery. The model features advanced document OCR with structured JSON outputs, strong scene understanding, and multilingual reasoning across industries such as finance, healthcare, and manufacturing. Designed for secure, efficient deployment, it runs on as little as one H100 or two A100 GPUs, enabling scalable on-premise or private enterprise applications.

 <hfoptions id="usage">
+<hfoption id="Pipeline">
+
+```py
+import torch
+from transformers import pipeline
+
+pipeline = pipeline(task="image-text-to-text", model="CohereLabs/command-a-vision-07-2025", dtype="auto")
+messages = [
+    {"role": "user",
+     "content": [
+       {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
+        {"type": "text", "text": "What is shown in this image?"},
+    ]},
+]
+pipeline(text=messages, max_new_tokens=300, return_full_text=False)
+```
+
+</hfoption>
 <hfoption id="AutoModel">

-```python
+```py
 import torch
-
 from transformers import AutoProcessor, AutoModelForImageTextToText

-model_id = "CohereLabs/command-a-vision-07-2025"
+processor = AutoProcessor.from_pretrained("CohereLabs/aya-vision-8b)
+model = AutoModelForImageTextToText.from_pretrained("CohereLabs/command-a-vision-07-2025", dtype="auto")

-processor = AutoProcessor.from_pretrained(model_id)
-model = AutoModelForImageTextToText.from_pretrained(
-    model_id, device_map="auto", dtype=torch.float16
-)
-
-# Format message with the Command-A-Vision chat template
 messages = [
-    {
-        "role": "user",
-        "content": [
-            {
-                "type": "image",
-                "url": "https://images.pexels.com/photos/1108099/pexels-photo-1108099.jpeg",
-            },
-            {"type": "text", "text": "what is in this image?"},
-        ],
-    },
+    {"role": "user",
+     "content": [
+       {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
+        {"type": "text", "text": "What is shown in this image?"},
+    ]},
 ]

 inputs = processor.apply_chat_template(
-    messages,
-    padding=True,
-    add_generation_prompt=True,
-    tokenize=True,
-    return_dict=True,
-    return_tensors="pt",
-).to(model.device)
+    messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt"
+)

-gen_tokens = model.generate(
+outputs = model.generate(
    **inputs,
    max_new_tokens=300,
    do_sample=True,
    temperature=0.3,
 )
-
-print(
-    processor.tokenizer.decode(
-        gen_tokens[0][inputs.input_ids.shape[1] :], skip_special_tokens=True
-    )
-)
-```
-
-</hfoption>
-<hfoption id="Pipeline">
-
-```python
-from transformers import pipeline
-
-pipe = pipeline(model="CohereLabs/command-a-vision-07-2025", task="image-text-to-text", device_map="auto")
-
-messages = [
-    {
-        "role": "user",
-        "content": [
-            {
-                "type": "image",
-                "url": "https://media.istockphoto.com/id/458012057/photo/istanbul-turkey.jpg?s=612x612&w=0&k=20&c=qogAOVvkpfUyqLUMr_XJQyq-HkACXyYUSZbKhBlPrxo=",
-            },
-            {"type": "text", "text": "Where was this taken ?"},
-        ],
-    },
-]
-
-outputs = pipe(text=messages, max_new_tokens=300, return_full_text=False)
-
-print(outputs)
+print(processor.tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
 ```

 </hfoption>
--- a/docs/source/en/model_doc/colpali.md
+++ b/docs/source/en/model_doc/colpali.md
@ -1,4 +1,5 @@
 <!--Copyright 2024 The HuggingFace Team. All rights reserved.
+
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at

@ -8,50 +9,28 @@ Unless required by applicable law or agreed to in writing, software distributed
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.

-⚠️ Note that this file is in Markdown but contains specific syntax for our doc-builder (similar to MDX) that may not be
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
 rendered properly in your Markdown viewer.
-->
-*This model was released on 2024-06-27 and added to Hugging Face Transformers on 2024-12-17.*

-<div style="float: right;">
-    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-    </div>
-</div>
+-->
+*This model was released on 2024-06-27 and added to Hugging Face Transformers on 2024-12-17 and contributed by [tonywu71](https://huggingface.co/tonywu71) and [yonigozlan](https://huggingface.co/yonigozlan).*

 # ColPali

-[ColPali](https://huggingface.co/papers/2407.01449) is a model designed to retrieve documents by analyzing their visual features. Unlike traditional systems that rely heavily on text extraction and OCR, ColPali treats each page as an image. It uses [Paligemma-3B](./paligemma) to capture not only text, but also the layout, tables, charts, and other visual elements to create detailed multi-vector embeddings that can be used for retrieval by computing pairwise late interaction similarity scores. This offers a more comprehensive understanding of documents and enables more efficient and accurate retrieval.
-
-This model was contributed by [@tonywu71](https://huggingface.co/tonywu71) (ILLUIN Technology) and [@yonigozlan](https://huggingface.co/yonigozlan) (HuggingFace).
-
-You can find all the original ColPali checkpoints under Vidore's [Hf-native ColVision Models](https://huggingface.co/collections/vidore/hf-native-colvision-models-6755d68fc60a8553acaa96f7) collection.
-
-> [!TIP]
-> Click on the ColPali models in the right sidebar for more examples of how to use ColPali for image retrieval.
+[ColPali](https://huggingface.co/papers/2407.01449) is a retrieval model designed for visually rich documents that processes document pages as images rather than relying solely on text. It builds on recent vision-language models to generate high-quality contextualized embeddings that capture both textual and visual information. Using a late interaction matching mechanism, ColPali achieves faster and more accurate document retrieval compared to existing systems. The model is evaluated on the new Visual Document Retrieval Benchmark (ViDoRe), which spans diverse domains, languages, and retrieval settings.

 <hfoptions id="usage">
-<hfoption id="image retrieval">
+<hfoption id="ColPaliForRetrieval">

-```python
+```py
 import requests
 import torch
 from PIL import Image
+from transformers import ColPaliForRetrieval, AutoProcessor

-from transformers import ColPaliForRetrieval, ColPaliProcessor
+model = ColPaliForRetrieval.from_pretrained("vidore/colpali-v1.3-hf",dtype="auto")
+processor = AutoProcessor.from_pretrained("vidore/colpali-v1.3-hf")

-
-# Load the model and the processor
-model_name = "vidore/colpali-v1.3-hf"
-
-model = ColPaliForRetrieval.from_pretrained(
-    model_name,
-    dtype=torch.bfloat16,
-    device_map="auto",  # "cpu", "cuda", "xpu", or "mps" for Apple Silicon
-)
-processor = ColPaliProcessor.from_pretrained(model_name)
-
-# The document page screenshots from your corpus
 url1 = "https://upload.wikimedia.org/wikipedia/commons/8/89/US-original-Declaration-1776.jpg"
 url2 = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/4c/Romeoandjuliet1597.jpg/500px-Romeoandjuliet1597.jpg"

@ -60,103 +39,26 @@ images = [
    Image.open(requests.get(url2, stream=True).raw),
 ]

-# The queries you want to retrieve documents for
 queries = [
    "When was the United States Declaration of Independence proclaimed?",
    "Who printed the edition of Romeo and Juliet?",
 ]

-# Process the inputs
 inputs_images = processor(images=images).to(model.device)
 inputs_text = processor(text=queries).to(model.device)

-# Forward pass
 with torch.no_grad():
    image_embeddings = model(**inputs_images).embeddings
    query_embeddings = model(**inputs_text).embeddings
-
-# Score the queries against the images
 scores = processor.score_retrieval(query_embeddings, image_embeddings)

 print("Retrieval scores (query x image):")
 print(scores)
 ```

-If you have issue with loading the images with PIL, you can use the following code to create dummy images:
-
-```python
-images = [
-    Image.new("RGB", (128, 128), color="white"),
-    Image.new("RGB", (64, 32), color="black"),
-]
-```
-
 </hfoption>
 </hfoptions>

-Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
-
-The example below uses [bitsandbytes](../quantization/bitsandbytes) to quantize the weights to int4.
-
-```python
-import requests
-import torch
-from PIL import Image
-
-from transformers import BitsAndBytesConfig, ColPaliForRetrieval, ColPaliProcessor
-
-
-model_name = "vidore/colpali-v1.3-hf"
-
-# 4-bit quantization configuration
-bnb_config = BitsAndBytesConfig(
-    load_in_4bit=True,
-    bnb_4bit_use_double_quant=True,
-    bnb_4bit_quant_type="nf4",
-    bnb_4bit_compute_dtype=torch.float16,
-)
-
-model = ColPaliForRetrieval.from_pretrained(
-    model_name,
-    quantization_config=bnb_config,
-    device_map="auto",
-)
-
-processor = ColPaliProcessor.from_pretrained(model_name)
-
-url1 = "https://upload.wikimedia.org/wikipedia/commons/8/89/US-original-Declaration-1776.jpg"
-url2 = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/4c/Romeoandjuliet1597.jpg/500px-Romeoandjuliet1597.jpg"
-
-images = [
-    Image.open(requests.get(url1, stream=True).raw),
-    Image.open(requests.get(url2, stream=True).raw),
-]
-
-queries = [
-    "When was the United States Declaration of Independence proclaimed?",
-    "Who printed the edition of Romeo and Juliet?",
-]
-
-# Process the inputs
-inputs_images = processor(images=images, return_tensors="pt").to(model.device)
-inputs_text = processor(text=queries, return_tensors="pt").to(model.device)
-
-# Forward pass
-with torch.no_grad():
-    image_embeddings = model(**inputs_images).embeddings
-    query_embeddings = model(**inputs_text).embeddings
-
-# Score the queries against the images
-scores = processor.score_retrieval(query_embeddings, image_embeddings)
-
-print("Retrieval scores (query x image):")
-print(scores)
-```
-
-## Notes
-
- [`~ColPaliProcessor.score_retrieval`] returns a 2D tensor where the first dimension is the number of queries and the second dimension is the number of images. A higher score indicates more similarity between the query and image.
-
 ## ColPaliConfig

 [[autodoc]] ColPaliConfig
--- a/docs/source/en/model_doc/colqwen2.md
+++ b/docs/source/en/model_doc/colqwen2.md
@ -13,49 +13,24 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2024-06-27 and added to Hugging Face Transformers on 2025-06-02.*
-
-<div style="float: right;">
-    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-    </div>
-</div>
+*This model was released on 2024-06-27 and added to Hugging Face Transformers on 2025-06-02 and contributed by [tonywu71](https://huggingface.co/tonywu71) and [yonigozlan](https://huggingface.co/yonigozlan).*

 # ColQwen2

 [ColQwen2](https://huggingface.co/papers/2407.01449) is a variant of the [ColPali](./colpali) model designed to retrieve documents by analyzing their visual features. Unlike traditional systems that rely heavily on text extraction and OCR, ColQwen2 treats each page as an image. It uses the [Qwen2-VL](./qwen2_vl) backbone to capture not only text, but also the layout, tables, charts, and other visual elements to create detailed multi-vector embeddings that can be used for retrieval by computing pairwise late interaction similarity scores. This offers a more comprehensive understanding of documents and enables more efficient and accurate retrieval.

-This model was contributed by [@tonywu71](https://huggingface.co/tonywu71) (ILLUIN Technology) and [@yonigozlan](https://huggingface.co/yonigozlan) (HuggingFace).
-
-You can find all the original ColPali checkpoints under Vidore's [Hf-native ColVision Models](https://huggingface.co/collections/vidore/hf-native-colvision-models-6755d68fc60a8553acaa96f7) collection.
-
-> [!TIP]
-> Click on the ColQwen2 models in the right sidebar for more examples of how to use ColQwen2 for image retrieval.
-
 <hfoptions id="usage">
-<hfoption id="image retrieval">
+<hfoption id="ColQwen2ForRetrieval">

 ```python
 import requests
 import torch
 from PIL import Image
+from transformers import ColQwen2ForRetrieval, AutoProcessor

-from transformers import ColQwen2ForRetrieval, ColQwen2Processor
-from transformers.utils.import_utils import is_flash_attn_2_available
+model = ColQwen2ForRetrieval.from_pretrained("vidore/colqwen2-v1.0-hf",dtype="auto")
+processor = AutoProcessor.from_pretrained("vidore/colqwen2-v1.0-hf")

-
-# Load the model and the processor
-model_name = "vidore/colqwen2-v1.0-hf"
-
-model = ColQwen2ForRetrieval.from_pretrained(
-    model_name,
-    dtype=torch.bfloat16,
-    device_map="auto",  # "cpu", "cuda", "xpu" or "mps" for Apple Silicon
-    attn_implementation="flash_attention_2" if is_flash_attn_2_available() else "sdpa",
-)
-processor = ColQwen2Processor.from_pretrained(model_name)
-
-# The document page screenshots from your corpus
 url1 = "https://upload.wikimedia.org/wikipedia/commons/8/89/US-original-Declaration-1776.jpg"
 url2 = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/4c/Romeoandjuliet1597.jpg/500px-Romeoandjuliet1597.jpg"

@ -64,106 +39,26 @@ images = [
    Image.open(requests.get(url2, stream=True).raw),
 ]

-# The queries you want to retrieve documents for
 queries = [
    "When was the United States Declaration of Independence proclaimed?",
    "Who printed the edition of Romeo and Juliet?",
 ]

-# Process the inputs
 inputs_images = processor(images=images).to(model.device)
 inputs_text = processor(text=queries).to(model.device)

-# Forward pass
 with torch.no_grad():
    image_embeddings = model(**inputs_images).embeddings
    query_embeddings = model(**inputs_text).embeddings
-
-# Score the queries against the images
 scores = processor.score_retrieval(query_embeddings, image_embeddings)

 print("Retrieval scores (query x image):")
 print(scores)
 ```

-If you have issue with loading the images with PIL, you can use the following code to create dummy images:
-
-```python
-images = [
-    Image.new("RGB", (128, 128), color="white"),
-    Image.new("RGB", (64, 32), color="black"),
-]
-```
-
 </hfoption>
 </hfoptions>

-Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
-
-The example below uses [bitsandbytes](../quantization/bitsandbytes) to quantize the weights to int4.
-
-```python
-import requests
-import torch
-from PIL import Image
-
-from transformers import BitsAndBytesConfig, ColQwen2ForRetrieval, ColQwen2Processor
-from accelerate import Accelerator
-
-model_name = "vidore/colqwen2-v1.0-hf"
-device = Accelerator().device
-
-# 4-bit quantization configuration
-bnb_config = BitsAndBytesConfig(
-    load_in_4bit=True,
-    bnb_4bit_use_double_quant=True,
-    bnb_4bit_quant_type="nf4",
-    bnb_4bit_compute_dtype=torch.float16,
-)
-
-model = ColQwen2ForRetrieval.from_pretrained(
-    model_name,
-    quantization_config=bnb_config,
-    device_map=device,
-).eval()
-
-processor = ColQwen2Processor.from_pretrained(model_name)
-
-url1 = "https://upload.wikimedia.org/wikipedia/commons/8/89/US-original-Declaration-1776.jpg"
-url2 = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/4c/Romeoandjuliet1597.jpg/500px-Romeoandjuliet1597.jpg"
-
-images = [
-    Image.open(requests.get(url1, stream=True).raw),
-    Image.open(requests.get(url2, stream=True).raw),
-]
-
-queries = [
-    "When was the United States Declaration of Independence proclaimed?",
-    "Who printed the edition of Romeo and Juliet?",
-]
-
-# Process the inputs
-inputs_images = processor(images=images, return_tensors="pt").to(model.device)
-inputs_text = processor(text=queries, return_tensors="pt").to(model.device)
-
-# Forward pass
-with torch.no_grad():
-    image_embeddings = model(**inputs_images).embeddings
-    query_embeddings = model(**inputs_text).embeddings
-
-# Score the queries against the images
-scores = processor.score_retrieval(query_embeddings, image_embeddings)
-
-print("Retrieval scores (query x image):")
-print(scores)
-```
-
-## Notes
-
- [`~ColQwen2Processor.score_retrieval`] returns a 2D tensor where the first dimension is the number of queries and the second dimension is the number of images. A higher score indicates more similarity between the query and image.
- Unlike ColPali, ColQwen2 supports arbitrary image resolutions and aspect ratios, which means images are not resized into fixed-size squares. This preserves more of the original input signal.
- Larger input images generate longer multi-vector embeddings, allowing users to adjust image resolution to balance performance and memory usage.
-
 ## ColQwen2Config

 [[autodoc]] ColQwen2Config
--- a/docs/source/en/model_doc/conditional_detr.md
+++ b/docs/source/en/model_doc/conditional_detr.md
@ -13,33 +13,54 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2021-08-13 and added to Hugging Face Transformers on 2022-09-22.*
+*This model was released on 2021-08-13 and added to Hugging Face Transformers on 2022-09-22 and contributed by [DepuMeng](https://huggingface.co/DepuMeng).*

 # Conditional DETR

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[Conditional DETR](https://huggingface.co/papers/2108.06152) addresses slow training convergence in DETR by introducing a conditional cross-attention mechanism. This mechanism allows the decoder to learn a conditional spatial query, enabling each cross-attention head to focus on distinct regions such as object extremities or internal regions. This approach reduces reliance on high-quality content embeddings, simplifying training and achieving up to 10× faster convergence for stronger backbones.

-## Overview
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-The Conditional DETR model was proposed in [Conditional DETR for Fast Training Convergence](https://huggingface.co/papers/2108.06152) by Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, Jingdong Wang. Conditional DETR presents a conditional cross-attention mechanism for fast DETR training. Conditional DETR converges 6.7× to 10× faster than DETR.
+```py
+import torch
+from transformers import pipeline

-The abstract from the paper is the following:
+pipeline = pipeline(task="object-detection", model="microsoft/conditional-detr-resnet-50", dtype="auto")
+pipeline("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg")
+```

-*The recently-developed DETR approach applies the transformer encoder and decoder architecture to object detection and achieves promising performance. In this paper, we handle the critical issue, slow training convergence, and present a conditional cross-attention mechanism for fast DETR training. Our approach is motivated by that the cross-attention in DETR relies highly on the content embeddings for localizing the four extremities and predicting the box, which increases the need for high-quality content embeddings and thus the training difficulty. Our approach, named conditional DETR, learns a conditional spatial query from the decoder embedding for decoder multi-head cross-attention. The benefit is that through the conditional spatial query, each cross-attention head is able to attend to a band containing a distinct region, e.g., one object extremity or a region inside the object box. This narrows down the spatial range for localizing the distinct regions for object classification and box regression, thus relaxing the dependence on the content embeddings and easing the training. Empirical results show that conditional DETR converges 6.7× faster for the backbones R50 and R101 and 10× faster for stronger backbones DC5-R50 and DC5-R101. Code is available at https://github.com/Atten4Vis/ConditionalDETR.*
+</hfoption>
+<hfoption id="AutoModel">

-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/conditional_detr_curve.jpg"
-alt="drawing" width="600"/>
+```py
+import torch
+import requests
+from PIL import Image
+from transformers import AutoImageProcessor, AutoModelForObjectDetection

-<small> Conditional DETR shows much faster convergence compared to the original DETR. Taken from the <a href="https://huggingface.co/papers/2108.06152">original paper</a>.</small>
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
+image = Image.open(requests.get(url, stream=True).raw)

-This model was contributed by [DepuMeng](https://huggingface.co/DepuMeng). The original code can be found [here](https://github.com/Atten4Vis/ConditionalDETR).
+image_processor = AutoImageProcessor.from_pretrained("microsoft/conditional-detr-resnet-50")
+model = AutoModelForObjectDetection.from_pretrained("microsoft/conditional-detr-resnet-50", dtype="auto")

-## Resources
+inputs = image_processor(images=image, return_tensors="pt")
+outputs = model(**inputs)
+target_sizes = torch.tensor([image.size[::-1]])
+results = image_processor.post_process_object_detection(outputs, threshold=0.5, target_sizes=target_sizes)[
+    0
+]
+for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
+    box = [round(i, 2) for i in box.tolist()]
+    print(
+        f"Detected {model.config.id2label[label.item()]} with confidence "
+        f"{round(score.item(), 3)} at location {box}"
+    )
+```

- Scripts for finetuning [`ConditionalDetrForObjectDetection`] with [`Trainer`] or [Accelerate](https://huggingface.co/docs/accelerate/index) can be found [here](https://github.com/huggingface/transformers/tree/main/examples/pytorch/object-detection).
- See also: [Object detection task guide](../tasks/object_detection).
+</hfoption>
+</hfoptions>

 ## ConditionalDetrConfig

@ -49,6 +70,10 @@ This model was contributed by [DepuMeng](https://huggingface.co/DepuMeng). The o

 [[autodoc]] ConditionalDetrImageProcessor
    - preprocess
+    - post_process_object_detection
+    - post_process_instance_segmentation
+    - post_process_semantic_segmentation
+    - post_process_panoptic_segmentation

 ## ConditionalDetrImageProcessorFast

@ -73,3 +98,4 @@ This model was contributed by [DepuMeng](https://huggingface.co/DepuMeng). The o

 [[autodoc]] ConditionalDetrForSegmentation
    - forward
+
--- a/docs/source/en/model_doc/convbert.md
+++ b/docs/source/en/model_doc/convbert.md
@ -13,48 +13,46 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2020-08-06 and added to Hugging Face Transformers on 2021-01-27.*
+*This model was released on 2020-08-06 and added to Hugging Face Transformers on 2021-01-27 and contributed by [abhishek](https://huggingface.co/abhishek).*

 # ConvBERT

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://huggingface.co/papers/2008.02496) proposes a novel span-based dynamic convolution to enhance BERT by replacing some self-attention heads with convolution heads, forming a mixed attention block. This design improves efficiency in learning both global and local contexts. ConvBERT outperforms BERT and its variants in various tasks, achieving an 86.4 GLUE score with less training cost and fewer parameters.

-## Overview
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-The ConvBERT model was proposed in [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://huggingface.co/papers/2008.02496) by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng
-Yan.
+```py
+import torch
+from transformers import pipeline

-The abstract from the paper is the following:
+pipeline = pipeline(task="fill-mask", model="YituTech/conv-bert-base", dtype="auto")
+pipeline("Plants create [MASK] through a process known as photosynthesis.")
+```

-*Pre-trained language models like BERT and its variants have recently achieved impressive performance in various
-natural language understanding tasks. However, BERT heavily relies on the global self-attention block and thus suffers
-large memory footprint and computation cost. Although all its attention heads query on the whole input sequence for
-generating the attention map from a global perspective, we observe some heads only need to learn local dependencies,
-which means the existence of computation redundancy. We therefore propose a novel span-based dynamic convolution to
-replace these self-attention heads to directly model local dependencies. The novel convolution heads, together with the
-rest self-attention heads, form a new mixed attention block that is more efficient at both global and local context
-learning. We equip BERT with this mixed attention design and build a ConvBERT model. Experiments have shown that
-ConvBERT significantly outperforms BERT and its variants in various downstream tasks, with lower training cost and
-fewer model parameters. Remarkably, ConvBERTbase model achieves 86.4 GLUE score, 0.7 higher than ELECTRAbase, while
-using less than 1/4 training cost. Code and pre-trained models will be released.*
+</hfoption>
+<hfoption id="AutoModel">

-This model was contributed by [abhishek](https://huggingface.co/abhishek). The original implementation can be found
-here: https://github.com/yitu-opensource/ConvBert
+```py
+import torch
+from transformers import AutoModelForMaskedLM, AutoTokenizer

-## Usage tips
+model = AutoModelForMaskedLM.from_pretrained("YituTech/conv-bert-base", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("YituTech/conv-bert-base")

-ConvBERT training tips are similar to those of BERT. For usage tips refer to [BERT documentation](bert).
+inputs = tokenizer("Plants create [MASK] through a process known as photosynthesis.", return_tensors="pt")
+outputs = model(**inputs)
+mask_token_id = tokenizer.mask_token_id
+mask_position = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
+predicted_word = tokenizer.decode(outputs.logits[0, mask_position].argmax(dim=-1))
+print(f"Predicted word: {predicted_word}")
+```
+
+</hfoption>
+</hfoptions>

 ## Resources

- [Text classification task guide](../tasks/sequence_classification)
- [Token classification task guide](../tasks/token_classification)
- [Question answering task guide](../tasks/question_answering)
- [Masked language modeling task guide](../tasks/masked_language_modeling)
- [Multiple choice task guide](../tasks/multiple_choice)
-
 ## ConvBertConfig

 [[autodoc]] ConvBertConfig
@ -100,3 +98,4 @@ ConvBERT training tips are similar to those of BERT. For usage tips refer to [BE

 [[autodoc]] ConvBertForQuestionAnswering
    - forward
+
--- a/docs/source/en/model_doc/convnext.md
+++ b/docs/source/en/model_doc/convnext.md
@ -13,47 +13,49 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2022-01-10 and added to Hugging Face Transformers on 2022-02-07.*
+*This model was released on 2022-01-10 and added to Hugging Face Transformers on 2022-02-07 and contributed by [nielsr](https://huggingface.co/nielsr).*

 # ConvNeXT

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[ConvNeXT](https://huggingface.co/papers/2201.03545) reexamines the design spaces of ConvNets and explores the potential of pure ConvNet architectures inspired by Vision Transformers. By modernizing a standard ResNet, the model identifies key components that enhance performance. ConvNeXT achieves competitive accuracy and scalability, reaching 87.8% ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detection and ADE20K segmentation, while retaining the simplicity and efficiency of traditional ConvNets.

-## Overview
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-The ConvNeXT model was proposed in [A ConvNet for the 2020s](https://huggingface.co/papers/2201.03545) by Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie.
-ConvNeXT is a pure convolutional model (ConvNet), inspired by the design of Vision Transformers, that claims to outperform them.
+```py
+import torch
+from transformers import pipeline

-The abstract from the paper is the following:
+pipeline = pipeline(task="image-classification", model="facebook/convnext-tiny-224", dtype="auto")
+pipeline("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg")
+```

-*The "Roaring 20s" of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model.
-A vanilla ViT, on the other hand, faces difficulties when applied to general computer vision tasks such as object detection and semantic segmentation. It is the hierarchical Transformers
-(e.g., Swin Transformers) that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone and demonstrating remarkable performance on a wide
-variety of vision tasks. However, the effectiveness of such hybrid approaches is still largely credited to the intrinsic superiority of Transformers, rather than the inherent inductive
-biases of convolutions. In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve. We gradually "modernize" a standard ResNet toward the design
-of a vision Transformer, and discover several key components that contribute to the performance difference along the way. The outcome of this exploration is a family of pure ConvNet models
-dubbed ConvNeXt. Constructed entirely from standard ConvNet modules, ConvNeXts compete favorably with Transformers in terms of accuracy and scalability, achieving 87.8% ImageNet top-1 accuracy
-and outperforming Swin Transformers on COCO detection and ADE20K segmentation, while maintaining the simplicity and efficiency of standard ConvNets.*
+</hfoption>
+<hfoption id="AutoModel">

-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/convnext_architecture.jpg"
-alt="drawing" width="600"/>
+```python
+import torch
+import requests
+from PIL import Image
+from transformers import AutoImageProcessor, AutoModelForImageClassification

-<small> ConvNeXT architecture. Taken from the <a href="https://huggingface.co/papers/2201.03545">original paper</a>.</small>
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
+image = Image.open(requests.get(url, stream=True).raw)

-This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/facebookresearch/ConvNeXt).
+image_processor = AutoImageProcessor.from_pretrained("facebook/convnext-tiny-224")
+model = AutoModelForImageClassification.from_pretrained("facebook/convnext-tiny-224", dtype="auto")

-## Resources
+inputs = image_processor(image, return_tensors="pt")

-A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with ConvNeXT.
+with torch.no_grad():
+    logits = model(**inputs).logits

-<PipelineTag pipeline="image-classification"/>
+predicted_label = logits.argmax(-1).item()
+print(model.config.id2label[predicted_label])
+```

- [`ConvNextForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb).
- See also: [Image classification task guide](../tasks/image_classification)
-
-If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
+</hfoption>
+</hfoptions>

 ## ConvNextConfig

@ -78,3 +80,4 @@ If you're interested in submitting a resource to be included here, please feel f

 [[autodoc]] ConvNextForImageClassification
    - forward
+
--- a/docs/source/en/model_doc/convnextv2.md
+++ b/docs/source/en/model_doc/convnextv2.md
@ -13,39 +13,49 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2023-01-02 and added to Hugging Face Transformers on 2023-03-14.*
+*This model was released on 2023-01-02 and added to Hugging Face Transformers on 2023-03-14 and contributed by [adirik](https://huggingface.co/adirik).*

 # ConvNeXt V2

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[ConvNeXt V2](https://huggingface.co/papers/2301.00808) is a fully convolutional model inspired by Vision Transformers and built upon ConvNeXt. It integrates a novel Global Response Normalization (GRN) layer to enhance inter-channel feature competition and a fully convolutional masked autoencoder framework. This co-design improves performance on various recognition tasks, including ImageNet classification, COCO detection, and ADE20K segmentation. Pre-trained ConvNeXt V2 models range from an efficient 3.7M-parameter Atto model achieving 76.7% top-1 accuracy on ImageNet to a 650M Huge model with 88.9% accuracy using only public training data.

-## Overview
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-The ConvNeXt V2 model was proposed in [ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders](https://huggingface.co/papers/2301.00808) by Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, Saining Xie.
-ConvNeXt V2 is a pure convolutional model (ConvNet), inspired by the design of Vision Transformers, and a successor of [ConvNeXT](convnext).
+```py
+import torch
+from transformers import pipeline

-The abstract from the paper is the following:
+pipeline = pipeline(task="image-classification", model="facebook/convnextv2-tiny-1k-224", dtype="auto")
+pipeline("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg")
+```

-*Driven by improved architectures and better representation learning frameworks, the field of visual recognition has enjoyed rapid modernization and performance boost in the early 2020s. For example, modern ConvNets, represented by ConvNeXt, have demonstrated strong performance in various scenarios. While these models were originally designed for supervised learning with ImageNet labels, they can also potentially benefit from self-supervised learning techniques such as masked  autoencoders (MAE). However, we found that simply combining these two approaches leads to subpar performance. In this paper, we propose a fully convolutional masked autoencoder framework and a new Global Response Normalization (GRN) layer that can be added to the ConvNeXt architecture to enhance inter-channel feature competition. This co-design of self-supervised learning techniques and architectural improvement results in a new model family called ConvNeXt V2, which significantly improves the performance of pure ConvNets on various recognition benchmarks, including ImageNet classification, COCO detection, and ADE20K segmentation. We also provide pre-trained ConvNeXt V2 models of various sizes, ranging from an efficient 3.7M-parameter Atto model with 76.7% top-1 accuracy on ImageNet, to a 650M Huge model that achieves a state-of-the-art 88.9% accuracy using only public training data.*
+</hfoption>
+<hfoption id="AutoModel">

-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/convnextv2_architecture.png"
-alt="drawing" width="600"/>
+```python
+import torch
+import requests
+from PIL import Image
+from transformers import AutoImageProcessor, AutoModelForImageClassification

-<small> ConvNeXt V2 architecture. Taken from the <a href="https://huggingface.co/papers/2301.00808">original paper</a>.</small>
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
+image = Image.open(requests.get(url, stream=True).raw)

-This model was contributed by [adirik](https://huggingface.co/adirik). The original code can be found [here](https://github.com/facebookresearch/ConvNeXt-V2).
+image_processor = AutoImageProcessor.from_pretrained("facebook/convnextv2-tiny-1k-224")
+model = AutoModelForImageClassification.from_pretrained("facebook/convnextv2-tiny-1k-224", dtype="auto")

-## Resources
+inputs = image_processor(image, return_tensors="pt")

-A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with ConvNeXt V2.
+with torch.no_grad():
+    logits = model(**inputs).logits

-<PipelineTag pipeline="image-classification"/>
+predicted_label = logits.argmax(-1).item()
+print(model.config.id2label[predicted_label])
+```

- [`ConvNextV2ForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb).
-
-If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
+</hfoption>
+</hfoptions>

 ## ConvNextV2Config

@ -60,3 +70,4 @@ If you're interested in submitting a resource to be included here, please feel f

 [[autodoc]] ConvNextV2ForImageClassification
    - forward
+
--- a/docs/source/en/model_doc/cpm.md
+++ b/docs/source/en/model_doc/cpm.md
@ -13,41 +13,41 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2020-12-01 and added to Hugging Face Transformers on 2021-04-10.*
+*This model was released on 2020-12-01 and added to Hugging Face Transformers on 2021-04-10 and contributed by [canwenxu](https://huggingface.co/canwenxu).*

 # CPM

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[CPM: A Large-scale Generative Chinese Pre-trained Language Model](https://huggingface.co/papers/2012.00413) is the largest Chinese pre-trained language model with 2.6 billion parameters and 100GB of Chinese training data. It facilitates various downstream NLP tasks including conversation, essay generation, cloze test, and language understanding. Extensive experiments show that CPM performs strongly in few-shot and zero-shot learning settings. Its architecture mirrors GPT-2, with the primary difference being the tokenization method.

-## Overview
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-The CPM model was proposed in [CPM: A Large-scale Generative Chinese Pre-trained Language Model](https://huggingface.co/papers/2012.00413) by Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin,
-Yusheng Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi Wang, Yanan Zheng, Guoyang Zeng, Huanqi Cao, Shengqi Chen,
-Daixuan Li, Zhenbo Sun, Zhiyuan Liu, Minlie Huang, Wentao Han, Jie Tang, Juanzi Li, Xiaoyan Zhu, Maosong Sun.
+```py
+import torch
+from transformers import pipeline

-The abstract from the paper is the following:
+pipeline = pipeline(task="text-generation", model="TsinghuaAI/CPM-Generate", dtype="auto")
+pipeline("植物通过光合作用产生能量。")
+```

-*Pre-trained Language Models (PLMs) have proven to be beneficial for various downstream NLP tasks. Recently, GPT-3,
-with 175 billion parameters and 570GB training data, drew a lot of attention due to the capacity of few-shot (even
-zero-shot) learning. However, applying GPT-3 to address Chinese NLP tasks is still challenging, as the training corpus
-of GPT-3 is primarily English, and the parameters are not publicly available. In this technical report, we release the
-Chinese Pre-trained Language Model (CPM) with generative pre-training on large-scale Chinese training data. To the best
-of our knowledge, CPM, with 2.6 billion parameters and 100GB Chinese training data, is the largest Chinese pre-trained
-language model, which could facilitate several downstream Chinese NLP tasks, such as conversation, essay generation,
-cloze test, and language understanding. Extensive experiments demonstrate that CPM achieves strong performance on many
-NLP tasks in the settings of few-shot (even zero-shot) learning.*
+</hfoption>
+<hfoption id="AutoModel">

-This model was contributed by [canwenxu](https://huggingface.co/canwenxu). The original implementation can be found
-here: https://github.com/TsinghuaAI/CPM-Generate
+```py
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer

-<Tip>
+model = AutoModelForCausalLM.from_pretrained("TsinghuaAI/CPM-Generate", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("TsinghuaAI/CPM-Generate")

-CPM's architecture is the same as GPT-2, except for tokenization method. Refer to [GPT-2 documentation](gpt2) for
-API reference information.
+inputs = tokenizer("植物通过光合作用产生能量。", return_tensors="pt")
+outputs = model.generate(**inputs, max_length=50, do_sample=True)
+generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(generated_text)
+```

-</Tip>
+</hfoption>
+</hfoptions>

 ## CpmTokenizer

@ -56,3 +56,4 @@ API reference information.
 ## CpmTokenizerFast

 [[autodoc]] CpmTokenizerFast
+
--- a/docs/source/en/model_doc/cpmant.md
+++ b/docs/source/en/model_doc/cpmant.md
@ -13,23 +13,41 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2022-09-16 and added to Hugging Face Transformers on 2023-04-12.*
+*This model was released on {release_date} and added to Hugging Face Transformers on 2023-04-12 and contributed by [openbmb](https://huggingface.co/openbmb).*

 # CPMAnt

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[CPM-Ant](https://github.com/OpenBMB/CPM-Live/tree/cpm-ant/cpm-live) is developed from CPM-Live, an open-source framework for training and serving large language models. It supports distributed training across multiple GPUs and nodes with model, data, and pipeline parallelism, enabling efficient scaling to billions of parameters. The framework provides features like dynamic micro-batching, mixed precision training, and checkpointing for fault tolerance. It also includes APIs for interactive inference, making it practical for both research and real-world deployment of large Transformer-based models.

-## Overview
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-CPM-Ant is an open-source Chinese pre-trained language model (PLM) with 10B parameters. It is also the first milestone of the live training process of CPM-Live. The training process is cost-effective and environment-friendly. CPM-Ant also achieves promising results with delta tuning on the CUGE benchmark. Besides the full model, we also provide various compressed versions to meet the requirements of different hardware configurations. [See more](https://github.com/OpenBMB/CPM-Live/tree/cpm-ant/cpm-live)
+```py
+import torch
+from transformers import pipeline

-This model was contributed by [OpenBMB](https://huggingface.co/openbmb). The original code can be found [here](https://github.com/OpenBMB/CPM-Live/tree/cpm-ant/cpm-live).
+pipeline = pipeline(task="text-generation", model="openbmb/cpm-ant-10b", dtype="auto")
+pipeline("植物通过光合作用产生能量。")
+```

-## Resources
+</hfoption>
+<hfoption id="AutoModel">

- A tutorial on [CPM-Live](https://github.com/OpenBMB/CPM-Live/tree/cpm-ant/cpm-live).
+```py
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+model = AutoModelForCausalLM.from_pretrained("openbmb/cpm-ant-10b", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("openbmb/cpm-ant-10b")
+
+inputs = tokenizer("植物通过光合作用产生能量。", return_tensors="pt")
+outputs = model.generate(**inputs, max_length=50, do_sample=True)
+generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(generated_text)
+```
+
+</hfoption>
+</hfoptions>

 ## CpmAntConfig

@ -45,8 +63,8 @@ This model was contributed by [OpenBMB](https://huggingface.co/openbmb). The ori

 [[autodoc]] CpmAntModel
    - all
-
+    
 ## CpmAntForCausalLM

 [[autodoc]] CpmAntForCausalLM
-    - all
+    - all
--- a/docs/source/en/model_doc/csm.md
+++ b/docs/source/en/model_doc/csm.md
@ -13,342 +13,49 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2025-02-27 and added to Hugging Face Transformers on 2025-05-07.*
+*This model was released on 2025-02-27 and added to Hugging Face Transformers on 2025-05-07 and contributed by [eustlb](https://huggingface.co/eustlb).*

-# Csm
+# CSM

-## Overview
+[CSM](https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice) is an end-to-end multimodal transformer system that generates contextually appropriate, high-fidelity speech by interleaving text and audio tokens. It operates directly on Residual Vector Quantization (RVQ) audio tokens and splits processing into two transformers: a large multimodal backbone that predicts the zeroth codebook and a lightweight audio decoder that handles the remaining codebooks for real-time generation. This structure allows CSM to capture conversational context while maintaining low latency. To train efficiently, it uses a compute amortization technique—training the audio decoder on only a small random subset of frames—preserving quality while dramatically reducing memory and compute costs.

-The Conversational Speech Model (CSM) is the first open-source contextual text-to-speech model [released by Sesame](https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice). It is designed to generate natural-sounding speech with or without conversational context. This context typically consists of multi-turn dialogue between speakers, represented as sequences of text and corresponding spoken audio.
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-**Model Architecture:**
-CSM is composed of two LLaMA-style auto-regressive transformer decoders: a backbone decoder that predicts the first codebook token and a depth decoder that generates the remaining tokens. It uses the pretrained codec model [Mimi](./mimi), introduced by Kyutai, to encode speech into discrete codebook tokens and decode them back into audio.
+```py
+import torch
+from transformers import pipeline

-The original csm-1b checkpoint is available under the [Sesame](https://huggingface.co/sesame/csm-1b) organization on Hugging Face.
+pipeline = pipeline(task="text-to-audio", model="sesame/csm-1b", dtype="auto")
+output = pipeline("Plants create energy through a process known as photosynthesis.")
+audio = output["audio"]
+```

-<div class="flex justify-center">
-    <img src="https://huggingface.co/datasets/eustlb/documentation-images/resolve/main/csm_architecture.png"/>
-</div>
+</hfoption>
+<hfoption id="CsmForConditionalGeneration">

-## Usage Tips
-
-### Without Conversational Context
-
-CSM can be used to simply generate speech from a text prompt:
-
-```python
+```py
 import torch
 from transformers import CsmForConditionalGeneration, AutoProcessor
-from accelerate import Accelerator

-model_id = "sesame/csm-1b"
-device = Accelerator().device
+processor = AutoProcessor.from_pretrained("sesame/csm-1b")
+model = CsmForConditionalGeneration.from_pretrained("sesame/csm-1b", dtype="auto")

-# load the model and the processor
-processor = AutoProcessor.from_pretrained(model_id)
-model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)
-
-# prepare the inputs
-text = "[0]The past is just a story we tell ourselves." # `[0]` for speaker id 0
-inputs = processor(text, add_special_tokens=True).to(device)
-
-# another equivalent way to prepare the inputs
 conversation = [
-    {"role": "0", "content": [{"type": "text", "text": "The past is just a story we tell ourselves."}]},
+    {"role": "0", "content": [{"type": "text", "text": "Plants generate energy through a process known as photosynthesis."}]},
 ]
 inputs = processor.apply_chat_template(
    conversation,
    tokenize=True,
    return_dict=True,
-).to(model.device)
+)

-# infer the model
-audio = model.generate(**inputs, output_audio=True)
-processor.save_audio(audio, "example_without_context.wav")
-```
-
-### With Conversational Context
-
-CSM can be used to generate speech given a conversation, allowing consistency in the voices and content-aware generation:
-
-```python
-import torch
-from transformers import CsmForConditionalGeneration, AutoProcessor
-from accelerate import Accelerator
-from datasets import load_dataset, Audio
-
-model_id = "sesame/csm-1b"
-device = Accelerator().device
-
-# load the model and the processor
-processor = AutoProcessor.from_pretrained(model_id)
-model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)
-
-# prepare the inputs
-ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
-# ensure the audio is 24kHz
-ds = ds.cast_column("audio", Audio(sampling_rate=24000))
-conversation = []
-
-# 1. context
-for text, audio, speaker_id in zip(ds[:4]["text"], ds[:4]["audio"], ds[:4]["speaker_id"]):
-    conversation.append(
-        {
-            "role": f"{speaker_id}",
-            "content": [{"type": "text", "text": text}, {"type": "audio", "path": audio["array"]}],
-        }
-    )
-
-# 2. text prompt
-conversation.append({"role": f"{ds[4]['speaker_id']}", "content": [{"type": "text", "text": ds[4]["text"]}]})
-
-inputs = processor.apply_chat_template(
-    conversation,
-    tokenize=True,
-    return_dict=True,
-).to(model.device)
-
-# infer the model
 audio = model.generate(**inputs, output_audio=True)
 processor.save_audio(audio, "example_with_context.wav")
 ```

-### Batched Inference
-
-CSM supports batched inference!
-
-```python
-import torch
-from transformers import CsmForConditionalGeneration, AutoProcessor
-from accelerate import Accelerator
-from datasets import load_dataset, Audio
-
-model_id = "sesame/csm-1b"
-device = Accelerator().device
-
-# load the model and the processor
-processor = AutoProcessor.from_pretrained(model_id)
-model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)
-
-# prepare the inputs 
-ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
-# ensure the audio is 24kHz
-ds = ds.cast_column("audio", Audio(sampling_rate=24000))
-# here a batch with two prompts
-conversation = [
-    [
-        {
-            "role": f"{ds[0]['speaker_id']}",
-            "content": [
-                {"type": "text", "text": ds[0]["text"]},
-                {"type": "audio", "path": ds[0]["audio"]["array"]},
-            ],
-        },
-        {
-            "role": f"{ds[1]['speaker_id']}",
-            "content": [
-                {"type": "text", "text": ds[1]["text"]},
-            ],
-        },
-    ],
-    [
-        {
-            "role": f"{ds[0]['speaker_id']}",
-            "content": [
-                {"type": "text", "text": ds[0]["text"]},
-            ],
-        }
-    ],
-]
-inputs = processor.apply_chat_template(
-    conversation,
-    tokenize=True,
-    return_dict=True,
-).to(model.device)
-
-audio = model.generate(**inputs, output_audio=True)
-processor.save_audio(audio, [f"speech_batch_idx_{i}.wav" for i in range(len(audio))])
-```
-
-### Making The Model Go Brrr
-
-CSM supports full-graph compilation with CUDA graphs!
-
-```python
-import torch
-import copy
-from transformers import CsmForConditionalGeneration, AutoProcessor
-from datasets import load_dataset
-
-model_id = "sesame/csm-1b"
-device = "cuda"
-
-# set logs to ensure no recompilation and graph breaks
-torch._logging.set_logs(graph_breaks=True, recompiles=True, cudagraphs=True)
-
-# load the model and the processor
-processor = AutoProcessor.from_pretrained(model_id)
-model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)
-
-# use static cache, enabling automatically torch compile with fullgraph and reduce-overhead
-model.generation_config.max_length = 250 # big enough to avoid recompilation
-model.generation_config.max_new_tokens = None # would take precedence over max_length
-model.generation_config.cache_implementation = "static"
-model.depth_decoder.generation_config.cache_implementation = "static"
-
-# generation kwargs
-gen_kwargs = {
-    "do_sample": False,
-    "depth_decoder_do_sample": False,
-    "temperature": 1.0,
-    "depth_decoder_temperature": 1.0,
-}
-
-# Define a timing decorator
-class TimerContext:
-    def __init__(self, name="Execution"):
-        self.name = name
-        self.start_event = None
-        self.end_event = None
-        
-    def __enter__(self):
-        # Use CUDA events for more accurate GPU timing
-        self.start_event = torch.cuda.Event(enable_timing=True)
-        self.end_event = torch.cuda.Event(enable_timing=True)
-        self.start_event.record()
-        return self
-
-    def __exit__(self, *args):
-        self.end_event.record()
-        torch.cuda.synchronize()
-        elapsed_time = self.start_event.elapsed_time(self.end_event) / 1000.0
-        print(f"{self.name} time: {elapsed_time:.4f} seconds")
-
-# prepare the inputs 
-ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
-
-conversation = [
-    {
-        "role": f"{ds[0]['speaker_id']}",
-        "content": [
-            {"type": "text", "text": ds[0]["text"]},
-            {"type": "audio", "path": ds[0]["audio"]["array"]},
-        ],
-    },
-    {
-        "role": f"{ds[1]['speaker_id']}",
-        "content": [
-            {"type": "text", "text": ds[1]["text"]},
-            {"type": "audio", "path": ds[1]["audio"]["array"]},
-        ],
-    },
-    {
-        "role": f"{ds[2]['speaker_id']}",
-        "content": [
-            {"type": "text", "text": ds[2]["text"]},
-        ],
-    },
-]
-
-padded_inputs_1 = processor.apply_chat_template(
-    conversation,
-    tokenize=True,
-    return_dict=True,
-).to(model.device)
-
-print("\n" + "="*50)
-print("First generation - compiling and recording CUDA graphs...")
-with TimerContext("First generation"):
-    _ = model.generate(**padded_inputs_1, **gen_kwargs)
-print("="*50)
-
-print("\n" + "="*50)
-print("Second generation - fast !!!")
-with TimerContext("Second generation"):
-    _ = model.generate(**padded_inputs_1, **gen_kwargs)
-print("="*50)
-
-# now with different inputs
-conversation = [
-    {
-        "role": f"{ds[0]['speaker_id']}",
-        "content": [
-            {"type": "text", "text": ds[2]["text"]},
-            {"type": "audio", "path": ds[2]["audio"]["array"]},
-        ],
-    },
-    {
-        "role": f"{ds[1]['speaker_id']}",
-        "content": [
-            {"type": "text", "text": ds[3]["text"]},
-            {"type": "audio", "path": ds[3]["audio"]["array"]},
-        ],
-    },
-    {
-        "role": f"{ds[2]['speaker_id']}",
-        "content": [
-            {"type": "text", "text": ds[4]["text"]},
-        ],
-    },
-]
-padded_inputs_2 = processor.apply_chat_template(
-    conversation,
-    tokenize=True,
-    return_dict=True,
-).to(model.device)
-
-print("\n" + "="*50)
-print("Generation with other inputs!")
-with TimerContext("Generation with different inputs"):
-    _ = model.generate(**padded_inputs_2, **gen_kwargs)
-print("="*50)
-```
-
-### Training
-
-CSM Transformers integration supports training!
-
-```python
-from transformers import CsmForConditionalGeneration, AutoProcessor
-from accelerate import Accelerator
-from datasets import load_dataset, Audio
-
-model_id = "sesame/csm-1b"
-device = Accelerator().device
-
-# load the model and the processor
-processor = AutoProcessor.from_pretrained(model_id)
-model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)
-model.train()
-model.codec_model.eval()
-
-ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
-# ensure the audio is 24kHz
-ds = ds.cast_column("audio", Audio(sampling_rate=24000))
-conversation = []
-
-# context
-for text, audio, speaker_id in zip(ds[:4]["text"], ds[:4]["audio"], ds[:4]["speaker_id"]):
-    conversation.append(
-        {
-            "role": f"{speaker_id}",
-            "content": [{"type": "text", "text": text}, {"type": "audio", "path": audio["array"]}],
-        }
-    )
-
-inputs = processor.apply_chat_template(
-    conversation,
-    tokenize=True,
-    return_dict=True,
-    output_labels=True,
-).to(model.device)
-
-out = model(**inputs)
-out.loss.backward()
-```
-
-This model was contributed by [Eustache Le Bihan](https://huggingface.co/eustlb).
-The original code can be found [here](https://github.com/SesameAILabs/csm).
+</hfoption>
+</hfoptions>

 ## CsmConfig

@ -360,10 +67,6 @@ The original code can be found [here](https://github.com/SesameAILabs/csm).

 ## CsmProcessor

-<div class="flex justify-center">
-    <img src="https://huggingface.co/datasets/eustlb/documentation-images/resolve/main/fig1.jpg"/>
-</div>
-
 [[autodoc]] CsmProcessor
    - __call__

--- a/docs/source/en/model_doc/ctrl.md
+++ b/docs/source/en/model_doc/ctrl.md
@ -13,52 +13,47 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2019-09-11 and added to Hugging Face Transformers on 2020-11-16.*
+*This model was released on 2019-09-11 and added to Hugging Face Transformers on 2020-11-16 and contributed by [keskarnitishr](https://huggingface.co/keskarnitishr).*

 # CTRL

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[CTRL](https://huggingface.co/papers/1909.05858) is a 1.63 billion-parameter conditional transformer language model designed to generate text based on control codes. These codes guide the style, content, and task-specific behavior of the generated text, leveraging unsupervised learning while offering explicit control. The model can also predict the most likely data sources for a given sequence, enabling model-based source attribution.

-## Overview
+<hfoptions id="usage">
+<hfoption id="Pipeline">

-CTRL model was proposed in [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://huggingface.co/papers/1909.05858) by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and
-Richard Socher. It's a causal (unidirectional) transformer pre-trained using language modeling on a very large corpus
-of ~140 GB of text data with the first token reserved as a control code (such as Links, Books, Wikipedia etc.).
+```py
+import torch
+from transformers import pipeline

-The abstract from the paper is the following:
+pipeline = pipeline(task="text-classification", model="salesforce/ctrl", dtype="auto")
+pipeline("Plants are amazing because they can create energy from the sun.")
+```

-*Large-scale language models show promising text generation capabilities, but users cannot easily control particular
-aspects of the generated text. We release CTRL, a 1.63 billion-parameter conditional transformer language model,
-trained to condition on control codes that govern style, content, and task-specific behavior. Control codes were
-derived from structure that naturally co-occurs with raw text, preserving the advantages of unsupervised learning while
-providing more explicit control over text generation. These codes also allow CTRL to predict which parts of the
-training data are most likely given a sequence. This provides a potential method for analyzing large amounts of data
-via model-based source attribution.*
+</hfoption>
+<hfoption id="AutoModel">

-This model was contributed by [keskarnitishr](https://huggingface.co/keskarnitishr). The original code can be found
-[here](https://github.com/salesforce/ctrl).
+```py
+from transformers import AutoModelForSequenceClassification, AutoTokenizer
+
+model = AutoModelForSequenceClassification.from_pretrained("Salesforce/ctrl", dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained("Salesforce/ctrl")
+
+inputs = tokenizer("Plants are amazing because they can create energy from the sun.", return_tensors="pt")
+outputs = model(**inputs)
+predicted_class_id = outputs.logits.argmax(dim=-1).item()
+label = model.config.id2label[predicted_class_id]
+print(f"Predicted label: {label}")
+```
+
+</hfoption>
+</hfoptions>

 ## Usage tips

- CTRL makes use of control codes to generate text: it requires generations to be started by certain words, sentences
-  or links to generate coherent text. Refer to the [original implementation](https://github.com/salesforce/ctrl) for
-  more information.
- CTRL is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
-  the left.
- CTRL was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next
-  token in a sequence. Leveraging this feature allows CTRL to generate syntactically coherent text as it can be
-  observed in the *run_generation.py* example script.
- The PyTorch models can take the `past_key_values` as input, which is the previously computed key/value attention pairs.
-  Using the `past_key_values` value prevents the model from re-computing
-  pre-computed values in the context of text generation. See the [`forward`](model_doc/ctrl#transformers.CTRLModel.forward)
-  method for more information on the usage of this argument.
-
-## Resources
-
- [Text classification task guide](../tasks/sequence_classification)
- [Causal language modeling task guide](../tasks/language_modeling)
+- CTRL uses control codes to generate text. Start generations with specific words, sentences, or links to generate coherent text. Check the original implementation for details.
+- Pad inputs on the right. CTRL uses absolute position embeddings.
+- PyTorch models accept `past_key_values` as input. These are previously computed key/value attention pairs. Using `past_key_values` prevents re-computing pre-computed values during text generation. See the [`~CTRLModel.forward`] method for usage details.

 ## CTRLConfig

@ -83,3 +78,4 @@ This model was contributed by [keskarnitishr](https://huggingface.co/keskarnitis

 [[autodoc]] CTRLForSequenceClassification
    - forward
+
--- a/docs/source/en/model_doc/cvt.md
+++ b/docs/source/en/model_doc/cvt.md
@ -13,26 +13,11 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
-*This model was released on 2021-03-29 and added to Hugging Face Transformers on 2022-05-18.*
-
-<div style="float: right;">
-    <div class="flex flex-wrap space-x-1">
-        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-    </div>
-</div>
+*This model was released on 2021-03-29 and added to Hugging Face Transformers on 2022-05-18 and contributed by [anugunj](https://huggingface.co/anugunj).*

 # Convolutional Vision Transformer (CvT)

-[Convolutional Vision Transformer (CvT)](https://huggingface.co/papers/2103.15808) is a model that combines the strengths of convolutional neural networks (CNNs) and Vision transformers for the computer vision tasks. It introduces convolutional layers into the vision transformer architecture, allowing it to capture local patterns in images while maintaining the global context provided by self-attention mechanisms.
-
-You can find all the CvT checkpoints under the [Microsoft](https://huggingface.co/microsoft?search_models=cvt) organization.
-
-> [!TIP]
-> This model was contributed by [anujunj](https://huggingface.co/anugunj).
->
-> Click on the CvT models in the right sidebar for more examples of how to apply CvT to different computer vision tasks.
-
-The example below demonstrates how to classify an image with [`Pipeline`] or the [`AutoModel`] class.
+[Convolutional vision Transformer (CvT)](https://huggingface.co/papers/2103.15808) enhances Vision Transformer (ViT) through the integration of convolutions, combining the strengths of both architectures. Key modifications include a hierarchical Transformer with a convolutional token embedding and a convolutional Transformer block with a convolutional projection. These enhancements introduce CNN properties like shift, scale, and distortion invariance while retaining Transformer benefits such as dynamic attention and global context. CvT achieves state-of-the-art performance on ImageNet-1k with fewer parameters and lower FLOPs, even when pretrained on larger datasets like ImageNet-22k. Notably, positional encoding can be omitted in CvT, simplifying the design for high-resolution vision tasks.

 <hfoptions id="usage">
 <hfoption id="Pipeline">
@ -41,51 +26,37 @@ The example below demonstrates how to classify an image with [`Pipeline`] or the
 import torch
 from transformers import pipeline

-pipeline = pipeline(
-    task="image-classification",
-    model="microsoft/cvt-13",
-    dtype=torch.float16,
-    device=0
-)
+pipeline = pipeline(task="image-classification", model="microsoft/cvt-13", dtype="auto")
 pipeline("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg")
 ```

 </hfoption>
 <hfoption id="AutoModel">

-```py
+```python
 import torch
 import requests
 from PIL import Image
-from transformers import AutoModelForImageClassification, AutoImageProcessor
-
-image_processor = AutoImageProcessor.from_pretrained("microsoft/cvt-13")
-model = AutoModelForImageClassification.from_pretrained(
-    "microsoft/cvt-13",
-    dtype=torch.float16,
-    device_map="auto"
-)
+from transformers import AutoImageProcessor, AutoModelForImageClassification

 url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
 image = Image.open(requests.get(url, stream=True).raw)
-inputs = image_processor(image, return_tensors="pt").to(model.device)
+
+image_processor = AutoImageProcessor.from_pretrained("microsoft/cvt-13")
+model = AutoModelForImageClassification.from_pretrained("microsoft/cvt-13", dtype="auto")
+
+inputs = image_processor(image, return_tensors="pt")

 with torch.no_grad():
-  logits = model(**inputs).logits
-predicted_class_id = logits.argmax(dim=-1).item()
+    logits = model(**inputs).logits

-class_labels = model.config.id2label
-predicted_class_label = class_labels[predicted_class_id]
-print(f"The predicted class label is: {predicted_class_label}")
+predicted_label = logits.argmax(-1).item()
+print(model.config.id2label[predicted_label])
 ```

 </hfoption>
 </hfoptions>

-## Resources
-
-Refer to this set of ViT [notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer) for examples of inference and fine-tuning on custom datasets. Replace [`ViTFeatureExtractor`] and [`ViTForImageClassification`] in these notebooks with [`AutoImageProcessor`] and [`CvtForImageClassification`].
-
 ## CvtConfig

 [[autodoc]] CvtConfig
@ -99,3 +70,4 @@ Refer to this set of ViT [notebooks](https://github.com/NielsRogge/Transformers-

 [[autodoc]] CvtForImageClassification
    - forward
+
--- a/docs/source/en/model_doc/cwm.md
+++ b/docs/source/en/model_doc/cwm.md
@ -15,7 +15,6 @@ limitations under the License.
 ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be rendered properly in your Markdown viewer.

 -->
-*This model was released on {release_date} and added to Hugging Face Transformers on 2025-10-09.*

 # Code World Model (CWM)

--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
stevhliu	0ecb993601	usage tips	2025-10-15 14:08:54 -07:00
stevhliu	d1d5d4d758	fixes	2025-10-15 11:20:56 -07:00
stevhliu	dc570c7505	remove result	2025-10-15 11:20:56 -07:00
stevhliu	daf6069c48	standardize	2025-10-15 11:20:54 -07:00