Merge branch 'main' into cleanup_tok_save

Add missing dates to docs (#41576 )
add dates
2025-10-21 09:44:02 +08:00 · 2025-10-16 11:36:34 +02:00 · 2025-10-16 09:32:28 +00:00 · 2025-10-16 11:23:53 +02:00 · 2025-10-16 11:14:20 +02:00 · 2025-10-16 11:12:07 +02:00
365 changed files with 13563 additions and 6157 deletions
--- a/.github/ISSUE_TEMPLATE/bug-report.yml
+++ b/.github/ISSUE_TEMPLATE/bug-report.yml
@ -48,18 +48,17 @@ body:
          - continuous batching: @remi-or @ArthurZucker @McPatate
          - pipelines: @Rocketknight1
          - tokenizers: @ArthurZucker and @itazap
-          - trainer: @zach-huggingface @SunMarc
+          - trainer: @SunMarc
          - attention: @vasqu @ArthurZucker @CyrilVallez
          - model loading (from pretrained, etc): @CyrilVallez
-          - distributed: @3outeille @ArthurZucker @S1ro1
+          - distributed: @3outeille @ArthurZucker
          - CIs: @ydshieh

        Integrations:

-          - deepspeed: HF Trainer/Accelerate: @SunMarc @zach-huggingface
          - ray/raytune: @richardliaw, @amogkam
          - Big Model Inference: @SunMarc
-          - quantization (bitsandbytes, autogpt): @SunMarc @MekkCyber
+          - quantization: @SunMarc @MekkCyber
          - kernels: @MekkCyber @drbh
          - peft: @BenjaminBossan @githubnemo
        
--- a/.github/PULL_REQUEST_TEMPLATE.md
+++ b/.github/PULL_REQUEST_TEMPLATE.md
@ -51,18 +51,17 @@ Library:
 - continuous batching: @remi-or @ArthurZucker @McPatate
 - pipelines: @Rocketknight1
 - tokenizers: @ArthurZucker and @itazap
- trainer: @zach-huggingface @SunMarc
+- trainer: @SunMarc
 - attention: @vasqu @ArthurZucker @CyrilVallez
 - model loading (from pretrained, etc): @CyrilVallez
- distributed: @3outeille @ArthurZucker @S1ro1
+- distributed: @3outeille @ArthurZucker
 - CIs: @ydshieh

 Integrations:

- deepspeed: HF Trainer/Accelerate: @SunMarc @zach-huggingface
 - ray/raytune: @richardliaw, @amogkam
 - Big Model Inference: @SunMarc
- quantization (bitsandbytes, autogpt): @SunMarc @MekkCyber
+- quantization: @SunMarc @MekkCyber
 - kernels: @MekkCyber @drbh
 - peft: @BenjaminBossan @githubnemo

--- a/.github/workflows/benchmark.yml
+++ b/.github/workflows/benchmark.yml
@ -1,10 +1,7 @@
 name: Self-hosted runner (benchmark)

 on:
-  push:
-    branches: [main]
-  pull_request:
-    types: [ opened, labeled, reopened, synchronize ]
+  workflow_dispatch:

 concurrency:
  group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
--- a/.github/workflows/benchmark_v2.yml
+++ b/.github/workflows/benchmark_v2.yml
@ -1,35 +1,7 @@
 name: Benchmark v2 Framework

 on:
-  workflow_call:
-    inputs:
-      runner:
-        description: 'GH Actions runner group to use'
-        required: true
-        type: string
-      container_image:
-        description: 'Docker image to use'
-        required: true
-        type: string
-      container_options:
-        description: 'Container options to use'
-        required: true
-        type: string
-      commit_sha:
-        description: 'Commit SHA to benchmark'
-        required: false
-        type: string
-        default: ''
-      run_id:
-        description: 'Custom run ID for organizing results (auto-generated if not provided)'
-        required: false
-        type: string
-        default: ''
-      benchmark_repo_id:
-        description: 'HuggingFace Dataset to upload results to (e.g., "org/benchmark-results")'
-        required: false
-        type: string
-        default: ''
+  workflow_dispatch:

 env:
  HF_HOME: /mnt/cache
@ -82,4 +54,4 @@ jobs:
          --token '${{ secrets.TRANSFORMERS_CI_RESULTS_UPLOAD_TOKEN }}' \
          --log-level INFO
        env:
-          HF_TOKEN: ${{ secrets.HF_HUB_READ_TOKEN }}
+          HF_TOKEN: ${{ secrets.HF_HUB_READ_TOKEN }}
--- a/.github/workflows/benchmark_v2_a10_caller.yml
+++ b/.github/workflows/benchmark_v2_a10_caller.yml
@ -1,11 +1,7 @@
 name: Benchmark v2 Scheduled Runner - A10 Single-GPU

 on:
-  schedule:
-    # Run daily at 16:30 UTC
-    - cron: "30 16 * * *"
-  pull_request:
-    types: [ opened, labeled, reopened, synchronize ]
+  workflow_dispatch:

 jobs:
  benchmark-v2-default:
@ -18,4 +14,4 @@ jobs:
      commit_sha: ${{ github.sha }}
      run_id: ${{ github.run_id }}
      benchmark_repo_id: hf-internal-testing/transformers-daily-benchmarks
-    secrets: inherit
+    secrets: inherit
--- a/.github/workflows/benchmark_v2_mi325_caller.yml
+++ b/.github/workflows/benchmark_v2_mi325_caller.yml
@ -1,11 +1,7 @@
 name: Benchmark v2 Scheduled Runner - MI325 Single-GPU

 on:
-  schedule:
-    # Run daily at 16:30 UTC
-    - cron: "30 16 * * *"
-  pull_request:
-    types: [ opened, labeled, reopened, synchronize ]
+  workflow_dispatch:

 jobs:
  benchmark-v2-default:
@ -18,4 +14,4 @@ jobs:
      commit_sha: ${{ github.sha }}
      run_id: ${{ github.run_id }}
      benchmark_repo_id: hf-internal-testing/transformers-daily-benchmarks
-    secrets: inherit
+    secrets: inherit
--- a/benchmark_v2/.gitignore
+++ b/benchmark_v2/.gitignore
@ -1 +1,2 @@
-benchmark_results/
+benchmark_results/
+benchmark_results_profiles/
--- a/benchmark_v2/benches/init.py
+++ b/benchmark_v2/benches/init.py
@ -1 +0,0 @@
-# Benchmark implementations directory
--- a/benchmark_v2/benches/llama.py
+++ b/benchmark_v2/benches/llama.py
@ -1,165 +0,0 @@
-# Copyright 2025 The HuggingFace Team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import logging
-import os
-from typing import Any
-
-import torch
-from benchmark_framework import ModelBenchmark
-
-
-os.environ["TOKENIZERS_PARALLELISM"] = "1"
-torch.set_float32_matmul_precision("high")
-
-
-class LLaMABenchmark(ModelBenchmark):
-    """Simplified LLaMA model benchmark implementation using the ModelBenchmark base class."""
-
-    def __init__(self, logger: logging.Logger):
-        super().__init__(logger)
-        self._default_prompt = "Why dogs are so cute?"  # Custom prompt for LLaMA
-
-    def get_scenario_configs(self) -> list[dict[str, Any]]:
-        """
-        Get LLaMA-specific scenario configurations.
-
-        Returns:
-            List of scenario configuration dictionaries
-        """
-        return [
-            # Eager variants
-            {"variant": "eager", "compile_mode": None, "use_cache": True, "description": "Eager execution with cache"},
-            # Compiled variants
-            {
-                "variant": "compiled",
-                "compile_mode": "max-autotune",
-                "use_cache": True,
-                "description": "Compiled with max autotune",
-            },
-            # Kernelized variant (if available)
-            {
-                "variant": "kernelized",
-                "compile_mode": "max-autotune",
-                "use_cache": True,
-                "description": "Kernelized execution",
-            },
-        ]
-
-    def _is_kernelization_available(self) -> bool:
-        """Check if kernelization is available for LLaMA."""
-        try:
-            from kernels import Mode, kernelize  # noqa: F401
-
-            return True
-        except ImportError:
-            self.logger.debug("Kernelization not available: kernels module not found")
-            return False
-
-    def get_default_generation_config(self) -> dict[str, Any]:
-        """Get LLaMA-specific generation configuration."""
-        return {
-            "do_sample": False,
-            "top_p": 1.0,
-            "temperature": 1.0,
-            "repetition_penalty": 1.0,
-            "max_new_tokens": None,  # Will be set per scenario
-        }
-
-    def get_model_init_kwargs(self, config) -> dict[str, Any]:
-        """Get LLaMA-specific model initialization kwargs."""
-        return {
-            "torch_dtype": getattr(torch, config.torch_dtype),
-            "attn_implementation": config.attn_implementation,
-            "use_cache": True,
-        }
-
-    def get_default_torch_dtype(self) -> str:
-        """Get default torch dtype for LLaMA."""
-        return "float16"  # LLaMA works well with float16
-
-    def get_default_device(self) -> str:
-        """Get default device for LLaMA."""
-        return "cuda"  # LLaMA prefers CUDA
-
-
-def run_llama(logger, output_dir, **kwargs):
-    """
-    Run LLaMA benchmark with the given configuration.
-
-    Args:
-        logger: Logger instance
-        output_dir: Output directory for results
-        **kwargs: Additional configuration options
-
-    Returns:
-        Path to output file if successful
-    """
-    from benchmark_framework import BenchmarkRunner
-
-    # Extract parameters with defaults
-    model_id = kwargs.get("model_id", "meta-llama/Llama-2-7b-hf")
-    warmup_iterations = kwargs.get("warmup_iterations", 3)
-    measurement_iterations = kwargs.get("measurement_iterations", 5)
-    num_tokens_to_generate = kwargs.get("num_tokens_to_generate", 100)
-    include_sdpa_variants = kwargs.get("include_sdpa_variants", True)
-    device = kwargs.get("device", "cuda")
-    torch_dtype = kwargs.get("torch_dtype", "float16")
-    batch_size = kwargs.get("batch_size", 1)
-    commit_id = kwargs.get("commit_id")
-
-    logger.info(f"Starting LLaMA benchmark for model: {model_id}")
-    logger.info(
-        f"Configuration: warmup={warmup_iterations}, measurement={measurement_iterations}, tokens={num_tokens_to_generate}"
-    )
-
-    try:
-        # Create benchmark instance
-        benchmark = LLaMABenchmark(logger)
-
-        # Create scenarios
-        scenarios = benchmark.create_scenarios(
-            model_id=model_id,
-            warmup_iterations=warmup_iterations,
-            measurement_iterations=measurement_iterations,
-            num_tokens_to_generate=num_tokens_to_generate,
-            include_sdpa_variants=include_sdpa_variants,
-            device=device,
-            torch_dtype=torch_dtype,
-            batch_size=batch_size,
-        )
-
-        logger.info(f"Created {len(scenarios)} benchmark scenarios")
-
-        # Create runner and execute benchmarks
-        runner = BenchmarkRunner(logger, output_dir)
-        results = runner.run_benchmark(benchmark, scenarios, commit_id=commit_id)
-
-        if not results:
-            logger.warning("No successful benchmark results")
-            return None
-
-        # Save results
-        model_name = model_id.split("/")[-1]  # Extract model name from ID
-        output_file = runner.save_results(model_name, results)
-
-        logger.info(f"LLaMA benchmark completed successfully. Results saved to: {output_file}")
-        return output_file
-
-    except Exception as e:
-        logger.error(f"LLaMA benchmark failed: {e}")
-        import traceback
-
-        logger.debug(traceback.format_exc())
-        raise
--- a/benchmark_v2/benchmark_framework.py
+++ b/benchmark_v2/benchmark_framework.py
--- a/benchmark_v2/framework/benchmark_config.py
+++ b/benchmark_v2/framework/benchmark_config.py
@ -0,0 +1,218 @@
+import hashlib
+import json
+import logging
+from typing import Any, Optional
+
+
+KERNELIZATION_AVAILABLE = False
+try:
+    from kernels import Mode, kernelize  # noqa: F401
+
+    KERNELIZATION_AVAILABLE = True
+except ImportError:
+    pass
+
+logger = logging.getLogger(__name__)
+
+
+class BenchmarkConfig:
+    """Configuration for a single benchmark scenario."""
+
+    def __init__(
+        self,
+        warmup_iterations: int = 5,
+        measurement_iterations: int = 20,
+        gpu_monitoring: bool = False,  # False by default because it slows down the benchmark by a lot
+        batch_size: int = 1,
+        sequence_length: int = 128,
+        num_tokens_to_generate: int = 128,
+        attn_implementation: str = "eager",
+        sdpa_backend: Optional[str] = None,
+        compile_mode: Optional[str] = None,
+        compile_options: Optional[dict[str, Any]] = None,
+        kernelize: bool = False,
+        name: Optional[str] = None,
+        skip_validity_check: bool = False,
+    ) -> None:
+        # Benchmark parameters
+        self.warmup_iterations = warmup_iterations
+        self.measurement_iterations = measurement_iterations
+        self.gpu_monitoring = gpu_monitoring
+        # Input parameters
+        self.batch_size = batch_size
+        self.sequence_length = sequence_length
+        self.num_tokens_to_generate = num_tokens_to_generate
+        # Generation parameters
+        self.attn_implementation = attn_implementation
+        self.sdpa_backend = sdpa_backend
+        # Optimization parameters
+        self.compile_mode = compile_mode
+        self.compile_options = compile_options if compile_options is not None else {}
+        self.kernelize = kernelize
+        # Constant parameters
+        self.dtype = "torch.bfloat16"
+        self.device = "cuda"
+
+        self.check_validity(skip_validity_check)
+        self.name = name if name is not None else self.infer_name()
+
+    def check_validity(self, skip_validity_check: bool = False) -> None:
+        if skip_validity_check:
+            return
+        # Flash attention does not support compile mode, so we turn it off # FIXME: it would be better to support it
+        is_fa = self.attn_implementation == "flash_attention_2"
+        is_fa |= self.attn_implementation == "sdpa" and self.sdpa_backend == "flash_attention"
+        if is_fa:
+            logger.warning("Flash attention does not support compile mode. Turning off compile mode.")
+            self.compile_mode = None
+
+    @property
+    def hash(self) -> str:
+        return hashlib.sha256(json.dumps(self.to_dict()).encode()).hexdigest()
+
+    def infer_name(self, compact: bool = True) -> str:
+        """Infer a human-readable name for the benchmark config, either compact or verbose."""
+        if compact:
+            iter_str = f"w{self.warmup_iterations}_i{self.measurement_iterations}"
+            gpu_monitor_str = "monitored" if self.gpu_monitoring else "unmonitored"
+            dimensions_str = f"b{self.batch_size}_s{self.sequence_length}_n{self.num_tokens_to_generate}"
+            attn_code = self.attn_implementation
+            attn_code += f"_{self.sdpa_backend}" if self.attn_implementation == "sdpa" else ""
+            compile_str = f"compiled_{self.compile_mode}" if self.compile_mode is not None else "uncompiled"
+            kernelize_str = "kernelized" if self.kernelize else "unkernelized"
+            sep = "-"
+        else:
+            iter_str = f"{self.warmup_iterations} warmup, {self.measurement_iterations} iterations"
+            gpu_monitor_str = ("with" if self.gpu_monitoring else "no") + " GPU monitoring"
+            dimensions_str = f"batch size {self.batch_size}, sequence length {self.sequence_length}, {self.num_tokens_to_generate} generated tokens"
+            attn_code = f"{self.attn_implementation} attention"
+            attn_code += f" with {self.sdpa_backend} backend" if self.attn_implementation == "sdpa" else ""
+            compile_str = "compiled" if self.compile_mode is not None else "not compiled"
+            kernelize_str = "kernelized" if self.kernelize else "not kernelized"
+            sep = ", "
+        return sep.join([iter_str, gpu_monitor_str, dimensions_str, attn_code, compile_str, kernelize_str])
+
+    def to_dict(self) -> dict[str, Any]:
+        return {
+            "name": self.name,
+            "warmup_iterations": self.warmup_iterations,
+            "measurement_iterations": self.measurement_iterations,
+            "gpu_monitoring": self.gpu_monitoring,
+            "batch_size": self.batch_size,
+            "sequence_length": self.sequence_length,
+            "num_tokens_to_generate": self.num_tokens_to_generate,
+            "attn_implementation": self.attn_implementation,
+            "sdpa_backend": self.sdpa_backend,
+            "compile_mode": self.compile_mode,
+            "compile_options": self.compile_options,
+            "kernelize": self.kernelize,
+        }
+
+    @classmethod
+    def from_dict(cls, data: dict[str, Any], skip_validity_check: bool = False) -> "BenchmarkConfig":
+        return cls(
+            warmup_iterations=data.get("warmup_iterations", 5),
+            measurement_iterations=data.get("measurement_iterations", 20),
+            gpu_monitoring=data.get("gpu_monitoring", False),
+            batch_size=data.get("batch_size", 1),
+            sequence_length=data.get("sequence_length", 128),
+            num_tokens_to_generate=data.get("num_tokens_to_generate", 128),
+            attn_implementation=data.get("attn_implementation", "eager"),
+            sdpa_backend=data.get("sdpa_backend"),
+            compile_mode=data.get("compile_mode"),
+            compile_options=data.get("compile_options"),
+            kernelize=data.get("kernelize", False),
+            name=data.get("name"),
+            skip_validity_check=skip_validity_check,
+        )
+
+
+def cross_generate_configs(
+    attn_impl_and_sdpa_backend: list[tuple[str, Optional[str]]],
+    compiled_mode: list[Optional[str]],
+    kernelized: list[bool],
+    warmup_iterations: int = 5,
+    measurement_iterations: int = 20,
+    batch_size: int = 1,
+    sequence_length: int = 128,
+    num_tokens_to_generate: int = 128,
+    gpu_monitoring: bool = False,  # this slows down the benchmark by a lot so we disable it by default
+) -> list[BenchmarkConfig]:
+    # Create kwargs common to all configs
+    kwargs = {
+        "warmup_iterations": warmup_iterations,
+        "measurement_iterations": measurement_iterations,
+        "batch_size": batch_size,
+        "sequence_length": sequence_length,
+        "num_tokens_to_generate": num_tokens_to_generate,
+        "gpu_monitoring": gpu_monitoring,
+    }
+    # Cross-generate all combinations of attn_implementation, compiled_mode, and kernelized
+    configs = []
+    for attn_implementation, sdpa_backend in list(dict.fromkeys(attn_impl_and_sdpa_backend)):
+        for cm in list(dict.fromkeys(compiled_mode)):
+            for kernelize_on in list(dict.fromkeys(kernelized)):
+                config = BenchmarkConfig(
+                    attn_implementation=attn_implementation,
+                    sdpa_backend=sdpa_backend,
+                    compile_mode=cm,
+                    kernelize=kernelize_on,
+                    **kwargs,
+                )
+                configs.append(config)
+    return configs
+
+
+def generate_all_configs(
+    warmup_iterations: int = 5,
+    measurement_iterations: int = 20,
+    batch_size: int = 1,
+    sequence_length: int = 128,
+    num_tokens_to_generate: int = 128,
+    gpu_monitoring: bool = False,
+) -> list[BenchmarkConfig]:
+    all_attn_implementations = [
+        ("flash_attention_2", None),
+        ("eager", None),
+        ("sdpa", "math"),
+        ("sdpa", "flash_attention"),
+        ("flex_attention", None),
+    ]
+    return cross_generate_configs(
+        attn_impl_and_sdpa_backend=all_attn_implementations,
+        compiled_mode=[None, "default", "reduce-overhead", "max-autotune", "max-autotune-no-cudagraphs"],
+        kernelized=[False, KERNELIZATION_AVAILABLE],
+        warmup_iterations=warmup_iterations,
+        measurement_iterations=measurement_iterations,
+        batch_size=batch_size,
+        sequence_length=sequence_length,
+        num_tokens_to_generate=num_tokens_to_generate,
+        gpu_monitoring=gpu_monitoring,
+    )
+
+
+def generate_default_configs(
+    warmup_iterations: int = 5,
+    measurement_iterations: int = 20,
+    batch_size: int = 1,
+    sequence_length: int = 128,
+    num_tokens_to_generate: int = 128,
+    gpu_monitoring: bool = False,
+) -> list[BenchmarkConfig]:
+    all_attn_implementations = [
+        ("flash_attention_2", None),
+        ("eager", None),
+        ("sdpa", "math"),
+        ("sdpa", "flash_attention"),  # note: this one can fail with compile because of attn mask
+    ]
+    return cross_generate_configs(
+        attn_impl_and_sdpa_backend=all_attn_implementations,
+        compiled_mode=[None, "max-autotune"],
+        kernelized=[False, KERNELIZATION_AVAILABLE],
+        warmup_iterations=warmup_iterations,
+        measurement_iterations=measurement_iterations,
+        batch_size=batch_size,
+        sequence_length=sequence_length,
+        num_tokens_to_generate=num_tokens_to_generate,
+        gpu_monitoring=gpu_monitoring,
+    )
--- a/benchmark_v2/framework/benchmark_runner.py
+++ b/benchmark_v2/framework/benchmark_runner.py
@ -0,0 +1,388 @@
+import gc
+import json
+import logging
+import os
+import pathlib
+import re
+import time
+from contextlib import nullcontext
+from datetime import datetime
+from queue import Queue
+from typing import Any, Optional
+
+import torch
+from tqdm import trange
+
+from transformers import (
+    AutoModelForCausalLM,
+    AutoTokenizer,
+    CompileConfig,
+    GenerationConfig,
+    GenerationMixin,
+)
+from transformers.generation.streamers import BaseStreamer
+
+from .benchmark_config import BenchmarkConfig
+from .data_classes import BenchmarkMetadata, BenchmarkResult, GPURawMetrics, pretty_print_dict
+from .hardware_metrics import GPUMonitor
+
+
+try:
+    from kernels import Mode, kernelize  # noqa: F401
+except ImportError:
+    kernelize = None
+    Mode = None
+
+
+DEFAULT_PROMPT = "\n".join([
+    "The French Revolution was a period of political and societal change in France that began with the Estates General of 1789 and ended with the Coup of 18 Brumaire on 9 November 1799.",
+    "Many of the revolution's ideas are considered fundamental principles of liberal democracy, and its values remain central to modern French political discourse.",
+    "It was caused by a combination of social, political, and economic factors which the existing regime proved unable to manage.",
+    "Financial crisis and widespread social distress led to the convocation of the Estates General in May 1789, its first meeting since 1614.",
+    "The representatives of the Third Estate broke away and re-constituted themselves as a National Assembly in June.",
+    "The Storming of the Bastille in Paris on 14 July led to a series of radical measures by the Assembly, including the abolition of feudalism, state control over the Catholic Church in France, and issuing the Declaration of the Rights of Man and of the Citizen.",
+    "The next three years were dominated by a struggle for political control.",
+    "King Louis XVI's attempted flight to Varennes in June 1791 further discredited the monarchy, and military defeats after the outbreak of the French Revolutionary Wars in April 1792 led to the insurrection of 10 August 1792.",
+    "As a result, the monarchy was replaced by the French First Republic in September, followed by the execution of Louis XVI himself in January 1793.",
+    "After another revolt in June 1793, the constitution was suspended, and political power passed from the National Convention to the Committee of Public Safety, dominated by radical Jacobins led by Maximilien Robespierre.",
+    "About 16,000 people were sentenced by the Revolutionary Tribunal and executed in the Reign of Terror, which ended in July 1794 with the Thermidorian Reaction.",
+    "Weakened by external threats and internal opposition, the Committee of Public Safety was replaced in November 1795 by the Directory.",
+    "Its instability ended in the coup of 18 Brumaire and the establishment of the Consulate, with Napoleon Bonaparte as First Consul.",
+])  # fmt: skip
+
+
+def compact_json_numeric_arrays(data: dict):
+    # Match arrays that contain only numbers (ints/floats), whitespace, commas, and newlines
+    pattern = r"\[\s*\n\s*((?:\d+(?:\.\d+)?\s*,\s*)*\d+(?:\.\d+)?)\s*\n\s*\]"
+
+    def replace_numeric_array(match):
+        # Get the array content
+        content = match.group(1)
+        # Remove extra whitespace but keep commas
+        compact_content = re.sub(r"\s+", " ", content).strip()
+        return f"[{compact_content}]"
+
+    return re.sub(pattern, replace_numeric_array, json.dumps(data, indent=4, default=str), flags=re.DOTALL)
+
+
+def get_git_revision() -> str:
+    base_path = pathlib.Path(__file__).parent.parent.parent
+    git_dir = base_path / ".git"
+    with (git_dir / "HEAD").open("r") as head:
+        ref = head.readline().split(" ")[-1].strip()
+    with (git_dir / ref).open("r") as git_hash:
+        return git_hash.readline().strip()
+
+
+def get_sdpa_backend(backend_name: Optional[str]) -> Optional[torch.nn.attention.SDPBackend]:
+    """Get the SDPA backend enum from string name."""
+    if backend_name is None:
+        return None
+
+    try:
+        backend_map = {
+            "math": torch.nn.attention.SDPBackend.MATH,
+            "flash_attention": torch.nn.attention.SDPBackend.FLASH_ATTENTION,
+            "efficient_attention": torch.nn.attention.SDPBackend.EFFICIENT_ATTENTION,
+            "cudnn_attention": torch.nn.attention.SDPBackend.CUDNN_ATTENTION,
+        }
+        return backend_map.get(backend_name.lower())
+    except AttributeError:
+        # torch.nn.attention.SDPBackend not available in older torch versions
+        return None
+
+
+def flush_memory():
+    """Flush GPU memory and run garbage collection."""
+    gc.collect()
+    # Dynamo resets
+    torch._dynamo.reset()
+    torch._dynamo.reset_code_caches()
+    if hasattr(torch._inductor, "codecache"):
+        # Clear FX graph cache
+        if hasattr(torch._inductor.codecache, "FxGraphCache"):
+            torch._inductor.codecache.FxGraphCache.clear()
+        # Clear PyCodeCache
+        if hasattr(torch._inductor.codecache, "PyCodeCache"):
+            torch._inductor.codecache.PyCodeCache.cache_clear()
+        # Clear TritonFuture cache (for async compilation)
+        if hasattr(torch._inductor.codecache, "TritonFuture"):
+            if hasattr(torch._inductor.codecache.TritonFuture, "_compile_cache"):
+                torch._inductor.codecache.TritonFuture._compile_cache.clear()
+    # Clear CUDA cache
+    if torch.cuda.is_available():
+        torch.cuda.empty_cache()
+        torch.cuda.reset_max_memory_allocated()
+        torch.cuda.reset_peak_memory_stats()
+        torch.cuda.synchronize()
+    gc.collect()
+
+
+class BenchmarkStreamer(BaseStreamer):
+    def __init__(self, **kwargs) -> None:
+        self.timestamps = []
+        self.text_queue = Queue()
+
+    def put(self, value):
+        """Receives tokens and logs the timestamp of the generation."""
+        self.timestamps.append(time.perf_counter())
+
+    def end(self):
+        self.timestamps.append(time.perf_counter())
+
+    def __iter__(self):
+        return self
+
+    def __next__(self):
+        value = self.text_queue.get(timeout=self.timeout)
+        if value == self.stop_signal:
+            raise StopIteration()
+        else:
+            return value
+
+
+class BenchmarkRunner:
+    """Main benchmark runner that coordinates benchmark execution."""
+
+    def __init__(
+        self, logger: logging.Logger, output_dir: str = "benchmark_results", commit_id: Optional[str] = None
+    ) -> None:
+        # Those stay constant for the whole run
+        self.logger = logger
+        self.output_dir = output_dir
+        self.commit_id = get_git_revision() if commit_id is None else commit_id
+        os.makedirs(self.output_dir, exist_ok=True)
+        self.profile_dir = None
+        # Attributes that are reset for each model
+        self._setup_for = ""
+        # Attributes that are reset for each run
+        self.model: Optional[GenerationMixin] = None
+
+    def cleanup(self) -> None:
+        del self.model
+        self.model = None
+        flush_memory()
+
+    def setup_one_run(self, model_id: str, config: BenchmarkConfig) -> None:
+        # Some attributes only need to be set once per model
+        if self._setup_for != model_id:
+            self.tokenizer = AutoTokenizer.from_pretrained(model_id)
+            # We set the EOS token to the padding token for open-ended generation
+            self.tokenizer.eos_token = self.tokenizer.pad_token
+            self._setup_for = model_id
+
+        # Prepare inputs
+        self.inputs = self.tokenizer(
+            [DEFAULT_PROMPT for _ in range(config.batch_size)],
+            return_tensors="pt",
+            max_length=config.sequence_length,
+            truncation=True,
+            return_attention_mask=True,
+        ).to(config.device)
+        self.inputs["use_cache"] = True
+
+        # Prepare generation config
+        gen_config = GenerationConfig(
+            do_sample=False, top_p=1.0, temperature=1.0, max_new_tokens=config.num_tokens_to_generate
+        )
+
+        # Prepare compile config
+        if config.compile_mode is not None:
+            gen_config.compile_config = CompileConfig(mode=config.compile_mode, options=config.compile_options)
+            gen_config.cache_implementation = "static"
+
+        # Load model
+        self.logger.debug(f"Loading model {model_id} on device {config.device}...")
+        dtype = getattr(torch, config.dtype.removeprefix("torch."))
+        self.model = AutoModelForCausalLM.from_pretrained(
+            model_id, dtype=dtype, attn_implementation=config.attn_implementation, generation_config=gen_config
+        )
+        self.model = self.model.eval().to(config.device)
+
+        # Kernelize the model if needed
+        if config.kernelize:
+            self.model = kernelize(self.model, mode=Mode.INFERENCE)
+
+    def run_one_benchmark(self, model_id: str, config: BenchmarkConfig, num_tokens_to_profile: int = 0) -> None:
+        sdpa_ctx = nullcontext()
+        if config.attn_implementation == "sdpa":
+            sdpa_backend = get_sdpa_backend(config.sdpa_backend)
+            sdpa_ctx = torch.nn.attention.sdpa_kernel(sdpa_backend)
+
+        with sdpa_ctx, torch.no_grad():
+            self.logger.info(f"Running benchmark scenario: {config.name}")
+
+            # Quick validation: try one measurement first to see if this scenario works
+            flush_memory()
+            e2e_latency, token_generation_times, decoded_output, gpu_metrics = self.time_generate(
+                max_new_tokens=1, gpu_monitor=None
+            )
+            if e2e_latency < 0:
+                self.logger.warning(f"Skipping config {config.name}: {e2e_latency = } (no GPU monitoring)")
+                return None
+
+            # Warmup runs
+            self.logger.info(f"Warming up with {config.warmup_iterations} iterations...")
+            for _ in trange(config.warmup_iterations):
+                _ = self.time_generate(max_new_tokens=config.num_tokens_to_generate)
+            self.logger.info("Warmup over.")
+
+            # Measurement runs
+            result = BenchmarkResult()
+            self.logger.info(f"Benchmarking with {config.measurement_iterations} iterations.")
+            for _ in trange(config.measurement_iterations):
+                e2e_latency, token_generation_times, decoded_output, gpu_metrics = self.time_generate(
+                    max_new_tokens=config.num_tokens_to_generate,
+                    gpu_monitor=(GPUMonitor(logger=self.logger) if config.gpu_monitoring else None),
+                )
+                result.accumulate(e2e_latency, token_generation_times, decoded_output, gpu_metrics)
+            self.logger.info("Benchmarking done. Cleaning up.")
+
+            # Profile if needed
+            if num_tokens_to_profile > 0:
+                self.profile_generate(num_tokens_to_profile, config.name)
+
+            return {
+                "metadata": BenchmarkMetadata(model_id=model_id, commit_id=self.commit_id),
+                "measurements": result,
+                "config": config,
+            }
+
+    def time_generate(
+        self,
+        max_new_tokens: int,
+        gpu_monitor: Optional[GPUMonitor] = None,
+    ) -> tuple[float, list[float], str, Optional[GPURawMetrics]]:
+        """Time the latency of a call to model.generate() with the given (inputs) and (max_new_tokens)."""
+        # Prepare gpu monitoring if needed
+        if gpu_monitor is not None:
+            gpu_monitor.start()
+        # Prepare streamer
+        streamer = BenchmarkStreamer()
+        # Generate and time
+        wall_time_0 = time.perf_counter()
+        outputs = self.model.generate(
+            **self.inputs,
+            max_new_tokens=max_new_tokens,
+            streamer=streamer,
+        )
+        wall_time_1 = time.perf_counter()
+        # Stop gpu monitoring if needed
+        gpu_metrics = gpu_monitor.stop_and_collect() if gpu_monitor is not None else None
+        # Check if generation had the right number of tokens
+        input_tokens = self.inputs["input_ids"].size(-1)
+        batch_size, output_tokens = outputs.shape
+        new_tokens = output_tokens - input_tokens
+        if new_tokens != max_new_tokens:
+            raise RuntimeError(f"Generated {new_tokens} tokens, expected {max_new_tokens}")
+        # Decode outputs
+        decoded_output = self.tokenizer.decode(outputs[0, input_tokens:], skip_special_tokens=True)
+        # Compute intermediate quantities
+        e2e_latency = wall_time_1 - wall_time_0
+        token_generation_times = [t - wall_time_0 for t in streamer.timestamps[1:]]
+        return e2e_latency, token_generation_times, decoded_output, gpu_metrics
+
+    def profile_generate(self, num_tokens_to_profile: int, config_name: str) -> None:
+        """Profile the latency of a call to model.generate() with the given (inputs) and (max_new_tokens)."""
+        profiler = torch.profiler.profile(
+            activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA],
+            record_shapes=True,
+        )
+        with profiler as prof:
+            _ = self.model.generate(
+                **self.inputs,
+                max_new_tokens=num_tokens_to_profile,
+            )
+        if self.profile_dir is None:
+            self.profile_dir = self.output_dir + "_profiles"
+            os.makedirs(self.profile_dir, exist_ok=True)
+        prof.export_chrome_trace(f"{self.profile_dir}/{config_name}.json")
+
+    def run_benchmarks(
+        self,
+        model_id: str,
+        benchmark_configs: list[BenchmarkConfig],
+        num_tokens_to_profile: int = 0,
+        pretty_print_summary: bool = True,
+    ) -> dict[str, Any]:
+        all_results = {}
+        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+        start_time = time.perf_counter()
+
+        n_configs = len(benchmark_configs)
+        for i, config in enumerate(benchmark_configs):
+            # Handle SDPA backend if not determined by the config (needs to be done before skipping duplicates)
+            if config.attn_implementation == "sdpa" and config.sdpa_backend is None:
+                default_backend = "flash_attention"  # FIXME: torch has a _cur_sdpa_kernel_backends but it fails
+                self.logger.warning(f"No SDPA backend provided, using {default_backend} instead.")
+                config.sdpa_backend = default_backend
+
+            # Skip if already run
+            if config.hash in all_results:
+                self.logger.info(f"Skipping duplicate config {config.name} for model {model_id} ({i + 1}/{n_configs})")
+                continue
+
+            # Otherwise, run the benchmark
+            self.setup_one_run(model_id, config)
+            self.logger.info(
+                f"Running benchmark of model {model_id} with scenario: {config.name} ({i + 1}/{n_configs})"
+            )
+
+            # Launch benchmark in a try/except block to avoid stopping the whole run if one benchmark fails
+            try:
+                results = self.run_one_benchmark(model_id, config, num_tokens_to_profile)
+                if results is not None:
+                    all_results[config.hash] = results
+
+            except Exception as e:
+                self.logger.error(f"Error running with scenario: {config.name}:\n{repr(e)}")
+            # Cleanup model and save results
+            self.cleanup()
+            self.save_results(model_id, all_results, timestamp=timestamp)
+
+        if pretty_print_summary:
+            print()
+            print("=" * 100)
+            print(f"Finished benchmarks in {time.perf_counter() - start_time:.2f} seconds")
+            print(f"Total number of benchmarks: {len(all_results)}")
+            if len(all_results) > 0:
+                print("First run metadata:")
+                first_key = list(all_results.keys())[0]
+                first_metadata = all_results[first_key]["metadata"].to_dict()
+                hardware_info = first_metadata.pop("hardware_info")
+                pretty_print_dict(first_metadata | hardware_info, tabs=1)
+            for value in all_results.values():
+                print("=" * 100)
+                print(f"Config: {value['config'].infer_name(compact=False)}\n")
+                value["measurements"].pprint(tabs=1)
+            print("=" * 100)
+
+        return all_results
+
+    def save_results(self, model_name: str, results: dict, timestamp: str = "") -> str:
+        """Save benchmark results to JSON file."""
+        # Create model-specific subdirectory
+        model_name = model_name.replace("/", "_")
+        model_dir = os.path.join(self.output_dir, model_name)
+        os.makedirs(model_dir, exist_ok=True)
+
+        # Create filename with timestamp
+        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") if not timestamp else timestamp
+        filename = f"{model_name}_benchmark_{timestamp}.json"
+        filepath = os.path.join(model_dir, filename)
+
+        # Convert results to dict
+        converted_results = {}
+        for cfg_hash in results.keys():
+            converted_results[cfg_hash] = {
+                "metadata": results[cfg_hash]["metadata"].to_dict(),
+                "measurements": results[cfg_hash]["measurements"].to_dict(),
+                "config": results[cfg_hash]["config"].to_dict(),
+            }
+
+        # Save to JSON file
+        with open(filepath, "w") as f:
+            f.write(compact_json_numeric_arrays(converted_results))
+
+        self.logger.info(f"Results saved to {filepath}")
+        return filepath
--- a/benchmark_v2/framework/data_classes.py
+++ b/benchmark_v2/framework/data_classes.py
@ -0,0 +1,152 @@
+from dataclasses import dataclass
+from datetime import datetime
+from typing import Any, Optional, Union
+
+import numpy as np
+
+from .hardware_metrics import GPURawMetrics, HardwareInfo
+
+
+def compute_basic_statistics(measurements: list[float]) -> dict[str, float]:
+    return {
+        "avg": np.mean(measurements),
+        "std": np.std(measurements),
+        "min": np.min(measurements),
+        "med": np.median(measurements),
+        "max": np.max(measurements),
+        "p95": np.percentile(measurements, 95),
+    }
+
+
+def add_unit_to_duration(stats: dict[str, float]) -> dict[str, str]:
+    for key in list(stats.keys()):
+        value = stats[key]
+        if value > 3600:
+            stats[key] = f"{(value / 3600):.2f}hr"
+        elif value > 60:
+            stats[key] = f"{(value / 60):.2f}min"
+        elif value > 1:
+            stats[key] = f"{value:.2f}s"
+        elif value > 1e-3:
+            stats[key] = f"{(value * 1e3):.2f}ms"
+        elif value > 1e-6:
+            stats[key] = f"{(value * 1e6):.2f}us"
+        else:
+            stats[key] = f"{(value * 1e9):.2f}ns"
+    return stats
+
+
+def equalize_lengths_and_collate(stats: list[dict[str, str]]) -> list[str]:
+    keys = ["avg", "std", "min", "med", "max", "p95"]
+    for key in keys:
+        max_length = max(len(stat[key]) for stat in stats)
+        for stat in stats:
+            stat[key] = stat[key].ljust(max_length, " ")
+    return [" ".join([f"{key}={stat[key]}" for key in keys]) for stat in stats]
+
+
+def pretty_print_dict(data: dict[str, Any], tabs: int = 0) -> None:
+    max_key_length = max([len(key) for key in data.keys()])
+    for key, value in data.items():
+        tabs_str = "  " * tabs
+        padded_key = key.ljust(max_key_length + 1, ".")
+        print(f"{tabs_str}{padded_key}: {value}")
+
+
+@dataclass
+class BenchmarkMetadata:
+    """Metadata collected for each benchmark run."""
+
+    model_id: str
+    timestamp: str
+    commit_id: str
+    hardware_info: HardwareInfo
+
+    def __init__(self, model_id: str, commit_id: str):
+        self.model_id = model_id
+        self.timestamp = datetime.utcnow().isoformat()
+        self.commit_id = commit_id
+        self.hardware_info = HardwareInfo()
+
+    def to_dict(self) -> dict[str, Any]:
+        return {
+            "timestamp": self.timestamp,
+            "commit_id": self.commit_id,
+            "hardware_info": self.hardware_info.to_dict(),
+        }
+
+
+class BenchmarkResult:
+    """Result from a series of benchmark runs."""
+
+    def __init__(self) -> None:
+        self.e2e_latency = []
+        self.token_generation_times = []  # time at which each token was generated (relative to start of the generation)
+        self.decoded_outputs = []
+        self.gpu_metrics = []
+
+    def accumulate(
+        self,
+        e2e_latency: float,
+        token_generation_times: list[float],
+        decoded_output: str,
+        gpu_metrics: Optional[GPURawMetrics],
+    ) -> None:
+        self.e2e_latency.append(e2e_latency)
+        self.token_generation_times.append(token_generation_times)
+        self.decoded_outputs.append(decoded_output)
+        self.gpu_metrics.append(gpu_metrics)
+
+    def to_dict(self) -> dict[str, Union[None, int, float]]:
+        # Save GPU metrics as None if it contains only None values
+        if all(gm is None for gm in self.gpu_metrics):
+            gpu_metrics = None
+        else:
+            gpu_metrics = [gm.to_dict() for gm in self.gpu_metrics]
+        return {
+            "e2e_latency": self.e2e_latency,
+            "token_generation_times": self.token_generation_times,
+            "decoded_outputs": self.decoded_outputs,
+            "gpu_metrics": gpu_metrics,
+        }
+
+    @classmethod
+    def from_dict(cls, data: dict[str, Union[None, int, float]]) -> "BenchmarkResult":
+        # Handle GPU metrics, which is saved as None if it contains only None values
+        if data["gpu_metrics"] is None:
+            gpu_metrics = [None for _ in range(len(data["e2e_latency"]))]
+        else:
+            gpu_metrics = [GPURawMetrics.from_dict(gm) for gm in data["gpu_metrics"]]
+        # Create a new instance and accumulate the data
+        new_instance = cls()
+        for i in range(len(data["e2e_latency"])):
+            new_instance.accumulate(
+                e2e_latency=data["e2e_latency"][i],
+                token_generation_times=data["token_generation_times"][i],
+                decoded_output=data["decoded_output"][i],
+                gpu_metrics=gpu_metrics[i],
+            )
+        return new_instance
+
+    def get_measured_ttft(self) -> list[float]:
+        return [dt[0] for dt in self.token_generation_times if len(dt) > 0]
+
+    def get_measured_itl(self) -> list[float]:
+        return [(dt[-1] - dt[0]) / (len(dt) - 1) for dt in self.token_generation_times if len(dt) > 1]
+
+    def pprint(self, tabs: int = 0) -> None:
+        collated_stats = equalize_lengths_and_collate(
+            [
+                add_unit_to_duration(compute_basic_statistics(self.e2e_latency)),
+                add_unit_to_duration(compute_basic_statistics(self.get_measured_ttft())),
+                add_unit_to_duration(compute_basic_statistics(self.get_measured_itl())),
+            ]
+        )
+        pretty_print_dict(
+            {
+                "E2E Latency": collated_stats[0],
+                "Time to First Token": collated_stats[1],
+                "Inter-Token Latency": collated_stats[2],
+            },
+            tabs=tabs,
+        )
--- a/benchmark_v2/framework/hardware_metrics.py
+++ b/benchmark_v2/framework/hardware_metrics.py
@ -0,0 +1,172 @@
+import json
+import logging
+import subprocess
+import sys
+import threading
+import time
+from dataclasses import dataclass
+from enum import Enum
+from logging import Logger
+from typing import Optional, Union
+
+import gpustat
+import psutil
+import torch
+
+
+# Data class to hold the hardware information
+def get_device_name_and_memory_total() -> tuple[str, float]:
+    """Returns the name and memory total of GPU 0."""
+    device_name = torch.cuda.get_device_properties(0).name
+    device_memory_total = torch.cuda.get_device_properties(0).total_memory / 1024**3
+    return device_name, device_memory_total
+
+
+class HardwareInfo:
+    """A class to hold information about the hardware."""
+
+    def __init__(self) -> None:
+        # Retrieve GPU stats
+        try:
+            self.gpu_name, self.gpu_memory_total_gb = get_device_name_and_memory_total()
+        except Exception:
+            self.gpu_name, self.gpu_memory_total_gb = None, None
+        # Retrieve python, torch and CUDA version
+        self.python_version = f"{sys.version.split()[0]}"
+        self.torch_version = torch.__version__
+        if hasattr(torch, "cuda") and torch.cuda.is_available():
+            self.cuda_version = torch.version.cuda
+        else:
+            self.cuda_version = None
+        # Retrieve general hardware information
+        self.cpu_count = psutil.cpu_count()
+        self.memory_total_mb = int(psutil.virtual_memory().total / (1024 * 1024))
+
+    def to_dict(self) -> dict[str, Union[None, int, float, str]]:
+        return {
+            "gpu_name": self.gpu_name,
+            "gpu_memory_total_gb": self.gpu_memory_total_gb,
+            "python_version": self.python_version,
+            "torch_version": self.torch_version,
+        }
+
+
+# Functions to get information about the GPU
+def get_amd_gpu_stats() -> tuple[int, float]:
+    """Returns the utilization and memory used of an AMD GPU, both in percent"""
+    rocm_smi_output = subprocess.check_output(["rocm-smi", "--json", "--showuse", "--showmeminfo", "VRAM"])
+    gpu_stats = json.loads(rocm_smi_output.decode("utf-8"))
+    gpu_stats = [
+        (card_id, stats["GPU use (%)"], stats["VRAM Total Used Memory (B)"]) for card_id, stats in gpu_stats.items()
+    ]
+    gpu_stats.sort(key=lambda x: x[1], reverse=True)
+    return int(gpu_stats[0][1]), float(gpu_stats[0][2]) / 1024**3
+
+
+def get_nvidia_gpu_stats() -> tuple[int, float]:
+    """Returns the utilization and memory used of an NVIDIA GPU, both in percent"""
+    gpu_stats = gpustat.GPUStatCollection.new_query()
+    gpu_stats = gpu_stats[0]
+    return int(gpu_stats["utilization.gpu"]), float(gpu_stats["memory.used"]) / 1024**3
+
+
+class GPUStatsCollector:
+    """A class to get statistics about the GPU. It serves as a wrapper that holds the GPU total memory and its name,
+    which is used to call the right function to get the utilization and memory used."""
+
+    def __init__(self) -> None:
+        self.device_name, self.device_memory_total = get_device_name_and_memory_total()
+        # Monkey patch the get_utilization_and_memory_used method based on the GPU type
+        if "amd" in self.device_name.lower():
+            self.get_utilization_and_memory_used = get_amd_gpu_stats
+        elif "nvidia" in self.device_name.lower():
+            self.get_utilization_and_memory_used = get_nvidia_gpu_stats
+        else:
+            raise RuntimeError(f"Unsupported GPU: {self.device_name}")
+
+    def get_measurements(self) -> tuple[int, float]:
+        """Get the utilization and memory used of the GPU, both in percent"""
+        raise NotImplementedError("This method is meant to be monkey patched during __init__")
+
+
+# Simple data classes to hold the raw GPU metrics
+class GPUMonitoringStatus(Enum):
+    """Status of GPU monitoring."""
+
+    SUCCESS = "success"
+    FAILED = "failed"
+    NO_GPUS_AVAILABLE = "no_gpus_available"
+    NO_SAMPLES_COLLECTED = "no_samples_collected"
+
+
+@dataclass
+class GPURawMetrics:
+    """Raw values for GPU utilization and memory used."""
+
+    utilization: list[float]  # in percent
+    memory_used: list[float]  # in GB
+    timestamps: list[float]  # in seconds
+    timestamp_0: float  # in seconds
+    monitoring_status: GPUMonitoringStatus
+
+    def to_dict(self) -> dict[str, Union[None, int, float, str]]:
+        return {
+            "utilization": self.utilization,
+            "memory_used": self.memory_used,
+            "timestamps": self.timestamps,
+            "timestamp_0": self.timestamp_0,
+            "monitoring_status": self.monitoring_status.value,
+        }
+
+
+# Main class, used to monitor the GPU utilization during benchmark execution
+class GPUMonitor:
+    """Monitor GPU utilization during benchmark execution."""
+
+    def __init__(self, sample_interval_sec: float = 0.1, logger: Optional[Logger] = None):
+        self.sample_interval_sec = sample_interval_sec
+        self.logger = logger if logger is not None else logging.getLogger(__name__)
+
+        self.num_available_gpus = torch.cuda.device_count()
+        if self.num_available_gpus == 0:
+            raise RuntimeError("No GPUs detected by torch.cuda.device_count().")
+        self.gpu_stats_getter = GPUStatsCollector()
+
+    def start(self):
+        """Start monitoring GPU metrics."""
+        # Clear the stop event to enable monitoring
+        self.stop_event = threading.Event()
+        self.gpu_utilization = []
+        self.gpu_memory_used = []
+        self.timestamps = []
+        self.thread = threading.Thread(target=self._monitor_loop)
+        self.thread.start()
+        self.logger.debug("GPU monitoring started")
+
+    def stop_and_collect(self) -> GPURawMetrics:
+        """Stop monitoring and return collected metrics."""
+        self.stop_event.set()
+        self.thread.join()
+        if self.gpu_utilization:
+            timestamp_0 = self.timestamps[0]
+            metrics = GPURawMetrics(
+                utilization=self.gpu_utilization,
+                memory_used=self.gpu_memory_used,
+                timestamps=[t - timestamp_0 for t in self.timestamps],
+                timestamp_0=timestamp_0,
+                monitoring_status=GPUMonitoringStatus.SUCCESS,
+            )
+            self.logger.debug(f"GPU monitoring completed: {len(self.gpu_utilization)} samples collected")
+        else:
+            metrics = GPURawMetrics(monitoring_status=GPUMonitoringStatus.NO_SAMPLES_COLLECTED)
+        return metrics
+
+    def _monitor_loop(self):
+        """Background monitoring loop using threading.Event for communication."""
+        while not self.stop_event.is_set():
+            utilization, memory_used = self.gpu_stats_getter.get_utilization_and_memory_used()
+            self.gpu_utilization.append(utilization)
+            self.gpu_memory_used.append(memory_used)
+            self.timestamps.append(time.time())
+            if self.stop_event.wait(timeout=self.sample_interval_sec):
+                break
--- a/benchmark_v2/run_benchmarks.py
+++ b/benchmark_v2/run_benchmarks.py
@ -19,477 +19,93 @@ in the ./benches directory, organizing outputs into model-specific subfolders.
 """

 import argparse
-import importlib.util
-import json
 import logging
-import os
+import random
 import sys
 import uuid
-from datetime import datetime
-from pathlib import Path
-from typing import Any, Optional
+
+from framework.benchmark_config import BenchmarkConfig, generate_all_configs
+from framework.benchmark_runner import BenchmarkRunner


-def setup_logging(log_level: str = "INFO", enable_file_logging: bool = False) -> logging.Logger:
-    """Setup logging configuration."""
-    numeric_level = getattr(logging, log_level.upper(), None)
-    if not isinstance(numeric_level, int):
-        raise ValueError(f"Invalid log level: {log_level}")
+if __name__ == "__main__":
+    # Parse arguments
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--output-dir", type=str, default="benchmark_results", help="Output dir for benchmark results")
+    parser.add_argument("--log-level", type=str, choices=["DEBUG", "INFO", "WARNING", "ERROR"], default="INFO")
+    parser.add_argument("--model-id", type=str, help="Specific model ID to benchmark (if supported by benchmarks)")
+
+    parser.add_argument("--warmup", type=int, default=5, help="Number of warmup iterations")
+    parser.add_argument("--iterations", type=int, default=20, help="Number of measurement iterations")
+
+    parser.add_argument("--batch-size", "-b", type=int, nargs="+", help="Batch size")
+    parser.add_argument("--sequence-length", "-s", type=int, nargs="+", help="Sequence length")
+    parser.add_argument("--num-tokens-to-generate", "-n", type=int, nargs="+", help="Number of tokens to generate")
+
+    parser.add_argument("--num-tokens-to-profile", "-p", type=int, default=0, help="Number of tokens to profile")
+
+    parser.add_argument("--commit-id", type=str, help="Git commit ID (if not provided, will auto-detect from git)")
+    args = parser.parse_args()
+
+    # Setup logging
+    benchmark_run_uuid = str(uuid.uuid4())[:8]
+    numeric_level = getattr(logging, args.log_level.upper())

    handlers = [logging.StreamHandler(sys.stdout)]
-
-    if enable_file_logging:
-        handlers.append(logging.FileHandler(f"benchmark_run_{datetime.now().strftime('%Y%m%d_%H%M%S')}.log"))
-
    logging.basicConfig(
        level=numeric_level, format="[%(levelname)s - %(asctime)s] %(name)s: %(message)s", handlers=handlers
    )

-    return logging.getLogger(__name__)
-
-
-def discover_benchmarks(benches_dir: str) -> list[dict[str, Any]]:
-    """
-    Discover all benchmark modules in the benches directory.
-
-    Returns:
-        List of dictionaries containing benchmark module info
-    """
-    benchmarks = []
-    benches_path = Path(benches_dir)
-
-    if not benches_path.exists():
-        raise FileNotFoundError(f"Benches directory not found: {benches_dir}")
-
-    for py_file in benches_path.glob("*.py"):
-        if py_file.name.startswith("__"):
-            continue
-
-        module_name = py_file.stem
-
-        try:
-            # Import the module
-            spec = importlib.util.spec_from_file_location(module_name, py_file)
-            module = importlib.util.module_from_spec(spec)
-            spec.loader.exec_module(module)
-
-            # Check if it has a benchmark runner function
-            if hasattr(module, f"run_{module_name}"):
-                benchmarks.append(
-                    {
-                        "name": module_name,
-                        "path": str(py_file),
-                        "module": module,
-                        "runner_function": getattr(module, f"run_{module_name}"),
-                    }
-                )
-            elif hasattr(module, "run_benchmark"):
-                benchmarks.append(
-                    {
-                        "name": module_name,
-                        "path": str(py_file),
-                        "module": module,
-                        "runner_function": getattr(module, "run_benchmark"),
-                    }
-                )
-            else:
-                logging.warning(f"No runner function found in {py_file}")
-
-        except Exception as e:
-            logging.error(f"Failed to import {py_file}: {e}")
-
-    return benchmarks
-
-
-def run_single_benchmark(
-    benchmark_info: dict[str, Any], output_dir: str, logger: logging.Logger, **kwargs
-) -> Optional[str]:
-    """
-    Run a single benchmark and return the output file path.
-
-    Args:
-        benchmark_info: Dictionary containing benchmark module info
-        output_dir: Base output directory
-        logger: Logger instance
-        **kwargs: Additional arguments to pass to the benchmark
-
-    Returns:
-        Path to the output file if successful, None otherwise
-    """
-    benchmark_name = benchmark_info["name"]
-    runner_func = benchmark_info["runner_function"]
-
-    logger.info(f"Running benchmark: {benchmark_name}")
-
-    try:
-        # Check function signature to determine what arguments to pass
-        import inspect
-
-        sig = inspect.signature(runner_func)
-
-        # Prepare arguments based on function signature
-        func_kwargs = {"logger": logger, "output_dir": output_dir}
-
-        # Add other kwargs if the function accepts them
-        for param_name in sig.parameters:
-            if param_name in kwargs:
-                func_kwargs[param_name] = kwargs[param_name]
-
-        # Filter kwargs to only include parameters the function accepts
-        # If function has **kwargs, include all provided kwargs
-        has_var_kwargs = any(param.kind == param.VAR_KEYWORD for param in sig.parameters.values())
-        if has_var_kwargs:
-            valid_kwargs = {**func_kwargs, **kwargs}
-        else:
-            valid_kwargs = {k: v for k, v in func_kwargs.items() if k in sig.parameters}
-
-        # Run the benchmark
-        result = runner_func(**valid_kwargs)
-
-        if isinstance(result, str):
-            # Function returned a file path
-            return result
-        else:
-            logger.info(f"Benchmark {benchmark_name} completed successfully")
-            return "completed"
-
-    except Exception as e:
-        logger.error(f"Benchmark {benchmark_name} failed: {e}")
-        import traceback
-
-        logger.debug(traceback.format_exc())
-        return None
-
-
-def generate_summary_report(
-    output_dir: str,
-    benchmark_results: dict[str, Any],
-    logger: logging.Logger,
-    benchmark_run_uuid: Optional[str] = None,
-) -> str:
-    """Generate a summary report of all benchmark runs."""
-    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
-    summary_file = os.path.join(output_dir, f"benchmark_summary_{timestamp}.json")
-
-    summary_data = {
-        "run_metadata": {
-            "timestamp": datetime.utcnow().isoformat(),
-            "benchmark_run_uuid": benchmark_run_uuid,
-            "total_benchmarks": len(benchmark_results),
-            "successful_benchmarks": len([r for r in benchmark_results.values() if r is not None]),
-            "failed_benchmarks": len([r for r in benchmark_results.values() if r is None]),
-        },
-        "benchmark_results": benchmark_results,
-        "output_directory": output_dir,
-    }
-
-    with open(summary_file, "w") as f:
-        json.dump(summary_data, f, indent=2, default=str)
-
-    logger.info(f"Summary report saved to: {summary_file}")
-    return summary_file
-
-
-def upload_results_to_hf_dataset(
-    output_dir: str,
-    summary_file: str,
-    dataset_name: str,
-    run_id: Optional[str] = None,
-    token: Optional[str] = None,
-    logger: Optional[logging.Logger] = None,
-) -> Optional[str]:
-    """
-    Upload benchmark results to a HuggingFace Dataset.
-    Based on upload_collated_report() from utils/collated_reports.py
-    Args:
-        output_dir: Local output directory containing results
-        summary_file: Path to the summary file
-        dataset_name: Name of the HuggingFace dataset to upload to
-        run_id: Unique run identifier (if None, will generate one)
-        token: HuggingFace token for authentication (if None, will use environment variables)
-        logger: Logger instance
-    Returns:
-        The run_id used for the upload, None if upload failed
-    """
-    if logger is None:
-        logger = logging.getLogger(__name__)
-
-    import os
-
-    from huggingface_hub import HfApi
-
-    api = HfApi()
-
-    if run_id is None:
-        github_run_number = os.getenv("GITHUB_RUN_NUMBER")
-        github_run_id = os.getenv("GITHUB_RUN_ID")
-        if github_run_number and github_run_id:
-            run_id = f"{github_run_number}-{github_run_id}"
-
-    date_folder = datetime.now().strftime("%Y-%m-%d")
-
-    github_event_name = os.getenv("GITHUB_EVENT_NAME")
-    if github_event_name != "schedule":
-        # Non-scheduled runs go under a runs subfolder
-        repo_path = f"{date_folder}/runs/{run_id}/benchmark_results"
-    else:
-        # Scheduled runs go directly under the date
-        repo_path = f"{date_folder}/{run_id}/benchmark_results"
-
-    logger.info(f"Uploading benchmark results to dataset '{dataset_name}' at path '{repo_path}'")
-
-    try:
-        # Upload all files in the output directory
-        from pathlib import Path
-
-        output_path = Path(output_dir)
-
-        for file_path in output_path.rglob("*"):
-            if file_path.is_file():
-                # Calculate relative path from output_dir
-                relative_path = file_path.relative_to(output_path)
-                path_in_repo = f"{repo_path}/{relative_path}"
-
-                logger.debug(f"Uploading {file_path} to {path_in_repo}")
-
-                api.upload_file(
-                    path_or_fileobj=str(file_path),
-                    path_in_repo=path_in_repo,
-                    repo_id=dataset_name,
-                    repo_type="dataset",
-                    token=token,
-                    commit_message=f"Upload benchmark results for run {run_id}",
-                )
-
-        logger.info(
-            f"Successfully uploaded results to: https://huggingface.co/datasets/{dataset_name}/tree/main/{repo_path}"
-        )
-
-        return run_id
-
-    except Exception as upload_error:
-        logger.error(f"Failed to upload results: {upload_error}")
-        import traceback
-
-        logger.debug(traceback.format_exc())
-        return None
-
-
-def main():
-    """Main entry point for the benchmarking script."""
-    # Generate a unique UUID for this benchmark run
-    benchmark_run_uuid = str(uuid.uuid4())[:8]
-
-    parser = argparse.ArgumentParser(
-        description="Run all benchmarks in the ./benches directory",
-        epilog="""
-Examples:
-  # Run all available benchmarks
-  python3 run_benchmarks.py
-  
-  # Run with specific model and upload to HuggingFace Dataset
-  python3 run_benchmarks.py --model-id meta-llama/Llama-2-7b-hf --upload-to-hf username/benchmark-results
-  
-  # Run with custom run ID and upload to HuggingFace Dataset
-  python3 run_benchmarks.py --run-id experiment_v1 --upload-to-hf org/benchmarks
-  
-  # Run only specific benchmarks with file logging
-  python3 run_benchmarks.py --include llama --enable-file-logging
-        """,  # noqa: W293
-        formatter_class=argparse.RawDescriptionHelpFormatter,
-    )
-
-    parser.add_argument(
-        "--output-dir",
-        type=str,
-        default="benchmark_results",
-        help="Base output directory for benchmark results (default: benchmark_results)",
-    )
-
-    parser.add_argument(
-        "--benches-dir",
-        type=str,
-        default="./benches",
-        help="Directory containing benchmark implementations (default: ./benches)",
-    )
-
-    parser.add_argument(
-        "--log-level",
-        type=str,
-        choices=["DEBUG", "INFO", "WARNING", "ERROR"],
-        default="INFO",
-        help="Logging level (default: INFO)",
-    )
-
-    parser.add_argument("--model-id", type=str, help="Specific model ID to benchmark (if supported by benchmarks)")
-
-    parser.add_argument("--warmup-iterations", type=int, default=3, help="Number of warmup iterations (default: 3)")
-
-    parser.add_argument(
-        "--measurement-iterations", type=int, default=5, help="Number of measurement iterations (default: 5)"
-    )
-
-    parser.add_argument(
-        "--num-tokens-to-generate",
-        type=int,
-        default=100,
-        help="Number of tokens to generate in benchmarks (default: 100)",
-    )
-
-    parser.add_argument("--include", type=str, nargs="*", help="Only run benchmarks matching these names")
-
-    parser.add_argument("--exclude", type=str, nargs="*", help="Exclude benchmarks matching these names")
-
-    parser.add_argument("--enable-file-logging", action="store_true", help="Enable file logging (disabled by default)")
-
-    parser.add_argument(
-        "--commit-id", type=str, help="Git commit ID for metadata (if not provided, will auto-detect from git)"
-    )
-
-    parser.add_argument(
-        "--push-to-hub",
-        type=str,
-        help="Upload results to HuggingFace Dataset (provide dataset name, e.g., 'username/benchmark-results')",
-    )
-
-    parser.add_argument(
-        "--run-id", type=str, help="Custom run ID for organizing results (if not provided, will generate a unique ID)"
-    )
-
-    parser.add_argument(
-        "--token",
-        type=str,
-        help="HuggingFace token for dataset uploads (if not provided, will use HF_TOKEN environment variable)",
-    )
-
-    args = parser.parse_args()
-
-    # Setup logging
-    logger = setup_logging(args.log_level, args.enable_file_logging)
-
+    logger = logging.getLogger("benchmark_v2")
    logger.info("Starting benchmark discovery and execution")
    logger.info(f"Benchmark run UUID: {benchmark_run_uuid}")
    logger.info(f"Output directory: {args.output_dir}")
-    logger.info(f"Benches directory: {args.benches_dir}")

-    # Create output directory
-    os.makedirs(args.output_dir, exist_ok=True)
+    # Error out if one of the arguments is not provided
+    if len(args.batch_size) * len(args.sequence_length) * len(args.num_tokens_to_generate) == 0:
+        raise ValueError(
+            "At least one of the arguments --batch-size, --sequence-length, or --num-tokens-to-generate is required"
+        )

-    try:
-        # Discover benchmarks
-        benchmarks = discover_benchmarks(args.benches_dir)
-        logger.info(f"Discovered {len(benchmarks)} benchmark(s): {[b['name'] for b in benchmarks]}")
+    # If there is only one (batch_size, sequence_length, num_tokens_to_generate), we benchmark across configs
+    elif len(args.batch_size) * len(args.sequence_length) * len(args.num_tokens_to_generate) == 1:
+        benchmark_configs = generate_all_configs(
+            warmup_iterations=args.warmup,
+            measurement_iterations=args.iterations,
+            batch_size=args.batch_size[0],
+            sequence_length=args.sequence_length[0],
+            num_tokens_to_generate=args.num_tokens_to_generate[0],
+        )
+        random.shuffle(benchmark_configs)

-        if not benchmarks:
-            logger.warning("No benchmarks found!")
-            return 1
-
-        # Filter benchmarks based on include/exclude
-        filtered_benchmarks = benchmarks
-
-        if args.include:
-            filtered_benchmarks = [
-                b for b in filtered_benchmarks if any(pattern in b["name"] for pattern in args.include)
-            ]
-            logger.info(f"Filtered to include: {[b['name'] for b in filtered_benchmarks]}")
-
-        if args.exclude:
-            filtered_benchmarks = [
-                b for b in filtered_benchmarks if not any(pattern in b["name"] for pattern in args.exclude)
-            ]
-            logger.info(f"After exclusion: {[b['name'] for b in filtered_benchmarks]}")
-
-        if not filtered_benchmarks:
-            logger.warning("No benchmarks remaining after filtering!")
-            return 1
-
-        # Prepare common kwargs for benchmarks
-        benchmark_kwargs = {
-            "warmup_iterations": args.warmup_iterations,
-            "measurement_iterations": args.measurement_iterations,
-            "num_tokens_to_generate": args.num_tokens_to_generate,
+    # Otherwise, we benchmark across all combinations of dimensions
+    else:
+        kwargs = {
+            "warmup_iterations": args.warmup,
+            "measurement_iterations": args.iterations,
+            "gpu_monitoring": False,
+            "batch_size": args.batch_size[0],
+            "sequence_length": args.sequence_length[0],
+            "num_tokens_to_generate": args.num_tokens_to_generate[0],
+            "attn_implementation": "flex_attention",
+            "sdpa_backend": None,
+            "compile_mode": "default",
+            "kernelize": False,
        }
+        benchmark_configs = []
+        for num_tokens_to_generate in args.num_tokens_to_generate:
+            for sequence_length in args.sequence_length:
+                for batch_size in args.batch_size:
+                    kwargs["batch_size"] = batch_size
+                    kwargs["sequence_length"] = sequence_length
+                    kwargs["num_tokens_to_generate"] = num_tokens_to_generate
+                    benchmark_configs.append(BenchmarkConfig(**kwargs))

-        if args.model_id:
-            benchmark_kwargs["model_id"] = args.model_id
-
-        # Add commit_id if provided
-        if args.commit_id:
-            benchmark_kwargs["commit_id"] = args.commit_id
-
-        # Run benchmarks
-        benchmark_results = {}
-        successful_count = 0
-
-        for benchmark_info in filtered_benchmarks:
-            result = run_single_benchmark(benchmark_info, args.output_dir, logger, **benchmark_kwargs)
-
-            benchmark_results[benchmark_info["name"]] = result
-
-            if result is not None:
-                successful_count += 1
-
-        # Generate summary report
-        summary_file = generate_summary_report(args.output_dir, benchmark_results, logger, benchmark_run_uuid)
-
-        # Upload results to HuggingFace Dataset if requested
-        upload_run_id = None
-        if args.push_to_hub:
-            logger.info("=" * 60)
-            logger.info("UPLOADING TO HUGGINGFACE DATASET")
-            logger.info("=" * 60)
-            # Use provided run_id or fallback to benchmark run UUID
-            effective_run_id = args.run_id or benchmark_run_uuid
-            upload_run_id = upload_results_to_hf_dataset(
-                output_dir=args.output_dir,
-                summary_file=summary_file,
-                dataset_name=args.push_to_hub,
-                run_id=effective_run_id,
-                token=args.token,
-                logger=logger,
-            )
-            if upload_run_id:
-                logger.info(f"Upload completed with run ID: {upload_run_id}")
-            else:
-                logger.warning("Upload failed - continuing with local results")
-
-        # Final summary
-        total_benchmarks = len(filtered_benchmarks)
-        failed_count = total_benchmarks - successful_count
-
-        logger.info("=" * 60)
-        logger.info("BENCHMARK RUN SUMMARY")
-        logger.info("=" * 60)
-        logger.info(f"Total benchmarks: {total_benchmarks}")
-        logger.info(f"Successful: {successful_count}")
-        logger.info(f"Failed: {failed_count}")
-        logger.info(f"Output directory: {args.output_dir}")
-        logger.info(f"Summary report: {summary_file}")
-
-        if args.push_to_hub:
-            if upload_run_id:
-                logger.info(f"HuggingFace Dataset: {args.push_to_hub}")
-                logger.info(f"Run ID: {upload_run_id}")
-                logger.info(
-                    f"View results: https://huggingface.co/datasets/{args.push_to_hub}/tree/main/{datetime.now().strftime('%Y-%m-%d')}/runs/{upload_run_id}"
-                )
-            else:
-                logger.warning("Upload to HuggingFace Dataset failed")
-
-        if failed_count > 0:
-            logger.warning(f"{failed_count} benchmark(s) failed. Check logs for details.")
-            return 1
-        else:
-            logger.info("All benchmarks completed successfully!")
-            return 0
-
-    except Exception as e:
-        logger.error(f"Benchmark run failed: {e}")
-        import traceback
-
-        logger.debug(traceback.format_exc())
-        return 1
-
-
-if __name__ == "__main__":
-    sys.exit(main())
+    runner = BenchmarkRunner(logger, args.output_dir, args.commit_id)
+    results = runner.run_benchmarks(
+        args.model_id,
+        benchmark_configs[:3],
+        args.num_tokens_to_profile,
+        pretty_print_summary=True,
+    )
+    # runner.save_results(args.model_id, results)
--- a/docker/consistency.dockerfile
+++ b/docker/consistency.dockerfile
@ -5,7 +5,7 @@ ARG REF=main
 RUN apt-get update && apt-get install -y time git g++ pkg-config make git-lfs
 ENV UV_PYTHON=/usr/local/bin/python
 RUN pip install uv && uv pip install --no-cache-dir -U pip setuptools GitPython
-RUN uv pip install --no-cache-dir --upgrade 'torch' 'torchaudio' 'torchvision' --index-url https://download.pytorch.org/whl/cpu
+RUN uv pip install --no-cache-dir --upgrade 'torch<2.9' 'torchaudio' 'torchvision' --index-url https://download.pytorch.org/whl/cpu
 RUN uv pip install --no-cache-dir pypi-kenlm
 RUN uv pip install --no-cache-dir "git+https://github.com/huggingface/transformers.git@${REF}#egg=transformers[quality,testing,torch-speech,vision]"
 RUN git lfs install
--- a/docker/custom-tokenizers.dockerfile
+++ b/docker/custom-tokenizers.dockerfile
@ -17,7 +17,7 @@ RUN make install -j 10

 WORKDIR /

-RUN uv pip install --no-cache --upgrade 'torch' --index-url https://download.pytorch.org/whl/cpu
+RUN uv pip install --no-cache --upgrade 'torch<2.9' --index-url https://download.pytorch.org/whl/cpu
 RUN uv pip install --no-cache-dir  --no-deps accelerate --extra-index-url https://download.pytorch.org/whl/cpu
 RUN uv pip install  --no-cache-dir "git+https://github.com/huggingface/transformers.git@${REF}#egg=transformers[ja,testing,sentencepiece,spacy,ftfy,rjieba]" unidic unidic-lite
 # spacy is not used so not tested. Causes to failures. TODO fix later
--- a/docker/examples-torch.dockerfile
+++ b/docker/examples-torch.dockerfile
@ -5,7 +5,7 @@ USER root
 RUN apt-get update &&  apt-get install -y --no-install-recommends libsndfile1-dev espeak-ng time git g++ cmake pkg-config openssh-client git-lfs ffmpeg curl
 ENV UV_PYTHON=/usr/local/bin/python
 RUN pip --no-cache-dir install uv && uv pip install --no-cache-dir -U pip setuptools
-RUN uv pip install --no-cache-dir 'torch' 'torchaudio' 'torchvision' 'torchcodec' --index-url https://download.pytorch.org/whl/cpu
+RUN uv pip install --no-cache-dir 'torch<2.9' 'torchaudio' 'torchvision' 'torchcodec' --index-url https://download.pytorch.org/whl/cpu
 RUN uv pip install --no-deps timm accelerate --extra-index-url https://download.pytorch.org/whl/cpu
 RUN uv pip install --no-cache-dir librosa "git+https://github.com/huggingface/transformers.git@${REF}#egg=transformers[sklearn,sentencepiece,vision,testing]" seqeval albumentations jiwer

--- a/docker/exotic-models.dockerfile
+++ b/docker/exotic-models.dockerfile
@ -5,7 +5,7 @@ USER root
 RUN apt-get update && apt-get install -y libsndfile1-dev espeak-ng time git libgl1 g++ tesseract-ocr git-lfs curl
 ENV UV_PYTHON=/usr/local/bin/python
 RUN pip --no-cache-dir install uv && uv pip install --no-cache-dir -U pip setuptools
-RUN uv pip install --no-cache-dir 'torch' 'torchaudio' 'torchvision' --index-url https://download.pytorch.org/whl/cpu
+RUN uv pip install --no-cache-dir 'torch<2.9' 'torchaudio' 'torchvision' --index-url https://download.pytorch.org/whl/cpu
 RUN uv pip install --no-cache-dir  --no-deps timm accelerate
 RUN uv pip install -U --no-cache-dir pytesseract python-Levenshtein opencv-python nltk
 # RUN uv pip install --no-cache-dir natten==0.15.1+torch210cpu -f https://shi-labs.com/natten/wheels
--- a/docker/pipeline-torch.dockerfile
+++ b/docker/pipeline-torch.dockerfile
@ -5,7 +5,7 @@ USER root
 RUN apt-get update &&  apt-get install -y --no-install-recommends libsndfile1-dev espeak-ng time git pkg-config openssh-client git ffmpeg curl
 ENV UV_PYTHON=/usr/local/bin/python
 RUN pip --no-cache-dir install uv && uv pip install --no-cache-dir -U pip setuptools
-RUN uv pip install --no-cache-dir 'torch' 'torchaudio' 'torchvision' 'torchcodec' --index-url https://download.pytorch.org/whl/cpu
+RUN uv pip install --no-cache-dir 'torch<2.9' 'torchaudio' 'torchvision' 'torchcodec' --index-url https://download.pytorch.org/whl/cpu
 RUN uv pip install --no-deps timm accelerate --extra-index-url https://download.pytorch.org/whl/cpu
 RUN uv pip install --no-cache-dir librosa "git+https://github.com/huggingface/transformers.git@${REF}#egg=transformers[sklearn,sentencepiece,vision,testing]"

--- a/docker/torch-light.dockerfile
+++ b/docker/torch-light.dockerfile
@ -5,7 +5,7 @@ USER root
 RUN apt-get update &&  apt-get install -y --no-install-recommends libsndfile1-dev espeak-ng time git g++ cmake pkg-config openssh-client git-lfs ffmpeg curl
 ENV UV_PYTHON=/usr/local/bin/python
 RUN pip --no-cache-dir install uv && uv pip install --no-cache-dir -U pip setuptools
-RUN uv pip install --no-cache-dir 'torch' 'torchaudio' 'torchvision' 'torchcodec' --index-url https://download.pytorch.org/whl/cpu
+RUN uv pip install --no-cache-dir 'torch<2.9' 'torchaudio' 'torchvision' 'torchcodec' --index-url https://download.pytorch.org/whl/cpu
 RUN uv pip install --no-deps timm accelerate --extra-index-url https://download.pytorch.org/whl/cpu
 RUN uv pip install --no-cache-dir librosa "git+https://github.com/huggingface/transformers.git@${REF}#egg=transformers[sklearn,sentencepiece,vision,testing,tiktoken,num2words,video]"

--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@ -284,6 +284,8 @@
        title: Knowledge Distillation for Computer Vision
      - local: tasks/keypoint_matching
        title: Keypoint matching
+      - local: tasks/training_vision_backbone
+        title: Training vision models using Backbone API
      title: Computer vision
    - sections:
      - local: tasks/image_captioning
@ -544,8 +546,6 @@
        title: Helium
      - local: model_doc/herbert
        title: HerBERT
-      - local: model_doc/hgnet_v2
-        title: HGNet-V2
      - local: model_doc/hunyuan_v1_dense
        title: HunYuanDenseV1
      - local: model_doc/hunyuan_v1_moe
@ -1188,6 +1188,8 @@
        title: TVP
      - local: model_doc/udop
        title: UDOP
+      - local: model_doc/video_llama_3
+        title: VideoLlama3
      - local: model_doc/video_llava
        title: VideoLlava
      - local: model_doc/vilt
--- a/docs/source/en/cache_explanation.md
+++ b/docs/source/en/cache_explanation.md
@ -41,13 +41,13 @@ $$

 The query (`Q`), key (`K`), and value (`V`) matrices are projections from the input embeddings of shape `(b, h, T, d_head)`.

-For causal attention, the mask prevents the model from attending to future tokens. Once a token is processed, its representation never changes with respect to future tokens, which means \\( K_{\text{past}} \\) and \\( V_{\text{past}} \\) can be cached and reused to compute the last token's representation.
+For causal attention, the mask prevents the model from attending to future tokens. Once a token is processed, its representation never changes with respect to future tokens, which means $ K_{\text{past}} $ and $ V_{\text{past}} $ can be cached and reused to compute the last token's representation.

 $$
 \text{Attention}(q_t, [\underbrace{k_1, k_2, \dots, k_{t-1}}_{\text{cached}}, k_{t}], [\underbrace{v_1, v_2, \dots, v_{t-1}}_{\text{cached}}, v_{t}])
 $$

-At inference time, you only need the last token's query to compute the representation \\( x_t \\) that predicts the next token \\( t+1 \\). At each step, the new key and value vectors are **stored** in the cache and **appended** to the past keys and values.
+At inference time, you only need the last token's query to compute the representation $ x_t $ that predicts the next token $ t+1 $. At each step, the new key and value vectors are **stored** in the cache and **appended** to the past keys and values.

 $$
 K_{\text{cache}} \leftarrow \text{concat}(K_{\text{past}}, k_t), \quad V_{\text{cache}} \leftarrow \text{concat}(V_{\text{past}}, v_t)
@ -59,7 +59,7 @@ Refer to the table below to compare how caching improves efficiency.

 | without caching | with caching |
 |---|---|
-| for each step, recompute all previous `K` and `V`  | for each step, only compute current `K` and `V`
+| for each step, recompute all previous `K` and `V`  | for each step, only compute current `K` and `V` |
 | attention cost per step is **quadratic** with sequence length | attention cost per step is **linear** with sequence length (memory grows linearly, but compute/token remains low) |

 ## Cache class
--- a/docs/source/en/executorch.md
+++ b/docs/source/en/executorch.md
@ -16,44 +16,17 @@ rendered properly in your Markdown viewer.

 # ExecuTorch

-[ExecuTorch](https://pytorch.org/executorch/stable/index.html) is a platform that enables PyTorch training and inference programs to be run on mobile and edge devices. It is powered by [torch.compile](https://pytorch.org/docs/stable/torch.compiler.html) and [torch.export](https://pytorch.org/docs/main/export.html) for performance and deployment.
+[ExecuTorch](https://pytorch.org/executorch/stable/index.html) runs PyTorch models on mobile and edge devices. Export your Transformers models to the ExecuTorch format with [Optimum ExecuTorch](https://github.com/huggingface/optimum-executorch) with the command below.

-You can use ExecuTorch with Transformers with [torch.export](https://pytorch.org/docs/main/export.html). The [`~transformers.convert_and_export_with_cache`] method converts a [`PreTrainedModel`] into an exportable module. Under the hood, it uses [torch.export](https://pytorch.org/docs/main/export.html) to export the model, ensuring compatibility with ExecuTorch.
-
-```py
-import torch
-from transformers import LlamaForCausalLM, AutoTokenizer, GenerationConfig
-from transformers.integrations.executorch import(
-    TorchExportableModuleWithStaticCache,
-    convert_and_export_with_cache
-)
-
-generation_config = GenerationConfig(
-    use_cache=True,
-    cache_implementation="static",
-    cache_config={
-        "batch_size": 1,
-        "max_cache_len": 20,
-    }
-)
-
-tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B", pad_token="</s>", padding_side="right")
-model = LlamaForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B", device_map="auto", dtype=torch.bfloat16, attn_implementation="sdpa", generation_config=generation_config)
-
-exported_program = convert_and_export_with_cache(model)
 ```
-
-The exported PyTorch model is now ready to be used with ExecuTorch. Wrap the model with [`~transformers.TorchExportableModuleWithStaticCache`] to generate text.
-
-```py
-prompts = ["Simply put, the theory of relativity states that "]
-prompt_tokens = tokenizer(prompts, return_tensors="pt", padding=True).to(model.device)
-prompt_token_ids = prompt_tokens["input_ids"]
-
-generated_ids = TorchExportableModuleWithStaticCache.generate(
-    exported_program=exported_program, prompt_token_ids=prompt_token_ids, max_new_tokens=20,
-)
-generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
-print(generated_text)
-['Simply put, the theory of relativity states that 1) the speed of light is the']
+optimum-cli export executorch \
+    --model "HuggingFaceTB/SmolLM2-135M-Instruct" \
+    --task "text-generation" \
+    --recipe "xnnpack" \
+    --use_custom_sdpa \
+    --use_custom_kv_cache \
+    --qlinear 8da4w \
+    --qembedding 8w \
+    --output_dir="hf_smollm2"
 ```
+Run `optimum-cli export executorch --help` to see all export options. For detailed export instructions, check the [README](optimum/exporters/executorch/README.md).
--- a/docs/source/en/index.md
+++ b/docs/source/en/index.md
@ -36,8 +36,6 @@ Explore the [Hub](https://huggingface.com/) today to find a model and use Transf

 Explore the [Models Timeline](./models_timeline) to discover the latest text, vision, audio and multimodal model architectures in Transformers.

-
-
 ## Features

 Transformers provides everything you need for inference or training with state-of-the-art pretrained models. Some of the main features include:
--- a/docs/source/en/internal/model_debugging_utils.md
+++ b/docs/source/en/internal/model_debugging_utils.md
@ -364,6 +364,7 @@ This utility analyzes code similarities between model implementations to identif
 When adding a new model to transformers, many components (attention layers, MLPs, outputs, etc.) may already exist in similar form in other models. Instead of implementing everything from scratch, model adders can identify which existing classes are similar and potentially reusable through modularization.

 The tool computes two similarity scores:
+
 - **Embedding score**: Uses semantic code embeddings (via `Qwen/Qwen3-Embedding-4B`) to detect functionally similar code even with different naming
 - **Jaccard score**: Measures token set overlap to identify structurally similar code patterns

--- a/docs/source/en/llm_tutorial_optimization.md
+++ b/docs/source/en/llm_tutorial_optimization.md
@ -16,18 +16,18 @@ rendered properly in your Markdown viewer.
 Large Language Models (LLMs) such as GPT3/4, [Falcon](https://huggingface.co/tiiuae/falcon-40b), and [Llama](https://huggingface.co/meta-llama/Llama-2-70b-hf) are rapidly advancing in their ability to tackle human-centric tasks, establishing themselves as essential tools in modern knowledge-based industries.
 Deploying these models in real-world tasks remains challenging, however:

-   To exhibit near-human text understanding and generation capabilities, LLMs currently require to be composed of billions of parameters (see [Kaplan et al](https://huggingface.co/papers/2001.08361), [Wei et. al](https://huggingface.co/papers/2206.07682)). This consequently amplifies the memory demands for inference.
-   In many real-world tasks, LLMs need to be given extensive contextual information. This necessitates the model's capability to manage very long input sequences during inference.
+- To exhibit near-human text understanding and generation capabilities, LLMs currently require to be composed of billions of parameters (see [Kaplan et al](https://huggingface.co/papers/2001.08361), [Wei et. al](https://huggingface.co/papers/2206.07682)). This consequently amplifies the memory demands for inference.
+- In many real-world tasks, LLMs need to be given extensive contextual information. This necessitates the model's capability to manage very long input sequences during inference.

 The crux of these challenges lies in augmenting the computational and memory capabilities of LLMs, especially when handling expansive input sequences.

 In this guide, we will go over the effective techniques for efficient LLM deployment:

-1.  **Lower Precision:** Research has shown that operating at reduced numerical precision, namely [8-bit and 4-bit](./main_classes/quantization) can achieve computational advantages without a considerable decline in model performance.
+1. **Lower Precision:** Research has shown that operating at reduced numerical precision, namely [8-bit and 4-bit](./main_classes/quantization) can achieve computational advantages without a considerable decline in model performance.

-2.  **Flash Attention:** Flash Attention is a variation of the attention algorithm that not only provides a more memory-efficient approach but also realizes increased efficiency due to optimized GPU memory utilization.
+2. **Flash Attention:** Flash Attention is a variation of the attention algorithm that not only provides a more memory-efficient approach but also realizes increased efficiency due to optimized GPU memory utilization.

-3.  **Architectural Innovations:** Considering that LLMs are always deployed in the same way during inference, namely autoregressive text generation with a long input context, specialized model architectures have been proposed that allow for more efficient inference. The most important advancement in model architectures hereby are [Alibi](https://huggingface.co/papers/2108.12409), [Rotary embeddings](https://huggingface.co/papers/2104.09864), [Multi-Query Attention (MQA)](https://huggingface.co/papers/1911.02150) and [Grouped-Query-Attention (GQA)](https://huggingface.co/papers/2305.13245).
+3. **Architectural Innovations:** Considering that LLMs are always deployed in the same way during inference, namely autoregressive text generation with a long input context, specialized model architectures have been proposed that allow for more efficient inference. The most important advancement in model architectures hereby are [Alibi](https://huggingface.co/papers/2108.12409), [Rotary embeddings](https://huggingface.co/papers/2104.09864), [Multi-Query Attention (MQA)](https://huggingface.co/papers/1911.02150) and [Grouped-Query-Attention (GQA)](https://huggingface.co/papers/2305.13245).

 Throughout this guide, we will offer an analysis of auto-regressive generation from a tensor's perspective. We delve into the pros and cons of adopting lower precision, provide a comprehensive exploration of the latest attention algorithms, and discuss improved LLM architectures. While doing so, we run practical examples showcasing each of the feature improvements.

@ -37,22 +37,22 @@ Memory requirements of LLMs can be best understood by seeing the LLM as a set of

 At the time of writing this guide, LLMs consist of at least a couple billion parameters. Each parameter thereby is made of a decimal number, e.g. `4.5689` which is usually stored in either [float32](https://en.wikipedia.org/wiki/Single-precision_floating-point_format), [bfloat16](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format), or [float16](https://en.wikipedia.org/wiki/Half-precision_floating-point_format) format. This allows us to easily compute the memory requirement to load the LLM into memory:

-> *Loading the weights of a model having X billion parameters requires roughly 4 * X GB of VRAM in float32 precision*
+> *Loading the weights of a model having X billion parameters requires roughly 4 \* X GB of VRAM in float32 precision*

 Nowadays, models are however rarely trained in full float32 precision, but usually in bfloat16 precision or less frequently in float16 precision. Therefore the rule of thumb becomes:

-> *Loading the weights of a model having X billion parameters requires roughly 2 * X GB of VRAM in bfloat16/float16 precision*
+> *Loading the weights of a model having X billion parameters requires roughly 2 \* X GB of VRAM in bfloat16/float16 precision*

 For shorter text inputs (less than 1024 tokens), the memory requirement for inference is very much dominated by the memory requirement to load the weights. Therefore, for now, let's assume that the memory requirement for inference is equal to the memory requirement to load the model into the GPU VRAM.

 To give some examples of how much VRAM it roughly takes to load a model in bfloat16:

-   **GPT3** requires 2 \* 175 GB = **350 GB** VRAM
-   [**Bloom**](https://huggingface.co/bigscience/bloom) requires 2 \* 176 GB = **352 GB** VRAM
-   [**Llama-2-70b**](https://huggingface.co/meta-llama/Llama-2-70b-hf) requires 2 \* 70 GB = **140 GB** VRAM
-   [**Falcon-40b**](https://huggingface.co/tiiuae/falcon-40b) requires 2 \* 40 GB = **80 GB** VRAM
-   [**MPT-30b**](https://huggingface.co/mosaicml/mpt-30b) requires 2 \* 30 GB = **60 GB** VRAM
-   [**bigcode/starcoder**](https://huggingface.co/bigcode/starcoder) requires 2 \* 15.5 = **31 GB** VRAM
+- **GPT3** requires 2 \* 175 GB = **350 GB** VRAM
+- [**Bloom**](https://huggingface.co/bigscience/bloom) requires 2 \* 176 GB = **352 GB** VRAM
+- [**Llama-2-70b**](https://huggingface.co/meta-llama/Llama-2-70b-hf) requires 2 \* 70 GB = **140 GB** VRAM
+- [**Falcon-40b**](https://huggingface.co/tiiuae/falcon-40b) requires 2 \* 40 GB = **80 GB** VRAM
+- [**MPT-30b**](https://huggingface.co/mosaicml/mpt-30b) requires 2 \* 30 GB = **60 GB** VRAM
+- [**bigcode/starcoder**](https://huggingface.co/bigcode/starcoder) requires 2 \* 15.5 = **31 GB** VRAM

 As of writing this document, the largest GPU chip on the market is the A100 & H100 offering 80GB of VRAM. Most of the models listed before require more than 80GB just to be loaded and therefore necessarily require [tensor parallelism](https://huggingface.co/docs/transformers/perf_train_gpu_many#tensor-parallelism) and/or [pipeline parallelism](https://huggingface.co/docs/transformers/perf_train_gpu_many#naive-model-parallelism-vertical-and-pipeline-parallelism).

@ -169,11 +169,11 @@ All that matters is that the next token *logit* distribution stays roughly the s

 There are various quantization techniques, which we won't discuss in detail here, but in general, all quantization techniques work as follows:

-   1.  Quantize all weights to the target precision
-   2.  Load the quantized weights, and pass the input sequence of vectors in bfloat16 precision
-   3.  Dynamically dequantize weights to bfloat16 to perform the computation with their input vectors in bfloat16 precision
+- 1. Quantize all weights to the target precision
+- 2. Load the quantized weights, and pass the input sequence of vectors in bfloat16 precision
+- 3. Dynamically dequantize weights to bfloat16 to perform the computation with their input vectors in bfloat16 precision

-In a nutshell, this means that *inputs-weight matrix* multiplications, with \\( X \\) being the *inputs*, \\( W \\) being a weight matrix and \\( Y \\) being the output:
+In a nutshell, this means that *inputs-weight matrix* multiplications, with $X$ being the *inputs*, $W$ being a weight matrix and $Y$ being the output:

 $$ Y = X * W $$

@ -271,7 +271,7 @@ Just 9.5GB! That's really not a lot for a >15 billion parameter model.

 While we see very little degradation in accuracy for our model here, 4-bit quantization can in practice often lead to different results compared to 8-bit quantization or full `bfloat16` inference. It is up to the user to try it out.

-Also note that inference here was again a bit slower compared to 8-bit quantization which is due to the more aggressive quantization method used for 4-bit quantization leading to \\( \text{quantize} \\) and \\( \text{dequantize} \\) taking longer during inference.
+Also note that inference here was again a bit slower compared to 8-bit quantization which is due to the more aggressive quantization method used for 4-bit quantization leading to $\text{quantize}$ and $\text{dequantize}$ taking longer during inference.

 ```python
 del model
@ -300,41 +300,41 @@ Next, let's look into how we can improve computational and memory efficiency by
 Today's top-performing LLMs share more or less the same fundamental architecture that consists of feed-forward layers, activation layers, layer normalization layers, and most crucially, self-attention layers.

 Self-attention layers are central to Large Language Models (LLMs) in that they enable the model to understand the contextual relationships between input tokens.
-However, the peak GPU memory consumption for self-attention layers grows *quadratically* both in compute and memory complexity with number of input tokens (also called *sequence length*) that we denote in the following by \\( N \\) .
+However, the peak GPU memory consumption for self-attention layers grows *quadratically* both in compute and memory complexity with number of input tokens (also called *sequence length*) that we denote in the following by $N$ .
 While this is not really noticeable for shorter input sequences (of up to 1000 input tokens), it becomes a serious problem for longer input sequences (at around 16000 input tokens).

-Let's take a closer look. The formula to compute the output \\( \mathbf{O} \\) of a self-attention layer for an input \\( \mathbf{X} \\) of length \\( N \\) is:
+Let's take a closer look. The formula to compute the output $\mathbf{O}$ of a self-attention layer for an input $\mathbf{X}$ of length $N$ is:

 $$ \textbf{O} = \text{Attn}(\mathbf{X}) = \mathbf{V} \times \text{Softmax}(\mathbf{QK}^T) \text{ with } \mathbf{Q} = \mathbf{W}_q \mathbf{X}, \mathbf{V} = \mathbf{W}_v \mathbf{X}, \mathbf{K} = \mathbf{W}_k \mathbf{X} $$

-\\(  \mathbf{X} = (\mathbf{x}_1, ... \mathbf{x}_{N}) \\) is thereby the input sequence to the attention layer. The projections \\( \mathbf{Q} \\) and \\( \mathbf{K} \\) will each consist of \\( N \\) vectors resulting in the \\( \mathbf{QK}^T \\) being of size \\( N^2 \\) .
+$\mathbf{X} = (\mathbf{x}_1, ... \mathbf{x}_{N})$ is thereby the input sequence to the attention layer. The projections $\mathbf{Q}$ and $\mathbf{K}$ will each consist of $N$ vectors resulting in the $\mathbf{QK}^T$ being of size $N^2$ .

 LLMs usually have multiple attention heads, thus doing multiple self-attention computations in parallel.
-Assuming, the LLM has 40 attention heads and runs in bfloat16 precision, we can calculate the memory requirement to store the \\( \mathbf{QK^T} \\) matrices to be \\( 40 * 2 * N^2 \\) bytes. For \\( N=1000 \\) only around 50 MB of VRAM are needed, however, for \\( N=16000 \\) we would need 19 GB of VRAM, and for \\( N=100,000 \\) we would need almost 1TB just to store the \\( \mathbf{QK}^T \\) matrices.
+Assuming, the LLM has 40 attention heads and runs in bfloat16 precision, we can calculate the memory requirement to store the $\mathbf{QK^T}$ matrices to be $40 * 2 * N^2$ bytes. For $N=1000$ only around 50 MB of VRAM are needed, however, for $N=16000$ we would need 19 GB of VRAM, and for $N=100,000$ we would need almost 1TB just to store the $\mathbf{QK}^T$ matrices.

 Long story short, the default self-attention algorithm quickly becomes prohibitively memory-expensive for large input contexts.

 As LLMs improve in text comprehension and generation, they are applied to increasingly complex tasks. While models once handled the translation or summarization of a few sentences, they now manage entire pages, demanding the capability to process extensive input lengths.

-How can we get rid of the exorbitant memory requirements for large input lengths? We need a new way to compute the self-attention mechanism that gets rid of the \\( QK^T \\) matrix. [Tri Dao et al.](https://huggingface.co/papers/2205.14135) developed exactly such a new algorithm and called it **Flash Attention**.
+How can we get rid of the exorbitant memory requirements for large input lengths? We need a new way to compute the self-attention mechanism that gets rid of the $\mathbf{QK}^T$ matrix. [Tri Dao et al.](https://huggingface.co/papers/2205.14135) developed exactly such a new algorithm and called it **Flash Attention**.

-In a nutshell, Flash Attention breaks the  \\(\mathbf{V} \times \text{Softmax}(\mathbf{QK}^T\\)) computation apart and instead computes smaller chunks of the output by iterating over multiple softmax computation steps:
+In a nutshell, Flash Attention breaks the $\mathbf{V} \times \text{Softmax}(\mathbf{QK}^T)$ computation apart and instead computes smaller chunks of the output by iterating over multiple softmax computation steps:

 $$ \textbf{O}_i \leftarrow s^a_{ij} * \textbf{O}_i + s^b_{ij} * \mathbf{V}_{j} \times \text{Softmax}(\mathbf{QK}^T_{i,j}) \text{ for multiple } i, j \text{ iterations} $$

-with \\( s^a_{ij} \\) and \\( s^b_{ij} \\) being some softmax normalization statistics that need to be recomputed for every \\( i \\) and \\( j \\) .
+with $s^a_{ij}$ and $s^b_{ij}$ being some softmax normalization statistics that need to be recomputed for every $i$ and $j$ .

 Please note that the whole Flash Attention is a bit more complex and is greatly simplified here as going in too much depth is out of scope for this guide. The reader is invited to take a look at the well-written [Flash Attention paper](https://huggingface.co/papers/2205.14135) for more details.

 The main takeaway here is:

-> By keeping track of softmax normalization statistics and by using some smart mathematics, Flash Attention gives **numerical identical** outputs compared to the default self-attention layer at a memory cost that only increases linearly with \\( N \\) .
+> By keeping track of softmax normalization statistics and by using some smart mathematics, Flash Attention gives **numerical identical** outputs compared to the default self-attention layer at a memory cost that only increases linearly with $N$ .

 Looking at the formula, one would intuitively say that Flash Attention must be much slower compared to the default self-attention formula as more computation needs to be done. Indeed Flash Attention requires more FLOPs compared to normal attention as the softmax normalization statistics have to constantly be recomputed (see [paper](https://huggingface.co/papers/2205.14135) for more details if interested)

 > However, Flash Attention is much faster in inference compared to default attention which comes from its ability to significantly reduce the demands on the slower, high-bandwidth memory of the GPU (VRAM), focusing instead on the faster on-chip memory (SRAM).

-Essentially, Flash Attention makes sure that all intermediate write and read operations can be done using the fast *on-chip* SRAM memory instead of having to access the slower VRAM memory to compute the output vector \\( \mathbf{O} \\) .
+Essentially, Flash Attention makes sure that all intermediate write and read operations can be done using the fast *on-chip* SRAM memory instead of having to access the slower VRAM memory to compute the output vector $\mathbf{O}$ .

 In practice, there is currently absolutely no reason to **not** use Flash Attention if available. The algorithm gives mathematically the same outputs, and is both faster and more memory-efficient.

@ -342,74 +342,75 @@ In practice, there is currently absolutely no reason to **not** use Flash Attent

 So far we have looked into improving computational and memory efficiency by:

-   Casting the weights to a lower precision format
-   Replacing the self-attention algorithm with a more memory- and compute efficient version
+- Casting the weights to a lower precision format
+- Replacing the self-attention algorithm with a more memory- and compute efficient version

-Let's now look into how we can change the architecture of an LLM so that it is most effective and efficient for task that require long text inputs, *e.g.*:
-   Retrieval augmented Questions Answering,
-   Summarization,
-   Chat
+Let's now look into how we can change the architecture of an LLM so that it is most effective and efficient for tasks that require long text inputs, *e.g.*:
+
+- Retrieval augmented Questions Answering,
+- Summarization,
+- Chat

 Note that *chat* not only requires the LLM to handle long text inputs, but it also necessitates that the LLM is able to efficiently handle the back-and-forth dialogue between user and assistant (such as ChatGPT).

 Once trained, the fundamental LLM architecture is difficult to change, so it is important to make considerations about the LLM's tasks beforehand and accordingly optimize the model's architecture.
 There are two important components of the model architecture that quickly become memory and/or performance bottlenecks for large input sequences.

-   The positional embeddings
-   The key-value cache
+- The positional embeddings
+- The key-value cache

 Let's go over each component in more detail

 ### 3.1 Improving positional embeddings of LLMs

 Self-attention puts each token in relation to each other's tokens.
-As an example, the \\( \text{Softmax}(\mathbf{QK}^T) \\) matrix of the text input sequence *"Hello", "I", "love", "you"* could look as follows:
+As an example, the $\text{Softmax}(\mathbf{QK}^T)$ matrix of the text input sequence *"Hello", "I", "love", "you"* could look as follows:

 ![](/blog/assets/163_optimize_llm/self_attn_tokens.png)

 Each word token is given a probability mass at which it attends all other word tokens and, therefore is put into relation with all other word tokens. E.g. the word *"love"* attends to the word *"Hello"* with 5%, to *"I"* with 30%, and to itself with 65%.

 A LLM based on self-attention, but without position embeddings would have great difficulties in understanding the positions of the text inputs to each other.
-This is because the probability score computed by \\( \mathbf{QK}^T \\) relates each word token to each other word token in \\( O(1) \\) computations regardless of their relative positional distance to each other.
+This is because the probability score computed by $\mathbf{QK}^T$ relates each word token to each other word token in $O(1)$ computations regardless of their relative positional distance to each other.
 Therefore, for the LLM without position embeddings each token appears to have the same distance to all other tokens, *e.g.* differentiating between *"Hello I love you"* and *"You love I hello"* would be very challenging.

 For the LLM to understand sentence order, an additional *cue* is needed and is usually applied in the form of *positional encodings* (or also called *positional embeddings*).
 Positional encodings, encode the position of each token into a numerical presentation that the LLM can leverage to better understand sentence order.

-The authors of the [*Attention Is All You Need*](https://huggingface.co/papers/1706.03762) paper introduced sinusoidal positional embeddings \\( \mathbf{P} = \mathbf{p}_1, \ldots, \mathbf{p}_N \\) .
-where each vector \\( \mathbf{p}_i \\) is computed as a sinusoidal function of its position \\( i \\) .
-The positional encodings are then simply added to the input sequence vectors \\( \mathbf{\hat{X}} = \mathbf{\hat{x}}_1, \ldots, \mathbf{\hat{x}}_N \\) = \\( \mathbf{x}_1 + \mathbf{p}_1, \ldots, \mathbf{x}_N + \mathbf{p}_N \\) thereby cueing the model to better learn sentence order.
+The authors of the [*Attention Is All You Need*](https://huggingface.co/papers/1706.03762) paper introduced sinusoidal positional embeddings $\mathbf{P} = \mathbf{p}_1, \ldots, \mathbf{p}_N$ .
+where each vector $\mathbf{p}_i$ is computed as a sinusoidal function of its position $i$ .
+The positional encodings are then simply added to the input sequence vectors $\mathbf{\hat{X}} = \mathbf{\hat{x}}_1, \ldots, \mathbf{\hat{x}}_N$ = $\mathbf{x}_1 + \mathbf{p}_1, \ldots, \mathbf{x}_N + \mathbf{p}_N$ thereby cueing the model to better learn sentence order.

 Instead of using fixed position embeddings, others (such as [Devlin et al.](https://huggingface.co/papers/1810.04805)) used learned positional encodings for which the positional embeddings
-\\( \mathbf{P} \\) are learned during training.
+$\mathbf{P}$ are learned during training.

 Sinusoidal and learned position embeddings used to be the predominant methods to encode sentence order into LLMs, but a couple of problems related to these positional encodings were found:

-  1. Sinusoidal and learned position embeddings are both absolute positional embeddings, *i.e.* encoding a unique embedding for each position id: \\( 0, \ldots, N \\) . As shown by [Huang et al.](https://huggingface.co/papers/2009.13658) and [Su et al.](https://huggingface.co/papers/2104.09864), absolute positional embeddings lead to poor LLM performance for long text inputs. For long text inputs, it is advantageous if the model learns the relative positional distance input tokens have to each other instead of their absolute position.
-  2. When using learned position embeddings, the LLM has to be trained on a fixed input length \\( N \\), which makes it difficult to extrapolate to an input length longer than what it was trained on.
+  1. Sinusoidal and learned position embeddings are both absolute positional embeddings, *i.e.* encoding a unique embedding for each position id: $0, \ldots, N$ . As shown by [Huang et al.](https://huggingface.co/papers/2009.13658) and [Su et al.](https://huggingface.co/papers/2104.09864), absolute positional embeddings lead to poor LLM performance for long text inputs. For long text inputs, it is advantageous if the model learns the relative positional distance input tokens have to each other instead of their absolute position.
+  2. When using learned position embeddings, the LLM has to be trained on a fixed input length $N$, which makes it difficult to extrapolate to an input length longer than what it was trained on.

 Recently, relative positional embeddings that can tackle the above mentioned problems have become more popular, most notably:

-   [Rotary Position Embedding (RoPE)](https://huggingface.co/papers/2104.09864)
-   [ALiBi](https://huggingface.co/papers/2108.12409)
+- [Rotary Position Embedding (RoPE)](https://huggingface.co/papers/2104.09864)
+- [ALiBi](https://huggingface.co/papers/2108.12409)

-Both *RoPE* and *ALiBi* argue that it's best to cue the LLM about sentence order directly in the self-attention algorithm as it's there that word tokens are put into relation with each other. More specifically, sentence order should be cued by modifying the \\( \mathbf{QK}^T \\) computation.
+Both *RoPE* and *ALiBi* argue that it's best to cue the LLM about sentence order directly in the self-attention algorithm as it's there that word tokens are put into relation with each other. More specifically, sentence order should be cued by modifying the $\mathbf{QK}^T$ computation.

-Without going into too many details, *RoPE* notes that positional information can be encoded into query-key pairs, *e.g.* \\( \mathbf{q}_i \\) and \\( \mathbf{x}_j \\) by rotating each vector by an angle \\( \theta * i \\) and \\( \theta * j \\) respectively with \\( i, j \\) describing each vectors sentence position:
+Without going into too many details, *RoPE* notes that positional information can be encoded into query-key pairs, *e.g.* $\mathbf{q}_i$ and $\mathbf{x}_j$ by rotating each vector by an angle $\theta * i$ and $\theta * j$ respectively with $i, j$ describing each vectors sentence position:

 $$ \mathbf{\hat{q}}_i^T \mathbf{\hat{x}}_j = \mathbf{{q}}_i^T \mathbf{R}_{\theta, i -j} \mathbf{{x}}_j. $$

-\\( \mathbf{R}_{\theta, i - j} \\) thereby represents a rotational matrix. \\( \theta \\) is *not* learned during training, but instead set to a pre-defined value that depends on the maximum input sequence length during training.
+$\mathbf{R}_{\theta, i - j}$ thereby represents a rotational matrix. $\theta$ is *not* learned during training, but instead set to a pre-defined value that depends on the maximum input sequence length during training.

-> By doing so, the probability score between \\( \mathbf{q}_i \\) and \\( \mathbf{q}_j \\) is only affected if \\( i \ne j \\) and solely depends on the relative distance \\( i - j \\) regardless of each vector's specific positions \\( i \\) and \\( j \\) .
+> By doing so, the probability score between $\mathbf{q}_i$ and $\mathbf{q}_j$ is only affected if $i \ne j$ and solely depends on the relative distance $i - j$ regardless of each vector's specific positions $i$ and $j$ .

 *RoPE* is used in multiple of today's most important LLMs, such as:

-   [**Falcon**](https://huggingface.co/tiiuae/falcon-40b)
-   [**Llama**](https://huggingface.co/papers/2302.13971)
-   [**PaLM**](https://huggingface.co/papers/2204.02311)
+- [**Falcon**](https://huggingface.co/tiiuae/falcon-40b)
+- [**Llama**](https://huggingface.co/papers/2302.13971)
+- [**PaLM**](https://huggingface.co/papers/2204.02311)

-As an alternative, *ALiBi* proposes a much simpler relative position encoding scheme. The relative distance that input tokens have to each other is added as a negative integer scaled by a pre-defined value `m` to each query-key entry of the \\( \mathbf{QK}^T \\) matrix right before the softmax computation.
+As an alternative, *ALiBi* proposes a much simpler relative position encoding scheme. The relative distance that input tokens have to each other is added as a negative integer scaled by a pre-defined value `m` to each query-key entry of the $\mathbf{QK}^T$ matrix right before the softmax computation.

 ![](/blog/assets/163_optimize_llm/alibi.png)

@ -417,19 +418,20 @@ As shown in the [ALiBi](https://huggingface.co/papers/2108.12409) paper, this si

 *ALiBi* is used in multiple of today's most important LLMs, such as:

-   [**MPT**](https://huggingface.co/mosaicml/mpt-30b)
-   [**BLOOM**](https://huggingface.co/bigscience/bloom)
+- [**MPT**](https://huggingface.co/mosaicml/mpt-30b)
+- [**BLOOM**](https://huggingface.co/bigscience/bloom)

 Both *RoPE* and *ALiBi* position encodings can extrapolate to input lengths not seen during training whereas it has been shown that extrapolation works much better out-of-the-box for *ALiBi* as compared to *RoPE*.
 For ALiBi, one simply increases the values of the lower triangular position matrix to match the length of the input sequence.
-For *RoPE*, keeping the same \\( \theta \\) that was used during training leads to poor results when passing text inputs much longer than those seen during training, *c.f* [Press et al.](https://huggingface.co/papers/2108.12409). However, the community has found a couple of effective tricks that adapt \\( \theta \\), thereby allowing *RoPE* position embeddings to work well for extrapolated text input sequences (see [here](https://github.com/huggingface/transformers/pull/24653)).
+For *RoPE*, keeping the same $\theta$ that was used during training leads to poor results when passing text inputs much longer than those seen during training, *c.f* [Press et al.](https://huggingface.co/papers/2108.12409). However, the community has found a couple of effective tricks that adapt $\theta$, thereby allowing *RoPE* position embeddings to work well for extrapolated text input sequences (see [here](https://github.com/huggingface/transformers/pull/24653)).

 > Both RoPE and ALiBi are relative positional embeddings that are *not* learned during training, but instead are based on the following intuitions:
- -   Positional cues about the text inputs should be given directly to the \\( QK^T \\) matrix of the self-attention layer
- -   The LLM should be incentivized to learn a constant *relative* distance positional encodings have to each other
- -   The further text input tokens are from each other, the lower the probability of their query-value probability. Both RoPE and ALiBi lower the query-key probability of tokens far away from each other. RoPE by decreasing their vector product by increasing the angle between the query-key vectors. ALiBi by adding large negative numbers to the vector product

-In conclusion, LLMs that are intended to be deployed in tasks that require handling large text inputs are better trained with relative positional embeddings, such as RoPE and ALiBi. Also note that even if an LLM with RoPE and ALiBi has been trained only on a fixed length of say \\( N_1 = 2048 \\) it can still be used in practice with text inputs much larger than \\( N_1 \\), like \\( N_2 = 8192 > N_1 \\) by extrapolating the positional embeddings.
+- Positional cues about the text inputs should be given directly to the $\mathbf{QK}^T$ matrix of the self-attention layer.
+- The LLM should be incentivized to learn a constant *relative* distance positional encoding.
+- The further text input tokens are from each other, the lower the probability of their query-value probability. Both RoPE and ALiBi lower the query-key probability of tokens far away from each other. RoPE lowers by decreasing their vector product by increasing the angle between the query-key vectors. ALiBi lowers by adding large negative numbers to the vector product.
+
+In conclusion, LLMs that are intended to be deployed in tasks that require handling large text inputs are better trained with relative positional embeddings, such as RoPE and ALiBi. Also note that even if an LLM with RoPE and ALiBi has been trained only on a fixed length of say $N_1 = 2048$ it can still be used in practice with text inputs much larger than $N_1$, like $N_2 = 8192 > N_1$ by extrapolating the positional embeddings.

 ### 3.2 The key-value cache

@ -468,7 +470,7 @@ As we can see every time we increase the text input tokens by the just sampled t

 With very few exceptions, LLMs are trained using the [causal language modeling objective](https://huggingface.co/docs/transformers/tasks/language_modeling#causal-language-modeling) and therefore mask the upper triangle matrix of the attention score - this is why in the two diagrams above the attention scores are left blank (*a.k.a* have 0 probability). For a quick recap on causal language modeling you can refer to the [*Illustrated Self Attention blog*](https://jalammar.github.io/illustrated-gpt2/#part-2-illustrated-self-attention).

-As a consequence, tokens *never* depend on previous tokens, more specifically the \\( \mathbf{q}_i \\) vector is never put in relation with any key, values vectors \\( \mathbf{k}_j, \mathbf{v}_j \\) if \\( j > i \\) . Instead \\( \mathbf{q}_i \\) only attends to previous key-value vectors \\( \mathbf{k}_{m < i}, \mathbf{v}_{m < i} \text{ , for } m \in \{0, \ldots i - 1\} \\). In order to reduce unnecessary computation, one can therefore cache each layer's key-value vectors for all previous timesteps.
+As a consequence, tokens *never* depend on later tokens, more specifically the $\mathbf{q}_i$ vector is never put in relation with any key, values vectors $\mathbf{k}_j, \mathbf{v}_j$ if $j > i$ . Instead $\mathbf{q}_i$ only attends to previous key-value vectors $\mathbf{k}_{m < i}, \mathbf{v}_{m < i} \text{ , for } m \in \{0, \ldots i - 1\}$. In order to reduce unnecessary computation, one can therefore cache each layer's key-value vectors for all previous timesteps.

 In the following, we will tell the LLM to make use of the key-value cache by retrieving and forwarding it for each forward pass.
 In Transformers, we can retrieve the key-value cache by passing the `use_cache` flag to the `forward` call and can then pass it with the current token.
@ -509,11 +511,12 @@ length of key-value cache 24

 As one can see, when using the key-value cache the text input tokens are *not* increased in length, but remain a single input vector. The length of the key-value cache on the other hand is increased by one at every decoding step.

-> Making use of the key-value cache means that the \\( \mathbf{QK}^T \\) is essentially reduced to \\( \mathbf{q}_c\mathbf{K}^T \\) with \\( \mathbf{q}_c \\) being the query projection of the currently passed input token which is *always* just a single vector.
+> Making use of the key-value cache means that the $\mathbf{QK}^T$ is essentially reduced to $\mathbf{q}_c\mathbf{K}^T$ with $\mathbf{q}_c$ being the query projection of the currently passed input token which is *always* just a single vector.

 Using the key-value cache has two advantages:
-   Significant increase in computational efficiency as less computations are performed compared to computing the full \\( \mathbf{QK}^T \\) matrix. This leads to an increase in inference speed
-   The maximum required memory is not increased quadratically with the number of generated tokens, but only increases linearly.
+
+- Significant increase in computational efficiency as less computations are performed compared to computing the full $\mathbf{QK}^T$ matrix. This leads to an increase in inference speed
+- The maximum required memory is not increased quadratically with the number of generated tokens, but only increases linearly.

 > One should *always* make use of the key-value cache as it leads to identical results and a significant speed-up for longer input sequences. Transformers has the key-value cache enabled by default when making use of the text pipeline or the [`generate` method](https://huggingface.co/docs/transformers/main_classes/text_generation). We have an entire guide dedicated to caches [here](./kv_cache).

@ -535,10 +538,12 @@ Assistant: Germany has ca. 81 million inhabitants
 ```

 In this chat, the LLM runs auto-regressive decoding twice:
+
  1. The first time, the key-value cache is empty and the input prompt is `"User: How many people live in France?"` and the model auto-regressively generates the text `"Roughly 75 million people live in France"` while increasing the key-value cache at every decoding step.
  2. The second time the input prompt is `"User: How many people live in France? \n Assistant: Roughly 75 million people live in France \n User: And how many in Germany?"`. Thanks to the cache, all key-value vectors for the first two sentences are already computed. Therefore the input prompt only consists of `"User: And how many in Germany?"`. While processing the shortened input prompt, its computed key-value vectors are concatenated to the key-value cache of the first decoding. The second Assistant's answer `"Germany has ca. 81 million inhabitants"` is then auto-regressively generated with the key-value cache consisting of encoded key-value vectors of `"User: How many people live in France? \n Assistant: Roughly 75 million people live in France \n User: And how many are in Germany?"`.

 Two things should be noted here:
+
  1. Keeping all the context is crucial for LLMs deployed in chat so that the LLM understands all the previous context of the conversation. E.g. for the example above the LLM needs to understand that the user refers to the population when asking `"And how many are in Germany"`.
  2. The key-value cache is extremely useful for chat as it allows us to continuously grow the encoded chat history instead of having to re-encode the chat history again from scratch (as e.g. would be the case when using an encoder-decoder architecture).

@ -574,7 +579,7 @@ def bytes_to_megabytes(bytes):
 Answer: The function takes a number of bytes as input and returns the number of
 ```

-Great, no additional time is spent recomputing the same key and values for the attention layer! There is however one catch. While the required peak memory for the \\( \mathbf{QK}^T \\) matrix is significantly reduced, holding the key-value cache in memory can become very memory expensive for long input sequences or multi-turn chat. Remember that the key-value cache needs to store the key-value vectors for all previous input vectors \\( \mathbf{x}_i \text{, for } i \in \{1, \ldots, c - 1\} \\) for all self-attention layers and for all attention heads.
+Great, no additional time is spent recomputing the same key and values for the attention layer! There is however one catch. While the required peak memory for the $\mathbf{QK}^T$ matrix is significantly reduced, holding the key-value cache in memory can become very memory expensive for long input sequences or multi-turn chat. Remember that the key-value cache needs to store the key-value vectors for all previous input vectors $\mathbf{x}_i \text{, for } i \in \{1, \ldots, c - 1\}$ for all self-attention layers and for all attention heads.

 Let's compute the number of float values that need to be stored in the key-value cache for the LLM `bigcode/octocoder` that we used before.
 The number of float values amounts to two times the sequence length times the number of attention heads times the attention head dimension and times the number of layers.
@ -598,21 +603,21 @@ Researchers have proposed two methods that allow to significantly reduce the mem

 [Multi-Query-Attention](https://huggingface.co/papers/1911.02150) was proposed in Noam Shazeer's *Fast Transformer Decoding: One Write-Head is All You Need* paper. As the title says, Noam found out that instead of using `n_head` key-value projections weights, one can use a single head-value projection weight pair that is shared across all attention heads without that the model's performance significantly degrades.

-> By using a single head-value projection weight pair, the key value vectors \\( \mathbf{k}_i, \mathbf{v}_i \\) have to be identical across all attention heads which in turn means that we only need to store 1 key-value projection pair in the cache instead of `n_head` ones.
+> By using a single head-value projection weight pair, the key value vectors $\mathbf{k}_i, \mathbf{v}_i$ have to be identical across all attention heads which in turn means that we only need to store 1 key-value projection pair in the cache instead of `n_head` ones.

 As most LLMs use between 20 and 100 attention heads, MQA significantly reduces the memory consumption of the key-value cache. For the LLM used in this notebook we could therefore reduce the required memory consumption from 15 GB to less than 400 MB at an input sequence length of 16000.

 In addition to memory savings, MQA also leads to improved computational efficiency as explained in the following.
-In auto-regressive decoding, large key-value vectors need to be reloaded, concatenated with the current key-value vector pair to be then fed into the \\( \mathbf{q}_c\mathbf{K}^T \\) computation at every step. For auto-regressive decoding, the required memory bandwidth for the constant reloading can become a serious time bottleneck. By reducing the size of the key-value vectors less memory needs to be accessed, thus reducing the memory bandwidth bottleneck. For more detail, please have a look at [Noam's paper](https://huggingface.co/papers/1911.02150).
+In auto-regressive decoding, large key-value vectors need to be reloaded, concatenated with the current key-value vector pair to be then fed into the $\mathbf{q}_c\mathbf{K}^T$ computation at every step. For auto-regressive decoding, the required memory bandwidth for the constant reloading can become a serious time bottleneck. By reducing the size of the key-value vectors less memory needs to be accessed, thus reducing the memory bandwidth bottleneck. For more detail, please have a look at [Noam's paper](https://huggingface.co/papers/1911.02150).

-The important part to understand here is that reducing the number of key-value attention heads to 1 only makes sense if a key-value cache is used. The peak memory consumption of the model for a single forward pass without key-value cache stays unchanged as every attention head still has a unique query vector so that each attention head still has a different \\( \mathbf{QK}^T \\) matrix.
+The important part to understand here is that reducing the number of key-value attention heads to 1 only makes sense if a key-value cache is used. The peak memory consumption of the model for a single forward pass without key-value cache stays unchanged as every attention head still has a unique query vector so that each attention head still has a different $\mathbf{QK}^T$ matrix.

 MQA has seen wide adoption by the community and is now used by many of the most popular LLMs:

-   [**Falcon**](https://huggingface.co/tiiuae/falcon-40b)
-   [**PaLM**](https://huggingface.co/papers/2204.02311)
-   [**MPT**](https://huggingface.co/mosaicml/mpt-30b)
-   [**BLOOM**](https://huggingface.co/bigscience/bloom)
+- [**Falcon**](https://huggingface.co/tiiuae/falcon-40b)
+- [**PaLM**](https://huggingface.co/papers/2204.02311)
+- [**MPT**](https://huggingface.co/mosaicml/mpt-30b)
+- [**BLOOM**](https://huggingface.co/bigscience/bloom)

 Also, the checkpoint used in this notebook - `bigcode/octocoder` - makes use of MQA.

--- a/docs/source/en/main_classes/model.md
+++ b/docs/source/en/main_classes/model.md
@ -42,7 +42,3 @@ set this to `False`.
 ## Pushing to the Hub

 [[autodoc]] utils.PushToHubMixin
-
-## Sharded checkpoints
-
-[[autodoc]] modeling_utils.load_sharded_checkpoint
--- a/docs/source/en/model_doc/altclip.md
+++ b/docs/source/en/model_doc/altclip.md
@ -100,22 +100,29 @@ for label, prob in zip(labels, probs[0]):
 - [`AltCLIPProcessor`] combines [`CLIPImageProcessor`] and [`XLMRobertaTokenizer`] into a single instance to encode text and prepare images.

 ## AltCLIPConfig
+
 [[autodoc]] AltCLIPConfig

 ## AltCLIPTextConfig
+
 [[autodoc]] AltCLIPTextConfig

 ## AltCLIPVisionConfig
+
 [[autodoc]] AltCLIPVisionConfig

 ## AltCLIPModel
+
 [[autodoc]] AltCLIPModel

 ## AltCLIPTextModel
+
 [[autodoc]] AltCLIPTextModel

 ## AltCLIPVisionModel
+
 [[autodoc]] AltCLIPVisionModel

 ## AltCLIPProcessor
+
 [[autodoc]] AltCLIPProcessor
--- a/docs/source/en/model_doc/cwm.md
+++ b/docs/source/en/model_doc/cwm.md
@ -12,11 +12,10 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.

-
 ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be rendered properly in your Markdown viewer.

 -->
-
+*This model was released on {release_date} and added to Hugging Face Transformers on 2025-10-09.*

 # Code World Model (CWM)

@ -53,7 +52,8 @@ CWM requires a dedicated system prompt to function optimally during inference. W
 configuration, CWM's output quality may be significantly degraded. The following serves as the default
 system prompt for reasoning tasks. For agentic workflows, append the relevant tool specifications
 after this base prompt. Checkout the original code repository for more details.
-```
+
+```text
 You are a helpful AI assistant. You always reason before responding, using the following format:

 <think>
@ -110,6 +110,7 @@ generated_ids = model.generate(
 output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
 print(tokenizer.decode(output_ids))
 ```
+
 <details>
 <summary>Produces the following output:</summary>

--- a/docs/source/en/model_doc/detr.md
+++ b/docs/source/en/model_doc/detr.md
@ -105,7 +105,7 @@ DETR can be naturally extended to perform panoptic segmentation (which unifies s
 - The decoder of DETR updates the query embeddings in parallel. This is different from language models like GPT-2, which use autoregressive decoding instead of parallel. Hence, no causal attention mask is used.
 - DETR adds position embeddings to the hidden states at each self-attention and cross-attention layer before projecting to queries and keys. For the position embeddings of the image, one can choose between fixed sinusoidal or learned absolute position embeddings. By default, the parameter `position_embedding_type` of [`~transformers.DetrConfig`] is set to `"sine"`.
 - During training, the authors of DETR did find it helpful to use auxiliary losses in the decoder, especially to help the model output the correct number of objects of each class. If you set the parameter `auxiliary_loss` of [`~transformers.DetrConfig`] to `True`, then prediction feedforward neural networks and Hungarian losses are added after each decoder layer (with the FFNs sharing parameters).
- If you want to train the model in a distributed environment across multiple nodes, then one should update the _num_boxes_ variable in the _DetrLoss_ class of _modeling_detr.py_. When training on multiple nodes, this should be set to the average number of target boxes across all nodes, as can be seen in the original implementation [here](https://github.com/facebookresearch/detr/blob/a54b77800eb8e64e3ad0d8237789fcbf2f8350c5/models/detr.py#L227-L232).
+- If you want to train the model in a distributed environment across multiple nodes, then one should update the *num_boxes* variable in the *DetrLoss* class of *modeling_detr.py*. When training on multiple nodes, this should be set to the average number of target boxes across all nodes, as can be seen in the original implementation [here](https://github.com/facebookresearch/detr/blob/a54b77800eb8e64e3ad0d8237789fcbf2f8350c5/models/detr.py#L227-L232).
 - [`~transformers.DetrForObjectDetection`] and [`~transformers.DetrForSegmentation`] can be initialized with any convolutional backbone available in the [timm library](https://github.com/rwightman/pytorch-image-models). Initializing with a MobileNet backbone for example can be done by setting the `backbone` attribute of [`~transformers.DetrConfig`] to `"tf_mobilenetv3_small_075"`, and then initializing the model with that config.
 - DETR resizes the input images such that the shortest side is at least a certain amount of pixels while the longest is at most 1333 pixels. At training time, scale augmentation is used such that the shortest side is randomly set to at least 480 and at most 800 pixels. At inference time, the shortest side is set to 800. One can use [`~transformers.DetrImageProcessor`] to prepare images (and optional annotations in COCO format) for the model. Due to this resizing, images in a batch can have different sizes. DETR solves this by padding images up to the largest size in a batch, and by creating a pixel mask that indicates which pixels are real/which are padding. Alternatively, one can also define a custom `collate_fn` in order to batch images together, using [`~transformers.DetrImageProcessor.pad_and_create_pixel_mask`].
 - The size of the images will determine the amount of memory being used, and will thus determine the `batch_size`. It is advised to use a batch size of 2 per GPU. See [this Github thread](https://github.com/facebookresearch/detr/issues/150) for more info.
--- a/docs/source/en/model_doc/dinov3.md
+++ b/docs/source/en/model_doc/dinov3.md
@ -178,3 +178,8 @@ print("Pooled output shape:", pooled_output.shape)

 [[autodoc]] DINOv3ViTImageProcessorFast
    - preprocess
+
+## DINOv3ConvNextBackbone
+
+[[autodoc]] DINOv3ConvNextBackbone
+    - forward
--- a/docs/source/en/model_doc/donut.md
+++ b/docs/source/en/model_doc/donut.md
@ -61,10 +61,10 @@ pipeline(image=image, question="What time is the coffee break?")
 # pip install datasets
 import torch
 from datasets import load_dataset
-from transformers import AutoProcessor, AutoModelForVision2Seq
+from transformers import AutoProcessor, AutoModelForImageTextToText

 processor = AutoProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa")
-model = AutoModelForVision2Seq.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa")
+model = AutoModelForImageTextToText.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa")

 dataset = load_dataset("hf-internal-testing/example-documents", split="test")
 image = dataset[0]["image"]
@ -92,11 +92,11 @@ The example below uses [torchao](../quantization/torchao) to only quantize the w
 # pip install datasets torchao
 import torch
 from datasets import load_dataset
-from transformers import TorchAoConfig, AutoProcessor, AutoModelForVision2Seq
+from transformers import TorchAoConfig, AutoProcessor, AutoModelForImageTextToText

 quantization_config = TorchAoConfig("int4_weight_only", group_size=128)
 processor = AutoProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa")
-model = AutoModelForVision2Seq.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa", quantization_config=quantization_config)
+model = AutoModelForImageTextToText.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa", quantization_config=quantization_config)

 dataset = load_dataset("hf-internal-testing/example-documents", split="test")
 image = dataset[0]["image"]
@ -120,7 +120,7 @@ print(answer)
    ```py
    >>> import re
    >>> from transformers import DonutProcessor, VisionEncoderDecoderModel
-from accelerate import Accelerator
+    >>> from accelerate import Accelerator
    >>> from datasets import load_dataset
    >>> import torch

@ -162,9 +162,9 @@ from accelerate import Accelerator

    ```py
    >>> import re
-    >>> from transformers import DonutProcessor, VisionEncoderDecoderModel
-from accelerate import Accelerator
+    >>> from accelerate import Accelerator
    >>> from datasets import load_dataset
+    >>> from transformers import DonutProcessor, VisionEncoderDecoderModel
    >>> import torch

    >>> processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2")
--- a/docs/source/en/model_doc/edgetam.md
+++ b/docs/source/en/model_doc/edgetam.md
@ -305,7 +305,6 @@ EdgeTAM can use masks from previous predictions as input to refine segmentation:
 ...     )
 ```

-
 ## EdgeTamConfig

 [[autodoc]] EdgeTamConfig
--- a/docs/source/en/model_doc/edgetam_video.md
+++ b/docs/source/en/model_doc/edgetam_video.md
@ -12,13 +12,11 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.

-
 ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be rendered properly in your Markdown viewer.

 -->
 *This model was released on 2025-01-13 and added to Hugging Face Transformers on 2025-09-29.*

-
 <div style="float: right;">
    <div class="flex flex-wrap space-x-1">
        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
--- a/docs/source/en/model_doc/glm4_moe.md
+++ b/docs/source/en/model_doc/glm4_moe.md
@ -37,7 +37,6 @@ We evaluated GLM-4.6 across eight public benchmarks covering agents, reasoning,

 For more eval results, show cases, and technical details, please visit our [technical blog](https://z.ai/blog/glm-4.6).

-
 ### GLM-4.5

 The [**GLM-4.5**](https://huggingface.co/papers/2508.06471) series models are foundation models designed for intelligent agents, MoE variants are documented here as Glm4Moe.
--- a/docs/source/en/model_doc/gpt_oss.md
+++ b/docs/source/en/model_doc/gpt_oss.md
@ -35,8 +35,8 @@ The abstract from the paper is the following:
 *<INSERT PAPER ABSTRACT HERE>*

 Tips:
- **Attention Sinks with Flex Attention**: When using flex attention, attention sinks require special handling. Unlike with standard attention implementations where sinks can be added directly to attention scores, flex attention `score_mod` function operates on individual score elements rather than the full attention matrix. Therefore, attention sinks renormalization have to be applied after the flex attention computations by renormalizing the outputs using the log-sum-exp (LSE) values returned by flex attention.

+- **Attention Sinks with Flex Attention**: When using flex attention, attention sinks require special handling. Unlike with standard attention implementations where sinks can be added directly to attention scores, flex attention `score_mod` function operates on individual score elements rather than the full attention matrix. Therefore, attention sinks renormalization have to be applied after the flex attention computations by renormalizing the outputs using the log-sum-exp (LSE) values returned by flex attention.

 <INSERT TIPS ABOUT MODEL HERE>

--- a/docs/source/en/model_doc/lfm2_moe.md
+++ b/docs/source/en/model_doc/lfm2_moe.md
@ -12,11 +12,10 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.

-
 ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be rendered properly in your Markdown viewer.

 -->
-
+*This model was released on {release_date} and added to Hugging Face Transformers on 2025-10-07.*

 # Lfm2Moe

@ -24,7 +23,7 @@ limitations under the License.

 LFM2-MoE is a Mixture-of-Experts (MoE) variant of [LFM2](https://huggingface.co/collections/LiquidAI/lfm2-686d721927015b2ad73eaa38). The LFM2 family is optimized for on-device inference by combining short‑range, input‑aware gated convolutions with grouped‑query attention (GQA) in a layout tuned to maximize quality under strict speed and memory constraints.

-LFM2‑MoE keeps this fast backbone and introduces sparse MoE feed‑forward networks to add representational capacity without significantly increasing the active compute path. The first LFM2-MoE release is LFM2-8B-A1B, with 8.3B total parameters and 1.5B active parameters. The model excels in quality (comparable to 3-4B dense models) and speed (faster than other 1.5B class models). 
+LFM2‑MoE keeps this fast backbone and introduces sparse MoE feed‑forward networks to add representational capacity without significantly increasing the active compute path. The first LFM2-MoE release is LFM2-8B-A1B, with 8.3B total parameters and 1.5B active parameters. The model excels in quality (comparable to 3-4B dense models) and speed (faster than other 1.5B class models).

 ## Example

--- a/docs/source/en/model_doc/mllama.md
+++ b/docs/source/en/model_doc/mllama.md
@ -67,7 +67,7 @@ processor = AutoProcessor.from_pretrained(model_id)
 messages = [
    [
        {
-            "role": "user", 
+            "role": "user",
            "content": [
                {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
                {"type": "text", "text": "What does the image show?"}
@ -113,6 +113,10 @@ print(processor.decode(output[0], skip_special_tokens=True))

 [[autodoc]] MllamaImageProcessor

+## MllamaImageProcessorFast
+
+[[autodoc]] MllamaImageProcessorFast
+
 ## MllamaForConditionalGeneration

 [[autodoc]] MllamaForConditionalGeneration
--- a/docs/source/en/model_doc/mobilevit.md
+++ b/docs/source/en/model_doc/mobilevit.md
@ -99,7 +99,6 @@ print(f"The predicted class label is:{predicted_class_label}")

 [[autodoc]] MobileViTConfig

-
 ## MobileViTImageProcessor

 [[autodoc]] MobileViTImageProcessor
--- a/docs/source/en/model_doc/ovis2.md
+++ b/docs/source/en/model_doc/ovis2.md
@ -39,12 +39,12 @@ import torch
 from torchvision import io
 from typing import Dict
 from transformers.image_utils import load_images, load_video
-from transformers import AutoModelForVision2Seq, AutoTokenizer, AutoProcessor
+from transformers import AutoModelForImageTextToText, AutoTokenizer, AutoProcessor
 from accelerate import Accelerator

 device = Accelerator().device

-model = AutoModelForVision2Seq.from_pretrained(
+model = AutoModelForImageTextToText.from_pretrained(
    "thisisiron/Ovis2-2B-hf",
    dtype=torch.bfloat16,
 ).eval().to(device)
--- a/docs/source/en/model_doc/qwen2_5_omni.md
+++ b/docs/source/en/model_doc/qwen2_5_omni.md
@ -271,6 +271,7 @@ processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B", min_pixels=min
 ```

 #### Prompt for audio output
+
 If users need audio output, the system prompt must be set as "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.", otherwise the audio output may not work as expected.

 ```python
@ -307,6 +308,7 @@ text_ids = model.generate(**inputs, return_audio=False)
 ```

 #### Change voice type of output audio
+
 Qwen2.5-Omni supports the ability to change the voice of the output audio. Users can use the `spk` parameter of `generate` function to specify the voice type. The `"Qwen/Qwen2.5-Omni-7B"` checkpoint support two voice types: `Chelsie` and `Ethan`, while `Chelsie` is a female voice and `Ethan` is a male voice. By default, if `spk` is not specified, the default voice type is `Chelsie`.

 ```python
--- a/docs/source/en/model_doc/qwen3_next.md
+++ b/docs/source/en/model_doc/qwen3_next.md
@ -31,6 +31,7 @@ Despite its ultra-efficiency, it outperforms Qwen3-32B on downstream tasks — w
 Moreover, it delivers over **10x higher inference throughput** than Qwen3-32B when handling contexts longer than 32K tokens.

 For more details, please visit our blog [Qwen3-Next](qwen3_next) ([blog post](https://qwenlm.github.io/blog/qwen3_next/)).
+
 ## Usage examples

 ```python
--- a/docs/source/en/model_doc/qwen3_omni_moe.md
+++ b/docs/source/en/model_doc/qwen3_omni_moe.md
@ -271,6 +271,7 @@ processor = AutoProcessor.from_pretrained("Qwen/Qwen3-Omni-30B-A3B-Instruct", mi
 ```

 #### Prompt for audio output
+
 If users need audio output, the system prompt must be set as "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.", otherwise the audio output may not work as expected.

 ```json
@ -307,6 +308,7 @@ text_ids = model.generate(**inputs, return_audio=False)
 ```

 #### Change voice type of output audio
+
 Qwen3-Omni-MOE supports the ability to change the voice of the output audio. Users can use the `spk` parameter of `generate` function to specify the voice type. The `"Qwen/Qwen3-Omni-30B-A3B-Instruct"` checkpoint support two voice types: `Chelsie` and `Ethan`, while `Chelsie` is a female voice and `Ethan` is a male voice. By default, if `spk` is not specified, the default voice type is `Chelsie`.

 ```python
--- a/docs/source/en/model_doc/reformer.md
+++ b/docs/source/en/model_doc/reformer.md
@ -50,14 +50,14 @@ found [here](https://github.com/google/trax/tree/master/trax/models/reformer).

 Axial Positional Encodings were first implemented in Google's [trax library](https://github.com/google/trax/blob/4d99ad4965bab1deba227539758d59f0df0fef48/trax/layers/research/position_encodings.py#L29)
 and developed by the authors of this model's paper. In models that are treating very long input sequences, the
-conventional position id encodings store an embeddings vector of size \\(d\\) being the `config.hidden_size` for
-every position \\(i, \ldots, n_s\\), with \\(n_s\\) being `config.max_embedding_size`. This means that having
-a sequence length of \\(n_s = 2^{19} \approx 0.5M\\) and a `config.hidden_size` of \\(d = 2^{10} \approx 1000\\)
+conventional position id encodings store an embeddings vector of size $d$ being the `config.hidden_size` for
+every position $i, \ldots, n_s$, with $n_s$ being `config.max_embedding_size`. This means that having
+a sequence length of $n_s = 2^{19} \approx 0.5M$ and a `config.hidden_size` of $d = 2^{10} \approx 1000$
 would result in a position encoding matrix:

 $$X_{i,j}, \text{ with } i \in \left[1,\ldots, d\right] \text{ and } j \in \left[1,\ldots, n_s\right]$$

-which alone has over 500M parameters to store. Axial positional encodings factorize \\(X_{i,j}\\) into two matrices:
+which alone has over 500M parameters to store. Axial positional encodings factorize $X_{i,j}$ into two matrices:

 $$X^{1}_{i,j}, \text{ with } i \in \left[1,\ldots, d^1\right] \text{ and } j \in \left[1,\ldots, n_s^1\right]$$

@ -76,16 +76,16 @@ X^{1}_{i, k}, & \text{if }\ i < d^1 \text{ with } k = j \mod n_s^1 \\
 X^{2}_{i - d^1, l}, & \text{if } i \ge d^1 \text{ with } l = \lfloor\frac{j}{n_s^1}\rfloor
 \end{cases}$$

-Intuitively, this means that a position embedding vector \\(x_j \in \mathbb{R}^{d}\\) is now the composition of two
-factorized embedding vectors: \\(x^1_{k, l} + x^2_{l, k}\\), where as the `config.max_embedding_size` dimension
-\\(j\\) is factorized into \\(k \text{ and } l\\). This design ensures that each position embedding vector
-\\(x_j\\) is unique.
+Intuitively, this means that a position embedding vector $x_j \in \mathbb{R}^{d}$ is now the composition of two
+factorized embedding vectors: $x^1_{k, l} + x^2_{l, k}$, where as the `config.max_embedding_size` dimension
+$j$ is factorized into $k \text{ and } l$. This design ensures that each position embedding vector
+$x_j$ is unique.

-Using the above example again, axial position encoding with \\(d^1 = 2^9, d^2 = 2^9, n_s^1 = 2^9, n_s^2 = 2^{10}\\)
-can drastically reduced the number of parameters from 500 000 000 to \\(2^{18} + 2^{19} \approx 780 000\\) parameters, this means 85% less memory usage.
+Using the above example again, axial position encoding with $d^1 = 2^9, d^2 = 2^9, n_s^1 = 2^9, n_s^2 = 2^{10}$
+can drastically reduced the number of parameters from 500 000 000 to $2^{18} + 2^{19} \approx 780 000$ parameters, this means 85% less memory usage.

-In practice, the parameter `config.axial_pos_embds_dim` is set to a tuple \\((d^1, d^2)\\) which sum has to be
-equal to `config.hidden_size` and `config.axial_pos_shape` is set to a tuple \\((n_s^1, n_s^2)\\) which
+In practice, the parameter `config.axial_pos_embds_dim` is set to a tuple $(d^1, d^2)$ which sum has to be
+equal to `config.hidden_size` and `config.axial_pos_shape` is set to a tuple $(n_s^1, n_s^2)$ which
 product has to be equal to `config.max_embedding_size`, which during training has to be equal to the *sequence
 length* of the `input_ids`.

@ -107,10 +107,10 @@ neighboring chunks and `config.lsh_num_chunks_after` following neighboring chunk

 For more information, see the [original Paper](https://huggingface.co/papers/2001.04451) or this great [blog post](https://www.pragmatic.ml/reformer-deep-dive/).

-Note that `config.num_buckets` can also be factorized into a list \\((n_{\text{buckets}}^1,
-n_{\text{buckets}}^2)\\). This way instead of assigning the query key embedding vectors to one of \\((1,\ldots,
-n_{\text{buckets}})\\) they are assigned to one of \\((1-1,\ldots, n_{\text{buckets}}^1-1, \ldots,
-1-n_{\text{buckets}}^2, \ldots, n_{\text{buckets}}^1-n_{\text{buckets}}^2)\\). This is crucial for very long sequences to
+Note that `config.num_buckets` can also be factorized into a list $(n_{\text{buckets}}^1,
+n_{\text{buckets}}^2)$. This way instead of assigning the query key embedding vectors to one of $(1,\ldots,
+n_{\text{buckets}})$ they are assigned to one of $(1-1,\ldots, n_{\text{buckets}}^1-1, \ldots,
+1-n_{\text{buckets}}^2, \ldots, n_{\text{buckets}}^1-n_{\text{buckets}}^2)$. This is crucial for very long sequences to
 save memory.

 When training a model from scratch, it is recommended to leave `config.num_buckets=None`, so that depending on the
@ -118,8 +118,8 @@ sequence length a good value for `num_buckets` is calculated on the fly. This va
 saved in the config and should be reused for inference.

 Using LSH self attention, the memory and time complexity of the query-key matmul operation can be reduced from
-\\(\mathcal{O}(n_s \times n_s)\\) to \\(\mathcal{O}(n_s \times \log(n_s))\\), which usually represents the memory
-and time bottleneck in a transformer model, with \\(n_s\\) being the sequence length.
+$\mathcal{O}(n_s \times n_s)$ to $\mathcal{O}(n_s \times \log(n_s))$, which usually represents the memory
+and time bottleneck in a transformer model, with $n_s$ being the sequence length.

 ### Local Self Attention

@ -129,8 +129,8 @@ the key embedding vectors in its chunk and to the key embedding vectors of `conf
 previous neighboring chunks and `config.local_num_chunks_after` following neighboring chunks.

 Using Local self attention, the memory and time complexity of the query-key matmul operation can be reduced from
-\\(\mathcal{O}(n_s \times n_s)\\) to \\(\mathcal{O}(n_s \times \log(n_s))\\), which usually represents the memory
-and time bottleneck in a transformer model, with \\(n_s\\) being the sequence length.
+$\mathcal{O}(n_s \times n_s)$ to $\mathcal{O}(n_s \times \log(n_s))$, which usually represents the memory
+and time bottleneck in a transformer model, with $n_s$ being the sequence length.

 ### Training

--- a/docs/source/en/model_doc/roberta-prelayernorm.md
+++ b/docs/source/en/model_doc/roberta-prelayernorm.md
@ -35,7 +35,7 @@ The original code can be found [here](https://github.com/princeton-nlp/DinkyTrai

 ## Usage tips

- The implementation is the same as [Roberta](roberta) except instead of using _Add and Norm_ it does _Norm and Add_. _Add_ and _Norm_ refers to the Addition and LayerNormalization as described in [Attention Is All You Need](https://huggingface.co/papers/1706.03762).
+- The implementation is the same as [Roberta](roberta) except instead of using *Add and Norm* it does *Norm and Add*. *Add* and *Norm* refers to the Addition and LayerNormalization as described in [Attention Is All You Need](https://huggingface.co/papers/1706.03762).
 - This is identical to using the `--encoder-normalize-before` flag in [fairseq](https://fairseq.readthedocs.io/).

 ## Resources
--- a/docs/source/en/model_doc/rt_detr.md
+++ b/docs/source/en/model_doc/rt_detr.md
@ -40,7 +40,8 @@ The model version was contributed by [rafaelpadilla](https://huggingface.co/rafa

 ## Usage tips

-Initially, an image is processed using a pre-trained convolutional neural network, specifically a Resnet-D variant as referenced in the original code. This network extracts features from the final three layers of the architecture. Following this, a hybrid encoder is employed to convert the multi-scale features into a sequential array of image features. Then, a decoder, equipped with auxiliary prediction heads is used to refine the object queries. This process facilitates the direct generation of bounding boxes, eliminating the need for any additional post-processing to acquire the logits and coordinates for the bounding boxes. The model is meant to be used on images resized to a size 640x640 with the corresponding ImageProcessor. Reshaping to other sizes will generally degrade performance. 
+Initially, an image is processed using a pre-trained convolutional neural network, specifically a Resnet-D variant as referenced in the original code. This network extracts features from the final three layers of the architecture. Following this, a hybrid encoder is employed to convert the multi-scale features into a sequential array of image features. Then, a decoder, equipped with auxiliary prediction heads is used to refine the object queries. This process facilitates the direct generation of bounding boxes, eliminating the need for any additional post-processing to acquire the logits and coordinates for the bounding boxes. The model is meant to be used on images resized to a size 640x640 with the corresponding ImageProcessor. Reshaping to other sizes will generally degrade performance.
+
 ```py
 >>> import torch
 >>> import requests
--- a/docs/source/en/model_doc/rt_detr_v2.md
+++ b/docs/source/en/model_doc/rt_detr_v2.md
@ -43,6 +43,7 @@ This second version of RT-DETR improves how the decoder finds objects in an imag
 - **optimized processing** – improves how attention weights mix information

 The model is meant to be used on images resized to a size 640x640 with the corresponding ImageProcessor. Reshaping to other sizes will generally degrade performance.
+
 ```py
 >>> import torch
 >>> import requests
--- a/docs/source/en/model_doc/rwkv.md
+++ b/docs/source/en/model_doc/rwkv.md
@ -94,27 +94,27 @@ In a traditional auto-regressive Transformer, attention is written as

 $$O = \hbox{softmax}(QK^{T} / \sqrt{d}) V$$

-with \\(Q\\), \\(K\\) and \\(V\\) are matrices of shape `seq_len x hidden_size` named query, key and value (they are actually bigger matrices with a batch dimension and an attention head dimension but we're only interested in the last two, which is where the matrix product is taken, so for the sake of simplicity we only consider those two). The product \\(QK^{T}\\) then has shape `seq_len x seq_len` and we can take the matrix product with \\(V\\) to get the output \\(O\\) of the same shape as the others.  
+with $Q$, $K$ and $V$ are matrices of shape `seq_len x hidden_size` named query, key and value (they are actually bigger matrices with a batch dimension and an attention head dimension but we're only interested in the last two, which is where the matrix product is taken, so for the sake of simplicity we only consider those two). The product $QK^{T}$ then has shape `seq_len x seq_len` and we can take the matrix product with $V$ to get the output $O$ of the same shape as the others.  

 Replacing the softmax by its value gives:

 $$O_{i} = \frac{\sum_{j=1}^{i} e^{Q_{i} K_{j}^{T} / \sqrt{d}} V_{j}}{\sum_{j=1}^{i} e^{Q_{i} K_{j}^{T} / \sqrt{d}}}$$

-Note that the entries in \\(QK^{T}\\) corresponding to \\(j > i\\) are masked (the sum stops at j) because the attention is not allowed to look at future tokens (only past ones).
+Note that the entries in $QK^{T}$ corresponding to $j > i$ are masked (the sum stops at j) because the attention is not allowed to look at future tokens (only past ones).

 In comparison, the RWKV attention is given by

 $$O_{i} = \sigma(R_{i}) \frac{\sum_{j=1}^{i} e^{W_{i-j} + K_{j}} V_{j}}{\sum_{j=1}^{i} e^{W_{i-j} + K_{j}}}$$

-where \\(R\\) is a new matrix called receptance by the author, \\(K\\) and \\(V\\) are still the key and value (\\(\sigma\\) here is the sigmoid function). \\(W\\) is a new vector that represents the position of the token and is given by
+where $R$ is a new matrix called receptance by the author, $K$ and $V$ are still the key and value ($\sigma$ here is the sigmoid function). $W$ is a new vector that represents the position of the token and is given by

 $$W_{0} = u \hbox{  and  } W_{k} = (k-1)w \hbox{ for } k \geq 1$$

-with \\(u\\) and \\(w\\) learnable parameters called in the code `time_first` and `time_decay` respectively. The numerator and denominator can both be expressed recursively. Naming them \\(N_{i}\\) and \\(D_{i}\\) we have:
+with $u$ and $w$ learnable parameters called in the code `time_first` and `time_decay` respectively. The numerator and denominator can both be expressed recursively. Naming them $N_{i}$ and $D_{i}$ we have:

 $$N_{i} = e^{u + K_{i}} V_{i} + \hat{N}_{i} \hbox{  where  } \hat{N}_{i} = e^{K_{i-1}} V_{i-1} + e^{w + K_{i-2}} V_{i-2} \cdots + e^{(i-2)w + K_{1}} V_{1}$$

-so \\(\hat{N}_{i}\\) (called `numerator_state` in the code) satisfies
+so $\hat{N}_{i}$ (called `numerator_state` in the code) satisfies

 $$\hat{N}_{0} = 0 \hbox{  and  } \hat{N}_{j+1} = e^{K_{j}} V_{j} + e^{w} \hat{N}_{j}$$

@ -122,7 +122,7 @@ and

 $$D_{i} = e^{u + K_{i}} + \hat{D}_{i} \hbox{  where  } \hat{D}_{i} = e^{K_{i-1}} + e^{w + K_{i-2}} \cdots + e^{(i-2)w + K_{1}}$$

-so \\(\hat{D}_{i}\\) (called `denominator_state` in the code) satisfies
+so $\hat{D}_{i}$ (called `denominator_state` in the code) satisfies

 $$\hat{D}_{0} = 0 \hbox{  and  } \hat{D}_{j+1} = e^{K_{j}} + e^{w} \hat{D}_{j}$$

@ -130,7 +130,7 @@ The actual recurrent formula used are a tiny bit more complex, as for numerical

 $$\frac{e^{x_{i}}}{\sum_{j=1}^{n} e^{x_{j}}} = \frac{e^{x_{i} - M}}{\sum_{j=1}^{n} e^{x_{j} - M}}$$

-with \\(M\\) the maximum of all \\(x_{j}\\). So here on top of saving the numerator state (\\(\hat{N}\\)) and the denominator state (\\(\hat{D}\\)) we also keep track of the maximum of all terms encountered in the exponentials. So we actually use
+with $M$ the maximum of all $x_{j}$. So here on top of saving the numerator state ($\hat{N}$) and the denominator state ($\hat{D}$) we also keep track of the maximum of all terms encountered in the exponentials. So we actually use

 $$\tilde{N}_{i} = e^{-M_{i}} \hat{N}_{i} \hbox{  and  } \tilde{D}_{i} = e^{-M_{i}} \hat{D}_{i}$$

@ -142,7 +142,7 @@ and

 $$\tilde{D}_{0} = 0 \hbox{  and  } \tilde{D}_{j+1} = e^{K_{j} - q} + e^{w + M_{j} - q} \tilde{D}_{j} \hbox{  where  } q = \max(K_{j}, w + M_{j})$$

-and \\(M_{j+1} = q\\). With those, we can then compute
+and $M_{j+1} = q$. With those, we can then compute

 $$N_{i} = e^{u + K_{i} - q} V_{i} + e^{M_{i}} \tilde{N}_{i} \hbox{  where  } q = \max(u + K_{i}, M_{i})$$

--- a/docs/source/en/model_doc/unispeech-sat.md
+++ b/docs/source/en/model_doc/unispeech-sat.md
@ -55,7 +55,6 @@ found [here](https://github.com/microsoft/UniSpeech/tree/main/UniSpeech-SAT).
  decoded using [`Wav2Vec2CTCTokenizer`].
 - UniSpeechSat performs especially well on speaker verification, speaker identification, and speaker diarization tasks.

-
 ## Resources

 - [Audio classification task guide](../tasks/audio_classification)
--- a/docs/source/en/model_doc/unispeech.md
+++ b/docs/source/en/model_doc/unispeech.md
@ -50,7 +50,6 @@ found [here](https://github.com/microsoft/UniSpeech/tree/main/UniSpeech).
 - UniSpeech model can be fine-tuned using connectionist temporal classification (CTC) so the model output has to be
  decoded using [`Wav2Vec2CTCTokenizer`].

-
 ## Resources

 - [Audio classification task guide](../tasks/audio_classification)
--- a/docs/source/en/model_doc/video_llama_3.md
+++ b/docs/source/en/model_doc/video_llama_3.md
@ -0,0 +1,229 @@
+<!--Copyright 2025 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+*This model was released on 2025-01-22 and added to Hugging Face Transformers on 2025-10-13.*
+
+# VideoLLaMA3
+
+<div class="flex flex-wrap space-x-1">
+<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
+<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
+</div>
+
+## Overview
+
+The [VideoLLaMA3](https://huggingface.co/papers/2501.13106) model is a major update to [VideoLLaMA2](https://huggingface.co/papers/2406.07476) from Alibaba DAMO Academy.
+
+The abstract from the paper is as following:
+
+*In this paper, we propose VideoLLaMA 3, a more advanced multimodal foundation model for image and video understanding. The core design philosophy of VideoLLaMA3 is vision-centric. The meaning of “vision-centric” is two-fold: the vision-centric training paradigm and vision-centric framework design. The key insight of our vision-centric training paradigm is that high-quality image-text data is crucial for both image and video understanding. Instead of preparing massive video-text datasets, we focus on constructing large-scale, high-quality image-text datasets. VideoLLaMA3 has four training stages: 1) Vision Encoder Adaptation, which enables the vision encoder to accept images of variable resolutions
+as input; 2) Vision-Language Alignment, which jointly tunes the vision encoder, projector, and LLM with large-scale image-text data covering multiple types (including scene images, documents, and charts) as well as text-only data. 3) Multi-task Fine-tuning, which incorporates image-text SFT data for downstream tasks and video-text data to establish a foundation for video understanding. 4) Video-centric Fine-tuning, which further improves the model’s capability in video understanding. As for the framework design, to better capture fine-grained details in images, the pretrained vision encoder is adapted to encode images of varying sizes into vision tokens with corresponding numbers, rather than a fixed number of tokens. For video inputs, we reduce the number of vision tokens according to their similarity so that the representation of videos will be more precise and compact. Benefiting from vision-centric designs, VideoLLaMA3 achieves compelling performances in both image and video understanding benchmarks.*
+
+<img src="https://github.com/DAMO-NLP-SG/VideoLLaMA3/raw/refs/heads/main/assets/pipeline.jpg"
+alt="drawing" width="600"/>
+
+<small> VideoLLaMA3 architecture. Taken from the <a href="https://huggingface.co/papers/2501.13106">technical report.</a> </small>
+
+This model was contributed by [lkhl](https://huggingface.co/lkhl).
+
+## Usage example
+
+### Single Media inference
+
+The model can accept both images and videos as input. Here's an example code for inference.
+
+```python
+import torch
+from transformers import VideoLlama3ForConditionalGeneration, AutoTokenizer, AutoProcessor
+
+# Load the model in half-precision on the available device(s)
+model = VideoLlama3ForConditionalGeneration.from_pretrained("lkhl/VideoLLaMA3-2B-Image-HF", device_map="auto")
+processor = AutoProcessor.from_pretrained("lkhl/VideoLLaMA3-2B-Image-HF")
+
+
+conversation = [
+    {
+        "role":"user",
+        "content":[
+            {"type": "image", "image": "https://github.com/DAMO-NLP-SG/VideoLLaMA3/raw/refs/heads/main/assets/sora.png"},
+            {"type": "text", "text": "Describe this image."}
+        ]
+    }
+]
+
+inputs = processor.apply_chat_template(
+    conversation,
+    add_generation_prompt=True,
+    tokenize=True,
+    return_dict=True,
+    return_tensors="pt"
+).to(model.device)
+
+# Inference: Generation of the output
+output_ids = model.generate(**inputs, max_new_tokens=128)
+generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
+output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
+print(output_text)
+
+
+
+# Video
+conversation = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "video", "video": "https://github.com/DAMO-NLP-SG/VideoLLaMA3/raw/refs/heads/main/assets/cat_and_chicken.mp4"},
+            {"type": "text", "text": "What happened in the video?"},
+        ],
+    }
+]
+
+inputs = processor.apply_chat_template(
+    conversation,
+    fps=1,
+    add_generation_prompt=True,
+    tokenize=True,
+    return_dict=True,
+    return_tensors="pt"
+).to(model.device)
+
+
+# Inference: Generation of the output
+output_ids = model.generate(**inputs, max_new_tokens=128)
+generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
+output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
+print(output_text)
+```
+
+### Batch Mixed Media Inference
+
+The model can batch inputs composed of mixed samples of various types such as images, videos, and text. Here is an example.
+
+```python
+# Image
+conversation1 = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "image": "https://github.com/DAMO-NLP-SG/VideoLLaMA3/raw/refs/heads/main/assets/sora.png"},
+            {"type": "text", "text": "Describe this image."}
+        ]
+    }
+]
+
+# Video
+conversation2 = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "video", "video": "https://github.com/DAMO-NLP-SG/VideoLLaMA3/raw/refs/heads/main/assets/cat_and_chicken.mp4"},
+            {"type": "text", "text": "What happened in the video?"},
+        ],
+    }
+]
+
+# Text
+conversation3 = [
+    {
+        "role": "user",
+        "content": "What color is a banana?"
+    }
+]
+
+
+conversations = [conversation1, conversation2, conversation3]
+# Preparation for batch inference
+inputs = processor.apply_chat_template(
+    conversations,
+    fps=1,
+    add_generation_prompt=True,
+    tokenize=True,
+    padding=True,
+    padding_side="left",
+    return_dict=True,
+    return_tensors="pt"
+).to(model.device)
+
+
+# Batch Inference
+output_ids = model.generate(**inputs, max_new_tokens=128)
+generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
+output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
+print(output_text)
+```
+
+#### Flash-Attention 2 to speed up generation
+
+First, make sure to install the latest version of Flash Attention 2:
+
+```bash
+pip install -U flash-attn --no-build-isolation
+```
+
+Also, you should have a hardware that is compatible with Flash-Attention 2. Read more about it in the official documentation of the [flash attention repository](https://github.com/Dao-AILab/flash-attention). FlashAttention-2 can only be used when a model is loaded in `torch.float16` or `torch.bfloat16`.
+
+To load and run a model using Flash Attention-2, simply add `attn_implementation="flash_attention_2"` when loading the model as follows:
+
+```python
+from transformers import VideoLlama3ForConditionalGeneration
+
+model = VideoLlama3ForConditionalGeneration.from_pretrained(
+    "lkhl/VideoLLaMA3-2B-Image-HF", 
+    dtype=torch.bfloat16, 
+    attn_implementation="flash_attention_2",
+)
+```
+
+## VideoLlama3Config
+
+[[autodoc]] VideoLlama3Config
+
+## VideoLlama3VisionConfig
+
+[[autodoc]] VideoLlama3VisionConfig
+
+## VideoLlama3ImageProcessor
+
+[[autodoc]] VideoLlama3ImageProcessor
+    - preprocess
+
+## VideoLlama3VideoProcessor
+
+[[autodoc]] VideoLlama3VideoProcessor
+    - preprocess
+
+## VideoLlama3ImageProcessorFast
+
+[[autodoc]] VideoLlama3ImageProcessorFast
+    - preprocess
+
+## VideoLlama3Processor
+
+[[autodoc]] VideoLlama3Processor
+
+## VideoLlama3Model
+
+[[autodoc]] VideoLlama3Model
+    - forward
+
+## VideoLlama3VisionModel
+
+[[autodoc]] VideoLlama3VisionModel
+    - forward
+
+## VideoLlama3ForConditionalGeneration
+
+[[autodoc]] VideoLlama3ForConditionalGeneration
+    - forward
--- a/docs/source/en/model_doc/video_llava.md
+++ b/docs/source/en/model_doc/video_llava.md
@ -48,7 +48,7 @@ a unified visual representation, outperforming models designed specifically for
 work to provide modest insights into the multi-modal inputs
 for the LLM*

-## Usage tips:
+## Usage tips

 - We advise users to use padding_side="left" when computing batched generation as it leads to more accurate results. Simply make sure to call processor.tokenizer.padding_side = "left" before generating.

--- a/docs/source/en/model_doc/videomae.md
+++ b/docs/source/en/model_doc/videomae.md
@ -90,6 +90,11 @@ to fine-tune a VideoMAE model on a custom dataset.
 [[autodoc]] VideoMAEImageProcessor
    - preprocess

+## VideoMAEVideoProcessor
+
+[[autodoc]] VideoMAEVideoProcessor
+    - preprocess
+
 ## VideoMAEModel

 [[autodoc]] VideoMAEModel
--- a/docs/source/en/model_doc/vipllava.md
+++ b/docs/source/en/model_doc/vipllava.md
@ -37,7 +37,7 @@ The original code can be found [here](https://github.com/mu-cai/ViP-LLaVA).

 This model was contributed by [Younes Belkada](https://huggingface.co/ybelkada)

-## Usage tips:
+## Usage tips

 - The architecture is similar than llava architecture except that the multi-modal projector takes a set of concatenated vision hidden states and has an additional layernorm layer on that module.

--- a/docs/source/en/model_doc/vision-encoder-decoder.md
+++ b/docs/source/en/model_doc/vision-encoder-decoder.md
@ -39,7 +39,7 @@ for more information).
 An example application is image captioning, in which the encoder is used to encode the image, after which an autoregressive language model generates
 the caption. Another example is optical character recognition. Refer to [TrOCR](trocr), which is an instance of [`VisionEncoderDecoderModel`].

-## Randomly initializing `VisionEncoderDecoderModel` from model configurations.
+## Randomly initializing `VisionEncoderDecoderModel` from model configurations

 [`VisionEncoderDecoderModel`] can be randomly initialized from an encoder and a decoder config. In the following example, we show how to do this using the default [`ViTModel`] configuration for the encoder
 and the default [`BertForCausalLM`] configuration for the decoder.
@ -54,7 +54,7 @@ and the default [`BertForCausalLM`] configuration for the decoder.
 >>> model = VisionEncoderDecoderModel(config=config)
 ```

-## Initialising `VisionEncoderDecoderModel` from a pretrained encoder and a pretrained decoder.
+## Initialising `VisionEncoderDecoderModel` from a pretrained encoder and a pretrained decoder

 [`VisionEncoderDecoderModel`] can be initialized from a pretrained encoder checkpoint and a pretrained decoder checkpoint. Note that any pretrained Transformer-based vision model, *e.g.* [Swin](swin), can serve as the encoder and both pretrained auto-encoding models, *e.g.* BERT, pretrained causal language models, *e.g.* GPT2, as well as the pretrained decoder part of sequence-to-sequence models, *e.g.* decoder of BART, can be used as the decoder.
 Depending on which architecture you choose as the decoder, the cross-attention layers might be randomly initialized.
@ -69,7 +69,7 @@ To do so, the `VisionEncoderDecoderModel` class provides a [`VisionEncoderDecode
 ... )
 ```

-## Loading an existing `VisionEncoderDecoderModel` checkpoint and perform inference.
+## Loading an existing `VisionEncoderDecoderModel` checkpoint and perform inference

 To load fine-tuned checkpoints of the `VisionEncoderDecoderModel` class, [`VisionEncoderDecoderModel`] provides the `from_pretrained(...)` method just like any other model architecture in Transformers.

--- a/docs/source/en/model_doc/vitpose.md
+++ b/docs/source/en/model_doc/vitpose.md
@ -164,7 +164,7 @@ image_pose_result = pose_results[0]

    ```py
    from transformers import AutoProcessor, VitPoseForPoseEstimation
-from accelerate import Accelerator
+    from accelerate import Accelerator

    device = Accelerator().device

--- a/docs/source/en/model_doc/vivit.md
+++ b/docs/source/en/model_doc/vivit.md
@ -52,11 +52,13 @@ For the best speedups, we recommend loading the model in half-precision (e.g. `t
 On a local benchmark (A100-40GB, PyTorch 2.3.0, OS Ubuntu 22.04) with `float32` and `google/vivit-b-16x2-kinetics400` model, we saw the following speedups during inference.

 ### Training
+
 |   num_training_steps |   batch_size |   is cuda |   Speedup (%) |   Eager peak mem (MB) |   sdpa peak mem (MB) |   Mem saving (%) |
 |---------------------:|-------------:|----------:|--------------:|----------------------:|---------------------:|-----------------:|
 |                  100 |            1 |      True |         7.122 |               2575.28 |              5932.54 |           130.364 |

 ### Inference
+
 |   num_batches |   batch_size |   is cuda |   is half |   Speedup (%) |   Mem eager (MB) |   Mem BT (MB) |   Mem saved (%) |
 |---------------|--------------|-----------|-----------|---------------|------------------|---------------|-----------------|
 |            20 |             1 |   True    |   False   |      15.422   |     715.807      |    317.079    |      125.75     |
--- a/docs/source/en/model_doc/wav2vec2.md
+++ b/docs/source/en/model_doc/wav2vec2.md
@ -48,7 +48,6 @@ Note: Meta (FAIR) released a new version of [Wav2Vec2-BERT 2.0](https://huggingf
 - Wav2Vec2 model was trained using connectionist temporal classification (CTC) so the model output has to be decoded
  using [`Wav2Vec2CTCTokenizer`].

-
 ## Using Flash Attention 2

 Flash Attention 2 is an faster, optimized version of the model.
--- a/docs/source/en/model_doc/whisper.md
+++ b/docs/source/en/model_doc/whisper.md
@ -29,7 +29,6 @@ rendered properly in your Markdown viewer.

 You can find all the original Whisper checkpoints under the [Whisper](https://huggingface.co/collections/openai/whisper-release-6501bba2cf999715fd953013) collection.

-
 > [!TIP]
 > Click on the Whisper models in the right sidebar for more examples of how to apply Whisper to different audio tasks.

--- a/docs/source/en/model_doc/xmod.md
+++ b/docs/source/en/model_doc/xmod.md
@ -24,7 +24,7 @@ rendered properly in your Markdown viewer.
 ## Overview

 The X-MOD model was proposed in [Lifting the Curse of Multilinguality by Pre-training Modular Transformers](https://huggingface.co/papers/2205.06266) by Jonas Pfeiffer, Naman Goyal, Xi Lin, Xian Li, James Cross, Sebastian Riedel, and Mikel Artetxe.
-X-MOD extends multilingual masked language models like [XLM-R](xlm-roberta) to include language-specific modular components (_language adapters_) during pre-training. For fine-tuning, the language adapters in each transformer layer are frozen.
+X-MOD extends multilingual masked language models like [XLM-R](xlm-roberta) to include language-specific modular components (*language adapters*) during pre-training. For fine-tuning, the language adapters in each transformer layer are frozen.

 The abstract from the paper is the following:

--- a/docs/source/en/models.md
+++ b/docs/source/en/models.md
@ -139,10 +139,10 @@ with tempfile.TemporaryDirectory() as tmp_dir:
    new_model = AutoModel.from_pretrained(tmp_dir)
 ```

-Sharded checkpoints can also be directly loaded with [`~transformers.modeling_utils.load_sharded_checkpoint`].
+Sharded checkpoints can also be directly loaded with [`~transformers.trainer_utils.load_sharded_checkpoint`].

 ```py
-from transformers.modeling_utils import load_sharded_checkpoint
+from transformers.trainer_utils import load_sharded_checkpoint

 with tempfile.TemporaryDirectory() as tmp_dir:
    model.save_pretrained(tmp_dir, max_shard_size="5GB")
--- a/docs/source/en/open_webui.md
+++ b/docs/source/en/open_webui.md
@ -1,4 +1,4 @@
-#  Audio transcriptions with WebUI and `transformers serve`
+# Audio transcriptions with WebUI and `transformers serve`

 This guide shows how to do audio transcription for chat purposes, using `transformers serve` and [Open WebUI](https://openwebui.com/). This guide assumes you have Open WebUI installed on your machine and ready to run. Please refer to the examples above to use the text functionalities of `transformer serve` with Open WebUI -- the instructions are the same.

--- a/docs/source/en/perplexity.md
+++ b/docs/source/en/perplexity.md
@ -23,11 +23,11 @@ that the metric applies specifically to classical language models (sometimes cal
 models) and is not well defined for masked language models like BERT (see [summary of the models](model_summary)).

 Perplexity is defined as the exponentiated average negative log-likelihood of a sequence. If we have a tokenized
-sequence \\(X = (x_0, x_1, \dots, x_t)\\), then the perplexity of \\(X\\) is,
+sequence $X = (x_0, x_1, \dots, x_t)$, then the perplexity of $X$ is,

-$$\text{PPL}(X) = \exp \left\{ {-\frac{1}{t}\sum_i^t \log p_\theta (x_i|x_{<i}) } \right\}$$
+$$ \text{PPL}(X) = \exp\left\{ -\frac{1}{t}\sum_i^t \log p_\theta (x_i|x_{<i})  \right\} $$

-where \\(\log p_\theta (x_i|x_{<i})\\) is the log-likelihood of the ith token conditioned on the preceding tokens \\(x_{<i}\\) according to our model. Intuitively, it can be thought of as an evaluation of the model's ability to predict uniformly among the set of specified tokens in a corpus. Importantly, this means that the tokenization procedure has a direct impact on a model's perplexity which should always be taken into consideration when comparing different models.
+where $\log p_\theta (x_i|x_{<i})$ is the log-likelihood of the ith token conditioned on the preceding tokens $x_{<i}$ according to our model. Intuitively, it can be thought of as an evaluation of the model's ability to predict uniformly among the set of specified tokens in a corpus. Importantly, this means that the tokenization procedure has a direct impact on a model's perplexity which should always be taken into consideration when comparing different models.

 This is also equivalent to the exponentiation of the cross-entropy between the data and model predictions. For more
 intuition about perplexity and its relationship to Bits Per Character (BPC) and data compression, check out this
@ -42,11 +42,11 @@ factorizing a sequence and conditioning on the entire preceding subsequence at e

 When working with approximate models, however, we typically have a constraint on the number of tokens the model can
 process. The largest version of [GPT-2](model_doc/gpt2), for example, has a fixed length of 1024 tokens, so we
-cannot calculate \\(p_\theta(x_t|x_{<t})\\) directly when \\(t\\) is greater than 1024.
+cannot calculate $p_\theta(x_t|x_{<t})$ directly when $t$ is greater than 1024.

 Instead, the sequence is typically broken into subsequences equal to the model's maximum input size. If a model's max
-input size is \\(k\\), we then approximate the likelihood of a token \\(x_t\\) by conditioning only on the
-\\(k-1\\) tokens that precede it rather than the entire context. When evaluating the model's perplexity of a
+input size is $k$, we then approximate the likelihood of a token $x_t$ by conditioning only on the
+$k-1$ tokens that precede it rather than the entire context. When evaluating the model's perplexity of a
 sequence, a tempting but suboptimal approach is to break the sequence into disjoint chunks and add up the decomposed
 log-likelihoods of each segment independently.

--- a/docs/source/en/philosophy.md
+++ b/docs/source/en/philosophy.md
@ -41,19 +41,18 @@ A longer, in-depth article with examples, visualizations and timelines is availa
 - On top of those three base classes, the library provides two APIs: [`pipeline`] for quickly
    using a model for inference on a given task and [`Trainer`] to quickly train or fine-tune a PyTorch model.

-
 ## Core tenets

 The following tenets solidified over time, and they're detailed in our new philosophy [blog post](https://huggingface.co/spaces/transformers-community/Transformers-tenets). They guide maintainer decisions when reviewing PRs and contributions.

 > - **Source of Truth.** Implementations must be faithful to official results and intended behavior.
->- **One Model, One File.** Core inference/training logic is visible top-to-bottom in the model file users read.
->- **Code is the Product.** Optimize for reading and diff-ing. Prefer explicit names over clever indirection.
->- **Standardize, Don’t Abstract.** Keep model-specific behavior in the model. Use shared interfaces only for generic infra.
->- **DRY\*** (Repeat when it helps users). End-user modeling files remain self-contained. Infra is factored out.
->- **Minimal User API.** Few codepaths, predictable kwargs, stable methods.
->- **Backwards Compatibility.** Public surfaces should not break. Old Hub artifacts have to keep working..
->- **Consistent Public Surface.** Naming, outputs, and optional diagnostics are aligned and tested.
+> - **One Model, One File.** Core inference/training logic is visible top-to-bottom in the model file users read.
+> - **Code is the Product.** Optimize for reading and diff-ing. Prefer explicit names over clever indirection.
+> - **Standardize, Don’t Abstract.** Keep model-specific behavior in the model. Use shared interfaces only for generic infra.
+> - **DRY\*** (Repeat when it helps users). End-user modeling files remain self-contained. Infra is factored out.
+> - **Minimal User API.** Few codepaths, predictable kwargs, stable methods.
+> - **Backwards Compatibility.** Public surfaces should not break. Old Hub artifacts have to keep working..
+> - **Consistent Public Surface.** Naming, outputs, and optional diagnostics are aligned and tested.

 ## Main classes

@ -64,7 +63,6 @@ The following tenets solidified over time, and they're detailed in our new philo

 - **Preprocessing classes** convert the raw data into a format accepted by the model. A [tokenizer](main_classes/tokenizer) stores the vocabulary for each model and provides methods for encoding and decoding strings in a list of token embedding indices. [Image processors](main_classes/image_processor) preprocess vision inputs, [video processors](https://huggingface.co/docs/transformers/en/main_classes/video_processor) preprocess videos inputs, [feature extractors](main_classes/feature_extractor) preprocess audio inputs, and [processors](main_classes/processors) preprocess multimodal inputs.

-
 All these classes can be instantiated from pretrained instances, saved locally, and shared on the Hub with three methods:

 - `from_pretrained()` lets you instantiate a model, configuration, and preprocessing class from a pretrained version either
--- a/docs/source/en/quantization/auto_round.md
+++ b/docs/source/en/quantization/auto_round.md
@ -66,6 +66,7 @@ For 2 bits, we recommend using `auto-round-best` or `auto-round`.
 <hfoption id="quantization auto-round api">

 ### AutoRound API Usage
+
 This setting offers a better trade-off between accuracy and tuning cost, and is recommended in all scenarios.

 ```python
@ -98,6 +99,7 @@ autoround.quantize_and_save(output_dir, format='auto_round')
 <hfoption id="quantization auto-round-best">

 ### AutoRoundBest recipe
+
 This setting provides the best accuracy in most scenarios but is 4–5× slower than the standard AutoRound recipe. It is especially recommended for 2-bit quantization and is a good choice if sufficient resources are available.

 ```python
@ -128,6 +130,7 @@ autoround.quantize_and_save(output_dir, format='auto_round')
 <hfoption id="quantization auto-round-light">

 ### AutoRoundLight recipe
+
 This setting offers the best speed (2 - 3X faster than AutoRound), but it may cause a significant accuracy drop for small models and 2-bit quantization. It is recommended for 4-bit settings and models larger than 3B.

 ```python
@ -279,8 +282,10 @@ If you encounter any issues with auto-round, please open an issue on
 the [AutoRound](https://github.com/intel/auto-round/issues) repository.

 ## Acknowledgement
+
 Special thanks to open-source low precision libraries such as AutoGPTQ, AutoAWQ, GPTQModel, Triton, Marlin, and ExLLaMAV2 for providing low-precision CUDA kernels, which are leveraged in AutoRound.

 ## Contribution
+
 Contributions to [AutoRound](https://github.com/intel/auto-round/pulls) are welcome and greatly appreciated!
 Whether it's fixing bugs, improving documentation, adding new features, or suggesting improvements, your help is always valued.
--- a/docs/source/en/quantization/bitsandbytes.md
+++ b/docs/source/en/quantization/bitsandbytes.md
@ -41,6 +41,7 @@ pip install --upgrade transformers accelerate bitsandbytes
 To compile from source, follow the instructions in the [bitsandbytes installation guide](https://huggingface.co/docs/bitsandbytes/main/en/installation).

 ## Hardware Compatibility
+
 bitsandbytes is supported on NVIDIA GPUs for CUDA versions 11.8 - 13.0, Intel XPU, Intel Gaudi (HPU), and CPU. There is an ongoing effort to support additional platforms. If you're interested in providing feedback or testing, check out the [bitsandbytes repository](https://github.com/bitsandbytes-foundation/bitsandbytes) for more information.

 ### NVIDIA GPUs (CUDA)
--- a/docs/source/en/quantization/concept_guide.md
+++ b/docs/source/en/quantization/concept_guide.md
@ -30,7 +30,7 @@ The sections below cover quantization schemes, granularity, and techniques.

 ## Quantization schemes

-The core idea is to map the range of values found in the original float32 weights and activations to the much smaller range represented by int8 (typically \\([-128, 127]\\)).
+The core idea is to map the range of values found in the original float32 weights and activations to the much smaller range represented by int8 (typically $[-128, 127]$).

 This section covers how some quantization techniques work.

@ -40,20 +40,20 @@ This section covers how some quantization techniques work.

 ### Affine quantization

-The most common method is *affine quantization*. For a given float32 tensor (like a layer's weights), it finds the minimum \\(val_{min}\\) and maximum \\(val_{max}\\) values. This range \\([val_{min}, val_{max}]\\) is mapped to the int8 range \\([q_{min}, q_{max}]\\), which is typically \\([-128, 127]\\).
+The most common method is *affine quantization*. For a given float32 tensor (like a layer's weights), it finds the minimum $val_{min}$ and maximum $val_{max}$ values. This range $[val_{min}, val_{max}]$ is mapped to the int8 range $[q_{min}, q_{max}]$, which is typically $[-128, 127]$.

 There are two main ways to perform this mapping, *symmetric* and *asymmetric*. The choice between symmetric and asymmetric quantization determines how the float32 range is mapped to the int8 range.

- Symmetric: This method assumes the original float32 range is symmetric around zero ( \\([ -a, a ]\\) ). This range is mapped symmetrically to the int8 range, for example, \\([-127, 127]\\). A key characteristic is that the float32 value \\(0.0\\) maps directly to the int8 value \\(0\\). This only requires one parameter, the **scale ( \\(S\\) )**, to define the mapping. It can simplify computations, but it might be less accurate if the original data distribution isn't naturally centered around zero.
- Asymmetric (Affine): This method does not assume the data is centered around zero. It maps the exact range \\([val_{min}, val_{max}]\\) from float32 to the full int8 range, like \\([-128, 127]\\). This requires two parameters, a **scale ( \\(S\\) )** and a **zero-point ( \\(Z\\) )**.
+- Symmetric: This method assumes the original float32 range is symmetric around zero ( $[ -a, a ]$ ). This range is mapped symmetrically to the int8 range, for example, $[-127, 127]$. A key characteristic is that the float32 value $0.0$ maps directly to the int8 value $0$. This only requires one parameter, the **scale ( $S$ )**, to define the mapping. It can simplify computations, but it might be less accurate if the original data distribution isn't naturally centered around zero.
+- Asymmetric (Affine): This method does not assume the data is centered around zero. It maps the exact range $[val_{min}, val_{max}]$ from float32 to the full int8 range, like $[-128, 127]$. This requires two parameters, a **scale ( $S$ )** and a **zero-point ( $Z$ )**.

-    scale ( \\(S\\) ): A positive float32 number representing the ratio between the float32 and the int8 range.
+    scale ( $S$ ): A positive float32 number representing the ratio between the float32 and the int8 range.

 $$
 S = \frac{val_{max} - val_{min}}{q_{max} - q_{min}}
 $$

-zero-Point ( \\(Z\\) ): An int8 value that corresponds to the float32 value \\(0.0\\).
+zero-point ( $Z$ ): An int8 value that corresponds to the float32 value $0.0$.

 $$
 Z = q_{min} - round\left(\frac{val_{min}}{S}\right)
@ -62,13 +62,13 @@ $$
 > [!TIP]
 > In symmetric quantization, Z would typically be fixed at 0.

-With these parameters, a float32 value, \\(x\\). can be quantized to int8 ( \\(q\\) ) with the formula below.
+With these parameters, a float32 value, $x$. can be quantized to int8 ( $q$ ) with the formula below.

 $$
 q = round\left(\frac{x}{S} + Z\right)
 $$

-The int8 value, \\(q\\), can be dequantized back to approximate float32 with the formula below.
+The int8 value, $q$, can be dequantized back to approximate float32 with the formula below.

 $$
 x \approx S \cdot (q - Z)
@ -78,7 +78,7 @@ $$
    <img width="606" alt="dequant" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/dequant.png" />
 </div>

-During inference, computations like matrix multiplication are performed using the int8 values ( \\(q\\) ), and the result is dequantized back to float32 (often using a higher-precision accumulation type like int32 internally) before it is passed to the next layer.
+During inference, computations like matrix multiplication are performed using the int8 values ( $q$ ), and the result is dequantized back to float32 (often using a higher-precision accumulation type like int32 internally) before it is passed to the next layer.

 ### int4 and weight packing

@ -86,7 +86,7 @@ During inference, computations like matrix multiplication are performed using th
    <img width="606" alt="weight packing" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/weight_packing.png" />
 </div>

-int4 quantization further reduces the model size and memory usage (halving it compared to int8). The same affine or symmetric quantization principles apply, mapping the float32 range to the 16 possible values representable by int4 ( \\([-8, 7]\\) for signed int4).
+int4 quantization further reduces the model size and memory usage (halving it compared to int8). The same affine or symmetric quantization principles apply, mapping the float32 range to the 16 possible values representable by int4 ( $[-8, 7]$ for signed int4).

 A key aspect of int4 quantization is **weight packing**. Since most hardware can't natively handle 4-bit data types in memory, two int4 values are typically packed together into a single int8 byte for storage and transfer. For example, the first value might occupy the lower 4 bits and the second value the upper 4 bits of the byte (`packed_byte = (val1 & 0x0F) | (val2 << 4)`).

@ -114,10 +114,10 @@ Transformers supports FP8 through specific backends like [FBGEMM](./fbgemm_fp8),

 ## Granularity

-Quantization parameters ( \\(S\\) and \\(Z\\)) can be calculated in one of two ways.
+Quantization parameters ( $S$ and $Z$) can be calculated in one of two ways.

- Per-Tensor: One set of \\(S\\) and \\(Z\\) for the entire tensor. Simpler, but less accurate if data values vary greatly within the tensor.
- Per-Channel (or Per-Group/Block): Separate \\(S\\) and \\(Z\\) for each channel or group. More accurate and better performance at the cost of slightly more complexity and memory.
+- Per-Tensor: One set of $S$ and $Z$ for the entire tensor. Simpler, but less accurate if data values vary greatly within the tensor.
+- Per-Channel (or Per-Group/Block): Separate $S$ and $Z$ for each channel or group. More accurate and better performance at the cost of slightly more complexity and memory.

 <div class="flex justify-center">
    <img width="625" alt="Granularities" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/Granularities.png" />
--- a/docs/source/en/quantization/torchao.md
+++ b/docs/source/en/quantization/torchao.md
@ -87,6 +87,7 @@ Create a [`TorchAoConfig`] and specify the quantization type and `group_size` of
 We'll show examples for recommended quantization methods based on hardwares, e.g. A100 GPU, H100 GPU, CPU.

 ### H100 GPU
+
 <hfoptions id="examples-H100-GPU">
 <hfoption id="float8-dynamic-and-weight-only">

@ -182,6 +183,7 @@ print(tokenizer.decode(output[0], skip_special_tokens=True))
 </hfoptions>

 ### A100 GPU
+
 <hfoptions id="examples-A100-GPU">
 <hfoption id="int8-dynamic-and-weight-only">

@ -284,6 +286,7 @@ print(tokenizer.decode(output[0], skip_special_tokens=True))
 </hfoptions>

 ### Intel XPU
+
 <hfoptions id="examples-Intel-XPU">
 <hfoption id="int8-dynamic-and-weight-only">

@ -350,6 +353,7 @@ print(tokenizer.decode(output[0], skip_special_tokens=True))
 </hfoptions>

 ### CPU
+
 <hfoptions id="examples-CPU">
 <hfoption id="int8-dynamic-and-weight-only">

@ -415,7 +419,9 @@ print(tokenizer.decode(output[0], skip_special_tokens=True))
 </hfoptions>

 ### Per Module Quantization
+
 #### 1. Skip quantization for certain layers
+
 With `ModuleFqnToConfig` we can specify a default configuration for all layers while skipping quantization for certain layers.

 ```py
@ -485,6 +491,7 @@ print(output_text)
 ```

 #### 3. Quantizing different layers with different quantization configs (with regex)
+
 We can also use regex to specify the config for all modules that has `module_fqn` that
 matches the regex, all regex should start with `re:`, for example `re:layers\..*\.gate_proj` will
 match all layers like `layers.0.gate_proj`. See [here](https://github.com/pytorch/ao/blob/2fe0ca0899c730c528efdbec8886feaa38879f39/torchao/quantization/quant_api.py#L2392) for docs.
--- a/docs/source/en/run_scripts.md
+++ b/docs/source/en/run_scripts.md
@ -61,7 +61,7 @@ The example below fine-tunes [T5-small](https://huggingface.co/google-t5/t5-smal

 The example script downloads and preprocesses a dataset, and then fine-tunes it with [`Trainer`] with a supported model architecture.

-Resuming training from a checkpoint is very useful if training is interrupted because you don't have to start over again: 
+Resuming training from a checkpoint is very useful if training is interrupted because you don't have to start over again:

 * `--resume_from_checkpoint path_to_specific_checkpoint` resumes training from a specific checkpoint folder.

--- a/docs/source/en/serving.md
+++ b/docs/source/en/serving.md
@ -24,6 +24,7 @@ Transformer models can be efficiently deployed using libraries such as vLLM, Tex
 You can also serve transformer models with the `transformers serve` CLI. With Continuous Batching, `serve` now delivers solid throughput and latency well suited for evaluation, experimentation, and moderate-load local or self-hosted deployments. While vLLM, SGLang, or other inference engines remain our recommendations for large-scale production, `serve` avoids the extra runtime and operational overhead, and is on track to gain more production-oriented features.

 In this document, we dive into the different supported endpoints and modalities; we also cover the setup of several user interfaces that can be used on top of `transformers serve` in the following guides:
+
 - [Jan (text and MCP user interface)](./jan)
 - [Cursor (IDE)](./cursor)
 - [Open WebUI (text, image, speech user interface)](./open_webui)
--- a/docs/source/en/tasks/idefics.md
+++ b/docs/source/en/tasks/idefics.md
@ -110,6 +110,7 @@ on the fly while loading.
 Now that you have the model loaded in one of the suggested ways, let's move on to exploring tasks that you can use IDEFICS for.

 ## Image captioning
+
 Image captioning is the task of predicting a caption for a given image. A common application is to aid visually impaired
 people navigate through different situations, for instance, explore image content online.

--- a/docs/source/en/tasks/training_vision_backbone.md
+++ b/docs/source/en/tasks/training_vision_backbone.md
@ -0,0 +1,250 @@
+<!--Copyright 2025 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# Training Vision Models using Backbone API
+
+Computer vision workflows follow a common pattern. Use a pre-trained backbone for feature extraction ([ViT](../model_doc/vit), [DINOv3](../model_doc/dinov3)). Add a "neck" for feature enhancement. Attach a task-specific head ([DETR](../model_doc/detr) for object detection, [MaskFormer](../model_doc/maskformer) for segmentation).
+
+The Transformers library implements these models and the [backbone API](../backbones) lets you swap different backbones and heads with minimal code.
+
+![Backbone Explanation](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/Backbone.png)
+
+This guide combines [DINOv3 with ConvNext architecture](https://huggingface.co/facebook/dinov3-convnext-large-pretrain-lvd1689m) and a [DETR head](https://huggingface.co/facebook/detr-resnet-50). You'll train on the [license plate detection dataset](https://huggingface.co/datasets/merve/license-plates). DINOv3 delivers the best performance as of this writing.
+
+> [!NOTE]
+> This model requires access approval. Visit [the model repository](https://huggingface.co/facebook/dinov3-convnext-large-pretrain-lvd1689m) to request access.
+
+Install [trackio](https://github.com/gradio-app/trackio) for experiment tracking and [albumentations](https://albumentations.ai/) for data augmentation. Use the latest transformers version.
+
+```bash
+pip install -Uq albumentations trackio transformers datasets
+```
+
+Initialize [`DetrConfig`] with the pre-trained DINOv3 ConvNext backbone. Use `num_labels=1` to detect the license plate bounding boxes. Create [`DetrForObjectDetection`] with this configuration. Freeze the backbone to preserve DINOv3 features without updating weights. Load the [`DetrImageProcessor`].
+
+```py
+from transformers import DetrConfig, DetrForObjectDetection, AutoImageProcessor
+
+config = DetrConfig(backbone="facebook/dinov3-convnext-large-pretrain-lvd1689m",
+                    use_pretrained_backbone=True, use_timm_backbone=False,
+                    num_labels=1, id2label={0: "license_plate"}, label2id={"license_plate": 0})
+model = DetrForObjectDetection(config)
+
+for param in model.model.backbone.parameters():
+    param.requires_grad = False
+image_processor = AutoImageProcessor.from_pretrained("facebook/detr-resnet-50")
+```
+
+Load the dataset and split it for training.
+
+```py
+from datasets import load_dataset
+ds = load_dataset("merve/license-plates")
+ds = ds["train"]
+
+ds = ds.train_test_split(test_size=0.05)
+train_dataset = ds["train"]
+val_dataset = ds["test"]
+len(train_dataset)
+# 5867
+```
+
+Augment the dataset. Rescale images to a maximum size, flip them, and apply affine transforms. Eliminate invalid bounding boxes and ensure annotations stay clean with `rebuild_objects`.
+
+```py
+import albumentations as A
+import numpy as np
+from PIL import Image
+
+train_aug = A.Compose(
+    [
+        A.LongestMaxSize(max_size=1024, p=1.0),
+        A.HorizontalFlip(p=0.5),
+        A.Affine(rotate=(-5, 5), shear=(-5, 5), translate_percent=(0.05, 0.05), p=0.5),
+    ],
+    bbox_params=A.BboxParams(format="coco", label_fields=["category_id"], min_visibility=0.0),
+)
+
+def train_transform(batch):
+    imgs_out, objs_out = [], []
+    original_imgs, original_objs = batch["image"], batch["objects"]
+
+    for i, (img_pil, objs) in enumerate(zip(original_imgs, original_objs)):
+        img = np.array(img_pil)
+        labels = [0] * len(objs["bbox"])
+
+        out = train_aug(image=img, bboxes=list(objs["bbox"]), category_id=labels)
+
+        if len(out["bboxes"]) == 0:
+            imgs_out.append(img_pil) # if no boxes left after augmentation, use original
+            objs_out.append(objs)
+            continue
+
+        H, W = out["image"].shape[:2]
+        clamped = []
+        for (x, y, w, h) in out["bboxes"]:
+            x = max(0.0, min(x, W - 1.0))
+            y = max(0.0, min(y, H - 1.0))
+            w = max(1.0, min(w, W - x))
+            h = max(1.0, min(h, H - y))
+            clamped.append([x, y, w, h])
+
+        imgs_out.append(Image.fromarray(out["image"]))
+        objs_out.append(rebuild_objects(clamped, out["category_id"]))
+
+    batch["image"] = imgs_out
+    batch["objects"] = objs_out
+    return batch
+
+
+
+def rebuild_objects(bboxes, labels):
+    bboxes = [list(map(float, b)) for b in bboxes]
+    areas  = [float(w*h) for (_, _, w, h) in bboxes]
+    ids    = list(range(len(bboxes)))
+    return {
+        "id": ids,
+        "bbox": bboxes,
+        "category_id": list(map(int, labels)),
+        "area": areas,
+        "iscrowd": [0]*len(bboxes),
+    }
+
+train_dataset = train_dataset.with_transform(train_transform)
+```
+
+
+Build COCO-style annotations for the image processor.
+
+```py
+import torch
+
+def format_annotations(image, objects, image_id):
+    n = len(objects["id"])
+    anns = []
+    iscrowd_list = objects.get("iscrowd", [0] * n)
+    area_list = objects.get("area", None)
+
+    for i in range(n):
+        x, y, w, h = objects["bbox"][i]
+        area = area_list[i] if area_list is not None else float(w * h)
+
+        anns.append({
+            "id": int(objects["id"][i]),
+            "iscrowd": int(iscrowd_list[i]),
+            "bbox": [float(x), float(y), float(w), float(h)],
+            "category_id": int(objects.get("category_id", objects.get("category"))[i]),
+            "area": float(area),
+        })
+
+    return {"image_id": int(image_id), "annotations": anns}
+```
+
+Create batches in the data collator. Format annotations and pass them with transformed images to the image processor.
+
+```py
+def collate_fn(examples):
+    images = [example["image"] for example in examples]
+    ann_batch = [format_annotations(example["image"], example["objects"], example["image_id"]) for example in examples]
+
+    inputs = image_processor(images=images, annotations=ann_batch, return_tensors="pt")
+    return inputs
+```
+
+Initialize the [`Trainer`] and set up [`TrainingArguments`] for model convergence. Pass datasets, data collator, arguments, and model to `Trainer` to start training.
+
+```py
+from transformers import Trainer, TrainingArguments
+
+training_args = TrainingArguments(
+    output_dir="./license-plate-detr-dinov3",
+    per_device_train_batch_size=4,
+    per_device_eval_batch_size=4,
+    num_train_epochs=8,
+    learning_rate=1e-5,
+    weight_decay=1e-4,
+    warmup_steps=500,
+    eval_strategy="steps",
+    eval_steps=500,
+    save_total_limit=2,
+    dataloader_pin_memory=False,
+    fp16=True,
+    report_to="trackio",
+    load_best_model_at_end=True,
+    remove_unused_columns=False,
+    push_to_hub=True,
+)
+
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    train_dataset=train_dataset,
+    eval_dataset=val_dataset,
+    data_collator=collate_fn,
+)
+
+trainer.train()
+```
+
+Push the trainer and image processor to the Hub.
+
+```py
+trainer.push_to_hub()
+image_processor.push_to_hub("merve/license-plate-detr-dinov3")
+```
+
+Test the model with an object detection pipeline.
+
+```py
+from transformers import pipeline
+
+obj_detector = pipeline(
+    "object-detection", model="merve/license-plate-detr-dinov3"
+)
+results = obj_detector("https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/license-plates.jpg", threshold=0.05)
+print(results)
+```
+
+Visualize the results.
+
+```py
+from PIL import Image, ImageDraw
+import numpy as np
+import requests
+
+
+def plot_results(image, results, threshold):
+    image = Image.fromarray(np.uint8(image))
+    draw = ImageDraw.Draw(image)
+    width, height = image.size
+
+    for result in results:
+        score = result["score"]
+        label = result["label"]
+        box = list(result["box"].values())
+
+        if score > threshold:
+            x1, y1, x2, y2 = tuple(box)
+            draw.rectangle((x1, y1, x2, y2), outline="red")
+            draw.text((x1 + 5, y1 + 10), f"{score:.2f}", fill="green" if score > 0.7 else "red")
+
+    return image
+
+image = Image.open(requests.get("https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/license-plates.jpg", stream=True).raw)
+plot_results(image, results, threshold=0.05)
+```
+
+![Results](https://huggingface.co/datasets/huggingface/documentation-images/results/main/transformers/tasks/backbone_training_results.png)
--- a/docs/source/en/tokenizer_summary.md
+++ b/docs/source/en/tokenizer_summary.md
@ -166,9 +166,9 @@ base vocabulary, we obtain:
 ```

 BPE then counts the frequency of each possible symbol pair and picks the symbol pair that occurs most frequently. In
-the example above `"h"` followed by `"u"` is present _10 + 5 = 15_ times (10 times in the 10 occurrences of
+the example above `"h"` followed by `"u"` is present *10 + 5 = 15* times (10 times in the 10 occurrences of
 `"hug"`, 5 times in the 5 occurrences of `"hugs"`). However, the most frequent symbol pair is `"u"` followed by
-`"g"`, occurring _10 + 5 + 5 = 20_ times in total. Thus, the first merge rule the tokenizer learns is to group all
+`"g"`, occurring *10 + 5 + 5 = 20* times in total. Thus, the first merge rule the tokenizer learns is to group all
 `"u"` symbols followed by a `"g"` symbol together. Next, `"ug"` is added to the vocabulary. The set of words then
 becomes

@ -222,8 +222,8 @@ So what does this mean exactly? Referring to the previous example, maximizing th
 equivalent to finding the symbol pair, whose probability divided by the probabilities of its first symbol followed by
 its second symbol is the greatest among all symbol pairs. *E.g.* `"u"`, followed by `"g"` would have only been
 merged if the probability of `"ug"` divided by `"u"`, `"g"` would have been greater than for any other symbol
-pair. Intuitively, WordPiece is slightly different to BPE in that it evaluates what it _loses_ by merging two symbols
-to ensure it's _worth it_.
+pair. Intuitively, WordPiece is slightly different to BPE in that it evaluates what it *loses* by merging two symbols
+to ensure it's *worth it*.

 <a id='unigram'></a>

@ -257,8 +257,8 @@ likely tokenization in practice, but also offers the possibility to sample a pos
 probabilities.

 Those probabilities are defined by the loss the tokenizer is trained on. Assuming that the training data consists of
-the words \\(x_{1}, \dots, x_{N}\\) and that the set of all possible tokenizations for a word \\(x_{i}\\) is
-defined as \\(S(x_{i})\\), then the overall loss is defined as
+the words $x_{1}, \dots, x_{N}$ and that the set of all possible tokenizations for a word $x_{i}$ is
+defined as $S(x_{i})$, then the overall loss is defined as

 $$\mathcal{L} = -\sum_{i=1}^{N} \log \left ( \sum_{x \in S(x_{i})} p(x) \right )$$

--- a/docs/source/en/training.md
+++ b/docs/source/en/training.md
@ -111,6 +111,8 @@ training_args = TrainingArguments(
 Create a [`Trainer`] instance and pass it the model, training arguments, training and test datasets, and evaluation function. Call [`~Trainer.train`] to start training.

 ```py
+from transformers import Trainer
+
 trainer = Trainer(
    model=model,
    args=training_args,
--- a/docs/source/it/big_models.md
+++ b/docs/source/it/big_models.md
@ -106,10 +106,10 @@ La mappa dei pesi è la parte principale di questo indice, che mappa ogni nome d
 ...
 ```

-Se vuoi caricare direttamente un checkpoint frammentato in un modello senza usare [`~PreTrainedModel.from_pretrained`] (come si farebbe con `model.load_state_dict()` per un checkpoint completo) devi usare [`~modeling_utils.load_sharded_checkpoint`]:
+Se vuoi caricare direttamente un checkpoint frammentato in un modello senza usare [`~PreTrainedModel.from_pretrained`] (come si farebbe con `model.load_state_dict()` per un checkpoint completo) devi usare [`~trainer_utils.load_sharded_checkpoint`]:

 ```py
->>> from transformers.modeling_utils import load_sharded_checkpoint
+>>> from transformers.trainer_utils import load_sharded_checkpoint

 >>> with tempfile.TemporaryDirectory() as tmp_dir:
 ...     model.save_pretrained(tmp_dir, max_shard_size="200MB")
--- a/docs/source/ja/big_models.md
+++ b/docs/source/ja/big_models.md
@ -110,11 +110,11 @@ dict_keys(['metadata', 'weight_map'])
 ```

 直接モデル内で[`~PreTrainedModel.from_pretrained`]を使用せずに、
-シャーディングされたチェックポイントをロードしたい場合（フルチェックポイントの場合に`model.load_state_dict()`を使用するように行う方法）、[`~modeling_utils.load_sharded_checkpoint`]を使用する必要があります：
+シャーディングされたチェックポイントをロードしたい場合（フルチェックポイントの場合に`model.load_state_dict()`を使用するように行う方法）、[`~trainer_utils.load_sharded_checkpoint`]を使用する必要があります：


 ```py
->>> from transformers.modeling_utils import load_sharded_checkpoint
+>>> from transformers.trainer_utils import load_sharded_checkpoint

 >>> with tempfile.TemporaryDirectory() as tmp_dir:
 ...     model.save_pretrained(tmp_dir, max_shard_size="200MB")
--- a/docs/source/ja/main_classes/model.md
+++ b/docs/source/ja/main_classes/model.md
@ -127,7 +127,3 @@ Pytorch の設計により、この機能は浮動小数点 dtype でのみ使
 ## Pushing to the Hub

 [[autodoc]] utils.PushToHubMixin
-
-## Sharded checkpoints
-
-[[autodoc]] modeling_utils.load_sharded_checkpoint
--- a/docs/source/ko/main_classes/model.md
+++ b/docs/source/ko/main_classes/model.md
@ -46,7 +46,3 @@ rendered properly in your Markdown viewer.
 ## 허브에 저장하기

 [[autodoc]] utils.PushToHubMixin
-
-## 공유된 체크포인트
-
-[[autodoc]] modeling_utils.load_sharded_checkpoint
--- a/docs/source/ko/models.md
+++ b/docs/source/ko/models.md
@ -178,10 +178,10 @@ with tempfile.TemporaryDirectory() as tmp_dir:
    new_model = AutoModel.from_pretrained(tmp_dir)
 ```

-분할된 체크포인트는 [`~transformers.modeling_utils.load_sharded_checkpoint`]로도 직접 불러올 수 있습니다.
+분할된 체크포인트는 [`~transformers.trainer_utils.load_sharded_checkpoint`]로도 직접 불러올 수 있습니다.

 ```py
-from transformers.modeling_utils import load_sharded_checkpoint
+from transformers.trainer_utils import load_sharded_checkpoint

 with tempfile.TemporaryDirectory() as tmp_dir:
    model.save_pretrained(tmp_dir, max_shard_size="5GB")
--- a/docs/source/zh/big_models.md
+++ b/docs/source/zh/big_models.md
@ -105,11 +105,11 @@ dict_keys(['metadata', 'weight_map'])
 ...
 ```

-如果您想直接在模型内部加载这样的分片`checkpoint`，而不使用 [`PreTrainedModel.from_pretrained`](就像您会为完整`checkpoint`执行 `model.load_state_dict()` 一样)，您应该使用 [`modeling_utils.load_sharded_checkpoint`]：
+如果您想直接在模型内部加载这样的分片`checkpoint`，而不使用 [`PreTrainedModel.from_pretrained`](就像您会为完整`checkpoint`执行 `model.load_state_dict()` 一样)，您应该使用 [`trainer_utils.load_sharded_checkpoint`]：


 ```py
->>> from transformers.modeling_utils import load_sharded_checkpoint
+>>> from transformers.trainer_utils import load_sharded_checkpoint

 >>> with tempfile.TemporaryDirectory() as tmp_dir:
 ...     model.save_pretrained(tmp_dir, max_shard_size="200MB")
--- a/docs/source/zh/main_classes/model.md
+++ b/docs/source/zh/main_classes/model.md
@ -109,6 +109,3 @@ model = AutoModel.from_config(config)

 ## 推送到 Hub
 [[autodoc]] utils.PushToHubMixin
-
-## 分片检查点
-[[autodoc]] modeling_utils.load_sharded_checkpoint
--- a/examples/legacy/run_language_modeling.py
+++ b/examples/legacy/run_language_modeling.py
@ -39,7 +39,10 @@ from transformers import (
    DataCollatorForPermutationLanguageModeling,
    DataCollatorForWholeWordMask,
    HfArgumentParser,
+    LineByLineTextDataset,
+    LineByLineWithRefDataset,
    PreTrainedTokenizer,
+    TextDataset,
    Trainer,
    TrainingArguments,
    set_seed,
--- a/examples/legacy/seq2seq/finetune_tpu.sh
+++ b/examples/legacy/seq2seq/finetune_tpu.sh
@ -16,8 +16,8 @@ export TPU_NUM_CORES=8

 # the proper usage is documented in the README, you need to specify data_dir, output_dir and model_name_or_path
 # run ./finetune_tpu.sh --help to see all the possible options
-python xla_spawn.py --num_cores $TPU_NUM_CORES \
-    finetune_trainer.py \
+# To specify the number of cores to use, use the TPU_NUM_DEVICES environment variable
+python xla_spawn.py finetune_trainer.py \
    --learning_rate=3e-5 \
    --do_train --do_eval \
    --eval_strategy steps \
--- a/examples/legacy/seq2seq/finetune_trainer.py
+++ b/examples/legacy/seq2seq/finetune_trainer.py
@ -295,9 +295,7 @@ def main():
        data_args=data_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
-        data_collator=Seq2SeqDataCollator(
-            tokenizer, data_args, model.config.decoder_start_token_id, training_args.tpu_num_cores
-        ),
+        data_collator=Seq2SeqDataCollator(tokenizer, data_args, model.config.decoder_start_token_id),
        compute_metrics=compute_metrics_fn,
        processing_class=tokenizer,
    )
--- a/examples/legacy/seq2seq/train_distil_marian_enro_tpu.sh
+++ b/examples/legacy/seq2seq/train_distil_marian_enro_tpu.sh
@ -16,10 +16,8 @@ export WANDB_PROJECT=distil-marian
 export BS=64
 export m=sshleifer/student_marian_en_ro_6_3
 export MAX_LEN=128
-export TPU_NUM_CORES=8

-python xla_spawn.py --num_cores $TPU_NUM_CORES \
-    finetune_trainer.py \
+python xla_spawn.py finetune_trainer.py \
    --tokenizer_name $m --model_name_or_path $m \
    --data_dir $ENRO_DIR \
    --output_dir marian_en_ro_6_3 \
--- a/examples/legacy/seq2seq/utils.py
+++ b/examples/legacy/seq2seq/utils.py
@ -279,7 +279,7 @@ class Seq2SeqDataset(AbstractSeq2SeqDataset):


 class Seq2SeqDataCollator:
-    def __init__(self, tokenizer, data_args, decoder_start_token_id, tpu_num_cores=None):
+    def __init__(self, tokenizer, data_args, decoder_start_token_id):
        self.tokenizer = tokenizer
        self.pad_token_id = tokenizer.pad_token_id
        self.decoder_start_token_id = decoder_start_token_id
@ -287,7 +287,6 @@ class Seq2SeqDataCollator:
            f"pad_token_id is not defined for ({self.tokenizer.__class__.__name__}), it must be defined."
        )
        self.data_args = data_args
-        self.tpu_num_cores = tpu_num_cores
        self.dataset_kwargs = {"add_prefix_space": True} if isinstance(tokenizer, BartTokenizer) else {}
        if data_args.src_lang is not None:
            self.dataset_kwargs["src_lang"] = data_args.src_lang
@ -336,7 +335,7 @@ class Seq2SeqDataCollator:
            tgt_texts=[x["tgt_texts"] for x in batch],
            max_length=self.data_args.max_source_length,
            max_target_length=self.data_args.max_target_length,
-            padding="max_length" if self.tpu_num_cores is not None else "longest",  # TPU hack
+            padding="longest",
            return_tensors="pt",
            **self.dataset_kwargs,
        )
--- a/examples/legacy/seq2seq/xla_spawn.py
+++ b/examples/legacy/seq2seq/xla_spawn.py
@ -73,9 +73,9 @@ def main():
    mod = importlib.import_module(mod_name)

    # Patch sys.argv
-    sys.argv = [args.training_script] + args.training_script_args + ["--tpu_num_cores", str(args.num_cores)]
+    sys.argv = [args.training_script] + args.training_script_args

-    xmp.spawn(mod._mp_fn, args=(), nprocs=args.num_cores)
+    xmp.spawn(mod._mp_fn, args=())


 if __name__ == "__main__":
--- a/examples/pytorch/continuous_batching.py
+++ b/examples/pytorch/continuous_batching.py
@ -26,22 +26,25 @@ from tqdm import tqdm

 from transformers import AutoModelForCausalLM, AutoTokenizer
 from transformers.generation import GenerationConfig
+from transformers.generation.continuous_batching.requests import logger


 # MODEL_ID = "Qwen/Qwen3-4B-Instruct-2507"
 SLIDING_WINDOW = 0
-MODEL_ID = "google/gemma-2-2b-it" if SLIDING_WINDOW > 0 else "Qwen/Qwen3-4B-Instruct-2507"
+MODEL_ID = "google/gemma-2-2b-it" if SLIDING_WINDOW > 0 else "meta-llama/Meta-Llama-3-8B"
 FORCE_MAX_LENGTH = False  # should be False unless you are debugging sliding window features
+SKIP_SPECIAL_TOKENS = False


 def generate_simple(
    attn_impl: str, simple_batch_inputs: list[int], generation_config: GenerationConfig
 ) -> dict[str, str]:
    attn_impl = {
-        "sdpa_paged": "sdpa",
-        "eager_paged": "eager",
+        "sdpa": "sdpa",
+        "eager": "eager",
        "paged_attention": "eager",  # TODO: this does not work on AMD docker
        "flash_paged": "flash_attention_2",  # TODO: this does not work on AMD docker
+        "kernels-community/flash-attn": "eager",
    }[attn_impl]

    model = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype=torch.bfloat16, attn_implementation=attn_impl)
@ -56,7 +59,7 @@ def generate_simple(
        # attention_mask = torch.ones_like(input_ids)
        outputs = model.generate(input_ids, generation_config=generation_config, use_model_defaults=False)
        generated_tokens = outputs[0][input_ids.shape[1] :]
-        decoded_output = tokenizer.decode(generated_tokens, skip_special_tokens=True)
+        decoded_output = tokenizer.decode(generated_tokens, skip_special_tokens=SKIP_SPECIAL_TOKENS)
        decoded_outputs[key] = decoded_output
    return decoded_outputs

@ -99,7 +102,6 @@ def batch_generate(
    displayed_samples: int = 0,  # -1: no display, 0: display stats, >0: display inputs and some outputs
    output_file: Optional[str] = None,
    expected_outputs: Optional[list[str]] = None,
-    slice_inputs: bool = True,
 ) -> tuple[float, float]:
    # Actual batch generation
    if displayed_samples >= 0:
@ -108,7 +110,6 @@ def batch_generate(
    batch_outputs = model.generate_batch(
        inputs=simple_batch_inputs,
        generation_config=generation_config,
-        slice_inputs=slice_inputs,  # TODO: move this to the generation config
    )
    end_time_simple = time.time()
    if displayed_samples >= 0:
@ -118,19 +119,21 @@ def batch_generate(
    token_count = 0
    data = []
    for i, request in enumerate(batch_outputs):
-        input_text = tokenizer.decode(batch_outputs[request].prompt_ids, skip_special_tokens=True)
+        input_text = tokenizer.decode(batch_outputs[request].prompt_ids, skip_special_tokens=SKIP_SPECIAL_TOKENS)
        # The key is used to tie back to the output of unbatched generation
        key = " ".join(map(str, batch_outputs[request].prompt_ids))
        data.append({"input": input_text, "key": key})

        # Try to decode the output
        try:
-            output_text = tokenizer.decode(batch_outputs[request].generated_tokens, skip_special_tokens=True)
+            output_text = tokenizer.decode(
+                batch_outputs[request].generated_tokens, skip_special_tokens=SKIP_SPECIAL_TOKENS
+            )
            token_count += len(batch_outputs[request].generated_tokens[1:])
-            data[-1]["output"] = output_text
+            data[-1]["cb_outputs"] = output_text
        except Exception as e:
            print(f"Decoding failed for request {request}: {e}")
-            data[-1]["output"] = "__ERROR__"
+            data[-1]["cb_outputs"] = "__ERROR__"
            continue

        # Display sample if asked
@ -148,7 +151,7 @@ def batch_generate(
        if expected_outputs is not None:
            expected_output = expected_outputs.pop(key)
            matches = output_text == expected_output  # TODO: rework this for a better distance metric
-            data[-1]["ref"] = expected_output
+            data[-1]["without_cb"] = expected_output
            data[-1]["matches"] = matches
            data[-1].pop("key")
            print(f"Request {i} matches" if matches else f"Request {i} does NOT match!")
@ -186,19 +189,20 @@ if __name__ == "__main__":

    parser.add_argument("--attn", type=str, default="kernels-community/flash-attn", help="Attention implementation")
    parser.add_argument("--matmul-precision", "-mp", type=str, default="high")  # set to "none" to disable
-    parser.add_argument("--no-slice-inputs", action="store_true")  # slicing is enabled by default because much faster
-    parser.add_argument("--use-cuda-graph", "-cg", action="store_true")
-    parser.add_argument("--compile", action="store_true")
+    parser.add_argument("--cuda-graph", "-cg", help="Use cuda graphs", type=str, default=None)
+    parser.add_argument("--compile", action="store_true", help="Compile the model using torch.compile")

-    parser.add_argument("--samples", type=int, default=500)
+    parser.add_argument("--samples", type=int, default=500, help="Number of samples to generate")
    parser.add_argument("--displayed", type=int, default=0, help="Number of samples to display")
+    parser.add_argument("--log-level", type=str, default="INFO")
    parser.add_argument("--output-file", type=str, default=None)
    parser.add_argument("--compare", action="store_true")
    parser.add_argument("--metrics", action="store_true")
    parser.add_argument("--profile", type=str, default=None)
    args = parser.parse_args()

-    args.slice_inputs = not args.no_slice_inputs
+    # Set log level
+    logger.setLevel(args.log_level.upper())

    # If turned on, we setup metrics
    if args.metrics:
@ -207,6 +211,15 @@ if __name__ == "__main__":
    # Set matmul precision if not none
    if args.matmul_precision != "none":
        torch.set_float32_matmul_precision(args.matmul_precision)
+    # Parse cuda graph argument
+    if args.cuda_graph is not None:
+        use_cuda_graph = {
+            "none": None,
+            "yes": True, "y": True, "true": True, "t": True, "1": True,
+            "no": False, "n": False, "false": False, "f": False, "0": False,
+        }[args.cuda_graph.lower()]  # fmt: skip
+    else:
+        use_cuda_graph = None

    # Prepare model
    model = AutoModelForCausalLM.from_pretrained(
@ -222,9 +235,6 @@ if __name__ == "__main__":
    # If turned on, we compile the model
    if args.compile:
        model.forward = torch.compile(model.forward, mode="max-autotune-no-cudagraphs")
-    if args.slice_inputs:
-        assert not args.compile, "Slicing inputs requires is not the model to be compiled"
-        assert not args.use_cuda_graph, "Slicing inputs is not compatible with cuda graphs"

    # Prepare tokenizer and dataset
    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, padding_side="left")
@ -237,10 +247,10 @@ if __name__ == "__main__":
    # Prepare generation config
    generation_config = GenerationConfig(
        max_new_tokens=512,
-        use_cuda_graph=args.use_cuda_graph,
+        use_cuda_graph=use_cuda_graph,
        eos_token_id=tokenizer.pad_token_id if FORCE_MAX_LENGTH else tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id,
-        do_sample=True,
+        do_sample=not args.compare,
        temperature=0.8,
        top_p=0.9,
        num_blocks=args.num_blocks,
@ -265,7 +275,6 @@ if __name__ == "__main__":
        generation_config,
        tokenizer,
        displayed_samples=-1,
-        slice_inputs=args.slice_inputs,
    )

    if args.profile is not None:
@ -282,12 +291,11 @@ if __name__ == "__main__":
            displayed_samples=args.displayed,
            output_file=args.output_file,
            expected_outputs=expected_outputs,
-            slice_inputs=args.slice_inputs,
        )
    if args.profile is not None:
        filename = args.profile if args.profile.endswith(".json") else args.profile + ".json"
        prof.export_chrome_trace(filename)

 # Example usage:
-# python examples/pytorch/continuous_batching.py --attn sdpa_paged -mp none --slice-inputs --samples 3 --compare
+# python examples/pytorch/continuous_batching.py --attn sdpa_paged -mp none --samples 3 --compare
 # python examples/pytorch/continuous_batching.py --num-blocks 369 --max-batch-tokens 23 --attn sdpa_paged -mp none --samples 1 --displayed 0 --output-file sliced.json
--- a/examples/pytorch/continuous_batching_simple.py
+++ b/examples/pytorch/continuous_batching_simple.py
@ -68,7 +68,6 @@ if __name__ == "__main__":
    _ = model.generate_batch(
        inputs=simple_batch_inputs[: min(5, args.samples)],
        generation_config=generation_config,
-        slice_inputs=True,
    )

    # Actual batch generation
@ -77,7 +76,6 @@ if __name__ == "__main__":
    batch_outputs = model.generate_batch(
        inputs=simple_batch_inputs,
        generation_config=generation_config,
-        slice_inputs=True,
    )
    end_time = time.time()
    print("Done with batch generation.")
--- a/examples/pytorch/language-modeling/README.md
+++ b/examples/pytorch/language-modeling/README.md
@ -24,6 +24,8 @@ objectives in our [model summary](https://huggingface.co/transformers/model_summ

 There are two sets of scripts provided. The first set leverages the Trainer API. The second set with `no_trainer` in the suffix uses a custom training loop and leverages the 🤗 Accelerate library . Both sets use the 🤗 Datasets library. You can easily customize them to your needs if you need extra processing on your datasets.

+**Note:** The old script `run_language_modeling.py` is still available [here](https://github.com/huggingface/transformers/blob/main/examples/legacy/run_language_modeling.py).
+
 The following examples, will run on datasets hosted on our [hub](https://huggingface.co/datasets) or with your own
 text files for training and validation. We give examples of both below.

--- a/examples/pytorch/xla_spawn.py
+++ b/examples/pytorch/xla_spawn.py
@ -73,7 +73,7 @@ def main():
    mod = importlib.import_module(mod_name)

    # Patch sys.argv
-    sys.argv = [args.training_script] + args.training_script_args + ["--tpu_num_cores", str(args.num_cores)]
+    sys.argv = [args.training_script] + args.training_script_args

    xmp.spawn(mod._mp_fn, args=(), nprocs=args.num_cores)

--- a/i18n/README_fr.md
+++ b/i18n/README_fr.md
@ -227,7 +227,7 @@ Le modèle lui-même est un module [`nn.Module` PyTorch](https://pytorch.org/doc

 1. Choisissez le bon framework pour chaque partie de la vie d'un modèle :
    - Entraînez des modèles de pointe en 3 lignes de code.
-    - Transférer un seul modèle entre les frameworks TF2.0/PyTorch/JAX à volonté.
+    - Transférez un seul modèle entre les frameworks TF2.0/PyTorch/JAX à volonté.
    - Choisissez facilement le bon framework pour l'entraînement, l'évaluation et la production.

 1. Personnalisez facilement un modèle ou un exemple selon vos besoins :
@ -237,7 +237,7 @@ Le modèle lui-même est un module [`nn.Module` PyTorch](https://pytorch.org/doc

 ## Pourquoi ne devrais-je pas utiliser transformers ?

- Cette bibliothèque n'est pas une boîte à outils modulaire de blocs de construction pour les réseaux neuronaux. Le code dans les fichiers de modèle n'est pas refactored avec des abstractions supplémentaires à dessein, afin que les chercheurs puissent itérer rapidement sur chacun des modèles sans plonger dans des abstractions/fichiers supplémentaires.
+- Cette bibliothèque n'est pas une boîte à outils modulaire de blocs de construction pour les réseaux neuronaux. Le code dans les fichiers de modèle n'est pas refactorisé avec des abstractions supplémentaires à dessein, afin que les chercheurs puissent itérer rapidement sur chacun des modèles sans plonger dans des abstractions/fichiers supplémentaires.
 - L'API d'entraînement n'est pas destinée à fonctionner avec n'importe quel modèle, mais elle est optimisée pour fonctionner avec les modèles fournis par la bibliothèque. Pour des boucles génériques d'apprentissage automatique, vous devriez utiliser une autre bibliothèque (éventuellement, [Accelerate](https://huggingface.co/docs/accelerate)).
 - Bien que nous nous efforcions de présenter autant de cas d'utilisation que possible, les scripts de notre [dossier d'exemples](https://github.com/huggingface/transformers/tree/main/examples) ne sont que cela : des exemples. Il est prévu qu'ils ne fonctionnent pas immédiatement sur votre problème spécifique et que vous devrez probablement modifier quelques lignes de code pour les adapter à vos besoins.

--- a/notebooks/README.md
+++ b/notebooks/README.md
@ -28,25 +28,25 @@ Pull Request so it can be included under the Community notebooks.

 You can open any page of the documentation as a notebook in Colab (there is a button directly on said pages) but they are also listed here if you need them:

-| Notebook     |      Description      |   |   |
-|:----------|:-------------|:-------------|------:|
+| Notebook     |      Description      |   |   |   |
+|:----------|:-------------|:-------------|:------------|------:|
 | [Quicktour of the library](https://github.com/huggingface/notebooks/blob/main/transformers_doc/en/quicktour.ipynb)  | A presentation of the various APIs in Transformers |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/transformers_doc/en/quicktour.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/en/transformers_doc/quicktour.ipynb)|
 | [Summary of the tasks](https://github.com/huggingface/notebooks/blob/main/transformers_doc/en/task_summary.ipynb)  | How to run the models of the Transformers library task by task |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/transformers_doc/en/task_summary.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/transformers_doc/en/task_summary.ipynb)|
 | [Preprocessing data](https://github.com/huggingface/notebooks/blob/main/transformers_doc/en/preprocessing.ipynb)  | How to use a tokenizer to preprocess your data |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/transformers_doc/en/preprocessing.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/transformers_doc/en/preprocessing.ipynb)|
 | [Fine-tuning a pretrained model](https://github.com/huggingface/notebooks/blob/main/transformers_doc/en/training.ipynb)  | How to use the Trainer to fine-tune a pretrained model |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/transformers_doc/en/training.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/transformers_doc/en/training.ipynb)|
-| [Summary of the tokenizers](https://github.com/huggingface/notebooks/blob/main/transformers_doc/en/tokenizer_summary.ipynb)  | The differences between the tokenizers algorithm |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/transformers_doc/en/tokenizer_summary.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/transformers_doc/en/tokenizer_summary.ipynb)|
-| [Multilingual models](https://github.com/huggingface/notebooks/blob/main/transformers_doc/en/multilingual.ipynb)  | How to use the multilingual models of the library |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/transformers_doc/en/multilingual.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/transformers_doc/en/multilingual.ipynb)|
+| [Summary of the tokenizers](https://github.com/huggingface/notebooks/blob/main/transformers_doc/en/tokenizer_summary.ipynb)  | The differences between the tokenizers algorithm |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/transformers_doc/en/tokenizer_summary.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/transformers_doc/en/tokenizer_summary.ipynb)|[![Open in AMD Dev Cloud](https://oneclickamd.ai/static/amd.svg?v=2)](http://oneclickamd.ai/github/huggingface/notebooks/blob/main/transformers_doc/en/tokenizer_summary.ipynb )|
+| [Multilingual models](https://github.com/huggingface/notebooks/blob/main/transformers_doc/en/multilingual.ipynb)  | How to use the multilingual models of the library |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/transformers_doc/en/multilingual.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/transformers_doc/en/multilingual.ipynb)|[![Open in AMD Dev Cloud](https://oneclickamd.ai/static/amd.svg)](http://oneclickamd.ai/github/huggingface/notebooks/blob/main/transformers_doc/en/multilingual.ipynb)|

 ### PyTorch Examples

 #### Natural Language Processing[[pytorch-nlp]]

-| Notebook     |      Description      |   |   |
-|:----------|:-------------|:-------------|------:|
-| [Train your tokenizer](https://github.com/huggingface/notebooks/blob/main/examples/tokenizer_training.ipynb)  | How to train and use your very own tokenizer  |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/tokenizer_training.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/tokenizer_training.ipynb)|
-| [Train your language model](https://github.com/huggingface/notebooks/blob/main/examples/language_modeling_from_scratch.ipynb)   | How to easily start using transformers  |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling_from_scratch.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/language_modeling_from_scratch.ipynb)|
+| Notebook     |      Description      |   |   |   |
+|:----------|:-------------|:-------------|:-------------|------:|
+| [Train your tokenizer](https://github.com/huggingface/notebooks/blob/main/examples/tokenizer_training.ipynb)  | How to train and use your very own tokenizer  |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/tokenizer_training.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/tokenizer_training.ipynb)|[![Open in AMD Dev Cloud](https://oneclickamd.ai/static/amd.svg?v=2)](http://oneclickamd.ai/github/huggingface/notebooks/blob/main/examples/tokenizer_training.ipynb)|
+| [Train your language model](https://github.com/huggingface/notebooks/blob/main/examples/language_modeling_from_scratch.ipynb)   | How to easily start using transformers  |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling_from_scratch.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/language_modeling_from_scratch.ipynb)|[![Open in AMD Dev Cloud](https://oneclickamd.ai/static/amd.svg?v=2)](http://oneclickamd.ai/github/huggingface/notebooks/blob/main/examples/language_modeling_from_scratch.ipynb)|
 | [How to fine-tune a model on text classification](https://github.com/huggingface/notebooks/blob/main/examples/text_classification.ipynb)| Show how to preprocess the data and fine-tune a pretrained model on any GLUE task. | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/text_classification.ipynb)|
-| [How to fine-tune a model on language modeling](https://github.com/huggingface/notebooks/blob/main/examples/language_modeling.ipynb)| Show how to preprocess the data and fine-tune a pretrained model on a causal or masked LM task. | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb)|
+| [How to fine-tune a model on language modeling](https://github.com/huggingface/notebooks/blob/main/examples/language_modeling.ipynb)| Show how to preprocess the data and fine-tune a pretrained model on a causal or masked LM task. | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb)|[![Open in AMD Dev Cloud](https://oneclickamd.ai/static/amd.svg?v=2)](http://oneclickamd.ai/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb)|
 | [How to fine-tune a model on token classification](https://github.com/huggingface/notebooks/blob/main/examples/token_classification.ipynb)| Show how to preprocess the data and fine-tune a pretrained model on a token classification task (NER, PoS). | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/token_classification.ipynb)|
 | [How to fine-tune a model on question answering](https://github.com/huggingface/notebooks/blob/main/examples/question_answering.ipynb)| Show how to preprocess the data and fine-tune a pretrained model on SQUAD. | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/question_answering.ipynb)|
 | [How to fine-tune a model on multiple choice](https://github.com/huggingface/notebooks/blob/main/examples/multiple_choice.ipynb)| Show how to preprocess the data and fine-tune a pretrained model on SWAG. | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/multiple_choice.ipynb)|
--- a/setup.py
+++ b/setup.py
@ -170,7 +170,7 @@ _deps = [
    "tiktoken",
    "timm<=1.0.19,!=1.0.18",
    "tokenizers>=0.22.0,<=0.23.0",
-    "torch>=2.2",
+    "torch>=2.2,<2.9",
    "torchaudio",
    "torchvision",
    "pyctcdecode>=0.4.0",
--- a/src/transformers/init.py
+++ b/src/transformers/init.py
@ -375,8 +375,13 @@ else:
    _import_structure["data.datasets"] = [
        "GlueDataset",
        "GlueDataTrainingArguments",
+        "LineByLineTextDataset",
+        "LineByLineWithRefDataset",
+        "LineByLineWithSOPTextDataset",
        "SquadDataset",
        "SquadDataTrainingArguments",
+        "TextDataset",
+        "TextDatasetForNextSentencePrediction",
    ]
    _import_structure["generation"].extend(
        [
@ -522,8 +527,13 @@ if TYPE_CHECKING:
    from .data.data_collator import default_data_collator as default_data_collator
    from .data.datasets import GlueDataset as GlueDataset
    from .data.datasets import GlueDataTrainingArguments as GlueDataTrainingArguments
+    from .data.datasets import LineByLineTextDataset as LineByLineTextDataset
+    from .data.datasets import LineByLineWithRefDataset as LineByLineWithRefDataset
+    from .data.datasets import LineByLineWithSOPTextDataset as LineByLineWithSOPTextDataset
    from .data.datasets import SquadDataset as SquadDataset
    from .data.datasets import SquadDataTrainingArguments as SquadDataTrainingArguments
+    from .data.datasets import TextDataset as TextDataset
+    from .data.datasets import TextDatasetForNextSentencePrediction as TextDatasetForNextSentencePrediction
    from .feature_extraction_sequence_utils import SequenceFeatureExtractor as SequenceFeatureExtractor

    # Feature Extractor
--- a/src/transformers/cache_utils.py
+++ b/src/transformers/cache_utils.py
@ -176,6 +176,11 @@ class DynamicSlidingWindowLayer(DynamicLayer):
        super().__init__()
        self.sliding_window = sliding_window
        self.cumulative_length = 0
+        self._sliding_window_tensor = torch.tensor(self.sliding_window, dtype=torch.long)
+
+    def lazy_initialization(self, key_states: torch.Tensor) -> None:
+        super().lazy_initialization(key_states)
+        self._sliding_window_tensor = self._sliding_window_tensor.to(self.device)

    def update(
        self,
@ -932,7 +937,7 @@ class DynamicCache(Cache):

    def __init__(
        self,
-        ddp_cache_data: Optional[Iterable[tuple[torch.Tensor, torch.Tensor]]] = None,
+        ddp_cache_data: Optional[Iterable[tuple[Optional[torch.Tensor], torch.Tensor, torch.Tensor]]] = None,
        config: Optional[PreTrainedConfig] = None,
        offloading: bool = False,
        offload_only_non_sliding: bool = False,
@ -965,10 +970,15 @@ class DynamicCache(Cache):
        # In this case, use the passed data to already fill in the Cache
        if ddp_cache_data is not None:
            # Init all the layers with the data
-            for layer_idx, (key_states, value_states) in enumerate(ddp_cache_data):
-                # If the config was not passed above, initialize a DynamicLayer for each entry of the ddp_data
+            for layer_idx, (sliding_window_tensor, key_states, value_states) in enumerate(ddp_cache_data):
+                # If the config was not passed above, initialize a new cache layer for each entry of the ddp_data
                if config is None:
-                    layers.append(DynamicLayer())
+                    if sliding_window_tensor is not None:
+                        # Since the same layer is dispatched across replicas, sliding_window is the same for all
+                        sliding_window = sliding_window_tensor[0].item()
+                        layers.append(DynamicSlidingWindowLayer(sliding_window=sliding_window))
+                    else:
+                        layers.append(DynamicLayer())
                # Update the layer with the data
                _, _ = layers[layer_idx].update(key_states, value_states)

@ -982,6 +992,10 @@ class DynamicCache(Cache):
        else:
            super().__init__(layers=layers, offloading=offloading, offload_only_non_sliding=offload_only_non_sliding)

+    def __iter__(self):
+        for layer in self.layers:
+            yield getattr(layer, "_sliding_window_tensor", None), layer.keys, layer.values
+

 class StaticCache(Cache):
    """
--- a/src/transformers/commands/serving.py
+++ b/src/transformers/commands/serving.py
@ -26,7 +26,7 @@ import threading
 import time
 import uuid
 from argparse import ArgumentParser, Namespace
-from collections.abc import AsyncGenerator, Generator, Iterable
+from collections.abc import Generator, Iterable
 from contextlib import asynccontextmanager
 from dataclasses import dataclass, field
 from io import BytesIO
@ -35,6 +35,7 @@ from typing import Optional, TypedDict, Union

 from huggingface_hub import model_info
 from huggingface_hub.constants import HF_HUB_OFFLINE
+from openai.types.chat.chat_completion import Choice
 from tokenizers.decoders import DecodeStream

 import transformers
@ -90,14 +91,16 @@ if serve_dependencies_available:
    from fastapi.responses import JSONResponse, StreamingResponse
    from openai.types.audio.transcription import Transcription
    from openai.types.audio.transcription_create_params import TranscriptionCreateParamsBase
-    from openai.types.chat import ChatCompletionMessageParam
+    from openai.types.chat import ChatCompletion, ChatCompletionMessage, ChatCompletionMessageParam
    from openai.types.chat.chat_completion_chunk import (
        ChatCompletionChunk,
-        Choice,
        ChoiceDelta,
        ChoiceDeltaToolCall,
        ChoiceDeltaToolCallFunction,
    )
+    from openai.types.chat.chat_completion_chunk import (
+        Choice as ChoiceChunk,
+    )
    from openai.types.chat.completion_create_params import CompletionCreateParamsStreaming
    from openai.types.responses import (
        Response,
@ -345,8 +348,11 @@ class TimedModel:
            self._timer.cancel()

    def timeout_reached(self):
-        self.delete_model()
-        logger.info(f"{self._name_or_path} was removed from memory after {self.timeout_seconds} seconds of inactivity")
+        if self.timeout_seconds > 0:
+            self.delete_model()
+            logger.info(
+                f"{self._name_or_path} was removed from memory after {self.timeout_seconds} seconds of inactivity"
+            )

    def is_deleted(self):
        """Check if the instances have been deleted."""
@ -412,9 +418,13 @@ class ServeArguments:
    # Serving settings
    host: str = field(default="localhost", metadata={"help": "Interface the server will listen to."})
    port: int = field(default=8000, metadata={"help": "Port the server will listen to."})
-    model_timeout: int = field(
-        default=300,
-        metadata={"help": "Time in seconds after which a model will be removed from memory."},
+    model_timeout: Optional[int] = field(
+        default=None,
+        metadata={
+            "help": "Time in seconds after which a model will be removed from memory; defaults to 300 unless "
+            "`force_model` is set, in which case the model will not be removed from memory unless a value"
+            "is specified here."
+        },
    )

    # Other settings
@ -512,6 +522,14 @@ class ServeCommand(BaseTransformersCLICommand):
        self.last_kv_cache = None
        self.last_model = None

+        if self.args.model_timeout is None:
+            self.args.model_timeout = -1 if self.args.force_model else 300
+
+        if self.args.force_model:
+            model_id_and_revision = self.process_model_name(self.args.force_model)
+            self.last_model = model_id_and_revision
+            self.load_model_and_processor(model_id_and_revision)
+
    def _validate_request(
        self,
        request: dict,
@ -595,7 +613,7 @@ class ServeCommand(BaseTransformersCLICommand):
        tool_calls: Optional[list["ChoiceDeltaToolCall"]] = None,
        decode_stream: Optional[DecodeStream] = None,
        tokenizer: Optional[PreTrainedTokenizerFast] = None,
-    ) -> str:
+    ) -> ChatCompletionChunk:
        """
        Builds a chunk of a streaming OpenAI Chat Completion response.

@ -621,12 +639,13 @@ class ServeCommand(BaseTransformersCLICommand):
        """
        if decode_stream is not None and content is not None and tokenizer is not None:
            content = decode_stream.step(tokenizer._tokenizer, content)
+
        chunk = ChatCompletionChunk(
            id=request_id,
            created=int(time.time()),
            model=model,
            choices=[
-                Choice(
+                ChoiceChunk(
                    delta=ChoiceDelta(
                        content=content,
                        role=role,
@ -639,23 +658,25 @@ class ServeCommand(BaseTransformersCLICommand):
            system_fingerprint="",
            object="chat.completion.chunk",
        )
-        return f"data: {chunk.model_dump_json(exclude_none=True)}\n\n"

-    def build_response_event(self, response: "BaseModel") -> str:
+        return chunk
+
+    @staticmethod
+    def chunk_to_sse_element(chunk: ChatCompletionChunk | BaseModel) -> str:
        """
-        Builds a event of a streaming OpenAI Response response.
+        Builds an event of a streaming OpenAI Response model or a ChatCompletion chunk.

        IMPORTANT: The serialized chunk won't contain empty fields (fields with `None`). Some downstream apps,
        like Cursor, assume that when the field exists, it has data.

        Args:
-            response (`BaseModel`):
+            chunk (`BaseModel` or `ChatCompletionChunk`):
                The response to build an event from. One of the multiple OpenAI Response output types

        Returns:
            `str`: The built chunk, a string containing a JSON string with the payload.
        """
-        return f"data: {response.model_dump_json(exclude_none=True)}\n\n"
+        return f"data: {chunk.model_dump_json(exclude_none=True)}\n\n"

    def run(self):
        """
@ -668,6 +689,7 @@ class ServeCommand(BaseTransformersCLICommand):
        - POST /v1/responses: Generates responses.
        - POST /v1/audio/transcriptions: Generates transcriptions from audio.
        - GET /v1/models: Lists available models for 3rd party tools.
+        - GET /health: Health check.

        Requires FastAPI and Uvicorn to be installed.
        """
@ -703,10 +725,9 @@ class ServeCommand(BaseTransformersCLICommand):
            self.validate_chat_completion_request(request=body)

            if self.use_continuous_batching:
-                output = self.continuous_batching_chat_completion(body, request.state.request_id)
+                return self.continuous_batching_chat_completion(body, request.state.request_id)
            else:
-                output = self.generate_chat_completion(body)
-            return StreamingResponse(output, media_type="text/event-stream")
+                return self.generate_chat_completion(body)

        @app.post("/v1/responses")
        def responses(request: dict):
@ -803,7 +824,7 @@ class ServeCommand(BaseTransformersCLICommand):
                for model in model_infos
            ]

-    def continuous_batching_chat_completion(self, req: dict, request_id: str) -> AsyncGenerator[str, None]:
+    def continuous_batching_chat_completion(self, req: dict, request_id: str) -> StreamingResponse | JSONResponse:
        """
        Generates an OpenAI Chat Completion using continuous batching.

@ -816,14 +837,16 @@ class ServeCommand(BaseTransformersCLICommand):

        model_id_and_revision = self.process_model_name(req["model"])
        must_discard_cache = model_id_and_revision != self.last_model
+
        self.last_model = model_id_and_revision
+
+        # When switching models, terminate a continuous batching manager if it is running.
        if must_discard_cache:
-            # When switching models, terminate a continuous batching manager if it is running.
            if self.running_continuous_batching_manager is not None:
                self.running_continuous_batching_manager.stop(block=True, timeout=2)
                self.running_continuous_batching_manager = None
-        model, processor = self.load_model_and_processor(model_id_and_revision)

+        model, processor = self.load_model_and_processor(model_id_and_revision)
        tokenizer = processor.tokenizer if hasattr(processor, "tokenizer") else processor

        generation_config = create_generation_config_from_req(
@ -838,18 +861,17 @@ class ServeCommand(BaseTransformersCLICommand):

        if self.running_continuous_batching_manager is None:
            self.running_continuous_batching_manager = model.init_continuous_batching(
-                generation_config=generation_config, streaming=True
+                generation_config=generation_config
            )

-            # TODO (Joao, Lysandre): the logits processors should be fixed in continuous batching
-            # and correctly applied in non-cb
+            # TODO (Joao, Lysandre): the logits processors should be fixed in continuous batching and correctly applied in non-cb
            self.running_continuous_batching_manager.logit_processor = LogitsProcessorList()
            self.running_continuous_batching_manager.start()

        # TODO (Joao, Lysandre): this should also work with tool support
        inputs = processor.apply_chat_template(req["messages"], return_tensors="pt", add_generation_prompt=True).to(
            model.device
-        )
+        )[0]

        def stream_chat_completion(request_id, decode_stream):
            try:
@ -879,21 +901,61 @@ class ServeCommand(BaseTransformersCLICommand):
                self.running_continuous_batching_manager.cancel_request(request_id)
                yield f'data: {{"error": "{str(e)}"}}'

-        async def cancellation_wrapper(_inputs, request_id):
-            try:
-                decode_stream = DecodeStream(_inputs.tolist(), False)
-                # XXX: using returned request_id as safety in case it is None
-                request_id = self.running_continuous_batching_manager.add_request(
-                    _inputs, request_id=request_id, max_new_tokens=generation_config.max_new_tokens
-                )
-                for chunk in stream_chat_completion(request_id, decode_stream):
-                    yield chunk
-                    await asyncio.sleep(0)  # Yield control to the event loop to check for cancellations
-            except asyncio.CancelledError:
-                self.running_continuous_batching_manager.cancel_request(request_id)
-                logger.warning(f"Request {request_id} was cancelled.")
+        def buffer_chat_completion(_request_id):
+            result = None
+            while self.running_continuous_batching_manager.is_running() and result is None:
+                result = self.running_continuous_batching_manager.get_result(request_id=_request_id, timeout=1)

-        return cancellation_wrapper(inputs[0], request_id)
+            content = tokenizer.decode(result.generated_tokens)
+
+            chat_completion_result = ChatCompletion(
+                id=_request_id,
+                created=int(time.time()),
+                object="chat.completion",
+                model=model_id_and_revision,
+                choices=[
+                    Choice(
+                        # TODO check the index
+                        index=0,
+                        message=ChatCompletionMessage(content=content, role="assistant"),
+                        finish_reason="stop",
+                    )
+                ],
+                # TODO implement function calling
+                # TODO implement usage
+            )
+
+            return chat_completion_result
+
+        async def cancellation_wrapper_stream(_request_id):
+            # Enables cancellation in an async context
+            try:
+                decode_stream = DecodeStream(inputs.tolist(), False)
+                for _chunk in stream_chat_completion(_request_id, decode_stream):
+                    yield self.chunk_to_sse_element(_chunk)
+                    await asyncio.sleep(0)
+            except asyncio.CancelledError:
+                self.running_continuous_batching_manager.cancel_request(_request_id)
+                logger.warning(f"Request {_request_id} was cancelled.")
+
+        def cancellation_wrapper_buffer(_request_id):
+            # Enables cancellation in an async context
+            try:
+                return buffer_chat_completion(_request_id)
+            except asyncio.CancelledError:
+                self.running_continuous_batching_manager.cancel_request(_request_id)
+                logger.warning(f"Request {_request_id} was cancelled.")
+
+        request_id = self.running_continuous_batching_manager.add_request(
+            inputs, request_id=request_id, max_new_tokens=generation_config.max_new_tokens, streaming=req.get("stream")
+        )
+
+        if req.get("stream"):
+            return StreamingResponse(cancellation_wrapper_stream(request_id), media_type="text/event-stream")
+        else:
+            chunk = cancellation_wrapper_buffer(request_id)
+            json_chunk = chunk.model_dump_json(exclude_none=True)
+            return JSONResponse(json_chunk, media_type="application/json")

    @staticmethod
    def get_model_modality(model: "PreTrainedModel") -> Modality:
@ -953,7 +1015,7 @@ class ServeCommand(BaseTransformersCLICommand):
            processor_inputs.append(parsed_message)
        return processor_inputs

-    def generate_chat_completion(self, req: dict) -> Generator[str, None, None]:
+    def generate_chat_completion(self, req: dict) -> StreamingResponse | JSONResponse:
        """
        Generates an OpenAI Chat Completion using `generate`.

@ -1132,7 +1194,10 @@ class ServeCommand(BaseTransformersCLICommand):
                                )

                            yield self.build_chat_completion_chunk(
-                                request_id=_request_id, role=None, tool_calls=[tool], model=model_id_and_revision
+                                request_id=_request_id,
+                                role=None,
+                                tool_calls=[tool],
+                                model=model_id_and_revision,
                            )
                            continue
                    # ====== END OF TOOL CALL LOGIC ======
@ -1152,7 +1217,47 @@ class ServeCommand(BaseTransformersCLICommand):
            finally:
                thread.join()

-        return stream_chat_completion(generation_streamer, request_id)
+        if req.get("stream"):
+            return StreamingResponse(
+                map(self.chunk_to_sse_element, stream_chat_completion(generation_streamer, request_id)),
+                media_type="text/event-stream",
+            )
+        else:
+            content = []
+            finish_reason = "stop"
+
+            generator = stream_chat_completion(generation_streamer, request_id)
+            usage = None
+
+            for chunk in generator:
+                choice = chunk.choices[0]
+                if getattr(choice.delta, "content", None):
+                    content.append(choice.delta.content)
+                if choice.finish_reason:
+                    finish_reason = choice.finish_reason
+                if getattr(chunk, "usage", None):
+                    usage = chunk.usage
+
+            chat_completion_result = ChatCompletion(
+                id=request_id,
+                created=int(time.time()),
+                object="chat.completion",
+                model=model_id_and_revision,
+                choices=[
+                    Choice(
+                        # TODO check the index
+                        index=0,
+                        message=ChatCompletionMessage(content="".join(content), role="assistant"),
+                        finish_reason=finish_reason,
+                    )
+                ],
+                # TODO implement function calling
+                usage=usage,
+            )
+
+            result = chat_completion_result.model_dump(exclude_none=True)
+
+            return JSONResponse(result, media_type="application/json")

    def generate_response(self, req: dict) -> Generator[str, None, None]:
        """
@ -1263,7 +1368,7 @@ class ServeCommand(BaseTransformersCLICommand):
                    ),
                )
                sequence_number += 1
-                yield self.build_response_event(response_created)
+                yield self.chunk_to_sse_element(response_created)

                response_in_progress = ResponseInProgressEvent(
                    type="response.in_progress",
@ -1284,7 +1389,7 @@ class ServeCommand(BaseTransformersCLICommand):
                    ),
                )
                sequence_number += 1
-                yield self.build_response_event(response_in_progress)
+                yield self.chunk_to_sse_element(response_in_progress)

                # Start the output item. Emit the assistant role to start the stream. Other chunks won't have a role,
                # as it is implicit
@ -1297,7 +1402,7 @@ class ServeCommand(BaseTransformersCLICommand):
                    ),
                )
                sequence_number += 1
-                yield self.build_response_event(response_output_item_added)
+                yield self.chunk_to_sse_element(response_output_item_added)

                # Start the content part of the event
                response_content_part_added = ResponseContentPartAddedEvent(
@ -1309,7 +1414,7 @@ class ServeCommand(BaseTransformersCLICommand):
                    part=ResponseOutputText(type="output_text", text="", annotations=[]),
                )
                sequence_number += 1
-                yield self.build_response_event(response_content_part_added)
+                yield self.chunk_to_sse_element(response_content_part_added)

                # Stream the actual generated text
                results = ""
@ -1336,7 +1441,7 @@ class ServeCommand(BaseTransformersCLICommand):
                                logprobs=[],
                            )
                            sequence_number += 1
-                            yield self.build_response_event(response_output_text_delta)
+                            yield self.chunk_to_sse_element(response_output_text_delta)
                    else:
                        # Normal path: emit token deltas when not filtering CoT
                        if result:
@ -1350,7 +1455,7 @@ class ServeCommand(BaseTransformersCLICommand):
                                logprobs=[],
                            )
                            sequence_number += 1
-                            yield self.build_response_event(response_output_text_delta)
+                            yield self.chunk_to_sse_element(response_output_text_delta)

                # Signal the end of the text generation
                response_output_text_done = ResponseTextDoneEvent(
@ -1363,7 +1468,7 @@ class ServeCommand(BaseTransformersCLICommand):
                    logprobs=[],
                )
                sequence_number += 1
-                yield self.build_response_event(response_output_text_done)
+                yield self.chunk_to_sse_element(response_output_text_done)

                # Complete the content part
                response_content_part_done = ResponseContentPartDoneEvent(
@ -1376,7 +1481,7 @@ class ServeCommand(BaseTransformersCLICommand):
                )
                sequence_number += 1
                content_index += 1
-                yield self.build_response_event(response_content_part_done)
+                yield self.chunk_to_sse_element(response_content_part_done)

                # Complete the output item
                response_output_item_done = ResponseOutputItemDoneEvent(
@ -1394,7 +1499,7 @@ class ServeCommand(BaseTransformersCLICommand):
                )
                sequence_number += 1
                output_index += 1
-                yield self.build_response_event(response_output_item_done)
+                yield self.chunk_to_sse_element(response_output_item_done)

                # Finally, Complete the event
                response_completed = ResponseCompletedEvent(
@ -1416,7 +1521,7 @@ class ServeCommand(BaseTransformersCLICommand):
                    ),
                )
                sequence_number += 1
-                yield self.build_response_event(response_completed)
+                yield self.chunk_to_sse_element(response_completed)

                thread.join()
            except Exception as e:
@ -1427,7 +1532,7 @@ class ServeCommand(BaseTransformersCLICommand):
                    message=str(e),
                )
                sequence_number += 1
-                yield self.build_response_event(error_event)
+                yield self.chunk_to_sse_element(error_event)

                response_failed = ResponseFailedEvent(
                    type="response.failed",
@ -1452,7 +1557,7 @@ class ServeCommand(BaseTransformersCLICommand):
                    ),
                )
                sequence_number += 1
-                yield self.build_response_event(response_failed)
+                yield self.chunk_to_sse_element(response_failed)

            finally:
                thread.join()
--- a/src/transformers/data/datasets/init.py
+++ b/src/transformers/data/datasets/init.py
@ -13,4 +13,11 @@
 # limitations under the License.

 from .glue import GlueDataset, GlueDataTrainingArguments
+from .language_modeling import (
+    LineByLineTextDataset,
+    LineByLineWithRefDataset,
+    LineByLineWithSOPTextDataset,
+    TextDataset,
+    TextDatasetForNextSentencePrediction,
+)
 from .squad import SquadDataset, SquadDataTrainingArguments
--- a/src/transformers/data/datasets/language_modeling.py
+++ b/src/transformers/data/datasets/language_modeling.py
@ -0,0 +1,514 @@
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+import pickle
+import random
+import time
+import warnings
+from typing import Optional
+
+import torch
+from filelock import FileLock
+from torch.utils.data import Dataset
+
+from ...tokenization_utils import PreTrainedTokenizer
+from ...utils import logging
+
+
+logger = logging.get_logger(__name__)
+
+
+DEPRECATION_WARNING = (
+    "This dataset will be removed from the library soon, preprocessing should be handled with the 🤗 Datasets "
+    "library. You can have a look at this example script for pointers: {0}"
+)
+
+
+class TextDataset(Dataset):
+    def __init__(
+        self,
+        tokenizer: PreTrainedTokenizer,
+        file_path: str,
+        block_size: int,
+        overwrite_cache=False,
+        cache_dir: Optional[str] = None,
+    ):
+        warnings.warn(
+            DEPRECATION_WARNING.format(
+                "https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_mlm.py"
+            ),
+            FutureWarning,
+        )
+        if os.path.isfile(file_path) is False:
+            raise ValueError(f"Input file path {file_path} not found")
+
+        block_size = block_size - tokenizer.num_special_tokens_to_add(pair=False)
+
+        directory, filename = os.path.split(file_path)
+        cached_features_file = os.path.join(
+            cache_dir if cache_dir is not None else directory,
+            f"cached_lm_{tokenizer.__class__.__name__}_{block_size}_{filename}",
+        )
+
+        # Make sure only the first process in distributed training processes the dataset,
+        # and the others will use the cache.
+        lock_path = cached_features_file + ".lock"
+        with FileLock(lock_path):
+            if os.path.exists(cached_features_file) and not overwrite_cache:
+                start = time.time()
+                with open(cached_features_file, "rb") as handle:
+                    self.examples = pickle.load(handle)
+                logger.info(
+                    f"Loading features from cached file {cached_features_file} [took %.3f s]", time.time() - start
+                )
+
+            else:
+                logger.info(f"Creating features from dataset file at {directory}")
+
+                self.examples = []
+                with open(file_path, encoding="utf-8") as f:
+                    text = f.read()
+
+                tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))
+
+                for i in range(0, len(tokenized_text) - block_size + 1, block_size):  # Truncate in block of block_size
+                    self.examples.append(
+                        tokenizer.build_inputs_with_special_tokens(tokenized_text[i : i + block_size])
+                    )
+                # Note that we are losing the last truncated example here for the sake of simplicity (no padding)
+                # If your dataset is small, first you should look for a bigger one :-) and second you
+                # can change this behavior by adding (model specific) padding.
+
+                start = time.time()
+                with open(cached_features_file, "wb") as handle:
+                    pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)
+                logger.info(
+                    f"Saving features into cached file {cached_features_file} [took {time.time() - start:.3f} s]"
+                )
+
+    def __len__(self):
+        return len(self.examples)
+
+    def __getitem__(self, i) -> torch.Tensor:
+        return torch.tensor(self.examples[i], dtype=torch.long)
+
+
+class LineByLineTextDataset(Dataset):
+    def __init__(self, tokenizer: PreTrainedTokenizer, file_path: str, block_size: int):
+        warnings.warn(
+            DEPRECATION_WARNING.format(
+                "https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_mlm.py"
+            ),
+            FutureWarning,
+        )
+        if os.path.isfile(file_path) is False:
+            raise ValueError(f"Input file path {file_path} not found")
+        # Here, we do not cache the features, operating under the assumption
+        # that we will soon use fast multithreaded tokenizers from the
+        # `tokenizers` repo everywhere =)
+        logger.info(f"Creating features from dataset file at {file_path}")
+
+        with open(file_path, encoding="utf-8") as f:
+            lines = [line for line in f.read().splitlines() if (len(line) > 0 and not line.isspace())]
+
+        batch_encoding = tokenizer(lines, add_special_tokens=True, truncation=True, max_length=block_size)
+        self.examples = batch_encoding["input_ids"]
+        self.examples = [{"input_ids": torch.tensor(e, dtype=torch.long)} for e in self.examples]
+
+    def __len__(self):
+        return len(self.examples)
+
+    def __getitem__(self, i) -> dict[str, torch.tensor]:
+        return self.examples[i]
+
+
+class LineByLineWithRefDataset(Dataset):
+    def __init__(self, tokenizer: PreTrainedTokenizer, file_path: str, block_size: int, ref_path: str):
+        warnings.warn(
+            DEPRECATION_WARNING.format(
+                "https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_mlm_wwm.py"
+            ),
+            FutureWarning,
+        )
+        if os.path.isfile(file_path) is False:
+            raise ValueError(f"Input file path {file_path} not found")
+        if os.path.isfile(ref_path) is False:
+            raise ValueError(f"Ref file path {file_path} not found")
+        # Here, we do not cache the features, operating under the assumption
+        # that we will soon use fast multithreaded tokenizers from the
+        # `tokenizers` repo everywhere =)
+        logger.info(f"Creating features from dataset file at {file_path}")
+        logger.info(f"Use ref segment results at {ref_path}")
+        with open(file_path, encoding="utf-8") as f:
+            data = f.readlines()  # use this method to avoid delimiter '\u2029' to split a line
+        data = [line.strip() for line in data if len(line) > 0 and not line.isspace()]
+        # Get ref inf from file
+        with open(ref_path, encoding="utf-8") as f:
+            ref = [json.loads(line) for line in f.read().splitlines() if (len(line) > 0 and not line.isspace())]
+        if len(data) != len(ref):
+            raise ValueError(
+                f"Length of Input file should be equal to Ref file. But the length of {file_path} is {len(data)} "
+                f"while length of {ref_path} is {len(ref)}"
+            )
+
+        batch_encoding = tokenizer(data, add_special_tokens=True, truncation=True, max_length=block_size)
+        self.examples = batch_encoding["input_ids"]
+        self.examples = [{"input_ids": torch.tensor(e, dtype=torch.long)} for e in self.examples]
+
+        n = len(self.examples)
+        for i in range(n):
+            self.examples[i]["chinese_ref"] = torch.tensor(ref[i], dtype=torch.long)
+
+    def __len__(self):
+        return len(self.examples)
+
+    def __getitem__(self, i) -> dict[str, torch.tensor]:
+        return self.examples[i]
+
+
+class LineByLineWithSOPTextDataset(Dataset):
+    """
+    Dataset for sentence order prediction task, prepare sentence pairs for SOP task
+    """
+
+    def __init__(self, tokenizer: PreTrainedTokenizer, file_dir: str, block_size: int):
+        warnings.warn(
+            DEPRECATION_WARNING.format(
+                "https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_mlm.py"
+            ),
+            FutureWarning,
+        )
+        if os.path.isdir(file_dir) is False:
+            raise ValueError(f"{file_dir} is not a directory")
+        logger.info(f"Creating features from dataset file folder at {file_dir}")
+        self.examples = []
+        # TODO: randomness could apply a random seed, ex. rng = random.Random(random_seed)
+        # file path looks like ./dataset/wiki_1, ./dataset/wiki_2
+        for file_name in os.listdir(file_dir):
+            file_path = os.path.join(file_dir, file_name)
+            if os.path.isfile(file_path) is False:
+                raise ValueError(f"{file_path} is not a file")
+            article_open = False
+            with open(file_path, encoding="utf-8") as f:
+                original_lines = f.readlines()
+                article_lines = []
+                for line in original_lines:
+                    if "<doc id=" in line:
+                        article_open = True
+                    elif "</doc>" in line:
+                        article_open = False
+                        document = [
+                            tokenizer.convert_tokens_to_ids(tokenizer.tokenize(line))
+                            for line in article_lines[1:]
+                            if (len(line) > 0 and not line.isspace())
+                        ]
+
+                        examples = self.create_examples_from_document(document, block_size, tokenizer)
+                        self.examples.extend(examples)
+                        article_lines = []
+                    else:
+                        if article_open:
+                            article_lines.append(line)
+
+        logger.info("Dataset parse finished.")
+
+    def create_examples_from_document(self, document, block_size, tokenizer, short_seq_prob=0.1):
+        """Creates examples for a single document."""
+
+        # Account for special tokens
+        max_num_tokens = block_size - tokenizer.num_special_tokens_to_add(pair=True)
+
+        # We *usually* want to fill up the entire sequence since we are padding
+        # to `block_size` anyways, so short sequences are generally wasted
+        # computation. However, we *sometimes*
+        # (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter
+        # sequences to minimize the mismatch between pretraining and fine-tuning.
+        # The `target_seq_length` is just a rough target however, whereas
+        # `block_size` is a hard limit.
+        target_seq_length = max_num_tokens
+        if random.random() < short_seq_prob:
+            target_seq_length = random.randint(2, max_num_tokens)
+
+        # We DON'T just concatenate all of the tokens from a document into a long
+        # sequence and choose an arbitrary split point because this would make the
+        # next sentence prediction task too easy. Instead, we split the input into
+        # segments "A" and "B" based on the actual "sentences" provided by the user
+        # input.
+        examples = []
+        current_chunk = []  # a buffer stored current working segments
+        current_length = 0
+        i = 0
+        while i < len(document):
+            segment = document[i]  # get a segment
+            if not segment:
+                i += 1
+                continue
+            current_chunk.append(segment)  # add a segment to current chunk
+            current_length += len(segment)  # overall token length
+            # if current length goes to the target length or reaches the end of file, start building token a and b
+            if i == len(document) - 1 or current_length >= target_seq_length:
+                if current_chunk:
+                    # `a_end` is how many segments from `current_chunk` go into the `A` (first) sentence.
+                    a_end = 1
+                    # if current chunk has more than 2 sentences, pick part of it `A` (first) sentence
+                    if len(current_chunk) >= 2:
+                        a_end = random.randint(1, len(current_chunk) - 1)
+                    # token a
+                    tokens_a = []
+                    for j in range(a_end):
+                        tokens_a.extend(current_chunk[j])
+
+                    # token b
+                    tokens_b = []
+                    for j in range(a_end, len(current_chunk)):
+                        tokens_b.extend(current_chunk[j])
+
+                    if len(tokens_a) == 0 or len(tokens_b) == 0:
+                        continue
+
+                    # switch tokens_a and tokens_b randomly
+                    if random.random() < 0.5:
+                        is_next = False
+                        tokens_a, tokens_b = tokens_b, tokens_a
+                    else:
+                        is_next = True
+
+                    def truncate_seq_pair(tokens_a, tokens_b, max_num_tokens):
+                        """Truncates a pair of sequences to a maximum sequence length."""
+                        while True:
+                            total_length = len(tokens_a) + len(tokens_b)
+                            if total_length <= max_num_tokens:
+                                break
+                            trunc_tokens = tokens_a if len(tokens_a) > len(tokens_b) else tokens_b
+                            if not (len(trunc_tokens) >= 1):
+                                raise ValueError("Sequence length to be truncated must be no less than one")
+                            # We want to sometimes truncate from the front and sometimes from the
+                            # back to add more randomness and avoid biases.
+                            if random.random() < 0.5:
+                                del trunc_tokens[0]
+                            else:
+                                trunc_tokens.pop()
+
+                    truncate_seq_pair(tokens_a, tokens_b, max_num_tokens)
+                    if not (len(tokens_a) >= 1):
+                        raise ValueError(f"Length of sequence a is {len(tokens_a)} which must be no less than 1")
+                    if not (len(tokens_b) >= 1):
+                        raise ValueError(f"Length of sequence b is {len(tokens_b)} which must be no less than 1")
+
+                    # add special tokens
+                    input_ids = tokenizer.build_inputs_with_special_tokens(tokens_a, tokens_b)
+                    # add token type ids, 0 for sentence a, 1 for sentence b
+                    token_type_ids = tokenizer.create_token_type_ids_from_sequences(tokens_a, tokens_b)
+
+                    example = {
+                        "input_ids": torch.tensor(input_ids, dtype=torch.long),
+                        "token_type_ids": torch.tensor(token_type_ids, dtype=torch.long),
+                        "sentence_order_label": torch.tensor(0 if is_next else 1, dtype=torch.long),
+                    }
+                    examples.append(example)
+                current_chunk = []  # clear current chunk
+                current_length = 0  # reset current text length
+            i += 1  # go to next line
+        return examples
+
+    def __len__(self):
+        return len(self.examples)
+
+    def __getitem__(self, i) -> dict[str, torch.tensor]:
+        return self.examples[i]
+
+
+class TextDatasetForNextSentencePrediction(Dataset):
+    def __init__(
+        self,
+        tokenizer: PreTrainedTokenizer,
+        file_path: str,
+        block_size: int,
+        overwrite_cache=False,
+        short_seq_probability=0.1,
+        nsp_probability=0.5,
+    ):
+        warnings.warn(
+            DEPRECATION_WARNING.format(
+                "https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_mlm.py"
+            ),
+            FutureWarning,
+        )
+        if not os.path.isfile(file_path):
+            raise ValueError(f"Input file path {file_path} not found")
+
+        self.short_seq_probability = short_seq_probability
+        self.nsp_probability = nsp_probability
+
+        directory, filename = os.path.split(file_path)
+        cached_features_file = os.path.join(
+            directory,
+            f"cached_nsp_{tokenizer.__class__.__name__}_{block_size}_{filename}",
+        )
+
+        self.tokenizer = tokenizer
+
+        # Make sure only the first process in distributed training processes the dataset,
+        # and the others will use the cache.
+        lock_path = cached_features_file + ".lock"
+
+        # Input file format:
+        # (1) One sentence per line. These should ideally be actual sentences, not
+        # entire paragraphs or arbitrary spans of text. (Because we use the
+        # sentence boundaries for the "next sentence prediction" task).
+        # (2) Blank lines between documents. Document boundaries are needed so
+        # that the "next sentence prediction" task doesn't span between documents.
+        #
+        # Example:
+        # I am very happy.
+        # Here is the second sentence.
+        #
+        # A new document.
+
+        with FileLock(lock_path):
+            if os.path.exists(cached_features_file) and not overwrite_cache:
+                start = time.time()
+                with open(cached_features_file, "rb") as handle:
+                    self.examples = pickle.load(handle)
+                logger.info(
+                    f"Loading features from cached file {cached_features_file} [took %.3f s]", time.time() - start
+                )
+            else:
+                logger.info(f"Creating features from dataset file at {directory}")
+
+                self.documents = [[]]
+                with open(file_path, encoding="utf-8") as f:
+                    while True:
+                        line = f.readline()
+                        if not line:
+                            break
+                        line = line.strip()
+
+                        # Empty lines are used as document delimiters
+                        if not line and len(self.documents[-1]) != 0:
+                            self.documents.append([])
+                        tokens = tokenizer.tokenize(line)
+                        tokens = tokenizer.convert_tokens_to_ids(tokens)
+                        if tokens:
+                            self.documents[-1].append(tokens)
+
+                logger.info(f"Creating examples from {len(self.documents)} documents.")
+                self.examples = []
+                for doc_index, document in enumerate(self.documents):
+                    self.create_examples_from_document(document, doc_index, block_size)
+
+                start = time.time()
+                with open(cached_features_file, "wb") as handle:
+                    pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)
+                logger.info(
+                    f"Saving features into cached file {cached_features_file} [took {time.time() - start:.3f} s]"
+                )
+
+    def create_examples_from_document(self, document: list[list[int]], doc_index: int, block_size: int):
+        """Creates examples for a single document."""
+
+        max_num_tokens = block_size - self.tokenizer.num_special_tokens_to_add(pair=True)
+
+        # We *usually* want to fill up the entire sequence since we are padding
+        # to `block_size` anyways, so short sequences are generally wasted
+        # computation. However, we *sometimes*
+        # (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter
+        # sequences to minimize the mismatch between pretraining and fine-tuning.
+        # The `target_seq_length` is just a rough target however, whereas
+        # `block_size` is a hard limit.
+        target_seq_length = max_num_tokens
+        if random.random() < self.short_seq_probability:
+            target_seq_length = random.randint(2, max_num_tokens)
+
+        current_chunk = []  # a buffer stored current working segments
+        current_length = 0
+        i = 0
+
+        while i < len(document):
+            segment = document[i]
+            current_chunk.append(segment)
+            current_length += len(segment)
+            if i == len(document) - 1 or current_length >= target_seq_length:
+                if current_chunk:
+                    # `a_end` is how many segments from `current_chunk` go into the `A`
+                    # (first) sentence.
+                    a_end = 1
+                    if len(current_chunk) >= 2:
+                        a_end = random.randint(1, len(current_chunk) - 1)
+
+                    tokens_a = []
+                    for j in range(a_end):
+                        tokens_a.extend(current_chunk[j])
+
+                    tokens_b = []
+
+                    if len(current_chunk) == 1 or random.random() < self.nsp_probability:
+                        is_random_next = True
+                        target_b_length = target_seq_length - len(tokens_a)
+
+                        # This should rarely go for more than one iteration for large
+                        # corpora. However, just to be careful, we try to make sure that
+                        # the random document is not the same as the document
+                        # we're processing.
+                        for _ in range(10):
+                            random_document_index = random.randint(0, len(self.documents) - 1)
+                            if random_document_index != doc_index:
+                                break
+
+                        random_document = self.documents[random_document_index]
+                        random_start = random.randint(0, len(random_document) - 1)
+                        for j in range(random_start, len(random_document)):
+                            tokens_b.extend(random_document[j])
+                            if len(tokens_b) >= target_b_length:
+                                break
+                        # We didn't actually use these segments so we "put them back" so
+                        # they don't go to waste.
+                        num_unused_segments = len(current_chunk) - a_end
+                        i -= num_unused_segments
+                    # Actual next
+                    else:
+                        is_random_next = False
+                        for j in range(a_end, len(current_chunk)):
+                            tokens_b.extend(current_chunk[j])
+
+                    if not (len(tokens_a) >= 1):
+                        raise ValueError(f"Length of sequence a is {len(tokens_a)} which must be no less than 1")
+                    if not (len(tokens_b) >= 1):
+                        raise ValueError(f"Length of sequence b is {len(tokens_b)} which must be no less than 1")
+
+                    # add special tokens
+                    input_ids = self.tokenizer.build_inputs_with_special_tokens(tokens_a, tokens_b)
+                    # add token type ids, 0 for sentence a, 1 for sentence b
+                    token_type_ids = self.tokenizer.create_token_type_ids_from_sequences(tokens_a, tokens_b)
+
+                    example = {
+                        "input_ids": torch.tensor(input_ids, dtype=torch.long),
+                        "token_type_ids": torch.tensor(token_type_ids, dtype=torch.long),
+                        "next_sentence_label": torch.tensor(1 if is_random_next else 0, dtype=torch.long),
+                    }
+
+                    self.examples.append(example)
+
+                current_chunk = []
+                current_length = 0
+
+            i += 1
+
+    def __len__(self):
+        return len(self.examples)
+
+    def __getitem__(self, i):
+        return self.examples[i]
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Arthur	52289f4dff	Merge branch 'main' into cleanup_tok_save	2025-10-16 11:36:34 +02:00
Yoni Gozlan	a59124e27e	Add missing dates to docs (#41576 ) add dates	2025-10-16 09:32:28 +00:00
Cyril Vallez	81f97b17d2	Remove randomly added script (#41650 ) remove	2025-10-16 11:23:53 +02:00
Cyril Vallez	c0a5cf19ad	Fix tokenization test (#41649 ) fix	2025-10-16 11:14:20 +02:00
Cyril Vallez	3ef6f2c415	Allow passing `tp_plan` in `from_pretrained` directly (#41435 ) * start * allow passing it * fix plans * fix * fix * style * style * fix * add_test * oupsi indent * fix * fix * fix for CI without accelerator * fix import	2025-10-16 11:12:07 +02:00
Yuxuan Zhang	59efd86da2	Add aux loss for GLM-4.5V (#41564 ) * add aux * update * update config to text_config * use qwen data class to avoid repeat again * format * update * use 1e-4 * update * update for remove init * Apply style fixes --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Raushan Turganbay <raushan@huggingface.co>	2025-10-16 09:04:21 +00:00
Raushan Turganbay	7b7d17f9bf	🚨 [v5] Toggle the serialization format in processors (#41474 ) * toggle the serialization * prob this fixes it * fix tests * typo * delete legacy save entirely * remove extra nesting in if * revert test and serialzie a public attr instead of private	2025-10-16 10:19:22 +02:00
Merve Noyan	e20df45bf6	Add Backbone API fine-tuning tutorial (#41590 ) --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>	2025-10-15 18:42:32 +02:00
Jack	19df66dcba	Update executorch.md (#41582 ) * Update executorch.md * Update executorch.md * Update executorch.md * Apply suggestions from code review Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --------- Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>	2025-10-15 09:01:46 -07:00
Steven Liu	9f71e3a604	[docs] Duplicate entry (#41591 ) fix	2025-10-15 17:02:36 +02:00
Marc Sun	bc9900562d	Fix quantization base class (#41613 ) * fix * fix --------- Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>	2025-10-15 16:58:17 +02:00
Marc Sun	72fd67929b	Remove deprecated code (#41616 ) remove Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>	2025-10-15 16:57:52 +02:00
Yih-Dar	da382917aa	Remove the head masking block in some vision models (#41620 ) * old * new --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>	2025-10-15 15:51:01 +02:00
Raushan Turganbay	313afcc468	[chat template] update when "push_to_hub" (#39815 ) * update templates push to hub * rvert jinja suffix and move it to processor file	2025-10-15 13:49:59 +00:00
Raushan Turganbay	7bba4d1202	Fix video processing channel format (#41603 ) fix	2025-10-15 15:48:01 +02:00
Zhen	ab92534377	enable sdpa enable gqa logic for Ascend NPU (#41601 ) * enable gqa logic for Ascend NPU * remove redundant comments * fix comments about Ascend NPU --------- Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>	2025-10-15 13:45:28 +00:00
i3hz	56a727dde5	Add fast path for bidirectional mask creation to fix regression (#41586 ) * fixed performance regression * also fixed the older_torch function * Update src/transformers/masking_utils.py Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com> * fix * more general * fix slicing * fix data dependent --------- Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com> Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com> Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co>	2025-10-15 15:30:39 +02:00
Yih-Dar	dc6fdeb705	Update a dataset reop link (#41618 ) fix Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>	2025-10-15 14:41:38 +02:00
Harry Mellor	3953b65440	Reinstate early CUDA init fix (#41617 ) * Reinstate early CUDA init fix Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * Delay import further Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> --------- Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2025-10-15 14:41:10 +02:00
Yih-Dar	96d245a83d	torch 2.9 don't ❤️ torchcodec 💔 (#41610 ) pin Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>	2025-10-15 14:34:00 +02:00
Yuanyuan Chen	bb0c3af995	More markdown file fixes (#41599 ) * Format markdown files Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Format markdown files Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Format markdown files Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> --------- Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>	2025-10-15 12:29:27 +00:00
Marc Sun	70e871959c	Fix trainer simple tests (#41449 ) * fix * fix ray * train to tune * breaking changes wrt generation config * Fix ! * fix * fix * fix deepspeed ! * fix * fix * fix * improve logic * revert and fix * revert comment * oups * revert change * fix * style * typo in comment --------- Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>	2025-10-15 14:09:00 +02:00
Cyril Vallez	c4210796e0	Import `expand_device_map` instead of redefining it (#41608 ) remove it	2025-10-15 14:00:09 +02:00
Anton Vlasjuk	fcd1ccdb78	[`Docs`] Fix changed references (#41614 ) * fix * fix * other ln	2025-10-15 13:59:13 +02:00
Marc Sun	2b2c20f315	Update issue template (#41573 ) * update * fix	2025-10-15 13:54:37 +02:00
Marc Sun	e2122c4bcb	remove ray_scope and check_quantized_param (#41587 ) remove	2025-10-15 13:10:35 +02:00
Wang, Yi	e89cef6625	fix some case failures lead by "`torch.compile` recompiled part of th… (#41558 ) * fix some case failures lead by "`torch.compile` recompiled part of the forward pass" in xpu Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * update comment Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2025-10-15 10:45:29 +00:00
Philip Roeleveld	26b7f66850	Add `logits_to_keep` to many older CausalLM models (#41335 ) * Add logits_to_keep to CausalLM models * Skip failing test for git model * Remove unused return_dict from kosmos2 signature * Revert BlipForQuestionAnswering	2025-10-15 11:56:01 +02:00
Cyril Vallez	5db730786d	[device_map] Accelerate loading by computing device_map much faster (#41548 ) * start * add the important fix * continue * big cleanup * type hints * add method * fix typehints * typehints * fix * oupsi * remove space * improve function * CI	2025-10-15 11:18:57 +02:00
Lysandre Debut	13a35a5057	Enable non-streaming mode in `transformers serve` (#41446 ) * Enable non-streaming in transformers serve Remove typos Remove typos Remove typos * Fix tests * Arthur review	2025-10-15 09:37:26 +02:00
Rémi Ouazan	94df0e6560	Benchmark overhaul (#41408 ) * Big refactor, still classes to move around and script to re-complexify * Move to streamer, isolate benches, propagate num tokens * Some refacto * Added compile mode to name * Re-order * Move to dt_tokens * Better format * Fix and disable use_cache by default * Fixed compile and SDPA backend default * Refactor results format * Added default compile mode * Always use cache * Fixed cache and added flex * Plan for missing modules * Experiments: no cg and shuffle * Disable compile for FA * Remove wall time, add sweep mode, get git commit * Review compliance, start * Apply suggestions from code review Co-authored-by: Luc Georges <McPatate@users.noreply.github.com> * Update benchmark_v2/framework/benchmark_runner.py Co-authored-by: Luc Georges <McPatate@users.noreply.github.com> * Disable workflow * Pretty print * Added some pretty names to have pretty logs * Review n2 compliance (end?) * Style and end of PR --------- Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>	2025-10-14 21:41:43 +02:00
Rémi Ouazan	9e4199ede3	Gemma3 fixes (#41572 ) * Multiple device error fix * FA2 equivalence fix * Move the train fwd in cfg test * Style * Added comment * Made the comment more clear	2025-10-14 18:33:27 +02:00
Yuanyuan Chen	4c8d293599	Fix typsetting and content of llm_tutorial_optimization.md (#41172 ) * Fix typsetting of llm_tutorial_optimization Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Fix errors Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> --------- Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>	2025-10-14 08:40:26 -07:00
Marc Sun	a99b1be3c7	Revert some breaking changes bnb (#41581 ) fix	2025-10-14 16:28:16 +02:00
Rémi Ouazan	82cae9eb52	Add __iter__ to DynamicCache (#41569 ) * Add __iter__ to DynamicCache * Fix tests that use ddp init	2025-10-14 16:16:32 +02:00
NielsRogge	4fad35ee4a	[VisionEncoderDecoderModel] Update loss function (#40863 ) Update loss function	2025-10-14 16:03:00 +02:00
Mohamed Mekkouri	ae6f6cc3e0	Revert "add rmsnorm kernels support for Intel XPU" (#41579 ) Revert "add rmsnorm kernels support for Intel XPU (#41563)" This reverts commit fd787c5f6d667d3e00def70f588972af4437f631.	2025-10-14 15:49:33 +02:00
kaixuanliu	fd787c5f6d	add rmsnorm kernels support for Intel XPU (#41563 ) Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>	2025-10-14 13:26:09 +00:00
Chunyu	4e4f2af586	Add conditional checks to _check_and_adjust_attn_implementation() (#41542 )	2025-10-14 13:00:07 +00:00
Merve Noyan	3648fde486	Add DINOv3Backbone for ConvNext variant (#40651 ) --------- Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>	2025-10-14 14:57:04 +02:00
Yih-Dar	abf5b57a68	delete some tokenizer tests using pickle (#41514 ) * hate pickle * hate pickle * hate pickle --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>	2025-10-14 14:50:51 +02:00
Mohamed Mekkouri	8fe4db5399	[kernels] rm mra kernels (#41507 ) * fix modeling * remove kernel * fix style	2025-10-14 13:34:04 +02:00
Daniel Quintas	c620c38bb0	[Qwen3VLMoe] Fixed: Expected self.dtype to be equal to src.dtype - routing_weights casting (#41420 ) * Fixed Expected self.dtype to be equal to src.dtype on eval * Fixed Expected self.dtype to be equal to src.dtype on eval * Fixed Expected self.dtype to be equal to src.dtype on eval * generated modeling_qwen3_vl_moe.py file * Fixed Ernie_4_5_MoE router casting * Fixed routing_weights dtype casting (ernie4_5_moe, hunyuan_v1_moe, qwen2_moe, qwen3_moe, qwen3_next,qwen3_omni_moe) * rollback hunyuan_v1_moe changes --------- Co-authored-by: Daniel Oliveira <daniel-oliveira-11@hotmail.com> Co-authored-by: Daniel Oliveira <36623265+daniel3303@users.noreply.github.com>	2025-10-14 13:14:49 +02:00
Rémi Ouazan	0798797ec9	Fix an import error with PreTrainModel (#41571 )	2025-10-14 13:13:37 +02:00
Julien Denize	0566b6f5bd	Patch MistralCommonTokenizer (#41439 ) * Fix token_to_id and add add_generation_prompt * Fix spm download * Refactor spm * Try another possibly non-gated spm * Improve get_vocab * lint * Improve get_vocab * Add warn to piece_to_id * Improve from_pretrained raise and revert model spm * Revert fast	2025-10-14 11:13:19 +00:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟	b3e3c3dc93	[Qwen3VL] fix device mismatch error for FSDP2 training (#41536 ) For FSDP2, parameters might be on a meta device, and the weight.device attribute may not accurately reflect where the actual computation will happen during forward passes. ```log File "transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py", line 776, in forward pos_embeds = self.fast_pos_embed_interpolate(grid_thw) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py", line 745, in fast_pos_embed_interpolate pos_embeds = self.pos_embed(idx_tensor) * weight_tensor[:, :, None] ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "torch/nn/modules/module.py", line 1773, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "torch/nn/modules/module.py", line 1879, in _call_impl return inner() ^^^^^^^ File "torch/nn/modules/module.py", line 1827, in inner result = forward_call(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "torch/nn/modules/sparse.py", line 192, in forward return F.embedding( ^^^^^^^^^^^^ File "torch/nn/functional.py", line 2546, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Expected all tensors to be on the same device, but got index is on cpu, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA__index_select) ``` https://github.com/volcengine/verl/pull/3686#issuecomment-3380981817 Signed-off-by: Hollow Man <hollowman@opensuse.org>	2025-10-14 10:28:25 +00:00
Matt	b84c0b31c6	Remove references to AutoModelForVision2Seq (#41513 ) * Since Vision2Seq is deprecated, remove it from pipelines and docstrings * Catch some more references	2025-10-13 17:00:07 +01:00
Arthur	1ee3b288a6	[`from_pretrained`] Small refactor `from_pretrained`: move around unrelated stuff (#41445 ) * drafts * up * simplify modeling utils * more simplifications * type kwargs * up * move more accelerate related stuff * safeguarding? * nits * remove func when func is NOPE * more * nits * styling * yups * up * ups * revert * protect trainer utils iport * fix doc * Update src/transformers/integrations/peft.py Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co> * review * update * ? * fixx * update * super small update * ups * style * this is stupid * 🤦 well this was the issue * small nit * fix * nit * damn the missing return * one last stupid fix --------- Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co>	2025-10-13 16:33:32 +02:00
Kehan Li	cad74496ca	[model] Add VideoLLaMA3 implementation (#40499 ) * Add VideoLLaMA3 implementation * Run style fix * Switch to modular * Fix config and smart_resize * Fix * Fix * Fix style * Fix * Ruff fix * Rename * Rename * Fix * Clean * Fix consistency * Add doc * Fix * Fix * Fix doc * Update generated code * remove test_initialization * fix tests * simplify * tests * Add VideoLlama3IntegrationTest * replace asserts * fix tests --------- Co-authored-by: steven-ccq <55176896+steven-ccq@users.noreply.github.com> Co-authored-by: steven-ccq <1456320989@qq.com> Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co> Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>	2025-10-13 15:54:34 +02:00
Akilesh	3813a8e3a1	Add VideoMAE video processor (#41534 ) * Add video processor for VideoMAE * Document VideoMAE video processor * Add regression tests for VideoMAE video processor * refactor: Use direct batch key access for pixel_values_videos * test: add parity test for VideoMAEVideoProcessor vs VideoMAEImageProcessor * docs(videomae): update model docstring example to demonstrate VideoMAEVideoProcessor (TorchCodec-based decoding and sampling)	2025-10-13 15:42:27 +02:00
Julian Ste	66d8d7a077	Fixed typos and formatting (#34215 ) #hacktoberfest	2025-10-13 13:38:06 +00:00
Joao Gante	d621be8286	🚨 [v5] `generate` delegates default cache initialization to the model (#41505 )	2025-10-13 13:20:48 +01:00
regisss	d7c9fbdb64	Enable modular files from other libraries (#41372 ) Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>	2025-10-13 13:48:32 +02:00
fan-amd	41e763decd	Add AMD developer cloud support (#41126 ) * Add AMD developer cloud support * Add AMD remote svg link. * Update notebooks/README.md Co-authored-by: pagezyhf <165770107+pagezyhf@users.noreply.github.com> --------- Co-authored-by: Rémi Ouazan <83456801+remi-or@users.noreply.github.com> Co-authored-by: pagezyhf <165770107+pagezyhf@users.noreply.github.com>	2025-10-13 12:17:24 +02:00
Rémi Ouazan	cf1e9834ec	Restore cuda graphs to continuous batching (#41421 ) * Type hints and small fixes * Remove unusued params * Made slice inputs the default * ruffed * Updated some var name and moved index slicing * Logging arg in example * Added some padding debug var and reformat out cg * First working CG, fixe size * Working flexible CG * CG are compatible with all implementations * Fixed CG API * Update example * Documentation * Fix padding tokens in FA * Review compliance * Better doc around weird bug * Style * Fix for sliding with CG	2025-10-13 11:57:56 +02:00
Raushan Turganbay	6c901bdc0e	[SAM] Fix typing hints (#41506 ) fix	2025-10-13 11:52:00 +02:00
Sai-Suraj-27	58f9e13313	Fixed Type-hints in function defintions (#41525 ) * Explicitly annotate default None parameters as Optional * make style. * make style. * Fixed check_copies. * fix consistency.	2025-10-13 11:48:37 +02:00
Yoni Gozlan	eb28242251	Add MLlama fast image processor (#41391 ) * Merge conflict * add fast processor * add fast processor * make style * add new convert rgb * use nested group by shape in mllama fast, add support for multiple inputs in group by shape * refactor after review --------- Co-authored-by: Vincent <phamvinh257@gmail.com>	2025-10-13 09:16:05 +00:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟	65cb8fac6d	[Qwen3VL] fix: hidden_states in place modification error (#41535 ) ``` File "transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py", line 941, in forward hidden_states = self._deepstack_process( ^^^^^^^^^^^^^^^^^^^^^^^^ File "transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py", line 960, in _deepstack_process hidden_states[visual_pos_masks, :] = local_this ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Output 0 of SliceBackward0 is a view and is being modified inplace. This view was created inside a custom Function (or because an input was returned as-is) and the autograd logic to handle view+inplace would override the custom backward associated with the custom Function, leading to incorrect gradients. This behavior is forbidden. You can fix this by cloning the output of the custom Function. ``` Signed-off-by: Hollow Man <hollowman@opensuse.org>	2025-10-13 10:50:14 +02:00
Yih-Dar	3927ffed31	[testing] reduce runtime of `HunYuanMoEV1IntegrationTest:test_model_generation` (#41373 ) * fix * fix * fix --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>	2025-10-10 22:27:01 +02:00
Yuanyuan Chen	7164924a7e	Fix Latex typesetting in documentation (#41177 ) Fix Latex typsetting in documentation Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>	2025-10-10 08:54:27 -07:00
nicha-api	26a5368c44	Allow optuna's catch kwargs passthrough (#41496 ) * allow optuna's catch kwargs passthrough * apply ruff formatting --------- Co-authored-by: nicha <nicha.api@nectec.or.th>	2025-10-10 13:58:07 +00:00
Marc Sun	feca4f3de7	remove `tpu_num_cores` (#41383 ) * remove-tpu-num-cores * fix * let's remove it * style * Update examples/legacy/seq2seq/finetune_tpu.sh Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com> --------- Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>	2025-10-10 15:53:28 +02:00
Cyril Vallez	c6042a4169	Remove outdated flags (#41512 ) remove flags	2025-10-10 14:34:47 +02:00
Benjamin Keene	dfd4121cd4	add Trainer import to .md in appropriate cell block for training.ipynb transformers_doc (#41484 ) add Trainer import to .md in appropriate cell block for docs	2025-10-10 12:04:07 +00:00
Arthur	a06992cbc5	draft removal of special and added	2025-09-11 16:27:32 +02:00