build: try 28

build: try 16
build
2025-10-23 19:04:35 +08:00 · 2025-10-09 09:30:51 +02:00 · 2025-10-08 17:21:51 +02:00 · 2025-10-08 16:58:30 +02:00 · 2025-10-08 15:59:30 +02:00 · 2025-10-08 13:37:51 +00:00
2141 changed files with 40374 additions and 53922 deletions
--- a/.circleci/create_circleci_config.py
+++ b/.circleci/create_circleci_config.py
@ -29,6 +29,7 @@ COMMON_ENV_VARIABLES = {
    "RUN_PIPELINE_TESTS": False,
    # will be adjust in `CircleCIJob.to_dict`.
    "RUN_FLAKY": True,
+    "DISABLE_SAFETENSORS_CONVERSION": True,
 }
 # Disable the use of {"s": None} as the output is way too long, causing the navigation on CircleCI impractical
 COMMON_PYTEST_OPTIONS = {"max-worker-restart": 0, "vvv": None, "rsfE":None}
@ -129,6 +130,12 @@ class CircleCIJob:

    def to_dict(self):
        env = COMMON_ENV_VARIABLES.copy()
+        if self.job_name != "tests_hub":
+            # fmt: off
+            # not critical
+            env.update({"HF_TOKEN": "".join(["h", "f", "_", "H", "o", "d", "V", "u", "M", "q", "b", "R", "m", "t", "b", "z", "F", "Q", "O", "Q", "A", "J", "G", "D", "l", "V", "Q", "r", "R", "N", "w", "D", "M", "V", "C", "s", "d"])})
+            # fmt: on
+
        # Do not run tests decorated by @is_flaky on pull requests
        env['RUN_FLAKY'] = os.environ.get("CIRCLE_PULL_REQUEST", "") == ""
        env.update(self.additional_env)
@ -179,6 +186,7 @@ class CircleCIJob:
            # During the CircleCI docker images build time, we might already (or not) download the data.
            # If it's done already, the files are inside the directory `/test_data/`.
            {"run": {"name": "fetch hub objects before pytest", "command": "cp -r /test_data/* . 2>/dev/null || true; python3 utils/fetch_hub_objects_for_ci.py"}},
+            {"run": {"name": "download and unzip hub cache", "command": 'curl -L -o huggingface-cache.tar.gz https://huggingface.co/datasets/hf-internal-testing/hf_hub_cache/resolve/main/huggingface-cache.tar.gz && apt-get install pigz && tar --use-compress-program="pigz -d -p 8" -xf huggingface-cache.tar.gz && mv -n hub/* /root/.cache/huggingface/hub/ && ls -la /root/.cache/huggingface/hub/'}},
            {"run": {
                "name": "Run tests",
                "command": f"({timeout_cmd} python3 -m pytest {marker_cmd} -n {self.pytest_num_workers} {junit_flags} {repeat_on_failure_flags} {' '.join(pytest_flags)} $(cat splitted_tests.txt) | tee tests_output.txt)"}
--- a/.github/ISSUE_TEMPLATE/bug-report.yml
+++ b/.github/ISSUE_TEMPLATE/bug-report.yml
@ -61,6 +61,7 @@ body:
          - Big Model Inference: @SunMarc
          - quantization (bitsandbytes, autogpt): @SunMarc @MekkCyber
          - kernels: @MekkCyber @drbh
+          - peft: @BenjaminBossan @githubnemo
        
        Devices/Backends:
        
--- a/.github/PULL_REQUEST_TEMPLATE.md
+++ b/.github/PULL_REQUEST_TEMPLATE.md
@ -39,20 +39,23 @@ members/contributors who may be interested in your PR.

 Models:

- text models: @ArthurZucker
- vision models: @amyeroberts, @qubvel
- speech models: @eustlb
+- text models: @ArthurZucker @Cyrilvallez
+- vision models: @yonigozlan @molbap
+- audio models: @eustlb @ebezzam @vasqu
+- multimodal models: @zucchini-nlp
 - graph models: @clefourrier

 Library:

- flax: @gante and @Rocketknight1
 - generate: @zucchini-nlp (visual-language models) or @gante (all others)
+- continuous batching: @remi-or @ArthurZucker @McPatate
 - pipelines: @Rocketknight1
- tensorflow: @gante and @Rocketknight1
- tokenizers: @ArthurZucker
- trainer: @zach-huggingface, @SunMarc and @qgallouedec
- chat templates: @Rocketknight1
+- tokenizers: @ArthurZucker and @itazap
+- trainer: @zach-huggingface @SunMarc
+- attention: @vasqu @ArthurZucker @CyrilVallez
+- model loading (from pretrained, etc): @CyrilVallez
+- distributed: @3outeille @ArthurZucker @S1ro1
+- CIs: @ydshieh

 Integrations:

@ -60,20 +63,17 @@ Integrations:
 - ray/raytune: @richardliaw, @amogkam
 - Big Model Inference: @SunMarc
 - quantization (bitsandbytes, autogpt): @SunMarc @MekkCyber
+- kernels: @MekkCyber @drbh
+- peft: @BenjaminBossan @githubnemo
+
+Devices/Backends:
+
+- AMD ROCm: @ivarflakstad
+- Intel XPU: @IlyasMoutawwakil
+- Ascend NPU: @ivarflakstad 

 Documentation: @stevhliu

-HF projects:
-
- accelerate: [different repo](https://github.com/huggingface/accelerate)
- datasets: [different repo](https://github.com/huggingface/datasets)
- diffusers: [different repo](https://github.com/huggingface/diffusers)
- rust tokenizers: [different repo](https://github.com/huggingface/tokenizers)
-
-Maintained examples (not research project or legacy):
-
- Flax: @Rocketknight1
- PyTorch: See Models above and tag the person corresponding to the modality of the example.
- TensorFlow: @Rocketknight1
+Research projects are not maintained and should be taken as is.

 -->
--- a/.github/scripts/codeowners_for_review_action
+++ b/.github/scripts/codeowners_for_review_action
@ -7,8 +7,8 @@ docs/ @stevhliu
 /docker/ @ydshieh @ArthurZucker

 # More high-level globs catch cases when specific rules later don't apply
-/src/transformers/models/*/processing* @molbap @yonigozlan @qubvel
-/src/transformers/models/*/image_processing* @qubvel
+/src/transformers/models/*/processing* @molbap @yonigozlan
+/src/transformers/models/*/image_processing* @yonigozlan
 /src/transformers/models/*/image_processing_*_fast* @yonigozlan

 # Owners of subsections of the library
@ -186,65 +186,65 @@ trainer_utils.py @zach-huggingface @SunMarc
 /src/transformers/models/zamba/mod*_zamba* @ArthurZucker

 # Vision models
-/src/transformers/models/beit/mod*_beit* @amyeroberts @qubvel
-/src/transformers/models/bit/mod*_bit* @amyeroberts @qubvel
-/src/transformers/models/conditional_detr/mod*_conditional_detr* @amyeroberts @qubvel
-/src/transformers/models/convnext/mod*_convnext* @amyeroberts @qubvel
-/src/transformers/models/convnextv2/mod*_convnextv2* @amyeroberts @qubvel
-/src/transformers/models/cvt/mod*_cvt* @amyeroberts @qubvel
-/src/transformers/models/deformable_detr/mod*_deformable_detr* @amyeroberts @qubvel
-/src/transformers/models/deit/mod*_deit* @amyeroberts @qubvel
-/src/transformers/models/depth_anything/mod*_depth_anything* @amyeroberts @qubvel
-/src/transformers/models/depth_anything_v2/mod*_depth_anything_v2* @amyeroberts @qubvel
-/src/transformers/models/deta/mod*_deta* @amyeroberts @qubvel
-/src/transformers/models/detr/mod*_detr* @amyeroberts @qubvel
-/src/transformers/models/dinat/mod*_dinat* @amyeroberts @qubvel
-/src/transformers/models/dinov2/mod*_dinov2* @amyeroberts @qubvel
-/src/transformers/models/dinov2_with_registers/mod*_dinov2_with_registers* @amyeroberts @qubvel
-/src/transformers/models/dit/mod*_dit* @amyeroberts @qubvel
-/src/transformers/models/dpt/mod*_dpt* @amyeroberts @qubvel
-/src/transformers/models/efficientformer/mod*_efficientformer* @amyeroberts @qubvel
-/src/transformers/models/efficientnet/mod*_efficientnet* @amyeroberts @qubvel
-/src/transformers/models/focalnet/mod*_focalnet* @amyeroberts @qubvel
-/src/transformers/models/glpn/mod*_glpn* @amyeroberts @qubvel
-/src/transformers/models/hiera/mod*_hiera* @amyeroberts @qubvel
-/src/transformers/models/ijepa/mod*_ijepa* @amyeroberts @qubvel
-/src/transformers/models/imagegpt/mod*_imagegpt* @amyeroberts @qubvel
-/src/transformers/models/levit/mod*_levit* @amyeroberts @qubvel
-/src/transformers/models/mask2former/mod*_mask2former* @amyeroberts @qubvel
-/src/transformers/models/maskformer/mod*_maskformer* @amyeroberts @qubvel
-/src/transformers/models/mobilenet_v1/mod*_mobilenet_v1* @amyeroberts @qubvel
-/src/transformers/models/mobilenet_v2/mod*_mobilenet_v2* @amyeroberts @qubvel
-/src/transformers/models/mobilevit/mod*_mobilevit* @amyeroberts @qubvel
-/src/transformers/models/mobilevitv2/mod*_mobilevitv2* @amyeroberts @qubvel
-/src/transformers/models/nat/mod*_nat* @amyeroberts @qubvel
-/src/transformers/models/poolformer/mod*_poolformer* @amyeroberts @qubvel
-/src/transformers/models/pvt/mod*_pvt* @amyeroberts @qubvel
-/src/transformers/models/pvt_v2/mod*_pvt_v2* @amyeroberts @qubvel
-/src/transformers/models/regnet/mod*_regnet* @amyeroberts @qubvel
-/src/transformers/models/resnet/mod*_resnet* @amyeroberts @qubvel
-/src/transformers/models/rt_detr/mod*_rt_detr* @amyeroberts @qubvel
-/src/transformers/models/segformer/mod*_segformer* @amyeroberts @qubvel
-/src/transformers/models/seggpt/mod*_seggpt* @amyeroberts @qubvel
-/src/transformers/models/superpoint/mod*_superpoint* @amyeroberts @qubvel
-/src/transformers/models/swiftformer/mod*_swiftformer* @amyeroberts @qubvel
-/src/transformers/models/swin/mod*_swin* @amyeroberts @qubvel
-/src/transformers/models/swinv2/mod*_swinv2* @amyeroberts @qubvel
-/src/transformers/models/swin2sr/mod*_swin2sr* @amyeroberts @qubvel
-/src/transformers/models/table_transformer/mod*_table_transformer* @amyeroberts @qubvel
-/src/transformers/models/textnet/mod*_textnet* @amyeroberts @qubvel
-/src/transformers/models/timm_wrapper/mod*_timm_wrapper* @amyeroberts @qubvel
-/src/transformers/models/upernet/mod*_upernet* @amyeroberts @qubvel
-/src/transformers/models/van/mod*_van* @amyeroberts @qubvel
-/src/transformers/models/vit/mod*_vit* @amyeroberts @qubvel
-/src/transformers/models/vit_hybrid/mod*_vit_hybrid* @amyeroberts @qubvel
-/src/transformers/models/vitdet/mod*_vitdet* @amyeroberts @qubvel
-/src/transformers/models/vit_mae/mod*_vit_mae* @amyeroberts @qubvel
-/src/transformers/models/vitmatte/mod*_vitmatte* @amyeroberts @qubvel
-/src/transformers/models/vit_msn/mod*_vit_msn* @amyeroberts @qubvel
-/src/transformers/models/vitpose/mod*_vitpose* @amyeroberts @qubvel
-/src/transformers/models/yolos/mod*_yolos* @amyeroberts @qubvel
-/src/transformers/models/zoedepth/mod*_zoedepth* @amyeroberts @qubvel
+/src/transformers/models/beit/mod*_beit* @yonigozlan @molbap
+/src/transformers/models/bit/mod*_bit* @yonigozlan @molbap
+/src/transformers/models/conditional_detr/mod*_conditional_detr* @yonigozlan @molbap
+/src/transformers/models/convnext/mod*_convnext* @yonigozlan @molbap
+/src/transformers/models/convnextv2/mod*_convnextv2* @yonigozlan @molbap
+/src/transformers/models/cvt/mod*_cvt* @yonigozlan @molbap
+/src/transformers/models/deformable_detr/mod*_deformable_detr* @yonigozlan @molbap
+/src/transformers/models/deit/mod*_deit* @yonigozlan @molbap
+/src/transformers/models/depth_anything/mod*_depth_anything* @yonigozlan @molbap
+/src/transformers/models/depth_anything_v2/mod*_depth_anything_v2* @yonigozlan @molbap
+/src/transformers/models/deta/mod*_deta* @yonigozlan @molbap
+/src/transformers/models/detr/mod*_detr* @yonigozlan @molbap
+/src/transformers/models/dinat/mod*_dinat* @yonigozlan @molbap
+/src/transformers/models/dinov2/mod*_dinov2* @yonigozlan @molbap
+/src/transformers/models/dinov2_with_registers/mod*_dinov2_with_registers* @yonigozlan @molbap
+/src/transformers/models/dit/mod*_dit* @yonigozlan @molbap
+/src/transformers/models/dpt/mod*_dpt* @yonigozlan @molbap
+/src/transformers/models/efficientformer/mod*_efficientformer* @yonigozlan @molbap
+/src/transformers/models/efficientnet/mod*_efficientnet* @yonigozlan @molbap
+/src/transformers/models/focalnet/mod*_focalnet* @yonigozlan @molbap
+/src/transformers/models/glpn/mod*_glpn* @yonigozlan @molbap
+/src/transformers/models/hiera/mod*_hiera* @yonigozlan @molbap
+/src/transformers/models/ijepa/mod*_ijepa* @yonigozlan @molbap
+/src/transformers/models/imagegpt/mod*_imagegpt* @yonigozlan @molbap
+/src/transformers/models/levit/mod*_levit* @yonigozlan @molbap
+/src/transformers/models/mask2former/mod*_mask2former* @yonigozlan @molbap
+/src/transformers/models/maskformer/mod*_maskformer* @yonigozlan @molbap
+/src/transformers/models/mobilenet_v1/mod*_mobilenet_v1* @yonigozlan @molbap
+/src/transformers/models/mobilenet_v2/mod*_mobilenet_v2* @yonigozlan @molbap
+/src/transformers/models/mobilevit/mod*_mobilevit* @yonigozlan @molbap
+/src/transformers/models/mobilevitv2/mod*_mobilevitv2* @yonigozlan @molbap
+/src/transformers/models/nat/mod*_nat* @yonigozlan @molbap
+/src/transformers/models/poolformer/mod*_poolformer* @yonigozlan @molbap
+/src/transformers/models/pvt/mod*_pvt* @yonigozlan @molbap
+/src/transformers/models/pvt_v2/mod*_pvt_v2* @yonigozlan @molbap
+/src/transformers/models/regnet/mod*_regnet* @yonigozlan @molbap
+/src/transformers/models/resnet/mod*_resnet* @yonigozlan @molbap
+/src/transformers/models/rt_detr/mod*_rt_detr* @yonigozlan @molbap
+/src/transformers/models/segformer/mod*_segformer* @yonigozlan @molbap
+/src/transformers/models/seggpt/mod*_seggpt* @yonigozlan @molbap
+/src/transformers/models/superpoint/mod*_superpoint* @yonigozlan @molbap
+/src/transformers/models/swiftformer/mod*_swiftformer* @yonigozlan @molbap
+/src/transformers/models/swin/mod*_swin* @yonigozlan @molbap
+/src/transformers/models/swinv2/mod*_swinv2* @yonigozlan @molbap
+/src/transformers/models/swin2sr/mod*_swin2sr* @yonigozlan @molbap
+/src/transformers/models/table_transformer/mod*_table_transformer* @yonigozlan @molbap
+/src/transformers/models/textnet/mod*_textnet* @yonigozlan @molbap
+/src/transformers/models/timm_wrapper/mod*_timm_wrapper* @yonigozlan @molbap
+/src/transformers/models/upernet/mod*_upernet* @yonigozlan @molbap
+/src/transformers/models/van/mod*_van* @yonigozlan @molbap
+/src/transformers/models/vit/mod*_vit* @yonigozlan @molbap
+/src/transformers/models/vit_hybrid/mod*_vit_hybrid* @yonigozlan @molbap
+/src/transformers/models/vitdet/mod*_vitdet* @yonigozlan @molbap
+/src/transformers/models/vit_mae/mod*_vit_mae* @yonigozlan @molbap
+/src/transformers/models/vitmatte/mod*_vitmatte* @yonigozlan @molbap
+/src/transformers/models/vit_msn/mod*_vit_msn* @yonigozlan @molbap
+/src/transformers/models/vitpose/mod*_vitpose* @yonigozlan @molbap
+/src/transformers/models/yolos/mod*_yolos* @yonigozlan @molbap
+/src/transformers/models/zoedepth/mod*_zoedepth* @yonigozlan @molbap

 # Audio models
 /src/transformers/models/audio_spectrogram_transformer/mod*_audio_spectrogram_transformer* @eustlb
@ -304,7 +304,7 @@ trainer_utils.py @zach-huggingface @SunMarc
 /src/transformers/models/donut/mod*_donut* @zucchini-nlp
 /src/transformers/models/flava/mod*_flava* @zucchini-nlp
 /src/transformers/models/git/mod*_git* @zucchini-nlp
-/src/transformers/models/grounding_dino/mod*_grounding_dino* @qubvel
+/src/transformers/models/grounding_dino/mod*_grounding_dino* @yonigozlan
 /src/transformers/models/groupvit/mod*_groupvit* @zucchini-nlp
 /src/transformers/models/idefics/mod*_idefics* @zucchini-nlp
 /src/transformers/models/idefics2/mod*_idefics2* @zucchini-nlp
@ -326,10 +326,10 @@ trainer_utils.py @zach-huggingface @SunMarc
 /src/transformers/models/mgp_str/mod*_mgp_str* @zucchini-nlp
 /src/transformers/models/mllama/mod*_mllama* @zucchini-nlp
 /src/transformers/models/nougat/mod*_nougat* @NielsRogge
-/src/transformers/models/omdet_turbo/mod*_omdet_turbo* @qubvel @yonigozlan
+/src/transformers/models/omdet_turbo/mod*_omdet_turbo* @yonigozlan
 /src/transformers/models/oneformer/mod*_oneformer* @zucchini-nlp
-/src/transformers/models/owlvit/mod*_owlvit* @qubvel
-/src/transformers/models/owlv2/mod*_owlv2* @qubvel
+/src/transformers/models/owlvit/mod*_owlvit* @yonigozlan
+/src/transformers/models/owlv2/mod*_owlv2* @yonigozlan
 /src/transformers/models/paligemma/mod*_paligemma* @zucchini-nlp @molbap
 /src/transformers/models/perceiver/mod*_perceiver* @zucchini-nlp
 /src/transformers/models/pix2struct/mod*_pix2struct* @zucchini-nlp
--- a/.github/workflows/benchmark_v2.yml
+++ b/.github/workflows/benchmark_v2.yml
@ -7,6 +7,14 @@ on:
        description: 'GH Actions runner group to use'
        required: true
        type: string
+      container_image:
+        description: 'Docker image to use'
+        required: true
+        type: string
+      container_options:
+        description: 'Container options to use'
+        required: true
+        type: string
      commit_sha:
        description: 'Commit SHA to benchmark'
        required: false
@ -38,8 +46,8 @@ jobs:
      (github.event_name == 'pull_request' && contains( github.event.pull_request.labels.*.name, 'run-benchmark')) ||
      (github.event_name == 'schedule')
    container:
-      image: huggingface/transformers-pytorch-gpu
-      options: --gpus all --privileged --ipc host --shm-size "16gb"
+      image: ${{ inputs.container_image }}
+      options: ${{ inputs.container_options }}
    steps:
      - name: Get repo
        uses: actions/checkout@v4
--- a/.github/workflows/benchmark_v2_a10_caller.yml
+++ b/.github/workflows/benchmark_v2_a10_caller.yml
@ -13,6 +13,8 @@ jobs:
    uses: ./.github/workflows/benchmark_v2.yml
    with:
      runner: aws-g5-4xlarge-cache-use1-public-80
+      container_image: huggingface/transformers-pytorch-gpu
+      container_options: --gpus all --privileged --ipc host --shm-size "16gb"
      commit_sha: ${{ github.sha }}
      run_id: ${{ github.run_id }}
      benchmark_repo_id: hf-internal-testing/transformers-daily-benchmarks
--- a/.github/workflows/benchmark_v2_mi325_caller.yml
+++ b/.github/workflows/benchmark_v2_mi325_caller.yml
@ -13,6 +13,8 @@ jobs:
    uses: ./.github/workflows/benchmark_v2.yml
    with:
      runner: amd-mi325-ci-1gpu
+      container_image: huggingface/transformers-pytorch-amd-gpu
+      container_options: --device /dev/kfd --device /dev/dri --env ROCR_VISIBLE_DEVICES --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache
      commit_sha: ${{ github.sha }}
      run_id: ${{ github.run_id }}
      benchmark_repo_id: hf-internal-testing/transformers-daily-benchmarks
--- a/.github/workflows/build-docker-images-extra.yml
+++ b/.github/workflows/build-docker-images-extra.yml
@ -0,0 +1,42 @@
+name: Build other docker images (scheduled)
+
+on:
+  push:
+    branches:
+      - build_ci_docker_image_extra*
+      - docker-with-fa3
+  schedule:
+    - cron: "17 0 * * *"
+
+concurrency:
+  group: docker-images-extra-builds
+  cancel-in-progress: false
+
+jobs:
+  latest-docker:
+    name: "Latest PyTorch [dev] with FA3"
+    runs-on:
+      group: aws-highcpu-32-priv
+    steps:
+      -
+        name: Set up Docker Buildx
+        uses: docker/setup-buildx-action@v3
+      -
+        name: Check out code
+        uses: actions/checkout@v4
+      -
+        name: Login to DockerHub
+        uses: docker/login-action@v3
+        with:
+          username: ${{ secrets.DOCKERHUB_USERNAME }}
+          password: ${{ secrets.DOCKERHUB_PASSWORD }}
+      -
+        name: Build and push
+        uses: docker/build-push-action@v5
+        with:
+          context: ./docker/transformers-all-latest-gpu
+          build-args: |
+            REF=main
+            FA3=TRUE
+          push: true
+          tags: huggingface/transformers-all-latest-gpu:fa3
--- a/.github/workflows/build-docker-images.yml
+++ b/.github/workflows/build-docker-images.yml
@ -5,6 +5,7 @@ on:
    branches:
      - build_ci_docker_image*
  repository_dispatch:
+  workflow_dispatch:
  workflow_call:
    inputs:
      image_postfix:
@ -221,7 +222,7 @@ jobs:
  latest-pytorch-amd:
    name: "Latest PyTorch (AMD) [dev]"
    runs-on:
-      group: aws-general-8-plus
+      group: aws-highcpu-32-priv
    steps:
      -
        name: Set up Docker Buildx
--- a/.github/workflows/build_documentation.yml
+++ b/.github/workflows/build_documentation.yml
@ -16,8 +16,20 @@ jobs:
      commit_sha: ${{ github.sha }}
      package: transformers
      notebook_folder: transformers_doc
-      languages: ar de en es fr hi it ja ko pt zh
+      languages: en
      custom_container: huggingface/transformers-doc-builder
    secrets:
      token: ${{ secrets.HUGGINGFACE_PUSH }}
      hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
+
+   build_other_lang:
+    uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@main
+    with:
+      commit_sha: ${{ github.sha }}
+      package: transformers
+      notebook_folder: transformers_doc
+      languages: ar de es fr hi it ja ko pt zh
+      custom_container: huggingface/transformers-doc-builder
+    secrets:
+      token: ${{ secrets.HUGGINGFACE_PUSH }}
+      hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
--- a/.github/workflows/pr_build_doc_with_comment.yml
+++ b/.github/workflows/pr_build_doc_with_comment.yml
@ -14,7 +14,7 @@ permissions: {}
 jobs:
  get-pr-number:
    name: Get PR number
-    if: ${{ github.event.issue.state == 'open' && contains(fromJSON('["ydshieh", "ArthurZucker", "zucchini-nlp", "molbap", "gante", "LysandreJik", "Cyrilvallez", "Rocketknight1", "SunMarc", "muellerzr", "eustlb", "MekkCyber", "manueldeprada", "vasqu", "ivarflakstad", "stevhliu", "ebezzam", "itazap"]'), github.actor) && (startsWith(github.event.comment.body, 'build-doc')) }}
+    if: ${{ github.event.issue.state == 'open' && contains(fromJSON('["ydshieh", "ArthurZucker", "zucchini-nlp", "molbap", "gante", "LysandreJik", "Cyrilvallez", "Rocketknight1", "SunMarc", "eustlb", "MekkCyber", "vasqu", "ivarflakstad", "stevhliu", "ebezzam", "itazap"]'), github.actor) && (startsWith(github.event.comment.body, 'build-doc')) }}
    uses: ./.github/workflows/get-pr-number.yml

  get-pr-info:
--- a/.github/workflows/self-comment-ci.yml
+++ b/.github/workflows/self-comment-ci.yml
@ -29,7 +29,7 @@ jobs:
    runs-on: ubuntu-22.04
    name: Get PR number
    # For security: only allow team members to run
-    if: ${{ github.event.issue.state == 'open' && contains(fromJSON('["ydshieh", "ArthurZucker", "zucchini-nlp", "molbap", "gante", "LysandreJik", "Cyrilvallez", "Rocketknight1", "SunMarc", "muellerzr", "eustlb", "MekkCyber", "manueldeprada", "vasqu", "ivarflakstad", "stevhliu", "ebezzam", "remi-or", "itazap"]'), github.actor) && (startsWith(github.event.comment.body, 'run-slow') || startsWith(github.event.comment.body, 'run slow') || startsWith(github.event.comment.body, 'run_slow')) }}
+    if: ${{ github.event.issue.state == 'open' && contains(fromJSON('["ydshieh", "ArthurZucker", "zucchini-nlp", "molbap", "gante", "LysandreJik", "Cyrilvallez", "Rocketknight1", "SunMarc", "eustlb", "MekkCyber", "vasqu", "ivarflakstad", "stevhliu", "ebezzam", "remi-or", "itazap"]'), github.actor) && (startsWith(github.event.comment.body, 'run-slow') || startsWith(github.event.comment.body, 'run slow') || startsWith(github.event.comment.body, 'run_slow')) }}
    outputs:
      PR_NUMBER: ${{ steps.set_pr_number.outputs.PR_NUMBER }}
    steps:
--- a/.github/workflows/self-scheduled-amd-mi325-caller.yml
+++ b/.github/workflows/self-scheduled-amd-mi325-caller.yml
@ -20,7 +20,7 @@ jobs:
    with:
      job: run_models_gpu
      slack_report_channel: "#amd-hf-ci"
-      runner_scale_set: amd-mi325-ci
+      runner_group: amd-mi325
      docker: huggingface/transformers-pytorch-amd-gpu
      ci_event: Scheduled CI (AMD) - mi325
      report_repo_id: optimum-amd/transformers_daily_ci
@ -33,7 +33,7 @@ jobs:
    with:
      job: run_pipelines_torch_gpu
      slack_report_channel: "#amd-hf-ci"
-      runner_scale_set: amd-mi325-ci
+      runner_group: amd-mi325
      docker: huggingface/transformers-pytorch-amd-gpu
      ci_event: Scheduled CI (AMD) - mi325
      report_repo_id: optimum-amd/transformers_daily_ci
@ -46,7 +46,7 @@ jobs:
    with:
      job: run_examples_gpu
      slack_report_channel: "#amd-hf-ci"
-      runner_scale_set: amd-mi325-ci
+      runner_group: amd-mi325
      docker: huggingface/transformers-pytorch-amd-gpu
      ci_event: Scheduled CI (AMD) - mi325
      report_repo_id: optimum-amd/transformers_daily_ci
@ -59,7 +59,7 @@ jobs:
    with:
      job: run_torch_cuda_extensions_gpu
      slack_report_channel: "#amd-hf-ci"
-      runner_scale_set: amd-mi325-ci
+      runner_group: amd-mi325
      docker: huggingface/transformers-pytorch-deepspeed-amd-gpu
      ci_event: Scheduled CI (AMD) - mi325
      report_repo_id: optimum-amd/transformers_daily_ci
--- a/.github/workflows/self-scheduled-amd-mi355-caller.yml
+++ b/.github/workflows/self-scheduled-amd-mi355-caller.yml
@ -3,7 +3,7 @@ name: Self-hosted runner scale set (AMD mi355 scheduled CI caller)
 # Note: For every job in this workflow, the name of the runner scale set is finalized in the runner yaml i.e. huggingface/hf-workflows/.github/workflows/transformers_amd_ci_scheduled_arc_scale_set.yaml
 # For example, 1gpu : amd-mi355-ci-1gpu
 #              2gpu : amd-mi355-ci-2gpu
-
+ 
 on:
  workflow_run:
    workflows: ["Self-hosted runner (AMD scheduled CI caller)"]
@ -20,7 +20,7 @@ jobs:
    with:
      job: run_models_gpu
      slack_report_channel: "#amd-hf-ci"
-      runner_scale_set: amd-mi355-ci
+      runner_group: hfc-amd-mi355
      docker: huggingface/testing-rocm7.0-preview
      ci_event: Scheduled CI (AMD) - mi355
      report_repo_id: hf-transformers-bot/transformers-ci-dummy
@ -32,7 +32,7 @@ jobs:
    with:
      job: run_pipelines_torch_gpu
      slack_report_channel: "#amd-hf-ci"
-      runner_scale_set: amd-mi355-ci
+      runner_group: hfc-amd-mi355
      docker: huggingface/testing-rocm7.0-preview
      ci_event: Scheduled CI (AMD) - mi355
      report_repo_id: hf-transformers-bot/transformers-ci-dummy
@ -44,7 +44,7 @@ jobs:
    with:
      job: run_examples_gpu
      slack_report_channel: "#amd-hf-ci"
-      runner_scale_set: amd-mi355-ci
+      runner_group: hfc-amd-mi355
      docker: huggingface/testing-rocm7.0-preview
      ci_event: Scheduled CI (AMD) - mi355
      report_repo_id: hf-transformers-bot/transformers-ci-dummy
@ -53,10 +53,10 @@ jobs:
  deepspeed-ci:
    name: DeepSpeed CI
    uses: huggingface/hf-workflows/.github/workflows/transformers_amd_ci_scheduled_arc_scale_set.yaml@main
-    with:
+    with:  
      job: run_torch_cuda_extensions_gpu
      slack_report_channel: "#amd-hf-ci"
-      runner_scale_set: amd-mi355-ci
+      runner_group: hfc-amd-mi355
      docker: huggingface/testing-rocm7.0-preview
      ci_event: Scheduled CI (AMD) - mi355
      report_repo_id: hf-transformers-bot/transformers-ci-dummy
--- a/.github/workflows/ssh-runner.yml
+++ b/.github/workflows/ssh-runner.yml
@ -33,14 +33,17 @@ jobs:
    steps:
      - name: Get runner to use
        shell: bash
+        env:
+          NUM_GPUS: ${{ github.event.inputs.num_gpus }}
+          RUNNER_TYPE: ${{ github.event.inputs.runner_type }}
        run: |
-          if [[ "${{ github.event.inputs.num_gpus }}" == "single" && "${{ github.event.inputs.runner_type }}" == "t4" ]]; then
+          if [[ "$NUM_GPUS" == "single" && "$RUNNER_TYPE" == "t4" ]]; then
            echo "RUNNER=aws-g4dn-4xlarge-cache" >> $GITHUB_ENV
-          elif [[ "${{ github.event.inputs.num_gpus }}" == "multi" && "${{ github.event.inputs.runner_type }}" == "t4" ]]; then
+          elif [[ "$NUM_GPUS" == "multi" && "$RUNNER_TYPE" == "t4" ]]; then
            echo "RUNNER=aws-g4dn-12xlarge-cache" >> $GITHUB_ENV
-          elif [[ "${{ github.event.inputs.num_gpus }}" == "single" && "${{ github.event.inputs.runner_type }}" == "a10" ]]; then
+          elif [[ "$NUM_GPUS" == "single" && "$RUNNER_TYPE" == "a10" ]]; then
            echo "RUNNER=aws-g5-4xlarge-cache" >> $GITHUB_ENV
-          elif [[ "${{ github.event.inputs.num_gpus }}" == "multi" && "${{ github.event.inputs.runner_type }}" == "a10" ]]; then
+          elif [[ "$NUM_GPUS" == "multi" && "$RUNNER_TYPE" == "a10" ]]; then
            echo "RUNNER=aws-g5-12xlarge-cache" >> $GITHUB_ENV
          else
            echo "RUNNER=" >> $GITHUB_ENV
@ -85,9 +88,11 @@ jobs:
      - name: Store Slack infos
        #because the SSH can be enabled dynamically if the workflow failed, so we need to store slack infos to be able to retrieve them during the waitforssh step
        shell: bash
+        env:
+          GITHUB_ACTOR: ${{ github.actor }}
        run: |
-          echo "${{ github.actor }}"
-          github_actor=${{ github.actor }}
+          echo "$GITHUB_ACTOR"
+          github_actor=$GITHUB_ACTOR
          github_actor=${github_actor/'-'/'_'}
          echo "$github_actor"
          echo "github_actor=$github_actor" >> $GITHUB_ENV
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -278,13 +278,14 @@ are working on it).<br>
 useful to avoid duplicated work, and to differentiate it from PRs ready to be merged.<br>
 ☐ Make sure existing tests pass.<br>
 ☐ If adding a new feature, also add tests for it.<br>
-   - If you are adding a new model, make sure you use
+
+- If you are adding a new model, make sure you use
     `ModelTester.all_model_classes = (MyModel, MyModelWithLMHead,...)` to trigger the common tests.
-   - If you are adding new `@slow` tests, make sure they pass using
+- If you are adding new `@slow` tests, make sure they pass using
     `RUN_SLOW=1 python -m pytest tests/models/my_new_model/test_my_new_model.py`.
-   - If you are adding a new tokenizer, write tests and make sure
+- If you are adding a new tokenizer, write tests and make sure
     `RUN_SLOW=1 python -m pytest tests/models/{your_model_name}/test_tokenization_{your_model_name}.py` passes.
-   - CircleCI does not run the slow tests, but GitHub Actions does every night!<br>
+- CircleCI does not run the slow tests, but GitHub Actions does every night!<br>

 ☐ All public methods must have informative docstrings (see
 [`modeling_bert.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/bert/modeling_bert.py)
@ -340,6 +341,7 @@ RUN_SLOW=yes python -m pytest -n auto --dist=loadfile -s -v ./examples/pytorch/t
 ```

 Like the slow tests, there are other environment variables available which are not enabled by default during testing:
+
 - `RUN_CUSTOM_TOKENIZERS`: Enables tests for custom tokenizers.

 More environment variables and additional information can be found in the [testing_utils.py](https://github.com/huggingface/transformers/blob/main/src/transformers/testing_utils.py).
--- a/README.md
+++ b/README.md
@ -48,6 +48,7 @@ limitations under the License.
        <a href="https://github.com/huggingface/transformers/blob/main/i18n/README_te.md">తెలుగు</a> |
        <a href="https://github.com/huggingface/transformers/blob/main/i18n/README_fr.md">Français</a> |
        <a href="https://github.com/huggingface/transformers/blob/main/i18n/README_de.md">Deutsch</a> |
+        <a href="https://github.com/huggingface/transformers/blob/main/i18n/README_it.md">Italiano</a> |
        <a href="https://github.com/huggingface/transformers/blob/main/i18n/README_vi.md">Tiếng Việt</a> |
        <a href="https://github.com/huggingface/transformers/blob/main/i18n/README_ar.md">العربية</a> |
        <a href="https://github.com/huggingface/transformers/blob/main/i18n/README_ur.md">اردو</a> |
@ -110,10 +111,10 @@ git clone https://github.com/huggingface/transformers.git
 cd transformers

 # pip
-pip install .[torch]
+pip install '.[torch]'

 # uv
-uv pip install .[torch]
+uv pip install '.[torch]'
 ```

 ## Quickstart
--- a/conftest.py
+++ b/conftest.py
@ -54,7 +54,6 @@ NOT_DEVICE_TESTS = {
    "test_gradient_checkpointing_backward_compatibility",
    "test_gradient_checkpointing_enable_disable",
    "test_torch_save_load",
-    "test_initialization",
    "test_forward_signature",
    "test_model_get_set_embeddings",
    "test_model_main_input_name",
@ -64,8 +63,7 @@ NOT_DEVICE_TESTS = {
    "test_load_save_without_tied_weights",
    "test_tied_weights_keys",
    "test_model_weights_reload_no_missing_tied_weights",
-    "test_mismatched_shapes_have_properly_initialized_weights",
-    "test_matched_shapes_have_loaded_weights_when_some_mismatched_shapes_exist",
+    "test_can_load_ignoring_mismatched_shapes",
    "test_model_is_small",
    "ModelTest::test_pipeline_",  # None of the pipeline tests from PipelineTesterMixin (of which XxxModelTest inherits from) are running on device
    "ModelTester::test_pipeline_",
@ -91,6 +89,8 @@ def pytest_configure(config):
    config.addinivalue_line("markers", "torch_compile_test: mark test which tests torch compile functionality")
    config.addinivalue_line("markers", "torch_export_test: mark test which tests torch export functionality")

+    os.environ["DISABLE_SAFETENSORS_CONVERSION"] = "true"
+

 def pytest_collection_modifyitems(items):
    for item in items:
--- a/docker/transformers-all-latest-gpu/Dockerfile
+++ b/docker/transformers-all-latest-gpu/Dockerfile
@ -57,6 +57,15 @@ RUN python3 -m pip install --no-cache-dir quanto
 # After using A10 as CI runner, let's run FA2 tests
 RUN [ "$PYTORCH" != "pre" ] && python3 -m pip uninstall -y ninja && python3 -m pip install --no-cache-dir ninja && python3 -m pip install flash-attn --no-cache-dir --no-build-isolation || echo "Don't install FA2 with nightly torch"

+ARG FA3=FALSE
+RUN if [ "$FA3" != "FALSE" ]; then \
+        git clone https://github.com/Dao-AILab/flash-attention && \
+        cd flash-attention/hopper && \
+        MAX_JOBS=28 python3 setup.py install; \
+    else \
+        echo "FA3 is not installed."; \
+    fi
+
 # TODO (ydshieh): check this again
 # `quanto` will install `ninja` which leads to many `CUDA error: an illegal memory access ...` in some model tests
 # (`deformable_detr`, `rwkv`, `mra`)
--- a/docker/transformers-pytorch-amd-gpu/Dockerfile
+++ b/docker/transformers-pytorch-amd-gpu/Dockerfile
@ -35,3 +35,10 @@ RUN python3 -m pip uninstall -y kernels

 # On ROCm, torchcodec is required to decode audio files and 0.4 or 0.6 fails
 RUN python3 -m pip install --no-cache-dir "torchcodec==0.5"
+
+# Install flash attention from source. Tested with commit 6387433156558135a998d5568a9d74c1778666d8
+RUN git clone https://github.com/ROCm/flash-attention/ -b tridao && \
+    cd flash-attention && \
+    GPU_ARCHS="gfx942" python setup.py install
+
+RUN python3 -m pip install --no-cache-dir einops
--- a/docker/transformers-quantization-latest-gpu/Dockerfile
+++ b/docker/transformers-quantization-latest-gpu/Dockerfile
@ -30,22 +30,21 @@ RUN python3 -m pip install --no-cache-dir -U $VERSION torchvision torchaudio tor

 RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/accelerate@main#egg=accelerate

-# needed in bnb and awq
-RUN python3 -m pip install --no-cache-dir einops
-
-# Add bitsandbytes for mixed int8 testing
-RUN python3 -m pip install --no-cache-dir bitsandbytes
-
-# Add gptqmodel for gtpq quantization testing, installed from source for pytorch==2.6.0 compatibility
-RUN python3 -m pip install lm_eval
-RUN git clone https://github.com/ModelCloud/GPTQModel.git && cd GPTQModel && pip install -v . --no-build-isolation
-
 # Add optimum for gptq quantization testing
 RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/optimum@main#egg=optimum

 # Add PEFT
 RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/peft@main#egg=peft

+# needed in bnb and awq
+RUN python3 -m pip install --no-cache-dir einops
+
+# Add bitsandbytes
+RUN python3 -m pip install --no-cache-dir bitsandbytes
+
+# # Add gptqmodel
+# RUN python3 -m pip install --no-cache-dir gptqmodel
+
 # Add hqq for quantization testing
 RUN python3 -m pip install --no-cache-dir hqq

@ -83,6 +82,9 @@ RUN python3 -m pip uninstall -y flash-attn
 # this line must be added in order for python to be aware of transformers.
 RUN cd transformers && python3 setup.py develop

+# Add fp-quant for quantization testing
+RUN python3 -m pip install --no-cache-dir "fp-quant>=0.2.0"
+
 # Low usage or incompatible lib, will enable later on

 # # Add aqlm for quantization testing
@ -103,7 +105,3 @@ RUN cd transformers && python3 setup.py develop
 # # TODO: create a new workflow to test them
 # RUN python3 -m pip install --no-cache-dir flute-kernel==0.4.1
 # RUN python3 -m pip install --no-cache-dir git+https://github.com/Dao-AILab/fast-hadamard-transform.git
-
-# Add fp-quant for quantization testing
-# Requires py3.11 but our CI runs on 3.9
-# RUN python3 -m pip install --no-cache-dir "fp-quant>=0.1.6"
--- a/docs/TRANSLATING.md
+++ b/docs/TRANSLATING.md
@ -50,7 +50,7 @@ Begin translating the text!

 1. Start with the `_toctree.yml` file that corresponds to your documentation chapter. This file is essential for rendering the table of contents on the website.

-    - If the `_toctree.yml` file doesn’t exist for your language, create one by copying the English version and removing unrelated sections.
+    - If the `_toctree.yml` file doesn't exist for your language, create one by copying the English version and removing unrelated sections.
    - Ensure it is placed in the `docs/source/LANG-ID/` directory.

    Here’s an example structure for the `_toctree.yml` file:
--- a/docs/source/ar/autoclass_tutorial.md
+++ b/docs/source/ar/autoclass_tutorial.md
@ -52,7 +52,7 @@
    <figcaption class="mt-2 text-center text-sm text-gray-500">الصورة توضح مخطط مراحل نموذج Swin.</figcaption>
 </div>

-يسمح لك [`AutoBackbone`] باستخدام النماذج المُدربة مسبقًا كعمود فقري للحصول على خرائط ميزات من مراحل مختلفة من العمود الفقري. يجب عليك تحديد أحد المعلمات التالية في [`~PretrainedConfig.from_pretrained`]:
+يسمح لك [`AutoBackbone`] باستخدام النماذج المُدربة مسبقًا كعمود فقري للحصول على خرائط ميزات من مراحل مختلفة من العمود الفقري. يجب عليك تحديد أحد المعلمات التالية في [`~PreTrainedConfig.from_pretrained`]:

 * `out_indices` هو فهرس الطبقة التي تريد الحصول على خريطة الميزات منها
 * `out_features` هو اسم الطبقة التي تريد الحصول على خريطة الميزات منها
--- a/docs/source/ar/create_a_model.md
+++ b/docs/source/ar/create_a_model.md
@ -54,19 +54,19 @@ DistilBertConfig {
 
 ```

-يمكن تعديل خصائص النموذج المدرب مسبقًا في دالة [`~PretrainedConfig.from_pretrained`] :
+يمكن تعديل خصائص النموذج المدرب مسبقًا في دالة [`~PreTrainedConfig.from_pretrained`] :

 ```py
 >>> my_config = DistilBertConfig.from_pretrained("distilbert/distilbert-base-uncased", activation="relu", attention_dropout=0.4)
 ```

-بمجرد أن تصبح راضيًا عن تكوين نموذجك، يمكنك حفظه باستخدام [`~PretrainedConfig.save_pretrained`]. يتم تخزين ملف التكوين الخاص بك على أنه ملف JSON في دليل الحفظ المحدد:
+بمجرد أن تصبح راضيًا عن تكوين نموذجك، يمكنك حفظه باستخدام [`~PreTrainedConfig.save_pretrained`]. يتم تخزين ملف التكوين الخاص بك على أنه ملف JSON في دليل الحفظ المحدد:

 ```py
 >>> my_config.save_pretrained(save_directory="./your_model_save_path")
 ```

-لإعادة استخدام ملف التكوين، قم بتحميله باستخدام [`~PretrainedConfig.from_pretrained`]:
+لإعادة استخدام ملف التكوين، قم بتحميله باستخدام [`~PreTrainedConfig.from_pretrained`]:

 ```py
 >>> my_config = DistilBertConfig.from_pretrained("./your_model_save_path/config.json")
--- a/docs/source/ar/custom_models.md
+++ b/docs/source/ar/custom_models.md
@ -20,11 +20,11 @@
 في مثالنا، سنعدّل بعض الوسائط في فئة ResNet التي قد نرغب في ضبطها. ستعطينا التكوينات المختلفة أنواع ResNets المختلفة الممكنة. سنقوم بتخزين هذه الوسائط بعد التحقق من صحته.

 ```python
-from transformers import PretrainedConfig
+from transformers import PreTrainedConfig
 from typing import List


-class ResnetConfig(PretrainedConfig):
+class ResnetConfig(PreTrainedConfig):
    model_type = "resnet"

    def __init__(
@ -58,11 +58,11 @@ class ResnetConfig(PretrainedConfig):
 ```
 الأشياء الثلاثة المهمة التي يجب تذكرها عند كتابة تكوينك الخاص هي:

- يجب أن ترث من `PretrainedConfig`،
- يجب أن تقبل دالة  `__init__` الخاصة بـ `PretrainedConfig` أي معامﻻت إضافية kwargs،
+- يجب أن ترث من `PreTrainedConfig`،
+- يجب أن تقبل دالة  `__init__` الخاصة بـ `PreTrainedConfig` أي معامﻻت إضافية kwargs،
 - يجب تمرير هذه المعامﻻت الإضافية إلى دالة `__init__` فى الفئة الأساسية الاعلى.

-يضمن الإرث حصولك على جميع الوظائف من مكتبة 🤗 Transformers، في حين أن القيدين التانى والثالث يأتيان من حقيقة أن `PretrainedConfig` لديه المزيد من الحقول أكثر من تلك التي تقوم بتعيينها. عند إعادة تحميل تكوين باستخدام طريقة `from_pretrained`، يجب أن يقبل تكوينك هذه الحقول ثم إرسالها إلى الفئة الأساسية الأعلى.
+يضمن الإرث حصولك على جميع الوظائف من مكتبة 🤗 Transformers، في حين أن القيدين التانى والثالث يأتيان من حقيقة أن `PreTrainedConfig` لديه المزيد من الحقول أكثر من تلك التي تقوم بتعيينها. عند إعادة تحميل تكوين باستخدام طريقة `from_pretrained`، يجب أن يقبل تكوينك هذه الحقول ثم إرسالها إلى الفئة الأساسية الأعلى.

 تحديد `model_type` لتكوينك (هنا `model_type="resnet"`) ليس إلزاميًا، ما لم ترغب في
 تسجيل نموذجك باستخدام الفئات التلقائية (راجع القسم الأخير).
@ -82,7 +82,7 @@ resnet50d_config.save_pretrained("custom-resnet")
 resnet50d_config = ResnetConfig.from_pretrained("custom-resnet")
 ```

-يمكنك أيضًا استخدام أي طريقة أخرى من فئة [`PretrainedConfig`]، مثل [`~PretrainedConfig.push_to_hub`] لتحميل تكوينك مباشرة إلى Hub.
+يمكنك أيضًا استخدام أي طريقة أخرى من فئة [`PreTrainedConfig`]، مثل [`~PreTrainedConfig.push_to_hub`] لتحميل تكوينك مباشرة إلى Hub.

 ## كتابة نموذج مخصص

--- a/docs/source/ar/llm_tutorial_optimization.md
+++ b/docs/source/ar/llm_tutorial_optimization.md
@ -329,174 +329,6 @@ $$ \textbf{O}_i \leftarrow s^a_{ij} * \textbf{O}_i + s^b_{ij} * \mathbf{V}_{j} \
 لنلقِ نظرة على مثال عملي.


-يحصل نموذج OctoCoder الخاص بنا الآن على موجه إدخال أطول بشكل كبير يتضمن ما يسمى *موجه النظام*. تُستخدم موجهات النظام لتوجيه LLM إلى مساعد أفضل مصمم لمهام المستخدمين.
-فيما يلي، نستخدم موجه النظام الذي سيجعل OctoCoder مساعد ترميز أفضل.
-
-```python
-system_prompt = """Below are a series of dialogues between various people and an AI technical assistant.
-The assistant tries to be helpful, polite, honest, sophisticated, emotionally aware, and humble but knowledgeable.
-The assistant is happy to help with code questions and will do their best to understand exactly what is needed.
-It also tries to avoid giving false or misleading information, and it caveats when it isn't entirely sure about the right answer.
-That said, the assistant is practical really does its best, and doesn't let caution get too much in the way of being useful.
-
-The Starcoder models are a series of 15.5B parameter models trained on 80+ programming languages from The Stack (v1.2) (excluding opt-out requests).
-The model uses Multi Query Attention, was trained using the Fill-in-the-Middle objective, and with 8,192 tokens context window for a trillion tokens of heavily deduplicated data.
-----
-
-Question: Write a function that takes two lists and returns a list that has alternating elements from each input list.
-
-Answer: Sure. Here is a function that does that.
-
-def alternating(list1, list2):
-   results = []
-   for i in range(len(list1)):
-       results.append(list1[i])
-       results.append(list2[i])
-   return results
-
-Question: Can you write some test cases for this function?
-
-Answer: Sure, here are some tests.
-
-assert alternating([10, 20, 30], [1, 2, 3]) == [10, 1, 20, 2, 30, 3]
-assert alternating([True, False], [4, 5]) == [True, 4, False, 5]
-assert alternating([], []) == []
-
-Question: Modify the function so that it returns all input elements when the lists have uneven length. The elements from the longer list should be at the end.
-
-Answer: Here is the modified function.
-
-def alternating(list1, list2):
-   results = []
-   for i in range(min(len(list1), len(list2))):
-       results.append(list1[i])
-       results.append(list2[i])
-   if len(list1) > len(list2):
-       results.extend(list1[i+1:])
-   else:
-       results.extend(list2[i+1:])
-   return results
-----
-"""
-```
-لأغراض التوضيح، سنكرر موجه النظام عشر مرات بحيث يكون طول الإدخال طويلاً بما يكفي لملاحظة وفورات ذاكرة Flash Attention.
-نضيف موجه النص الأصلي "سؤال: يرجى كتابة وظيفة في Python تقوم بتحويل البايتات إلى جيجا بايت.
-
-```python
-long_prompt = 10 * system_prompt + prompt
-```
-
-نقوم بتنفيذ نموذجنا مرة أخرى بدقة bfloat16.
-
-```python
-model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", dtype=torch.bfloat16, device_map="auto")
-tokenizer = AutoTokenizer.from_pretrained("bigcode/octocoder")
-
-pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
-```
-
-دعنا الآن نقوم بتشغيل النموذج تمامًا مثلما كان من قبل *بدون اهتمام فلاشي* وقياس متطلبات ذاكرة GPU وقت الذروة ووقت الاستدلال.
-
-```python
-import time
-
-start_time = time.time()
-result = pipe(long_prompt, max_new_tokens=60)[0]["generated_text"][len(long_prompt):]
-
-print(f"Generated in {time.time() - start_time} seconds.")
-result
-```
-
-**الإخراج**:
-```
-تم التوليد في 10.96854019165039 ثانية.
-بالتأكيد. إليك وظيفة للقيام بذلك.
-
-def bytes_to_giga(bytes):
-return bytes / 1024 / 1024 / 1024
-
-الإجابة: بالتأكيد. إليك وظيفة للقيام بذلك.
-
-ديف
-```
-
-نحصل على نفس الإخراج كما كان من قبل، ولكن هذه المرة، يقوم النموذج بتكرار الإجابة عدة مرات حتى يتم قطعها عند 60 رمزًا. ليس من المستغرب أننا كررنا موجه النظام عشر مرات لأغراض التوضيح وبالتالي قمنا بتشغيل النموذج لتكرار نفسه.
-
-**ملاحظة** لا ينبغي تكرار موجه النظام عشر مرات في التطبيقات الواقعية - مرة واحدة كافية!
-
-دعنا نقيس متطلبات ذاكرة GPU وقت الذروة.
-
-```python
-bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
-```
-
-**الإخراج**:
-```
-37.668193340301514
-```
-
-كما نرى، فإن متطلبات ذاكرة GPU وقت الذروة أعلى بكثير مما كانت عليه في البداية، وهو ما يرجع إلى حد كبير إلى تسلسل الإدخال الأطول. أيضًا، يستغرق التوليد أكثر من دقيقة بقليل الآن.
-
-نستدعي `flush()` لتحرير ذاكرة GPU لتجربتنا التالية.
-
-```python
-flush()
-```
-
-لمقارنة، دعونا نقوم بتشغيل نفس الدالة، ولكن تمكين الاهتمام فلاش بدلا من ذلك.
-للقيام بذلك، نقوم بتحويل النموذج إلى [BetterTransformer](Https://huggingface.co/docs/optimum/bettertransformer/overview) ومن خلال القيام بذلك تمكين PyTorch's [SDPA self-attention](Https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention) والتي بدورها قادرة على استخدام الاهتمام فلاش.
-
-```python
-model.to_bettertransformer()
-```
-
-الآن نقوم بتشغيل نفس مقتطف التعليمات البرمجية بالضبط كما كان من قبل وتحت الغطاء سوف تستخدم المحولات الاهتمام فلاش.
-
-```py
-start_time = time.time()
-with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):
-    result = pipe(long_prompt, max_new_tokens=60)[0]["generated_text"][len(long_prompt):]
-
-print(f"Generated in {time.time() - start_time} seconds.")
-result
-```
-
-**الإخراج**:
-```
-تم التوليد في 3.0211617946624756 ثانية.
-بالتأكيد. إليك وظيفة للقيام بذلك.
-
-def bytes_to_giga(bytes):
-return bytes / 1024 / 1024 / 1024
-
-الإجابة: بالتأكيد. إليك وظيفة للقيام بذلك.
-
-ديف
-```
-
-نحصل على نفس النتيجة بالضبط كما كان من قبل، ولكن يمكننا ملاحظة تسريع كبير بفضل الاهتمام فلاش.
-
-دعنا نقيس استهلاك الذاكرة لآخر مرة.
-
-```python
-bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
-```
-
-**الإخراج**:
-```
-32.617331981658936
-```
-
-ونحن تقريبا مرة أخرى إلى ذاكرة GPU الذروة الأصلية لدينا 29GB.
-
-يمكننا أن نلاحظ أننا نستخدم فقط حوالي 100 ميجابايت إضافية من ذاكرة GPU عند تمرير تسلسل إدخال طويل جدًا مع الاهتمام فلاش مقارنة بتمرير تسلسل إدخال قصير كما فعلنا في البداية.
-
-```py
-flush()
-```
-
-لمزيد من المعلومات حول كيفية استخدام Flash Attention، يرجى الاطلاع على [صفحة doc هذه](Https://huggingface.co/docs/transformers/en/perf_infer_gpu_one#flashattention-2).
-
 ## 3. الابتكارات المعمارية

 حتى الآن، نظرنا في تحسين الكفاءة الحسابية والذاكرة من خلال:
--- a/docs/source/de/add_new_model.md
+++ b/docs/source/de/add_new_model.md
@ -53,7 +53,7 @@ Lassen Sie uns daher ein wenig tiefer in das allgemeine Design der Bibliothek ei
 ### Überblick über die Modelle

 Um ein Modell erfolgreich hinzuzufügen, ist es wichtig, die Interaktion zwischen Ihrem Modell und seiner Konfiguration zu verstehen,
-[`PreTrainedModel`] und [`PretrainedConfig`]. Als Beispiel werden wir
+[`PreTrainedModel`] und [`PreTrainedConfig`]. Als Beispiel werden wir
 das Modell, das zu 🤗 Transformers hinzugefügt werden soll, `BrandNewBert` nennen.

 Schauen wir uns das mal an:
@ -81,10 +81,10 @@ model.config  # model has access to its config
 ```

 Ähnlich wie das Modell erbt die Konfiguration grundlegende Serialisierungs- und Deserialisierungsfunktionalitäten von
-[`PretrainedConfig`]. Beachten Sie, dass die Konfiguration und das Modell immer in zwei verschiedene Formate serialisiert werden
+[`PreTrainedConfig`]. Beachten Sie, dass die Konfiguration und das Modell immer in zwei verschiedene Formate serialisiert werden
 unterschiedliche Formate serialisiert werden - das Modell in eine *pytorch_model.bin* Datei und die Konfiguration in eine *config.json* Datei. Aufruf von
 [`~PreTrainedModel.save_pretrained`] wird automatisch
-[`~PretrainedConfig.save_pretrained`] auf, so dass sowohl das Modell als auch die Konfiguration gespeichert werden.
+[`~PreTrainedConfig.save_pretrained`] auf, so dass sowohl das Modell als auch die Konfiguration gespeichert werden.


 ### Code-Stil
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@ -216,6 +216,11 @@
  - local: quantization/contribute
    title: Contribute
  title: Quantization
+- isExpanded: false
+  sections:
+  - local: kernel_doc/overview
+    title: Kernels in transformers
+  title: Kernels
 - isExpanded: false
  sections:
  - local: serialization
@ -305,6 +310,8 @@
    title: Glossary
  - local: philosophy
    title: Philosophy
+  - local: models_timeline
+    title: Models Timeline
  - local: notebooks
    title: Notebooks with examples
  - local: community
@ -340,8 +347,6 @@
      title: Models
    - local: main_classes/text_generation
      title: Text Generation
-    - local: main_classes/onnx
-      title: ONNX
    - local: main_classes/optimizer_schedules
      title: Optimization
    - local: main_classes/output
@ -368,6 +373,8 @@
      title: Image Processor
    - local: main_classes/video_processor
      title: Video Processor
+    - local: main_classes/kernels
+      title: Kernels
    title: Main Classes
  - sections:
    - sections:
@ -555,8 +562,8 @@
        title: LED
      - local: model_doc/lfm2
        title: LFM2
-      - local: model_doc/lfm2_vl
-        title: LFM2-VL
+      - local: model_doc/lfm2_moe
+        title: LFM2Moe
      - local: model_doc/llama
        title: LLaMA
      - local: model_doc/llama2
@ -935,6 +942,8 @@
        title: MusicGen
      - local: model_doc/musicgen_melody
        title: MusicGen Melody
+      - local: model_doc/parakeet
+        title: Parakeet
      - local: model_doc/pop2piano
        title: Pop2Piano
      - local: model_doc/seamless_m4t
@ -1033,6 +1042,10 @@
        title: DePlot
      - local: model_doc/donut
        title: Donut
+      - local: model_doc/edgetam
+        title: EdgeTAM
+      - local: model_doc/edgetam_video
+        title: EdgeTamVideo
      - local: model_doc/emu3
        title: Emu3
      - local: model_doc/evolla
@ -1085,6 +1098,8 @@
        title: LayoutLMV3
      - local: model_doc/layoutxlm
        title: LayoutXLM
+      - local: model_doc/lfm2_vl
+        title: LFM2-VL
      - local: model_doc/lilt
        title: LiLT
      - local: model_doc/llama4
--- a/docs/source/en/add_new_model.md
+++ b/docs/source/en/add_new_model.md
@ -51,7 +51,7 @@ This section describes how the model and configuration classes interact and the

 ### Model and configuration

-All Transformers' models inherit from a base [`PreTrainedModel`] and [`PretrainedConfig`] class. The configuration is the models blueprint.
+All Transformers' models inherit from a base [`PreTrainedModel`] and [`PreTrainedConfig`] class. The configuration is the models blueprint.

 There is never more than two levels of abstraction for any model to keep the code readable. The example model here, BrandNewLlama, inherits from `BrandNewLlamaPreTrainedModel` and [`PreTrainedModel`]. It is important that a new model only depends on [`PreTrainedModel`] so that it can use the [`~PreTrainedModel.from_pretrained`] and [`~PreTrainedModel.save_pretrained`] methods.

@ -66,9 +66,9 @@ model = BrandNewLlamaModel.from_pretrained("username/brand_new_llama")
 model.config
 ```

-[`PretrainedConfig`] provides the [`~PretrainedConfig.from_pretrained`] and [`~PretrainedConfig.save_pretrained`] methods.
+[`PreTrainedConfig`] provides the [`~PreTrainedConfig.from_pretrained`] and [`~PreTrainedConfig.save_pretrained`] methods.

-When you use [`PreTrainedModel.save_pretrained`], it automatically calls [`PretrainedConfig.save_pretrained`] so that both the model and configuration are saved together.
+When you use [`PreTrainedModel.save_pretrained`], it automatically calls [`PreTrainedConfig.save_pretrained`] so that both the model and configuration are saved together.

 A model is saved to a `model.safetensors` file and a configuration is saved to a `config.json` file.

--- a/docs/source/en/attention_interface.md
+++ b/docs/source/en/attention_interface.md
@ -193,4 +193,4 @@ def custom_attention_mask(

 It mostly works thanks to the `mask_function`, which is a `Callable` in the form of [torch's mask_mod functions](https://pytorch.org/blog/flexattention/), taking 4 indices as input and returning a boolean to indicate if this position should take part in the attention computation.

-If you cannot use the `mask_function` to create your mask for some reason, you can try to work around it by doing something similar to our [torch export workaround](https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/executorch.py).
+If you cannot use the `mask_function` to create your mask for some reason, you can try to work around it by doing something similar to our [torch export workaround](https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/executorch.py).
--- a/docs/source/en/auto_docstring.md
+++ b/docs/source/en/auto_docstring.md
@ -210,9 +210,9 @@ There are some rules for documenting different types of arguments and they're li
        This can span multiple lines.
    ```

-    * Include `type` in backticks.
-    * Add *optional* if the argument is not required or has a default value.
-    * Add "defaults to X" if it has a default value. You don't need to add "defaults to `None`" if the default value is `None`.
+  * Include `type` in backticks.
+  * Add *optional* if the argument is not required or has a default value.
+  * Add "defaults to X" if it has a default value. You don't need to add "defaults to `None`" if the default value is `None`.

    These arguments can also be passed to `@auto_docstring` as a `custom_args` argument. It is used to define the docstring block for new arguments once if they are repeated in multiple places in the modeling file.

@ -292,7 +292,7 @@ The `@auto_docstring` decorator automatically generates docstrings by:

 8. Unrolling kwargs typed with the unpack operator. For specific methods (defined in `UNROLL_KWARGS_METHODS`) or classes (defined in `UNROLL_KWARGS_CLASSES`), the decorator processes `**kwargs` parameters that are typed with `Unpack[KwargsTypedDict]`. It extracts the documentations from the `TypedDict` and adds each parameter to the function's docstring.

-    Currently only supported for [`FastImageProcessorKwargs`].
+    Currently only supported for [`ImagesKwargs`].

 ## Best practices

--- a/docs/source/en/backbones.md
+++ b/docs/source/en/backbones.md
@ -22,7 +22,7 @@ Higher-level computer visions tasks, such as object detection or image segmentat
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/Backbone.png"/>
 </div>

-Load a backbone with [`~PretrainedConfig.from_pretrained`] and use the `out_indices` parameter to determine which layer, given by the index, to extract a feature map from.
+Load a backbone with [`~PreTrainedConfig.from_pretrained`] and use the `out_indices` parameter to determine which layer, given by the index, to extract a feature map from.

 ```py
 from transformers import AutoBackbone
@ -46,7 +46,7 @@ There are two ways to load a Transformers backbone, [`AutoBackbone`] and a model
 <hfoptions id="backbone-classes">
 <hfoption id="AutoBackbone">

-The [AutoClass](./model_doc/auto) API automatically loads a pretrained vision model with [`~PretrainedConfig.from_pretrained`] as a backbone if it's supported.
+The [AutoClass](./model_doc/auto) API automatically loads a pretrained vision model with [`~PreTrainedConfig.from_pretrained`] as a backbone if it's supported.

 Set the `out_indices` parameter to the layer you'd like to get the feature map from. If you know the name of the layer, you could also use `out_features`. These parameters can be used interchangeably, but if you use both, make sure they refer to the same layer.

--- a/docs/source/en/cache_explanation.md
+++ b/docs/source/en/cache_explanation.md
@ -136,7 +136,7 @@ The cache position tracks where to insert new tokens in the attention cache. It

 Cache position is used internally for two purposes:

-1. Selecting new tokens to process in the input sequence and ensuring only tokens that haven’t been cached yet are passed to the model's `forward`.
+1. Selecting new tokens to process in the input sequence and ensuring only tokens that haven't been cached yet are passed to the model's `forward`.
 2. Storing key/value pairs at the correct positions in the cache. This is especially important for fixed-size caches, that pre-allocates a specific cache length.

 The generation loop usually takes care of the cache position, but if you're writing a custom generation method, it is important that cache positions are accurate since they are used to write and read key/value states into fixed slots.
@ -156,30 +156,3 @@ inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, ret
 generated_ids = model.generate(**inputs, use_cache=True, max_new_tokens=10)

 ```
-
-## Legacy cache format
-
-Before the [`Cache`] class, the cache used to be stored as a tuple of tuples of tensors. This format is dynamic because it grows as text is generated, similar to [`DynamicCache`].
-
-The legacy format is essentially the same data structure but organized differently.
- It's a tuple of tuples, where each inner tuple contains the key and value tensors for a layer.
- The tensors have the same shape `[batch_size, num_heads, seq_len, head_dim]`.
- The format is less flexible and doesn't support features like quantization or offloading.
-
-If your project depends on this legacy format, we recommend to convert to [`DynamicCache`] with [`~DynamicCache.from_legacy_cache`]. Note that legacy cache format is deprecated and not used anymore in `Transformers`. You can convert back to tuple format with [`DynamicCache.to_legacy_cache`] functions, which is helpful if you have custom logic for manipulating a cache in a specific format.
-
-```py
-import torch
-from transformers import AutoTokenizer, AutoModelForCausalLM, DynamicCache
-
-tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
-model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", dtype=torch.float16, device_map="auto")
-inputs = tokenizer("Hello, my name is", return_tensors="pt").to(model.device)
-
-# `return_dict_in_generate=True` is required to return the cache and `return_legacy_cache` forces the returned cache
-# in the legacy format
-generation_outputs = model.generate(**inputs, return_dict_in_generate=True, return_legacy_cache=True, max_new_tokens=5)
-
-cache = DynamicCache.from_legacy_cache(generation_outputs.past_key_values)
-legacy_format_cache = cache.to_legacy_cache()
-```
--- a/docs/source/en/chat_extras.md
+++ b/docs/source/en/chat_extras.md
@ -221,4 +221,4 @@ model_input = tokenizer.apply_chat_template(
    messages,
    tools = [current_time, multiply]
 )
-```
+```
--- a/docs/source/en/chat_templating.md
+++ b/docs/source/en/chat_templating.md
@ -77,9 +77,9 @@ Mistral-7B-Instruct uses `[INST]` and `[/INST]` tokens to indicate the start and

 The input to `apply_chat_template` should be structured as a list of dictionaries with `role` and `content` keys. The `role` key specifies the speaker, and the `content` key contains the message. The common roles are:

- - `user` for messages from the user
- - `assistant` for messages from the model
- - `system` for directives on how the model should act (usually placed at the beginning of the chat)
+- `user` for messages from the user
+- `assistant` for messages from the model
+- `system` for directives on how the model should act (usually placed at the beginning of the chat)

 [`apply_chat_template`] takes this list and returns a formatted sequence. Set `tokenize=True` if you want to tokenize the sequence.

@ -124,7 +124,7 @@ Matey, I'm afraid I must inform ye that humans cannot eat helicopters. Helicopte

 > [!WARNING]
 > Some tokenizers add special `<bos>` and `<eos>` tokens. Chat templates should already include all the necessary special tokens, and adding additional special tokens is often incorrect or duplicated, hurting model performance. When you format text with `apply_chat_template(tokenize=False)`, make sure you set `add_special_tokens=False` if you tokenize later to avoid duplicating these tokens.
-> This isn’t an issue if you use `apply_chat_template(tokenize=True)`, which means it's usually the safer option!
+> This isn't an issue if you use `apply_chat_template(tokenize=True)`, which means it's usually the safer option!

 ### add_generation_prompt

@ -168,7 +168,7 @@ Can I ask a question?<|im_end|>

 When `add_generation_prompt=True`, `<|im_start|>assistant` is added at the end to indicate the start of an `assistant` message. This lets the model know an `assistant` response is next.

-Not all models require generation prompts, and some models, like [Llama](./model_doc/llama), don’t have any special tokens before the `assistant` response. In these cases, [add_generation_prompt](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.add_generation_prompt) has no effect.
+Not all models require generation prompts, and some models, like [Llama](./model_doc/llama), don't have any special tokens before the `assistant` response. In these cases, [add_generation_prompt](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.add_generation_prompt) has no effect.

 ### continue_final_message

@ -187,9 +187,9 @@ model.generate(**formatted_chat)
 ```

 > [!WARNING]
-> You shouldn’t use [add_generation_prompt](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.add_generation_prompt) and [continue_final_message](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.continue_final_message) together. The former adds tokens that start a new message, while the latter removes end of sequence tokens. Using them together returns an error.
+> You shouldn't use [add_generation_prompt](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.add_generation_prompt) and [continue_final_message](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.continue_final_message) together. The former adds tokens that start a new message, while the latter removes end of sequence tokens. Using them together returns an error.

-[`TextGenerationPipeline`] sets [add_generation_prompt](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.add_generation_prompt) to `True` by default to start a new message. However, if the final message in the chat has the `assistant` role, it assumes the message is a prefill and switches to `continue_final_message=True`. This is because most models don’t support multiple consecutive assistant messages. To override this behavior, explicitly pass the [continue_final_message](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.continue_final_message) argument to the pipeline.
+[`TextGenerationPipeline`] sets [add_generation_prompt](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.add_generation_prompt) to `True` by default to start a new message. However, if the final message in the chat has the `assistant` role, it assumes the message is a prefill and switches to `continue_final_message=True`. This is because most models don't support multiple consecutive assistant messages. To override this behavior, explicitly pass the [continue_final_message](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.continue_final_message) argument to the pipeline.

 ## Model training

--- a/docs/source/en/chat_templating_multimodal.md
+++ b/docs/source/en/chat_templating_multimodal.md
@ -56,7 +56,7 @@ out = pipe(text=messages, max_new_tokens=128)
 print(out[0]['generated_text'][-1]['content'])
 ```

-```
+```text
 Ahoy, me hearty! These be two feline friends, likely some tabby cats, taking a siesta on a cozy pink blanket. They're resting near remote controls, perhaps after watching some TV or just enjoying some quiet time together. Cats sure know how to find comfort and relaxation, don't they?
 ```

@ -96,7 +96,7 @@ processed_chat = processor.apply_chat_template(messages, add_generation_prompt=T
 print(list(processed_chat.keys()))
 ```

-```
+```text
 ['input_ids', 'attention_mask', 'pixel_values', 'image_grid_thw']
 ```

@ -115,7 +115,7 @@ Some vision models also support video inputs. The message format is very similar

 - The content `"type"` should be `"video"` to indicate the content is a video.
 - For videos, it can be a link to the video (`"url"`) or it could be a file path (`"path"`). Videos loaded from a URL can only be decoded with [PyAV](https://pyav.basswood-io.com/docs/stable/) or [Decord](https://github.com/dmlc/decord).
- In addition to loading videos from a URL or file path, you can also pass decoded video data directly. This is useful if you’ve already preprocessed or decoded video frames elsewhere in memory (e.g., using OpenCV, decord, or torchvision). You don't need to save to files or store it in an URL.
+- In addition to loading videos from a URL or file path, you can also pass decoded video data directly. This is useful if you've already preprocessed or decoded video frames elsewhere in memory (e.g., using OpenCV, decord, or torchvision). You don't need to save to files or store it in an URL.

 > [!WARNING]
 > Loading a video from `"url"` is only supported by the PyAV or Decord backends.
--- a/docs/source/en/chat_templating_writing.md
+++ b/docs/source/en/chat_templating_writing.md
@ -188,7 +188,7 @@ The example below shows how a tool is defined in JSON schema format.

 An example of handling tool definitions in a chat template is shown below. The specific tokens and layouts should be changed to match the ones the model was trained with.

-```
+```jinja
 {%- if tools %}
    {%- for tool in tools %}
        {{- '<tool>' + tool['function']['name'] + '\n' }}
@ -226,7 +226,7 @@ Tool calls are generally passed in the `tool_calls` key of an `"assistant”` me

 A common pattern for handling tool calls is shown below. You can use this as a starting point, but make sure you template actually matches the format the model was trained with!

-```
+```jinja
 {%- if message['role'] == 'assistant' and 'tool_calls' in message %}
    {%- for tool_call in message['tool_calls'] %}
            {{- '<tool_call>' + tool_call['function']['name'] + '\n' + tool_call['function']['arguments']|tojson + '\n</tool_call>' }}
@ -249,7 +249,7 @@ Tool responses are message dicts with the `tool` role. They are much simpler tha

 Some templates may not even need the `name` key, in which case, you can write your template to only read the `content` key.

-```
+```jinja
 {%- if message['role'] == 'tool' %}
    {{- "<tool_result>" + message['content'] + "</tool_result>" }}
 {%- endif %}
--- a/docs/source/en/cursor.md
+++ b/docs/source/en/cursor.md
@ -21,9 +21,10 @@ where `port` is the port used by `transformers serve` (`8000` by default). On th
 </h3>

 You're now ready to set things up on the app side! In Cursor, while you can't set a new provider, you can change the endpoint for OpenAI requests in the model selection settings. First, navigate to "Settings" > "Cursor Settings", "Models" tab, and expand the "API Keys" collapsible. To set your `transformers serve` endpoint, follow this order:
+
 1. Unselect ALL models in the list above (e.g. `gpt4`, ...);
 2. Add and select the model you want to use (e.g. `Qwen/Qwen3-4B`)
-3. Add some random text to OpenAI API Key. This field won't be used, but it can’t be empty;
+3. Add some random text to OpenAI API Key. This field won't be used, but it can't be empty;
 4. Add the https address from `ngrok` to the "Override OpenAI Base URL" field, appending `/v1` to the address (i.e. `https://(...).ngrok-free.app/v1`);
 5. Hit "Verify".

--- a/docs/source/en/custom_models.md
+++ b/docs/source/en/custom_models.md
@ -25,12 +25,12 @@ This guide will show you how to customize a ResNet model, enable [AutoClass](./m

 ## Configuration

-A configuration, given by the base [`PretrainedConfig`] class, contains all the necessary information to build a model. This is where you'll configure the attributes of the custom ResNet model. Different attributes gives different ResNet model types.
+A configuration, given by the base [`PreTrainedConfig`] class, contains all the necessary information to build a model. This is where you'll configure the attributes of the custom ResNet model. Different attributes gives different ResNet model types.

 The main rules for customizing a configuration are:

-1. A custom configuration must subclass [`PretrainedConfig`]. This ensures a custom model has all the functionality of a Transformers' model such as [`~PretrainedConfig.from_pretrained`], [`~PretrainedConfig.save_pretrained`], and [`~PretrainedConfig.push_to_hub`].
-2. The [`PretrainedConfig`] `__init__` must accept any `kwargs` and they must be passed to the superclass `__init__`. [`PretrainedConfig`] has more fields than the ones set in your custom configuration, so when you load a configuration with [`~PretrainedConfig.from_pretrained`], those fields need to be accepted by your configuration and passed to the superclass.
+1. A custom configuration must subclass [`PreTrainedConfig`]. This ensures a custom model has all the functionality of a Transformers' model such as [`~PreTrainedConfig.from_pretrained`], [`~PreTrainedConfig.save_pretrained`], and [`~PreTrainedConfig.push_to_hub`].
+2. The [`PreTrainedConfig`] `__init__` must accept any `kwargs` and they must be passed to the superclass `__init__`. [`PreTrainedConfig`] has more fields than the ones set in your custom configuration, so when you load a configuration with [`~PreTrainedConfig.from_pretrained`], those fields need to be accepted by your configuration and passed to the superclass.

 > [!TIP]
 > It is useful to check the validity of some of the parameters. In the example below, a check is implemented to ensure `block_type` and `stem_type` belong to one of the predefined values.
@ -38,10 +38,10 @@ The main rules for customizing a configuration are:
 > Add `model_type` to the configuration class to enable [AutoClass](./models#autoclass) support.

 ```py
-from transformers import PretrainedConfig
+from transformers import PreTrainedConfig
 from typing import List

-class ResnetConfig(PretrainedConfig):
+class ResnetConfig(PreTrainedConfig):
    model_type = "resnet"

    def __init__(
@ -74,7 +74,7 @@ class ResnetConfig(PretrainedConfig):
        super().__init__(**kwargs)
 ```

-Save the configuration to a JSON file in your custom model folder, `custom-resnet`, with [`~PretrainedConfig.save_pretrained`].
+Save the configuration to a JSON file in your custom model folder, `custom-resnet`, with [`~PreTrainedConfig.save_pretrained`].

 ```py
 resnet50d_config = ResnetConfig(block_type="bottleneck", stem_width=32, stem_type="deep", avg_down=True)
@ -83,7 +83,7 @@ resnet50d_config.save_pretrained("custom-resnet")

 ## Model

-With the custom ResNet configuration, you can now create and customize the model. The model subclasses the base [`PreTrainedModel`] class. Like [`PretrainedConfig`], inheriting from [`PreTrainedModel`] and initializing the superclass with the configuration extends Transformers' functionalities such as saving and loading to the custom model.
+With the custom ResNet configuration, you can now create and customize the model. The model subclasses the base [`PreTrainedModel`] class. Like [`PreTrainedConfig`], inheriting from [`PreTrainedModel`] and initializing the superclass with the configuration extends Transformers' functionalities such as saving and loading to the custom model.

 Transformers' models follow the convention of accepting a `config` object in the `__init__` method. This passes the entire `config` to the model sublayers, instead of breaking the `config` object into multiple arguments that are individually passed to the sublayers.

@ -235,7 +235,7 @@ from resnet_model.configuration_resnet import ResnetConfig
 from resnet_model.modeling_resnet import ResnetModel, ResnetModelForImageClassification
 ```

-Copy the code from the model and configuration files. To make sure the AutoClass objects are saved with [`~PreTrainedModel.save_pretrained`], call the [`~PretrainedConfig.register_for_auto_class`] method. This modifies the configuration JSON file to include the AutoClass objects and mapping.
+Copy the code from the model and configuration files. To make sure the AutoClass objects are saved with [`~PreTrainedModel.save_pretrained`], call the [`~PreTrainedConfig.register_for_auto_class`] method. This modifies the configuration JSON file to include the AutoClass objects and mapping.

 For a model, pick the appropriate `AutoModelFor` class based on the task.

--- a/docs/source/en/debugging.md
+++ b/docs/source/en/debugging.md
@ -45,7 +45,7 @@ which nvcc

 You may also have more than one CUDA toolkit installed on your system.

-```bash
+```text
 /usr/local/cuda-10.2
 /usr/local/cuda-11.0
 ```
--- a/docs/source/en/deepspeed.md
+++ b/docs/source/en/deepspeed.md
@ -294,7 +294,7 @@ Consider running a [benchmark](https://github.com/microsoft/DeepSpeed/issues/998

 The example ZeRO-3 and ZeRO-Infinity config below sets most of the parameter values to `auto`, but you can also manually set configure these values.

-```yaml
+```json
 {
    "fp16": {
        "enabled": "auto",
@ -383,7 +383,7 @@ Gradient checkpointing saves memory by only storing *some* of the intermediate a

 The batch size can be automatically configured or manually set. When you choose the `"auto"` option, [`Trainer`] sets `train_micro_batch_size_per_gpu` and `train_batch_size` to the value of `world_size * per_device_train_batch_size * gradient_accumulation_steps`.

-```yaml
+```json
 {
    "train_micro_batch_size_per_gpu": "auto",
    "train_batch_size": "auto"
@ -400,7 +400,7 @@ Reduce operations are lossy, for example, when gradients are averaged across mul

 Choose the communication data type by setting the `communication_data_type` parameter in the config file. For example, choosing fp32 adds a small amount of overhead but ensures the reduction operation is accumulated in fp32 and when it is ready, it's downcasted to whichever half-precision data type you're training in.

-```yaml
+```json
 {
    "communication_data_type": "fp32"
 }
@ -412,7 +412,7 @@ Gradient accumulation accumulates gradients over several mini-batches of data be

 Gradient accumulation can be automatically configured or manually set. When you choose the `"auto"` option, [`Trainer`] sets it to the value of `gradient_accumulation_steps`.

-```yaml
+```json
 {
    "gradient_accumulation_steps": "auto"
 }
@ -424,7 +424,7 @@ Gradient clipping is useful for preventing exploding gradients which can lead to

 Gradient clipping can be automatically configured or manually set. When you choose the `"auto"` option, [`Trainer`] sets it to the value of `max_grad_norm`.

-```yaml
+```json
 {
    "gradient_clipping": "auto"
 }
@ -439,7 +439,7 @@ Mixed precision accelerates training speed by performing some calculations in ha

 Train in fp32 if a model wasn't pretrained in mixed precision because it may cause underflow or overflow errors. Disable fp16, the default, in this case.

-```yaml
+```json
 {
    "fp16": {
        "enabled": false
@ -452,9 +452,9 @@ For Ampere GPUs and PyTorch 1.7+, the more efficient [tf32](https://pytorch.org/
 </hfoption>
 <hfoption id="fp16">

-To configure AMP-like fp16 mixed precision, set up the config as shown below with `"auto"` or your own values. [`Trainer`] automatically enables or disables fp16 based on the value of `fp16_backend`, and the rest of the config can be set by you. fp16 is enabled from the command line when the following arguments are passed: `--fp16`, `--fp16_backend amp` or `--fp16_full_eval`.
+To configure fp16 mixed precision, set up the config as shown below with `"auto"` or your own values. [`Trainer`] automatically enables or disables fp16 based on the value of `fp16` or `fp16_full_eval`, and the rest of the config can be set by you. fp16 is enabled from the command line when the following arguments are passed: `--fp16` or `--fp16_full_eval` also.

-```yaml
+```json
 {
    "fp16": {
        "enabled": "auto",
@ -469,28 +469,17 @@ To configure AMP-like fp16 mixed precision, set up the config as shown below wit

 For additional DeepSpeed fp16 training options, take a look at the [FP16 Training Options](https://www.deepspeed.ai/docs/config-json/#fp16-training-options) reference.

-To configure Apex-like fp16 mixed precision, set up the config as shown below with `"auto"` or your own values. [`Trainer`] automatically configures `amp` based on the values of `fp16_backend` and `fp16_opt_level`. It can also be enabled from the command line when the following arguments are passed: `--fp16`, `--fp16_backend apex` or `--fp16_opt_level 01`.
-
-```yaml
-{
-    "amp": {
-        "enabled": "auto",
-        "opt_level": "auto"
-    }
-}
-```
-
 </hfoption>
 <hfoption id="bf16">

 > [!TIP]
 > bf16 requires DeepSpeed 0.6.0.

-bf16 has the same dynamic range as fp32, and doesn’t require loss scaling unlike fp16. However, if you use [gradient accumulation](#gradient-accumulation) with bf16, gradients are accumulated in bf16 which may not be desirable because the lower precision can lead to lossy accumulation.
+bf16 has the same dynamic range as fp32, and doesn't require loss scaling unlike fp16. However, if you use [gradient accumulation](#gradient-accumulation) with bf16, gradients are accumulated in bf16 which may not be desirable because the lower precision can lead to lossy accumulation.

 bf16 can be set up in the config file or enabled from the command line when the following arguments are passed: `--bf16` or `--bf16_full_eval`.

-```yaml
+```json
 {
    "bf16": {
        "enabled": "auto"
@ -514,7 +503,7 @@ DeepSpeed offers several [optimizers](https://www.deepspeed.ai/docs/config-json/

 You can set the parameters to `"auto"` or manually input your own values.

-```yaml
+```json
 {
   "optimizer": {
       "type": "AdamW",
@ -530,7 +519,7 @@ You can set the parameters to `"auto"` or manually input your own values.

 Use an unsupported optimizer by adding the following to the top level configuration.

-```yaml
+```json
 {
   "zero_allow_untested_optimizer": true
 }
@ -538,7 +527,7 @@ Use an unsupported optimizer by adding the following to the top level configurat

 From DeepSpeed 0.8.3+, if you want to use offload, you'll also need to add the following to the top level configuration because offload works best with DeepSpeed's CPU Adam optimizer.

-```yaml
+```json
 {
   "zero_force_ds_cpu_optimizer": false
 }
@ -558,7 +547,7 @@ If you don't configure the scheduler in the config file, [`Trainer`] automatical

 You can set the parameters to `"auto"` or manually input your own values.

-```yaml
+```json
 {
   "scheduler": {
         "type": "WarmupDecayLR",
@ -581,7 +570,7 @@ You can set the parameters to `"auto"` or manually input your own values.

 Resume training with a Universal checkpoint by setting `load_universal` to `true` in the config file.

-```yaml
+```json
 {
    "checkpoint": {
        "load_universal": true
@ -640,7 +629,7 @@ deepspeed --num_gpus=1 examples/pytorch/translation/run_translation.py \

 A multi-node setup consists of multiple nodes, where each node has one of more GPUs running a workload. DeepSpeed expects a shared storage system, but if this is not the case, you need to adjust the config file to include a [checkpoint](https://www.deepspeed.ai/docs/config-json/#checkpoint-options) to allow loading without access to a shared filesystem.

-```yaml
+```json
 {
  "checkpoint": {
    "use_node_local_storage": true
@ -824,7 +813,7 @@ ZeRO-2 saves the model weights in fp16. To save the weights in fp16 for ZeRO-3,

 If you don't, [`Trainer`] won't save the weights in fp16 and won't create a `pytorch_model.bin` file. This is because DeepSpeed's state_dict contains a placeholder instead of the real weights, so you won't be able to load it.

-```yaml
+```json
 {
    "zero_optimization": {
        "stage": 3,
@ -986,7 +975,7 @@ NaN loss often occurs when a model is pretrained in bf16 and you try to use it w

 It is also possible that fp16 is causing overflow. For example, if your config file looks like the one below, you may see the following overflow errors in the logs.

-```yaml
+```json
 {
    "fp16": {
        "enabled": "auto",
--- a/docs/source/en/generation_strategies.md
+++ b/docs/source/en/generation_strategies.md
@ -229,6 +229,7 @@ tokenizer.batch_decode(outputs, skip_special_tokens=True)
 ## Custom generation methods

 Custom generation methods enable specialized behavior such as:
+
 - have the model continue thinking if it is uncertain;
 - roll back generation if the model gets stuck;
 - handle special tokens with custom logic;
@ -289,7 +290,7 @@ print(tokenizer.batch_decode(gen_out)[0])

 If the custom method has pinned Python requirements that your environment doesn't meet, you'll get an exception about missing requirements. For instance, [transformers-community/custom_generate_bad_requirements](https://huggingface.co/transformers-community/custom_generate_bad_requirements) has an impossible set of requirements defined in its `custom_generate/requirements.txt` file, and you'll see the error message below if you try to run it.

-```
+```text
 ImportError: Missing requirements in your local environment for `transformers-community/custom_generate_bad_requirements`:
 foo (installed: None)
 bar==0.0.0 (installed: None)
@ -301,6 +302,7 @@ Updating your Python requirements accordingly will remove this error message.
 ### Creating a custom generation method

 To create a new generation method, you need to create a new [**Model**](https://huggingface.co/new) repository and push a few files into it.
+
 1. The model you've designed your generation method with.
 2. `custom_generate/generate.py`, which contains all the logic for your custom generation method.
 3. `custom_generate/requirements.txt`, used to optionally add new Python requirements and/or lock specific versions to correctly use your method.
@ -308,7 +310,7 @@ To create a new generation method, you need to create a new [**Model**](https://

 After you've added all required files, your repository should look like this

-```
+```text
 your_repo/
 ├── README.md          # include the 'custom_generate' tag
 ├── config.json
@ -377,6 +379,7 @@ def generate(model, input_ids, generation_config=None, left_padding=None, **kwar
 ```

 Follow the recommended practices below to ensure your custom generation method works as expected.
+
 - Feel free to reuse the logic for validation and input preparation in the original [`~GenerationMixin.generate`].
 - Pin the `transformers` version in the requirements if you use any private method/attribute in `model`.
 - Consider adding model validation, input validation, or even a separate test file to help users sanity-check your code in their environment.
@ -399,7 +402,7 @@ The root level `README.md` in the model repository usually describes the model t

 For discoverability, we highly recommend you to add the `custom_generate` tag to your repository. To do so, the top of your `README.md` file should look like the example below. After you push the file, you should see the tag in your repository!

-```
+```text
 ---
 library_name: transformers
 tags:
@ -410,13 +413,14 @@ tags:
 ```

 Recommended practices:
+
 - Document input and output differences in [`~GenerationMixin.generate`].
 - Add self-contained examples to enable quick experimentation.
 - Describe soft-requirements such as if the method only works well with a certain family of models.

-### Reusing `generate`’s input preparation
+### Reusing `generate`'s input preparation

-If you're adding a new decoding loop, you might want to preserve the input preparation present in `generate` (batch expansion, attention masks, logits processors, stopping criteria, etc.). You can also pass a **callable** to `custom_generate` to reuse [`~GenerationMixin.generate`]’s full preparation pipeline while overriding only the decoding loop.
+If you're adding a new decoding loop, you might want to preserve the input preparation present in `generate` (batch expansion, attention masks, logits processors, stopping criteria, etc.). You can also pass a **callable** to `custom_generate` to reuse [`~GenerationMixin.generate`]'s full preparation pipeline while overriding only the decoding loop.

 ```py
 def custom_loop(model, input_ids, attention_mask, logits_processor, stopping_criteria, generation_config, **model_kwargs):
@ -437,11 +441,12 @@ output = model.generate(
 ```

 > [!TIP]
-> If you publish a `custom_generate` repository, your `generate` implementation can itself define a callable and pass it to `model.generate()`. This lets you customize the decoding loop while still benefiting from Transformers’ built-in input preparation logic.
+> If you publish a `custom_generate` repository, your `generate` implementation can itself define a callable and pass it to `model.generate()`. This lets you customize the decoding loop while still benefiting from Transformers' built-in input preparation logic.

 ### Finding custom generation methods

 You can find all custom generation methods by [searching for their custom tag.](https://huggingface.co/models?other=custom_generate), `custom_generate`. In addition to the tag, we curate two collections of `custom_generate` methods:
+
 - [Custom generation methods - Community](https://huggingface.co/collections/transformers-community/custom-generation-methods-community-6888fb1da0efbc592d3a8ab6) -- a collection of powerful methods contributed by the community;
 - [Custom generation methods - Tutorials](https://huggingface.co/collections/transformers-community/custom-generation-methods-tutorials-6823589657a94940ea02cfec) -- a collection of reference implementations for methods that previously were part of `transformers`, as well as tutorials for `custom_generate`.

--- a/docs/source/en/glossary.md
+++ b/docs/source/en/glossary.md
@ -185,9 +185,9 @@ See the [Fine-tune a pretrained model](https://huggingface.co/docs/transformers/

 The model head refers to the last layer of a neural network that accepts the raw hidden states and projects them onto a different dimension. There is a different model head for each task. For example:

-  * [`GPT2ForSequenceClassification`] is a sequence classification head - a linear layer - on top of the base [`GPT2Model`].
-  * [`ViTForImageClassification`] is an image classification head - a linear layer on top of the final hidden state of the `CLS` token - on top of the base [`ViTModel`].
-  * [`Wav2Vec2ForCTC`] is a language modeling head with [CTC](#connectionist-temporal-classification-ctc) on top of the base [`Wav2Vec2Model`].
+* [`GPT2ForSequenceClassification`] is a sequence classification head - a linear layer - on top of the base [`GPT2Model`].
+* [`ViTForImageClassification`] is an image classification head - a linear layer on top of the final hidden state of the `CLS` token - on top of the base [`ViTModel`].
+* [`Wav2Vec2ForCTC`] is a language modeling head with [CTC](#connectionist-temporal-classification-ctc) on top of the base [`Wav2Vec2Model`].

 ## I

--- a/docs/source/en/how_to_hack_models.md
+++ b/docs/source/en/how_to_hack_models.md
@ -149,4 +149,4 @@ Call [print_trainable_parameters](https://huggingface.co/docs/peft/package_refer
 ```py
 model.print_trainable_parameters()
 "trainable params: 589,824 || all params: 94,274,096 || trainable%: 0.6256"
-```
+```
--- a/docs/source/en/index.md
+++ b/docs/source/en/index.md
@ -34,6 +34,10 @@ There are over 1M+ Transformers [model checkpoints](https://huggingface.co/model

 Explore the [Hub](https://huggingface.com/) today to find a model and use Transformers to help you get started right away.

+Explore the [Models Timeline](./models_timeline) to discover the latest text, vision, audio and multimodal model architectures in Transformers.
+
+
+
 ## Features

 Transformers provides everything you need for inference or training with state-of-the-art pretrained models. Some of the main features include:
@ -60,4 +64,4 @@ Transformers is designed for developers and machine learning engineers and resea

 ## Learn

-If you're new to Transformers or want to learn more about transformer models, we recommend starting with the [LLM course](https://huggingface.co/learn/llm-course/chapter1/1?fw=pt). This comprehensive course covers everything from the fundamentals of how transformer models work to practical applications across various tasks. You'll learn the complete workflow, from curating high-quality datasets to fine-tuning large language models and implementing reasoning capabilities. The course contains both theoretical and hands-on exercises to build a solid foundational knowledge of transformer models as you learn.
+If you're new to Transformers or want to learn more about transformer models, we recommend starting with the [LLM course](https://huggingface.co/learn/llm-course/chapter1/1?fw=pt). This comprehensive course covers everything from the fundamentals of how transformer models work to practical applications across various tasks. You'll learn the complete workflow, from curating high-quality datasets to fine-tuning large language models and implementing reasoning capabilities. The course contains both theoretical and hands-on exercises to build a solid foundational knowledge of transformer models as you learn.
--- a/docs/source/en/internal/generation_utils.md
+++ b/docs/source/en/internal/generation_utils.md
@ -153,6 +153,9 @@ generation.
 [[autodoc]] TemperatureLogitsWarper
    - __call__

+[[autodoc]] TopHLogitsWarper
+    - __call__
+
 [[autodoc]] TopKLogitsWarper
    - __call__

@ -193,28 +196,6 @@ A [`StoppingCriteria`] can be used to change when to stop generation (other than
 [[autodoc]] EosTokenCriteria
    - __call__

-## Constraints
-
-A [`Constraint`] can be used to force the generation to include specific tokens or sequences in the output. Please note that this is exclusively available to our PyTorch implementations.
-
-[[autodoc]] Constraint
-
-[[autodoc]] PhrasalConstraint
-
-[[autodoc]] DisjunctiveConstraint
-
-[[autodoc]] ConstraintListState
-
-## BeamSearch
-
-[[autodoc]] BeamScorer
-    - process
-    - finalize
-
-[[autodoc]] ConstrainedBeamSearchScorer
-    - process
-    - finalize
-
 ## Streamers

 [[autodoc]] TextStreamer
@ -270,19 +251,19 @@ A [`Constraint`] can be used to force the generation to include specific tokens
    - batch_select_indices

 [[autodoc]] DynamicCache
-    - to_legacy_cache
-    - from_legacy_cache
+
+[[autodoc]] StaticCache

 [[autodoc]] QuantizedCache

+[[autodoc]] EncoderDecoderCache
+
 [[autodoc]] QuantoQuantizedCache

 [[autodoc]] HQQQuantizedCache

 [[autodoc]] OffloadedCache

-[[autodoc]] StaticCache
-
 [[autodoc]] OffloadedStaticCache

 [[autodoc]] HybridCache
@ -291,10 +272,6 @@ A [`Constraint`] can be used to force the generation to include specific tokens

 [[autodoc]] SlidingWindowCache

-[[autodoc]] EncoderDecoderCache
-    - to_legacy_cache
-    - from_legacy_cache
-
 ## Watermark Utils

 [[autodoc]] WatermarkingConfig
--- a/docs/source/en/internal/model_debugging_utils.md
+++ b/docs/source/en/internal/model_debugging_utils.md
@ -218,9 +218,9 @@ path reference to the associated `.safetensors` file. Each tensor is written to
 the state dictionary. File names are constructed using the `module_path` as a prefix with a few possible postfixes that
 are built recursively.

-*   Module inputs are denoted with the `_inputs` and outputs by `_outputs`.
-*   `list` and `tuple` instances, such as `args` or function return values, will be postfixed with `_{index}`.
-*   `dict` instances will be postfixed with `_{key}`.
+* Module inputs are denoted with the `_inputs` and outputs by `_outputs`.
+* `list` and `tuple` instances, such as `args` or function return values, will be postfixed with `_{index}`.
+* `dict` instances will be postfixed with `_{key}`.

 ### Comparing between implementations

@ -255,6 +255,7 @@ how many tests are being skipped and for which models.
 When porting models to transformers, tests fail as they should, and sometimes `test_modeling_common` feels irreconcilable with the peculiarities of our brand new model. But how can we be sure we're not breaking everything by adding a seemingly innocent skip?

 This utility:
+
 - scans all test_modeling_common methods
 - looks for times where a method is skipped
 - returns a summary json you can load as a DataFrame/inspect
@ -279,7 +280,7 @@ python utils/scan_skipped_tests.py --output_dir path/to/output

 **Example output:**

-```
+```text
 🔬 Parsing 331 model test files once each...
 📝 Aggregating 224 tests...
  (224/224) test_update_candidate_strategy_with_matches_1es_3d_is_nonecodet_schedule_fa_kwargs
@ -344,10 +345,94 @@ Skipped : 124/323 (38.4%)
 - bit: Bit does not use inputs_embeds
 - blip: Blip does not use inputs_embeds
 - blip_2: Inputs_embeds is tested in individual model tests
- - bridgetower: 
+ - bridgetower:
 - canine: CANINE does not have a get_input_embeddings() method.
 - ...

 📄 JSON saved to /home/pablo/git/transformers/scan_test_inputs_embeds.json

 ```
+
+## Modular model detector
+
+### Code similarity analyzer - for model adders
+
+This utility analyzes code similarities between model implementations to identify opportunities for modularization. It compares a new or existing modeling file against all models in the library using embedding-based and token-based similarity metrics.
+
+### Rationale
+
+When adding a new model to transformers, many components (attention layers, MLPs, outputs, etc.) may already exist in similar form in other models. Instead of implementing everything from scratch, model adders can identify which existing classes are similar and potentially reusable through modularization.
+
+The tool computes two similarity scores:
+- **Embedding score**: Uses semantic code embeddings (via `Qwen/Qwen3-Embedding-4B`) to detect functionally similar code even with different naming
+- **Jaccard score**: Measures token set overlap to identify structurally similar code patterns
+
+A score of 1.00 means the code is identical.
+
+### Usage
+
+From the root of the `transformers` repository:
+
+```bash
+python utils/modular_model_detector.py --modeling-file path/to/modeling_file.py
+```
+
+The tool will automatically download the pre-built index from the Hub (requires RAM/VRAM for the embedding model).
+
+**Example output:**
+
+```text
+Loading checkpoint shards: 100%|████████████████████| 2/2 [00:00<00:00, 33.62it/s]
+encoding 21 query definitions with Qwen/Qwen3-Embedding-4B (device=cuda, batch=16, max_length=4096)
+
+stuff.py::Beit3ImageTextMatchingOutput:
+embedding:
+    blip_2::Blip2ImageTextMatchingModelOutput (0.9994)
+    chinese_clip::ChineseCLIPOutput (0.9818)
+    owlvit::OwlViTOutput (0.9818)
+jaccard:
+    owlv2::Owlv2Output (0.9667)
+    metaclip_2::MetaClip2Output (0.9667)
+    altclip::AltCLIPOutput (0.9667)
+intersection:
+    blip::BlipOutput
+    owlvit::OwlViTOutput
+
+stuff.py::Beit3MLP:
+embedding:
+    efficientloftr::EfficientLoFTRMLP (0.9718)
+    seggpt::SegGptMlp (0.9650)
+jaccard:
+    chinese_clip::ChineseCLIPTextSelfOutput (0.5294)
+    bert::BertSelfOutput (0.5294)
+intersection:
+```
+
+The `intersection` field shows classes that appear in both top-5 results, indicating high confidence for modularization candidates.
+
+### Building a custom index
+
+To rebuild the index from your local codebase (useful after adding new models or using a different embedding model):
+
+```bash
+python utils/modular_model_detector.py --build
+```
+
+To push the rebuilt index to a Hub dataset:
+
+```bash
+python utils/modular_model_detector.py --build --push-new-index --hub-dataset your-org/your-dataset
+```
+
+### Options
+
+- `--modeling-file`: Path to the modeling file to analyze
+- `--build`: Build the code similarity index from all modeling files in `src/transformers/models/`
+- `--push-new-index`: After building, push the index to a Hub dataset (requires `--build`)
+- `--hub-dataset`: Hub dataset repository ID to pull/push the index (default: `hf-internal-testing/transformers_code_embeddings`)
+
+### Limitations
+
+This tool requires GPU/CPU resources to run the embedding model (`Qwen/Qwen3-Embedding-4B`). The pre-built index is downloaded from the Hub by default, which requires an internet connection on first use.
+
+Results are suggestions based on code similarity and should be manually reviewed before modularization. High similarity scores don't guarantee perfect compatibility.
--- a/docs/source/en/internal/modeling_utils.md
+++ b/docs/source/en/internal/modeling_utils.md
@ -46,10 +46,4 @@ Most of those are only useful if you are studying the code of the models in the

 [[autodoc]] pytorch_utils.apply_chunking_to_forward

-[[autodoc]] pytorch_utils.find_pruneable_heads_and_indices
-
-[[autodoc]] pytorch_utils.prune_layer
-
-[[autodoc]] pytorch_utils.prune_conv1d_layer
-
 [[autodoc]] pytorch_utils.prune_linear_layer
--- a/docs/source/en/internal/trainer_utils.md
+++ b/docs/source/en/internal/trainer_utils.md
@ -36,10 +36,6 @@ Most of those are only useful if you are studying the code of the Trainer in the

 [[autodoc]] trainer_callback.CallbackHandler

-## Distributed Evaluation
-
-[[autodoc]] trainer_pt_utils.DistributedTensorGatherer
-
 ## Trainer Argument Parser

 [[autodoc]] HfArgumentParser
--- a/docs/source/en/jan.md
+++ b/docs/source/en/jan.md
@ -25,7 +25,7 @@ You are now ready to chat!

 To conclude this example, let's look into a more advanced use-case. If you have a beefy machine to serve models with, but prefer using Jan on a different device, you need to add port forwarding. If you have `ssh` access from your Jan machine into your server, this can be accomplished by typing the following to your Jan machine's terminal

-```
+```bash
 ssh -N -f -L 8000:localhost:8000 your_server_account@your_server_IP -p port_to_ssh_into_your_server
 ```

--- a/docs/source/en/kernel_doc/overview.md
+++ b/docs/source/en/kernel_doc/overview.md
@ -0,0 +1,3 @@
+# Overview
+
+Kernels in transformers are used to optimize the performance of models with custom layers from the hub and very low effort.
--- a/docs/source/en/kv_cache.md
+++ b/docs/source/en/kv_cache.md
@ -213,7 +213,7 @@ A cache can also work in iterative generation settings where there is back-and-f

 For iterative generation with a cache, start by initializing an empty cache class and then you can feed in your new prompts. Keep track of dialogue history with a [chat template](./chat_templating).

-The following example demonstrates [Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf). If you’re using a different chat-style model, [`~PreTrainedTokenizer.apply_chat_template`] may process messages differently. It might cut out important tokens depending on how the Jinja template is written.
+The following example demonstrates [Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf). If you're using a different chat-style model, [`~PreTrainedTokenizer.apply_chat_template`] may process messages differently. It might cut out important tokens depending on how the Jinja template is written.

 For example, some models use special `<think> ... </think>` tokens during reasoning. These could get lost during re-encoding, causing indexing issues. You might need to manually remove or adjust extra tokens from the completions to keep things stable.

--- a/docs/source/en/llm_tutorial.md
+++ b/docs/source/en/llm_tutorial.md
@ -94,6 +94,7 @@ model.generate(**inputs, num_beams=4, do_sample=True)
 ```

 [`~GenerationMixin.generate`] can also be extended with external libraries or custom code:
+
 1. the `logits_processor` parameter accepts custom [`LogitsProcessor`] instances for manipulating the next token probability distribution;
 2. the `stopping_criteria` parameters supports custom [`StoppingCriteria`] to stop text generation;
 3. other custom generation methods can be loaded through the `custom_generate` flag ([docs](generation_strategies.md/#custom-decoding-methods)).
--- a/docs/source/en/llm_tutorial_optimization.md
+++ b/docs/source/en/llm_tutorial_optimization.md
@ -100,7 +100,7 @@ result

 **Output**:

-```
+```text
 Here is a Python function that transforms bytes to Giga bytes:\n\n```python\ndef bytes_to_giga_bytes(bytes):\n    return bytes / 1024 / 1024 / 1024\n```\n\nThis function takes a single
 ```

@ -119,7 +119,7 @@ bytes_to_giga_bytes(torch.cuda.max_memory_allocated())

 **Output**:

-```bash
+```text
 29.0260648727417
 ```

@ -208,7 +208,7 @@ result

 **Output**:

-```
+```text
 Here is a Python function that transforms bytes to Giga bytes:\n\n```python\ndef bytes_to_giga_bytes(bytes):\n    return bytes / 1024 / 1024 / 1024\n```\n\nThis function takes a single
 ```

@ -220,7 +220,7 @@ bytes_to_giga_bytes(torch.cuda.max_memory_allocated())

 **Output**:

-```
+```text
 15.219234466552734
 ```

@ -251,7 +251,7 @@ result

 **Output**:

-```
+```text
 Here is a Python function that transforms bytes to Giga bytes:\n\n```\ndef bytes_to_gigabytes(bytes):\n    return bytes / 1024 / 1024 / 1024\n```\n\nThis function takes a single argument
 ```

@ -263,7 +263,7 @@ bytes_to_giga_bytes(torch.cuda.max_memory_allocated())

 **Output**:

-```
+```text
 9.543574333190918
 ```

@ -338,169 +338,6 @@ Essentially, Flash Attention makes sure that all intermediate write and read ope

 In practice, there is currently absolutely no reason to **not** use Flash Attention if available. The algorithm gives mathematically the same outputs, and is both faster and more memory-efficient.

-Let's look at a practical example.
-
-Our OctoCoder model now gets a significantly longer input prompt which includes a so-called *system prompt*. System prompts are used to steer the LLM into a better assistant that is tailored to the users' task.
-In the following, we use a system prompt that will make OctoCoder a better coding assistant.
-
-```python
-system_prompt = """Below are a series of dialogues between various people and an AI technical assistant.
-The assistant tries to be helpful, polite, honest, sophisticated, emotionally aware, and humble but knowledgeable.
-The assistant is happy to help with code questions and will do their best to understand exactly what is needed.
-It also tries to avoid giving false or misleading information, and it caveats when it isn't entirely sure about the right answer.
-That said, the assistant is practical really does its best, and doesn't let caution get too much in the way of being useful.
-
-The Starcoder models are a series of 15.5B parameter models trained on 80+ programming languages from The Stack (v1.2) (excluding opt-out requests).
-The model uses Multi Query Attention, was trained using the Fill-in-the-Middle objective, and with 8,192 tokens context window for a trillion tokens of heavily deduplicated data.
-
-----
-
-Question: Write a function that takes two lists and returns a list that has alternating elements from each input list.
-
-Answer: Sure. Here is a function that does that.
-
-def alternating(list1, list2):
-   results = []
-   for i in range(len(list1)):
-       results.append(list1[i])
-       results.append(list2[i])
-   return results
-
-Question: Can you write some test cases for this function?
-
-Answer: Sure, here are some tests.
-
-assert alternating([10, 20, 30], [1, 2, 3]) == [10, 1, 20, 2, 30, 3]
-assert alternating([True, False], [4, 5]) == [True, 4, False, 5]
-assert alternating([], []) == []
-
-Question: Modify the function so that it returns all input elements when the lists have uneven length. The elements from the longer list should be at the end.
-
-Answer: Here is the modified function.
-
-def alternating(list1, list2):
-   results = []
-   for i in range(min(len(list1), len(list2))):
-       results.append(list1[i])
-       results.append(list2[i])
-   if len(list1) > len(list2):
-       results.extend(list1[i+1:])
-   else:
-       results.extend(list2[i+1:])
-   return results
-
-----
-"""
-```
-
-For demonstration purposes, we duplicate the system prompt by ten so that the input length is long enough to observe Flash Attention's memory savings.
-We append the original text prompt `"Question: Please write a function in Python that transforms bytes to Giga bytes.\n\nAnswer: Here"`
-
-```python
-long_prompt = 10 * system_prompt + prompt
-```
-
-We instantiate our model again in bfloat16 precision.
-
-```python
-model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", dtype=torch.bfloat16, device_map="auto")
-tokenizer = AutoTokenizer.from_pretrained("bigcode/octocoder")
-
-pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
-```
-
-Let's now run the model just like before *without Flash Attention* and measure the peak GPU memory requirement and inference time.
-
-```python
-import time
-
-start_time = time.time()
-result = pipe(long_prompt, max_new_tokens=60)[0]["generated_text"][len(long_prompt):]
-
-print(f"Generated in {time.time() - start_time} seconds.")
-result
-```
-
-**Output**:
-
-```
-Generated in 10.96854019165039 seconds.
-Sure. Here is a function that does that.\n\ndef bytes_to_giga(bytes):\n   return bytes / 1024 / 1024 / 1024\n\nAnswer: Sure. Here is a function that does that.\n\ndef
-````
-
-We're getting the same output as before, however this time, the model repeats the answer multiple times until it's 60 tokens cut-off. This is not surprising as we've repeated the system prompt ten times for demonstration purposes and thus cued the model to repeat itself.
-
-**Note** that the system prompt should not be repeated ten times in real-world applications - one time is enough!
-
-Let's measure the peak GPU memory requirement.
-
-```python
-bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
-```
-
-**Output**:
-
-```bash
-37.668193340301514
-```
-
-As we can see the peak GPU memory requirement is now significantly higher than in the beginning, which is largely due to the longer input sequence. Also the generation takes a little over a minute now.
-
-We call `flush()` to free GPU memory for our next experiment.
-
-```python
-flush()
-```
-
-For comparison, let's run the same function, but enable Flash Attention instead.
-To do so, we convert the model to [BetterTransformer](https://huggingface.co/docs/optimum/bettertransformer/overview) and by doing so enabling PyTorch's [SDPA self-attention](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention) which in turn is able to use Flash Attention.
-
-```python
-model.to_bettertransformer()
-```
-
-Now we run the exact same code snippet as before and under the hood Transformers will make use of Flash Attention.
-
-```py
-start_time = time.time()
-with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):
-    result = pipe(long_prompt, max_new_tokens=60)[0]["generated_text"][len(long_prompt):]
-
-print(f"Generated in {time.time() - start_time} seconds.")
-result
-```
-
-**Output**:
-
-```
-Generated in 3.0211617946624756 seconds.
- Sure. Here is a function that does that.\n\ndef bytes_to_giga(bytes):\n   return bytes / 1024 / 1024 / 1024\n\nAnswer: Sure. Here is a function that does that.\n\ndef
-```
-
-We're getting the exact same result as before, but can observe a very significant speed-up thanks to Flash Attention.
-
-Let's measure the memory consumption one last time.
-
-```python
-bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
-```
-
-**Output**:
-
-```
-32.617331981658936
-```
-
-And we're almost back to our original 29GB peak GPU memory from the beginning.
-
-We can observe that we only use roughly 100MB more GPU memory when passing a very long input sequence with Flash Attention compared to passing a short input sequence as done in the beginning.
-
-```py
-flush()
-```
-
-For more information on how to use Flash Attention, please have a look at [this doc page](https://huggingface.co/docs/transformers/en/perf_infer_gpu_one#flashattention-2).
-
 ## 3. Architectural Innovations

 So far we have looked into improving computational and memory efficiency by:
@ -618,7 +455,7 @@ generated_text

 **Output**:

-```
+```text
 shape of input_ids torch.Size([1, 21])
 shape of input_ids torch.Size([1, 22])
 shape of input_ids torch.Size([1, 23])
@ -656,7 +493,7 @@ generated_text

 **Output**:

-```
+```text
 shape of input_ids torch.Size([1, 1])
 length of key-value cache 20
 shape of input_ids torch.Size([1, 1])
@ -690,7 +527,7 @@ Note that, despite our advice to use key-value caches, your LLM output may be sl

 The key-value cache is especially useful for applications such as chat where multiple passes of auto-regressive decoding are required. Let's look at an example.

-```
+```text
 User: How many people live in France?
 Assistant: Roughly 75 million people live in France
 User: And how many are in Germany?
@ -728,7 +565,7 @@ tokenizer.batch_decode(generation_output.sequences)[0][len(prompt):]

 **Output**:

-```
+```text
 is a modified version of the function that returns Mega bytes instead.

 def bytes_to_megabytes(bytes):
@ -750,7 +587,7 @@ config = model.config

 **Output**:

-```
+```text
 7864320000
 ```

--- a/docs/source/en/main_classes/configuration.md
+++ b/docs/source/en/main_classes/configuration.md
@ -16,7 +16,7 @@ rendered properly in your Markdown viewer.

 # Configuration

-The base class [`PretrainedConfig`] implements the common methods for loading/saving a configuration
+The base class [`PreTrainedConfig`] implements the common methods for loading/saving a configuration
 either from a local file or directory, or from a pretrained model configuration provided by the library (downloaded
 from HuggingFace's AWS S3 repository).

@ -24,8 +24,8 @@ Each derived config class implements model specific attributes. Common attribute
 `hidden_size`, `num_attention_heads`, and `num_hidden_layers`. Text models further implement:
 `vocab_size`.

-## PretrainedConfig
+## PreTrainedConfig

-[[autodoc]] PretrainedConfig
+[[autodoc]] PreTrainedConfig
    - push_to_hub
    - all
--- a/docs/source/en/main_classes/feature_extractor.md
+++ b/docs/source/en/main_classes/feature_extractor.md
@ -16,7 +16,7 @@ rendered properly in your Markdown viewer.

 # Feature Extractor

-A feature extractor is in charge of preparing input features for audio or vision models. This includes feature extraction from sequences, e.g., pre-processing audio files to generate Log-Mel Spectrogram features, feature extraction from images, e.g., cropping image files, but also padding, normalization, and conversion to NumPy and PyTorch tensors.
+A feature extractor is in charge of preparing input features for audio models. This includes feature extraction from sequences, e.g., pre-processing audio files to generate Log-Mel Spectrogram features, and conversion to NumPy and PyTorch tensors.

 ## FeatureExtractionMixin

--- a/docs/source/en/main_classes/kernels.md
+++ b/docs/source/en/main_classes/kernels.md
@ -0,0 +1,7 @@
+## Kernels
+
+This page documents the kernels configuration utilities.
+
+### KernelConfig
+
+[[autodoc]] KernelConfig
--- a/docs/source/en/main_classes/logging.md
+++ b/docs/source/en/main_classes/logging.md
@ -80,6 +80,7 @@ We use both in the `transformers` library. We leverage and adapt `logging`'s `ca
 management of these warning messages by the verbosity setters above.

 What does that mean for developers of the library? We should respect the following heuristics:
+
 - `warnings` should be favored for developers of the library and libraries dependent on `transformers`
 - `logging` should be used for end-users of the library using it in every-day projects

--- a/docs/source/en/main_classes/model.md
+++ b/docs/source/en/main_classes/model.md
@ -22,7 +22,6 @@ file or directory, or from a pretrained model configuration provided by the libr
 [`PreTrainedModel`] also implements a few methods which are common among all the models to:

 - resize the input token embeddings when new tokens are added to the vocabulary
- prune the attention heads of the model.

 The other methods that are common to each model are defined in [`~modeling_utils.ModuleUtilsMixin`] and [`~generation.GenerationMixin`].

--- a/docs/source/en/main_classes/onnx.md
+++ b/docs/source/en/main_classes/onnx.md
@ -1,53 +0,0 @@
-<!--Copyright 2020 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-
-⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
-rendered properly in your Markdown viewer.
-
-->
-
-# Exporting 🤗 Transformers models to ONNX
-
-🤗 Transformers provides a `transformers.onnx` package that enables you to
-convert model checkpoints to an ONNX graph by leveraging configuration objects.
-
-See the [guide](../serialization) on exporting 🤗 Transformers models for more
-details.
-
-## ONNX Configurations
-
-We provide three abstract classes that you should inherit from, depending on the
-type of model architecture you wish to export:
-
-* Encoder-based models inherit from [`~onnx.config.OnnxConfig`]
-* Decoder-based models inherit from [`~onnx.config.OnnxConfigWithPast`]
-* Encoder-decoder models inherit from [`~onnx.config.OnnxSeq2SeqConfigWithPast`]
-
-### OnnxConfig
-
-[[autodoc]] onnx.config.OnnxConfig
-
-### OnnxConfigWithPast
-
-[[autodoc]] onnx.config.OnnxConfigWithPast
-
-### OnnxSeq2SeqConfigWithPast
-
-[[autodoc]] onnx.config.OnnxSeq2SeqConfigWithPast
-
-## ONNX Features
-
-Each ONNX configuration is associated with a set of _features_ that enable you
-to export models for different types of topologies or tasks.
-
-### FeaturesManager
-
-[[autodoc]] onnx.features.FeaturesManager
--- a/docs/source/en/main_classes/pipelines.md
+++ b/docs/source/en/main_classes/pipelines.md
@ -159,7 +159,7 @@ for batch_size in [1, 8, 64, 256]:
        pass
 ```

-```
+```text
 # On GTX 970
 ------------------------------
 Streaming no batching
@ -195,7 +195,7 @@ This is a occasional very long sentence compared to the other. In that case, the
 tokens long, so the whole batch will be [64, 400] instead of [64, 4], leading to the high slowdown. Even worse, on
 bigger batches, the program simply crashes.

-```
+```text
 ------------------------------
 Streaming no batching
 100%|█████████████████████████████████████████████████████████████████████| 1000/1000 [00:05<00:00, 183.69it/s]
--- a/docs/source/en/main_classes/processors.md
+++ b/docs/source/en/main_classes/processors.md
@ -17,6 +17,7 @@ rendered properly in your Markdown viewer.
 # Processors

 Processors can mean two different things in the Transformers library:
+
 - the objects that pre-process inputs for multi-modal models such as [Wav2Vec2](../model_doc/wav2vec2) (speech and text)
  or [CLIP](../model_doc/clip) (text and vision)
 - deprecated objects that were used in older versions of the library to preprocess data for GLUE or SQUAD.
--- a/docs/source/en/main_classes/text_generation.md
+++ b/docs/source/en/main_classes/text_generation.md
@ -30,15 +30,15 @@ like token streaming.
 ## GenerationConfig

 [[autodoc]] generation.GenerationConfig
-	- from_pretrained
-	- from_model_config
-	- save_pretrained
-	- update
-	- validate
-	- get_generation_mode
+    - from_pretrained
+    - from_model_config
+    - save_pretrained
+    - update
+    - validate
+    - get_generation_mode

 ## GenerationMixin

 [[autodoc]] GenerationMixin
-	- generate
-	- compute_transition_scores
+    - generate
+    - compute_transition_scores
--- a/docs/source/en/model_doc/align.md
+++ b/docs/source/en/model_doc/align.md
@ -148,6 +148,7 @@ for label, score in zip(candidate_labels, probs):
  ```

 ## Resources
+
 - Refer to the [Kakao Brain’s Open Source ViT, ALIGN, and the New COYO Text-Image Dataset](https://huggingface.co/blog/vit-align) blog post for more details.

 ## AlignConfig
--- a/docs/source/en/model_doc/arcee.md
+++ b/docs/source/en/model_doc/arcee.md
@ -102,4 +102,4 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ## ArceeForTokenClassification

 [[autodoc]] ArceeForTokenClassification
-    - forward
+    - forward
--- a/docs/source/en/model_doc/audio-spectrogram-transformer.md
+++ b/docs/source/en/model_doc/audio-spectrogram-transformer.md
@ -61,7 +61,7 @@ page for more information.
 SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set
 `attn_implementation="sdpa"` in `from_pretrained()` to explicitly request SDPA to be used.

-```
+```py
 from transformers import ASTForAudioClassification
 model = ASTForAudioClassification.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593", attn_implementation="sdpa", dtype=torch.float16)
 ...
--- a/docs/source/en/model_doc/auto.md
+++ b/docs/source/en/model_doc/auto.md
@ -48,7 +48,7 @@ You will then be able to use the auto classes like you would usually do!

 <Tip warning={true}>

-If your `NewModelConfig` is a subclass of [`~transformers.PretrainedConfig`], make sure its
+If your `NewModelConfig` is a subclass of [`~transformers.PreTrainedConfig`], make sure its
 `model_type` attribute is set to the same key you use when registering the config (here `"new-model"`).

 Likewise, if your `NewModel` is a subclass of [`PreTrainedModel`], make sure its
--- a/docs/source/en/model_doc/bark.md
+++ b/docs/source/en/model_doc/bark.md
@ -62,24 +62,13 @@ model.enable_cpu_offload()

 Note that 🤗 Accelerate must be installed before using this feature. [Here's how to install it.](https://huggingface.co/docs/accelerate/basic_tutorials/install)

-#### Using Better Transformer
-
-Better Transformer is an 🤗 Optimum feature that performs kernel fusion under the hood. You can gain 20% to 30% in speed with zero performance degradation. It only requires one line of code to export the model to 🤗 Better Transformer:
-
-```python
-model =  model.to_bettertransformer()
-```
-
-Note that 🤗 Optimum must be installed before using this feature. [Here's how to install it.](https://huggingface.co/docs/optimum/installation)
-
 #### Using Flash Attention 2

 Flash Attention 2 is an even faster, optimized version of the previous optimization.

 ##### Installation

-First, check whether your hardware is compatible with Flash Attention 2. The latest list of compatible hardware can be found in the [official documentation](https://github.com/Dao-AILab/flash-attention#installation-and-features). If your hardware is not compatible with Flash Attention 2, you can still benefit from attention kernel optimisations through Better Transformer support covered [above](https://huggingface.co/docs/transformers/main/en/model_doc/bark#using-better-transformer).
-
+First, check whether your hardware is compatible with Flash Attention 2. The latest list of compatible hardware can be found in the [official documentation](https://github.com/Dao-AILab/flash-attention#installation-and-features).
 Next, [install](https://github.com/Dao-AILab/flash-attention#installation-and-features) the latest version of Flash Attention 2:

 ```bash
@ -96,7 +85,7 @@ model = BarkModel.from_pretrained("suno/bark-small", dtype=torch.float16, attn_i

 ##### Performance comparison

-The following diagram shows the latency for the native attention implementation (no optimisation) against Better Transformer and Flash Attention 2. In all cases, we generate 400 semantic tokens on a 40GB A100 GPU with PyTorch 2.1. Flash Attention 2 is also consistently faster than Better Transformer, and its performance improves even more as batch sizes increase:
+The following diagram shows the latency for the native attention implementation (no optimisation) against Flash Attention 2. In all cases, we generate 400 semantic tokens on a 40GB A100 GPU with PyTorch 2.1:

 <div style="text-align: center">
 <img src="https://huggingface.co/datasets/ylacombe/benchmark-comparison/resolve/main/Bark%20Optimization%20Benchmark.png">
@ -104,11 +93,9 @@ The following diagram shows the latency for the native attention implementation

 To put this into perspective, on an NVIDIA A100 and when generating 400 semantic tokens with a batch size of 16, you can get 17 times the [throughput](https://huggingface.co/blog/optimizing-bark#throughput) and still be 2 seconds faster than generating sentences one by one with the native model implementation. In other words, all the samples will be generated 17 times faster.

-At batch size 8, on an NVIDIA A100, Flash Attention 2 is also 10% faster than Better Transformer, and at batch size 16, 25%.
-
 #### Combining optimization techniques

-You can combine optimization techniques, and use CPU offload, half-precision and Flash Attention 2 (or 🤗 Better Transformer) all at once.
+You can combine optimization techniques, and use CPU offload, half-precision and Flash Attention 2 all at once.

 ```python
 from transformers import BarkModel, infer_device
--- a/docs/source/en/model_doc/bart.md
+++ b/docs/source/en/model_doc/bart.md
@ -23,7 +23,7 @@ rendered properly in your Markdown viewer.
 </div>

 # BART
-[BART](https://huggingface.co/papers/1910.13461) is a sequence-to-sequence model that combines the pretraining objectives from BERT and GPT. It’s pretrained by corrupting text in different ways like deleting words, shuffling sentences, or masking tokens and learning how to fix it. The encoder encodes the corrupted document and the corrupted text is fixed by the decoder. As it learns to recover the original text, BART gets really good at both understanding and generating language.
+[BART](https://huggingface.co/papers/1910.13461) is a sequence-to-sequence model that combines the pretraining objectives from BERT and GPT. It's pretrained by corrupting text in different ways like deleting words, shuffling sentences, or masking tokens and learning how to fix it. The encoder encodes the corrupted document and the corrupted text is fixed by the decoder. As it learns to recover the original text, BART gets really good at both understanding and generating language.

 You can find all the original BART checkpoints under the [AI at Meta](https://huggingface.co/facebook?search_models=bart) organization.

@ -89,7 +89,7 @@ echo -e "Plants create <mask> through a process known as photosynthesis." | tran

 - Inputs should be padded on the right because BERT uses absolute position embeddings.
 - The [facebook/bart-large-cnn](https://huggingface.co/facebook/bart-large-cnn) checkpoint doesn't include `mask_token_id` which means it can't perform mask-filling tasks.
- BART doesn’t use `token_type_ids` for sequence classification. Use [`BartTokenizer`] or [`~PreTrainedTokenizerBase.encode`] to get the proper splitting.
+- BART doesn't use `token_type_ids` for sequence classification. Use [`BartTokenizer`] or [`~PreTrainedTokenizerBase.encode`] to get the proper splitting.
 - The forward pass of [`BartModel`] creates the `decoder_input_ids` if they're not passed. This can be different from other model APIs, but it is a useful feature for mask-filling tasks.
 - Model predictions are intended to be identical to the original implementation when `forced_bos_token_id=0`. This only works if the text passed to `fairseq.encode` begins with a space.
 - [`~GenerationMixin.generate`] should be used for conditional generation tasks like summarization.
--- a/docs/source/en/model_doc/beit.md
+++ b/docs/source/en/model_doc/beit.md
@ -52,7 +52,7 @@ This model was contributed by [nielsr](https://huggingface.co/nielsr). The origi
 - BEiT models are regular Vision Transformers, but pre-trained in a self-supervised way rather than supervised. They
  outperform both the [original model (ViT)](vit) as well as [Data-efficient Image Transformers (DeiT)](deit) when fine-tuned on ImageNet-1K and CIFAR-100. You can check out demo notebooks regarding inference as well as
  fine-tuning on custom data [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer) (you can just replace
-  [`ViTFeatureExtractor`] by [`BeitImageProcessor`] and
+  [`ViTImageProcessor`] by [`BeitImageProcessor`] and
  [`ViTForImageClassification`] by [`BeitForImageClassification`]).
 - There's also a demo notebook available which showcases how to combine DALL-E's image tokenizer with BEiT for
  performing masked image modeling. You can find it [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/BEiT).
@ -87,7 +87,7 @@ page for more information.
 SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set
 `attn_implementation="sdpa"` in `from_pretrained()` to explicitly request SDPA to be used.

-```
+```py
 from transformers import BeitForImageClassification
 model = BeitForImageClassification.from_pretrained("microsoft/beit-base-patch16-224", attn_implementation="sdpa", dtype=torch.float16)
 ...
@ -123,6 +123,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 - See also: [Image classification task guide](../tasks/image_classification)

 **Semantic segmentation**
+
 - [Semantic segmentation task guide](../tasks/semantic_segmentation)

 If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
@ -135,12 +136,6 @@ If you're interested in submitting a resource to be included here, please feel f

 [[autodoc]] BeitConfig

-## BeitFeatureExtractor
-
-[[autodoc]] BeitFeatureExtractor
-    - __call__
-    - post_process_semantic_segmentation
-
 ## BeitImageProcessor

 [[autodoc]] BeitImageProcessor
--- a/docs/source/en/model_doc/bert-generation.md
+++ b/docs/source/en/model_doc/bert-generation.md
@ -156,4 +156,4 @@ print(tokenizer.decode(outputs[0]))
 ## BertGenerationDecoder

 [[autodoc]] BertGenerationDecoder
-    - forward
+    - forward
--- a/docs/source/en/model_doc/bertweet.md
+++ b/docs/source/en/model_doc/bertweet.md
@ -24,7 +24,7 @@ rendered properly in your Markdown viewer.

 ## BERTweet

-[BERTweet](https://huggingface.co/papers/2005.10200) shares the same architecture as [BERT-base](./bert), but it’s pretrained like [RoBERTa](./roberta) on English Tweets. It performs really well on Tweet-related tasks like part-of-speech tagging, named entity recognition, and text classification.
+[BERTweet](https://huggingface.co/papers/2005.10200) shares the same architecture as [BERT-base](./bert), but it's pretrained like [RoBERTa](./roberta) on English Tweets. It performs really well on Tweet-related tasks like part-of-speech tagging, named entity recognition, and text classification.

 You can find all the original BERTweet checkpoints under the [VinAI Research](https://huggingface.co/vinai?search_models=BERTweet) organization.

@ -88,7 +88,8 @@ echo -e "Plants create <mask> through a process known as photosynthesis." | tran
 </hfoptions>

 ## Notes
- Use the [`AutoTokenizer`] or [`BertweetTokenizer`] because it’s preloaded with a custom vocabulary adapted to tweet-specific tokens like hashtags (#), mentions (@), emojis, and common abbreviations. Make sure to also install the [emoji](https://pypi.org/project/emoji/) library.
+
+- Use the [`AutoTokenizer`] or [`BertweetTokenizer`] because it's preloaded with a custom vocabulary adapted to tweet-specific tokens like hashtags (#), mentions (@), emojis, and common abbreviations. Make sure to also install the [emoji](https://pypi.org/project/emoji/) library.
 - Inputs should be padded on the right (`padding="max_length"`) because BERT uses absolute position embeddings.

 ## BertweetTokenizer
--- a/docs/source/en/model_doc/big_bird.md
+++ b/docs/source/en/model_doc/big_bird.md
@ -87,6 +87,7 @@ print(f"The predicted token is: {predicted_token}")
 </hfoptions>

 ## Notes
+
 - Inputs should be padded on the right because BigBird uses absolute position embeddings.
 - BigBird supports `original_full` and `block_sparse` attention. If the input sequence length is less than 1024, it is recommended to use `original_full` since sparse patterns don't offer much benefit for smaller inputs.
 - The current implementation uses window size of 3 blocks and 2 global blocks, only supports the ITC-implementation, and doesn't support `num_random_blocks=0`.
--- a/docs/source/en/model_doc/biogpt.md
+++ b/docs/source/en/model_doc/biogpt.md
@ -121,7 +121,6 @@ print(output)

 - Pad inputs on the right because BioGPT uses absolute position embeddings.
 - BioGPT can reuse previously computed key-value attention pairs. Access this feature with the [past_key_values](https://huggingface.co/docs/transformers/main/en/model_doc/biogpt#transformers.BioGptModel.forward.past_key_values) parameter in [`BioGPTModel.forward`].
- The `head_mask` argument is ignored when using an attention implementation other than "eager". If you want to use `head_mask`, make sure `attn_implementation="eager"`).

   ```py
   from transformers import AutoModelForCausalLM
--- a/docs/source/en/model_doc/bit.md
+++ b/docs/source/en/model_doc/bit.md
@ -36,6 +36,7 @@ The original code can be found [here](https://github.com/google-research/big_tra
 ## Usage tips

 - BiT models are equivalent to ResNetv2 in terms of architecture, except that: 1) all batch normalization layers are replaced by [group normalization](https://huggingface.co/papers/1803.08494),
+
 2) [weight standardization](https://huggingface.co/papers/1903.10520) is used for convolutional layers. The authors show that the combination of both is useful for training with large batch sizes, and has a significant
 impact on transfer learning.

@ -72,4 +73,4 @@ If you're interested in submitting a resource to be included here, please feel f
 ## BitForImageClassification

 [[autodoc]] BitForImageClassification
-    - forward
+    - forward
--- a/docs/source/en/model_doc/bitnet.md
+++ b/docs/source/en/model_doc/bitnet.md
@ -38,22 +38,22 @@ Several versions of the model weights are available on Hugging Face:
 ### Model Details

 * **Architecture:** Transformer-based, modified with `BitLinear` layers (BitNet framework).
-    * Uses Rotary Position Embeddings (RoPE).
-    * Uses squared ReLU (ReLU²) activation in FFN layers.
-    * Employs [`subln`](https://proceedings.mlr.press/v202/wang23u.html) normalization.
-    * No bias terms in linear or normalization layers.
+  * Uses Rotary Position Embeddings (RoPE).
+  * Uses squared ReLU (ReLU²) activation in FFN layers.
+  * Employs [`subln`](https://proceedings.mlr.press/v202/wang23u.html) normalization.
+  * No bias terms in linear or normalization layers.
 * **Quantization:** Native 1.58-bit weights and 8-bit activations (W1.58A8).
-    * Weights are quantized to ternary values {-1, 0, +1} using absmean quantization during the forward pass.
-    * Activations are quantized to 8-bit integers using absmax quantization (per-token).
-    * **Crucially, the model was *trained from scratch* with this quantization scheme, not post-training quantized.**
+  * Weights are quantized to ternary values {-1, 0, +1} using absmean quantization during the forward pass.
+  * Activations are quantized to 8-bit integers using absmax quantization (per-token).
+  * **Crucially, the model was *trained from scratch* with this quantization scheme, not post-training quantized.**
 * **Parameters:** ~2 Billion
 * **Training Tokens:** 4 Trillion
-*   **Context Length:** Maximum sequence length of **4096 tokens**.
-    *   *Recommendation:* For optimal performance on tasks requiring very long contexts (beyond the pre-training length or for specialized long-reasoning tasks), we recommend performing intermediate long-sequence adaptation/training before the final fine-tuning stage.
+* **Context Length:** Maximum sequence length of **4096 tokens**.
+  * *Recommendation:* For optimal performance on tasks requiring very long contexts (beyond the pre-training length or for specialized long-reasoning tasks), we recommend performing intermediate long-sequence adaptation/training before the final fine-tuning stage.
 * **Training Stages:**
-    1.  **Pre-training:** Large-scale training on public text/code and synthetic math data using a two-stage learning rate and weight decay schedule.
-    2.  **Supervised Fine-tuning (SFT):** Fine-tuned on instruction-following and conversational datasets using sum loss aggregation and specific hyperparameter tuning.
-    3.  **Direct Preference Optimization (DPO):** Aligned with human preferences using preference pairs.
+    1. **Pre-training:** Large-scale training on public text/code and synthetic math data using a two-stage learning rate and weight decay schedule.
+    2. **Supervised Fine-tuning (SFT):** Fine-tuned on instruction-following and conversational datasets using sum loss aggregation and specific hyperparameter tuning.
+    3. **Direct Preference Optimization (DPO):** Aligned with human preferences using preference pairs.
 * **Tokenizer:** LLaMA 3 Tokenizer (vocab size: 128,256).

 ## Usage tips
--- a/docs/source/en/model_doc/blip.md
+++ b/docs/source/en/model_doc/blip.md
@ -128,7 +128,7 @@ Refer to this [notebook](https://github.com/huggingface/notebooks/blob/main/exam
 ## BlipTextLMHeadModel

 [[autodoc]] BlipTextLMHeadModel
- forward
+    - forward

 ## BlipVisionModel

--- a/docs/source/en/model_doc/bloom.md
+++ b/docs/source/en/model_doc/bloom.md
@ -43,16 +43,19 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 - [`BloomForCausalLM`] is supported by this [causal language modeling example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling#gpt-2gpt-and-causal-language-modeling) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb).

 See also:
+
 - [Causal language modeling task guide](../tasks/language_modeling)
 - [Text classification task guide](../tasks/sequence_classification)
 - [Token classification task guide](../tasks/token_classification)
 - [Question answering task guide](../tasks/question_answering)

 ⚡️ Inference
+
 - A blog on [Optimization story: Bloom inference](https://huggingface.co/blog/bloom-inference-optimization).
 - A blog on [Incredibly Fast BLOOM Inference with DeepSpeed and Accelerate](https://huggingface.co/blog/bloom-inference-pytorch-scripts).

 ⚙️ Training
+
 - A blog on [The Technology Behind BLOOM Training](https://huggingface.co/blog/bloom-megatron-deepspeed).

 ## BloomConfig
--- a/docs/source/en/model_doc/blt.md
+++ b/docs/source/en/model_doc/blt.md
@ -13,6 +13,7 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.

 -->
+*This model was released on 2024-12-13 and added to Hugging Face Transformers on 2025-09-19.*

 <div style="float: right;">
    <div class="flex flex-wrap space-x-1">
@ -24,11 +25,11 @@ rendered properly in your Markdown viewer.
    </div>
 </div>

-# Byte Lantet Transformer (BLT)
+# Byte Latent Transformer (BLT)

 ## Overview

-The BLT model was proposed in [Byte Latent Transformer: Patches Scale Better Than Tokens](<https://arxiv.org/pdf/2412.09871>) by Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li1, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman†, Srinivasan Iyer.
+The BLT model was proposed in [Byte Latent Transformer: Patches Scale Better Than Tokens](https://huggingface.co/papers/2412.09871) by Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li1, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman†, Srinivasan Iyer.
 BLT is a byte-level LLM that achieves tokenization-level performance through entropy-based dynamic patching.

 The abstract from the paper is the following:
@ -64,8 +65,8 @@ from transformers import AutoTokenizer, AutoModelForCausalLM

 tokenizer = AutoTokenizer.from_pretrained("itazap/blt-1b-hf")
 model = AutoModelForCausalLM.from_pretrained(
-    "itazap/blt-1b-hf", 
-    device_map="auto", 
+    "itazap/blt-1b-hf",
+    device_map="auto",
 )

 inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
--- a/docs/source/en/model_doc/camembert.md
+++ b/docs/source/en/model_doc/camembert.md
@ -16,10 +16,10 @@ rendered properly in your Markdown viewer.
 *This model was released on 2019-11-10 and added to Hugging Face Transformers on 2020-11-16.*

 <div style="float: right;">
-	<div class="flex flex-wrap space-x-1">
-		<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
+ <div class="flex flex-wrap space-x-1">
+  <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
-	</div>
+ </div>
 </div>

 # CamemBERT
--- a/docs/source/en/model_doc/canine.md
+++ b/docs/source/en/model_doc/canine.md
@ -23,7 +23,7 @@ rendered properly in your Markdown viewer.

 # CANINE

-[CANINE](https://huggingface.co/papers/2103.06874) is a tokenization-free Transformer. It skips the usual step of splitting text into subwords or wordpieces and processes text character by character. That means it works directly with raw Unicode, making it especially useful for languages with complex or inconsistent tokenization rules and even noisy inputs like typos. Since working with characters means handling longer sequences, CANINE uses a smart trick. The model compresses the input early on (called downsampling) so the transformer doesn’t have to process every character individually. This keeps things fast and efficient.
+[CANINE](https://huggingface.co/papers/2103.06874) is a tokenization-free Transformer. It skips the usual step of splitting text into subwords or wordpieces and processes text character by character. That means it works directly with raw Unicode, making it especially useful for languages with complex or inconsistent tokenization rules and even noisy inputs like typos. Since working with characters means handling longer sequences, CANINE uses a smart trick. The model compresses the input early on (called downsampling) so the transformer doesn't have to process every character individually. This keeps things fast and efficient.

 You can find all the original CANINE checkpoints under the [Google](https://huggingface.co/google?search_models=canine) organization.

--- a/docs/source/en/model_doc/chinese_clip.md
+++ b/docs/source/en/model_doc/chinese_clip.md
@ -96,10 +96,6 @@ Currently, following scales of pretrained Chinese-CLIP models are available on
 [[autodoc]] ChineseCLIPImageProcessorFast
    - preprocess

-## ChineseCLIPFeatureExtractor
-
-[[autodoc]] ChineseCLIPFeatureExtractor
-
 ## ChineseCLIPProcessor

 [[autodoc]] ChineseCLIPProcessor
@ -119,4 +115,4 @@ Currently, following scales of pretrained Chinese-CLIP models are available on
 ## ChineseCLIPVisionModel

 [[autodoc]] ChineseCLIPVisionModel
-    - forward
+    - forward
--- a/docs/source/en/model_doc/clip.md
+++ b/docs/source/en/model_doc/clip.md
@ -119,10 +119,6 @@ print(f"Most likely label: {most_likely_label} with probability: {probs[0][most_
 [[autodoc]] CLIPImageProcessorFast
    - preprocess

-## CLIPFeatureExtractor
-
-[[autodoc]] CLIPFeatureExtractor
-
 ## CLIPProcessor

 [[autodoc]] CLIPProcessor
--- a/docs/source/en/model_doc/clipseg.md
+++ b/docs/source/en/model_doc/clipseg.md
@ -106,4 +106,4 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 ## CLIPSegForImageSegmentation

 [[autodoc]] CLIPSegForImageSegmentation
-    - forward
+    - forward
--- a/docs/source/en/model_doc/cohere.md
+++ b/docs/source/en/model_doc/cohere.md
@ -122,7 +122,8 @@ visualizer("Plants create energy through a process known as")
 </div>

 ## Notes
- Don’t use the dtype parameter in [`~AutoModel.from_pretrained`] if you’re using FlashAttention-2 because it only supports fp16 or bf16. You should use [Automatic Mixed Precision](https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html), set fp16 or bf16 to True if using [`Trainer`], or use [torch.autocast](https://pytorch.org/docs/stable/amp.html#torch.autocast).
+
+- Don't use the dtype parameter in [`~AutoModel.from_pretrained`] if you're using FlashAttention-2 because it only supports fp16 or bf16. You should use [Automatic Mixed Precision](https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html), set fp16 or bf16 to True if using [`Trainer`], or use [torch.autocast](https://pytorch.org/docs/stable/amp.html#torch.autocast).

 ## CohereConfig

--- a/docs/source/en/model_doc/conditional_detr.md
+++ b/docs/source/en/model_doc/conditional_detr.md
@ -59,15 +59,6 @@ This model was contributed by [DepuMeng](https://huggingface.co/DepuMeng). The o
    - post_process_semantic_segmentation
    - post_process_panoptic_segmentation

-## ConditionalDetrFeatureExtractor
-
-[[autodoc]] ConditionalDetrFeatureExtractor
-    - __call__
-    - post_process_object_detection
-    - post_process_instance_segmentation
-    - post_process_semantic_segmentation
-    - post_process_panoptic_segmentation
-
 ## ConditionalDetrModel

 [[autodoc]] ConditionalDetrModel
--- a/docs/source/en/model_doc/convnext.md
+++ b/docs/source/en/model_doc/convnext.md
@ -59,10 +59,6 @@ If you're interested in submitting a resource to be included here, please feel f

 [[autodoc]] ConvNextConfig

-## ConvNextFeatureExtractor
-
-[[autodoc]] ConvNextFeatureExtractor
-
 ## ConvNextImageProcessor

 [[autodoc]] ConvNextImageProcessor
--- a/docs/source/en/model_doc/cpmant.md
+++ b/docs/source/en/model_doc/cpmant.md
@ -49,4 +49,4 @@ This model was contributed by [OpenBMB](https://huggingface.co/openbmb). The ori
 ## CpmAntForCausalLM

 [[autodoc]] CpmAntForCausalLM
-    - all
+    - all
--- a/docs/source/en/model_doc/dab-detr.md
+++ b/docs/source/en/model_doc/dab-detr.md
@ -80,7 +80,7 @@ for result in results:

 This should output

-```
+```text
 cat: 0.87 [14.7, 49.39, 320.52, 469.28]
 remote: 0.86 [41.08, 72.37, 173.39, 117.2]
 cat: 0.86 [344.45, 19.43, 639.85, 367.86]
--- a/docs/source/en/model_doc/data2vec.md
+++ b/docs/source/en/model_doc/data2vec.md
@ -53,7 +53,6 @@ The original code for vision can be found [here](https://github.com/facebookrese
 - For Data2VecAudio, preprocessing is identical to [`Wav2Vec2Model`], including feature extraction
 - For Data2VecText, preprocessing is identical to [`RobertaModel`], including tokenization.
 - For Data2VecVision, preprocessing is identical to [`BeitModel`], including feature extraction.
- The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`

 ### Using Scaled Dot Product Attention (SDPA)

@ -68,7 +67,7 @@ SDPA is used by default for `torch>=2.1.1` when an implementation is available,

 The SDPA implementation is currently available for the Data2VecAudio and Data2VecVision models.

-```
+```py
 from transformers import Data2VecVisionForImageClassification
 model = Data2VecVisionForImageClassification.from_pretrained("facebook/data2vec-vision-base", attn_implementation="sdpa", dtype=torch.float16)
 ...
@ -104,6 +103,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 - [`Data2VecVisionForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb).

 **Data2VecText documentation resources**
+
 - [Text classification task guide](../tasks/sequence_classification)
 - [Token classification task guide](../tasks/token_classification)
 - [Question answering task guide](../tasks/question_answering)
@ -112,10 +112,12 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 - [Multiple choice task guide](../tasks/multiple_choice)

 **Data2VecAudio documentation resources**
+
 - [Audio classification task guide](../tasks/audio_classification)
 - [Automatic speech recognition task guide](../tasks/asr)

 **Data2VecVision documentation resources**
+
 - [Image classification](../tasks/image_classification)
 - [Semantic segmentation](../tasks/semantic_segmentation)

--- a/docs/source/en/model_doc/deberta.md
+++ b/docs/source/en/model_doc/deberta.md
@ -92,6 +92,7 @@ echo -e '{"text": "A soccer game with multiple people playing.", "text_pair": "S
 </hfoptions>

 ## Notes
+
 - DeBERTa uses **relative position embeddings**, so it does not require **right-padding** like BERT.
 - For best results, use DeBERTa on sentence-level or sentence-pair classification tasks like MNLI, RTE, or SST-2.
 - If you're using DeBERTa for token-level tasks like masked language modeling, make sure to load a checkpoint specifically pretrained or fine-tuned for token-level tasks.
--- a/docs/source/en/model_doc/deepseek_v2.md
+++ b/docs/source/en/model_doc/deepseek_v2.md
@ -47,4 +47,4 @@ The model uses Multi-head Latent Attention (MLA) and DeepSeekMoE architectures f
 ## DeepseekV2ForSequenceClassification

 [[autodoc]] DeepseekV2ForSequenceClassification
-    - forward
+    - forward
--- a/docs/source/en/model_doc/deepseek_v3.md
+++ b/docs/source/en/model_doc/deepseek_v3.md
@ -64,7 +64,7 @@ print(time.time()-start)

 This generated:

-``````
+``````text
 <｜Assistant｜><think>
 Okay, the user wants to demonstrate how chat templating works. Let me break down what that means. Chat templating is about structuring the conversation data, especially for models that need specific input formats. Maybe they're referring to something like how messages are formatted with roles (user, assistant, system) in APIs like OpenAI.

@ -138,7 +138,7 @@ Applying the template to our `messages` list would produce:

 This tells the model:  
 1. The conversation history (user/assistant turns).  
-2. The model’s turn to generate a response (`<|assistant|>` at the end).  
+2. The model's turn to generate a response (`<|assistant|>` at the end).  

 ---

@ -195,4 +195,4 @@ error, it means NCCL was probably not loaded.
 ## DeepseekV3ForTokenClassification

 [[autodoc]] DeepseekV3ForTokenClassification
-    - forward
+    - forward
--- a/docs/source/en/model_doc/deepseek_vl_hybrid.md
+++ b/docs/source/en/model_doc/deepseek_vl_hybrid.md
@ -24,7 +24,7 @@ rendered properly in your Markdown viewer.

 # DeepseekVLHybrid

-[Deepseek-VL-Hybrid](https://huggingface.co/papers/2403.05525) was introduced by the DeepSeek AI team. It is a vision-language model (VLM) designed to process both text and images for generating contextually relevant responses. The model leverages [LLaMA](./llama) as its text encoder, while [SigLip](./siglip) is used for encoding low-resolution images and [SAM (Segment Anything Model)](./sam) is incorporated to handle high-resolution image encoding, enhancing the model’s ability to process fine-grained visual details. Deepseek-VL-Hybrid is a variant of Deepseek-VL that uses [SAM (Segment Anything Model)](./sam) to handle high-resolution image encoding.
+[Deepseek-VL-Hybrid](https://huggingface.co/papers/2403.05525) was introduced by the DeepSeek AI team. It is a vision-language model (VLM) designed to process both text and images for generating contextually relevant responses. The model leverages [LLaMA](./llama) as its text encoder, while [SigLip](./siglip) is used for encoding low-resolution images and [SAM (Segment Anything Model)](./sam) is incorporated to handle high-resolution image encoding, enhancing the model's ability to process fine-grained visual details. Deepseek-VL-Hybrid is a variant of Deepseek-VL that uses [SAM (Segment Anything Model)](./sam) to handle high-resolution image encoding.

 You can find all the original Deepseek-VL-Hybrid checkpoints under the [DeepSeek-community](https://huggingface.co/deepseek-community) organization.

--- a/docs/source/en/model_doc/deformable_detr.md
+++ b/docs/source/en/model_doc/deformable_detr.md
@ -16,9 +16,9 @@ rendered properly in your Markdown viewer.
 *This model was released on 2020-10-08 and added to Hugging Face Transformers on 2022-09-14.*

 <div style="float: right;">
-	<div class="flex flex-wrap space-x-1">
-		<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-	</div>
+ <div class="flex flex-wrap space-x-1">
+  <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
+ </div>
 </div>

 # Deformable DETR
@ -105,12 +105,6 @@ for result in results:
    - preprocess
    - post_process_object_detection

-## DeformableDetrFeatureExtractor
-
-[[autodoc]] DeformableDetrFeatureExtractor
-    - __call__
-    - post_process_object_detection
-
 ## DeformableDetrConfig

 [[autodoc]] DeformableDetrConfig
--- a/docs/source/en/model_doc/deit.md
+++ b/docs/source/en/model_doc/deit.md
@ -86,7 +86,7 @@ page for more information.
 SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set
 `attn_implementation="sdpa"` in `from_pretrained()` to explicitly request SDPA to be used.

-```
+```py
 from transformers import DeiTForImageClassification
 model = DeiTForImageClassification.from_pretrained("facebook/deit-base-distilled-patch16-224", attn_implementation="sdpa", dtype=torch.float16)
 ...
@ -122,11 +122,6 @@ If you're interested in submitting a resource to be included here, please feel f

 [[autodoc]] DeiTConfig

-## DeiTFeatureExtractor
-
-[[autodoc]] DeiTFeatureExtractor
-    - __call__
-
 ## DeiTImageProcessor

 [[autodoc]] DeiTImageProcessor
--- a/docs/source/en/model_doc/deplot.md
+++ b/docs/source/en/model_doc/deplot.md
@ -68,4 +68,4 @@ scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=1000, nu

 DePlot is a model trained using `Pix2Struct` architecture. For API reference, see [`Pix2Struct` documentation](pix2struct).

-</Tip>
+</Tip>
--- a/docs/source/en/model_doc/depth_anything.md
+++ b/docs/source/en/model_doc/depth_anything.md
@ -86,4 +86,4 @@ Image.fromarray(depth.astype("uint8"))
 ## DepthAnythingForDepthEstimation

 [[autodoc]] DepthAnythingForDepthEstimation
-    - forward
+    - forward
--- a/docs/source/en/model_doc/depth_anything_v2.md
+++ b/docs/source/en/model_doc/depth_anything_v2.md
@ -110,4 +110,4 @@ If you're interested in submitting a resource to be included here, please feel f
 ## DepthAnythingForDepthEstimation

 [[autodoc]] DepthAnythingForDepthEstimation
-    - forward
+    - forward
--- a/docs/source/en/model_doc/depth_pro.md
+++ b/docs/source/en/model_doc/depth_pro.md
@ -84,12 +84,13 @@ alt="drawing" width="600"/>
 The `DepthProForDepthEstimation` model uses a `DepthProEncoder`, for encoding the input image and a `FeatureFusionStage` for fusing the output features from encoder.

 The `DepthProEncoder` further uses two encoders:
+
 - `patch_encoder`
-   - Input image is scaled with multiple ratios, as specified in the `scaled_images_ratios` configuration.
-   - Each scaled image is split into smaller **patches** of size `patch_size` with overlapping areas determined by `scaled_images_overlap_ratios`.
-   - These patches are processed by the **`patch_encoder`**
+  - Input image is scaled with multiple ratios, as specified in the `scaled_images_ratios` configuration.
+  - Each scaled image is split into smaller **patches** of size `patch_size` with overlapping areas determined by `scaled_images_overlap_ratios`.
+  - These patches are processed by the **`patch_encoder`**
 - `image_encoder`
-   - Input image is also rescaled to `patch_size` and processed by the **`image_encoder`**
+  - Input image is also rescaled to `patch_size` and processed by the **`image_encoder`**

 Both these encoders can be configured via `patch_model_config` and `image_model_config` respectively, both of which are separate `Dinov2Model` by default.

@ -159,8 +160,8 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 - Official Implementation: [apple/ml-depth-pro](https://github.com/apple/ml-depth-pro)
 - DepthPro Inference Notebook: [DepthPro Inference](https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/DepthPro_inference.ipynb)
 - DepthPro for Super Resolution and Image Segmentation
-    - Read blog on Medium: [Depth Pro: Beyond Depth](https://medium.com/@raoarmaghanshakir040/depth-pro-beyond-depth-9d822fc557ba)
-    - Code on Github: [geetu040/depthpro-beyond-depth](https://github.com/geetu040/depthpro-beyond-depth)
+  - Read blog on Medium: [Depth Pro: Beyond Depth](https://medium.com/@raoarmaghanshakir040/depth-pro-beyond-depth-9d822fc557ba)
+  - Code on Github: [geetu040/depthpro-beyond-depth](https://github.com/geetu040/depthpro-beyond-depth)

 If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.

--- a/Show More
+++ b/Show More