up

more updates and cleanup
current status
2025-10-25 12:44:35 +08:00 · 2025-10-21 17:23:26 -07:00 · 2025-10-21 17:16:02 -07:00 · 2025-10-21 16:02:33 -07:00 · 2025-10-20 19:07:44 -07:00 · 2025-10-20 19:07:30 -07:00
2601 changed files with 66387 additions and 80124 deletions
--- a/.circleci/create_circleci_config.py
+++ b/.circleci/create_circleci_config.py
@ -29,6 +29,7 @@ COMMON_ENV_VARIABLES = {
    "RUN_PIPELINE_TESTS": False,
    # will be adjust in `CircleCIJob.to_dict`.
    "RUN_FLAKY": True,
+    "DISABLE_SAFETENSORS_CONVERSION": True,
 }
 # Disable the use of {"s": None} as the output is way too long, causing the navigation on CircleCI impractical
 COMMON_PYTEST_OPTIONS = {"max-worker-restart": 0, "vvv": None, "rsfE":None}
@ -129,6 +130,12 @@ class CircleCIJob:

    def to_dict(self):
        env = COMMON_ENV_VARIABLES.copy()
+        if self.job_name != "tests_hub":
+            # fmt: off
+            # not critical
+            env.update({"HF_TOKEN": "".join(["h", "f", "_", "H", "o", "d", "V", "u", "M", "q", "b", "R", "m", "t", "b", "z", "F", "Q", "O", "Q", "A", "J", "G", "D", "l", "V", "Q", "r", "R", "N", "w", "D", "M", "V", "C", "s", "d"])})
+            # fmt: on
+
        # Do not run tests decorated by @is_flaky on pull requests
        env['RUN_FLAKY'] = os.environ.get("CIRCLE_PULL_REQUEST", "") == ""
        env.update(self.additional_env)
@ -179,6 +186,7 @@ class CircleCIJob:
            # During the CircleCI docker images build time, we might already (or not) download the data.
            # If it's done already, the files are inside the directory `/test_data/`.
            {"run": {"name": "fetch hub objects before pytest", "command": "cp -r /test_data/* . 2>/dev/null || true; python3 utils/fetch_hub_objects_for_ci.py"}},
+            {"run": {"name": "download and unzip hub cache", "command": 'curl -L -o huggingface-cache.tar.gz https://huggingface.co/datasets/hf-internal-testing/hf_hub_cache/resolve/main/huggingface-cache.tar.gz && apt-get install pigz && tar --use-compress-program="pigz -d -p 8" -xf huggingface-cache.tar.gz && mv -n hub/* /root/.cache/huggingface/hub/ && ls -la /root/.cache/huggingface/hub/'}},
            {"run": {
                "name": "Run tests",
                "command": f"({timeout_cmd} python3 -m pytest {marker_cmd} -n {self.pytest_num_workers} {junit_flags} {repeat_on_failure_flags} {' '.join(pytest_flags)} $(cat splitted_tests.txt) | tee tests_output.txt)"}
--- a/.github/ISSUE_TEMPLATE/bug-report.yml
+++ b/.github/ISSUE_TEMPLATE/bug-report.yml
@ -48,19 +48,19 @@ body:
          - continuous batching: @remi-or @ArthurZucker @McPatate
          - pipelines: @Rocketknight1
          - tokenizers: @ArthurZucker and @itazap
-          - trainer: @zach-huggingface @SunMarc
+          - trainer: @SunMarc
          - attention: @vasqu @ArthurZucker @CyrilVallez
          - model loading (from pretrained, etc): @CyrilVallez
-          - distributed: @3outeille @ArthurZucker @S1ro1
+          - distributed: @3outeille @ArthurZucker
          - CIs: @ydshieh

        Integrations:

-          - deepspeed: HF Trainer/Accelerate: @SunMarc @zach-huggingface
          - ray/raytune: @richardliaw, @amogkam
          - Big Model Inference: @SunMarc
-          - quantization (bitsandbytes, autogpt): @SunMarc @MekkCyber
+          - quantization: @SunMarc @MekkCyber
          - kernels: @MekkCyber @drbh
+          - peft: @BenjaminBossan @githubnemo
        
        Devices/Backends:
        
--- a/.github/PULL_REQUEST_TEMPLATE.md
+++ b/.github/PULL_REQUEST_TEMPLATE.md
@ -39,41 +39,40 @@ members/contributors who may be interested in your PR.

 Models:

- text models: @ArthurZucker
- vision models: @amyeroberts, @qubvel
- speech models: @eustlb
+- text models: @ArthurZucker @Cyrilvallez
+- vision models: @yonigozlan @molbap
+- audio models: @eustlb @ebezzam @vasqu
+- multimodal models: @zucchini-nlp
 - graph models: @clefourrier

 Library:

- flax: @gante and @Rocketknight1
 - generate: @zucchini-nlp (visual-language models) or @gante (all others)
+- continuous batching: @remi-or @ArthurZucker @McPatate
 - pipelines: @Rocketknight1
- tensorflow: @gante and @Rocketknight1
- tokenizers: @ArthurZucker
- trainer: @zach-huggingface, @SunMarc and @qgallouedec
- chat templates: @Rocketknight1
+- tokenizers: @ArthurZucker and @itazap
+- trainer: @SunMarc
+- attention: @vasqu @ArthurZucker @CyrilVallez
+- model loading (from pretrained, etc): @CyrilVallez
+- distributed: @3outeille @ArthurZucker
+- CIs: @ydshieh

 Integrations:

- deepspeed: HF Trainer/Accelerate: @SunMarc @zach-huggingface
 - ray/raytune: @richardliaw, @amogkam
 - Big Model Inference: @SunMarc
- quantization (bitsandbytes, autogpt): @SunMarc @MekkCyber
+- quantization: @SunMarc @MekkCyber
+- kernels: @MekkCyber @drbh
+- peft: @BenjaminBossan @githubnemo
+
+Devices/Backends:
+
+- AMD ROCm: @ivarflakstad
+- Intel XPU: @IlyasMoutawwakil
+- Ascend NPU: @ivarflakstad 

 Documentation: @stevhliu

-HF projects:
-
- accelerate: [different repo](https://github.com/huggingface/accelerate)
- datasets: [different repo](https://github.com/huggingface/datasets)
- diffusers: [different repo](https://github.com/huggingface/diffusers)
- rust tokenizers: [different repo](https://github.com/huggingface/tokenizers)
-
-Maintained examples (not research project or legacy):
-
- Flax: @Rocketknight1
- PyTorch: See Models above and tag the person corresponding to the modality of the example.
- TensorFlow: @Rocketknight1
+Research projects are not maintained and should be taken as is.

 -->
--- a/.github/scripts/codeowners_for_review_action
+++ b/.github/scripts/codeowners_for_review_action
@ -7,8 +7,8 @@ docs/ @stevhliu
 /docker/ @ydshieh @ArthurZucker

 # More high-level globs catch cases when specific rules later don't apply
-/src/transformers/models/*/processing* @molbap @yonigozlan @qubvel
-/src/transformers/models/*/image_processing* @qubvel
+/src/transformers/models/*/processing* @molbap @yonigozlan
+/src/transformers/models/*/image_processing* @yonigozlan
 /src/transformers/models/*/image_processing_*_fast* @yonigozlan

 # Owners of subsections of the library
@ -186,65 +186,65 @@ trainer_utils.py @zach-huggingface @SunMarc
 /src/transformers/models/zamba/mod*_zamba* @ArthurZucker

 # Vision models
-/src/transformers/models/beit/mod*_beit* @amyeroberts @qubvel
-/src/transformers/models/bit/mod*_bit* @amyeroberts @qubvel
-/src/transformers/models/conditional_detr/mod*_conditional_detr* @amyeroberts @qubvel
-/src/transformers/models/convnext/mod*_convnext* @amyeroberts @qubvel
-/src/transformers/models/convnextv2/mod*_convnextv2* @amyeroberts @qubvel
-/src/transformers/models/cvt/mod*_cvt* @amyeroberts @qubvel
-/src/transformers/models/deformable_detr/mod*_deformable_detr* @amyeroberts @qubvel
-/src/transformers/models/deit/mod*_deit* @amyeroberts @qubvel
-/src/transformers/models/depth_anything/mod*_depth_anything* @amyeroberts @qubvel
-/src/transformers/models/depth_anything_v2/mod*_depth_anything_v2* @amyeroberts @qubvel
-/src/transformers/models/deta/mod*_deta* @amyeroberts @qubvel
-/src/transformers/models/detr/mod*_detr* @amyeroberts @qubvel
-/src/transformers/models/dinat/mod*_dinat* @amyeroberts @qubvel
-/src/transformers/models/dinov2/mod*_dinov2* @amyeroberts @qubvel
-/src/transformers/models/dinov2_with_registers/mod*_dinov2_with_registers* @amyeroberts @qubvel
-/src/transformers/models/dit/mod*_dit* @amyeroberts @qubvel
-/src/transformers/models/dpt/mod*_dpt* @amyeroberts @qubvel
-/src/transformers/models/efficientformer/mod*_efficientformer* @amyeroberts @qubvel
-/src/transformers/models/efficientnet/mod*_efficientnet* @amyeroberts @qubvel
-/src/transformers/models/focalnet/mod*_focalnet* @amyeroberts @qubvel
-/src/transformers/models/glpn/mod*_glpn* @amyeroberts @qubvel
-/src/transformers/models/hiera/mod*_hiera* @amyeroberts @qubvel
-/src/transformers/models/ijepa/mod*_ijepa* @amyeroberts @qubvel
-/src/transformers/models/imagegpt/mod*_imagegpt* @amyeroberts @qubvel
-/src/transformers/models/levit/mod*_levit* @amyeroberts @qubvel
-/src/transformers/models/mask2former/mod*_mask2former* @amyeroberts @qubvel
-/src/transformers/models/maskformer/mod*_maskformer* @amyeroberts @qubvel
-/src/transformers/models/mobilenet_v1/mod*_mobilenet_v1* @amyeroberts @qubvel
-/src/transformers/models/mobilenet_v2/mod*_mobilenet_v2* @amyeroberts @qubvel
-/src/transformers/models/mobilevit/mod*_mobilevit* @amyeroberts @qubvel
-/src/transformers/models/mobilevitv2/mod*_mobilevitv2* @amyeroberts @qubvel
-/src/transformers/models/nat/mod*_nat* @amyeroberts @qubvel
-/src/transformers/models/poolformer/mod*_poolformer* @amyeroberts @qubvel
-/src/transformers/models/pvt/mod*_pvt* @amyeroberts @qubvel
-/src/transformers/models/pvt_v2/mod*_pvt_v2* @amyeroberts @qubvel
-/src/transformers/models/regnet/mod*_regnet* @amyeroberts @qubvel
-/src/transformers/models/resnet/mod*_resnet* @amyeroberts @qubvel
-/src/transformers/models/rt_detr/mod*_rt_detr* @amyeroberts @qubvel
-/src/transformers/models/segformer/mod*_segformer* @amyeroberts @qubvel
-/src/transformers/models/seggpt/mod*_seggpt* @amyeroberts @qubvel
-/src/transformers/models/superpoint/mod*_superpoint* @amyeroberts @qubvel
-/src/transformers/models/swiftformer/mod*_swiftformer* @amyeroberts @qubvel
-/src/transformers/models/swin/mod*_swin* @amyeroberts @qubvel
-/src/transformers/models/swinv2/mod*_swinv2* @amyeroberts @qubvel
-/src/transformers/models/swin2sr/mod*_swin2sr* @amyeroberts @qubvel
-/src/transformers/models/table_transformer/mod*_table_transformer* @amyeroberts @qubvel
-/src/transformers/models/textnet/mod*_textnet* @amyeroberts @qubvel
-/src/transformers/models/timm_wrapper/mod*_timm_wrapper* @amyeroberts @qubvel
-/src/transformers/models/upernet/mod*_upernet* @amyeroberts @qubvel
-/src/transformers/models/van/mod*_van* @amyeroberts @qubvel
-/src/transformers/models/vit/mod*_vit* @amyeroberts @qubvel
-/src/transformers/models/vit_hybrid/mod*_vit_hybrid* @amyeroberts @qubvel
-/src/transformers/models/vitdet/mod*_vitdet* @amyeroberts @qubvel
-/src/transformers/models/vit_mae/mod*_vit_mae* @amyeroberts @qubvel
-/src/transformers/models/vitmatte/mod*_vitmatte* @amyeroberts @qubvel
-/src/transformers/models/vit_msn/mod*_vit_msn* @amyeroberts @qubvel
-/src/transformers/models/vitpose/mod*_vitpose* @amyeroberts @qubvel
-/src/transformers/models/yolos/mod*_yolos* @amyeroberts @qubvel
-/src/transformers/models/zoedepth/mod*_zoedepth* @amyeroberts @qubvel
+/src/transformers/models/beit/mod*_beit* @yonigozlan @molbap
+/src/transformers/models/bit/mod*_bit* @yonigozlan @molbap
+/src/transformers/models/conditional_detr/mod*_conditional_detr* @yonigozlan @molbap
+/src/transformers/models/convnext/mod*_convnext* @yonigozlan @molbap
+/src/transformers/models/convnextv2/mod*_convnextv2* @yonigozlan @molbap
+/src/transformers/models/cvt/mod*_cvt* @yonigozlan @molbap
+/src/transformers/models/deformable_detr/mod*_deformable_detr* @yonigozlan @molbap
+/src/transformers/models/deit/mod*_deit* @yonigozlan @molbap
+/src/transformers/models/depth_anything/mod*_depth_anything* @yonigozlan @molbap
+/src/transformers/models/depth_anything_v2/mod*_depth_anything_v2* @yonigozlan @molbap
+/src/transformers/models/deta/mod*_deta* @yonigozlan @molbap
+/src/transformers/models/detr/mod*_detr* @yonigozlan @molbap
+/src/transformers/models/dinat/mod*_dinat* @yonigozlan @molbap
+/src/transformers/models/dinov2/mod*_dinov2* @yonigozlan @molbap
+/src/transformers/models/dinov2_with_registers/mod*_dinov2_with_registers* @yonigozlan @molbap
+/src/transformers/models/dit/mod*_dit* @yonigozlan @molbap
+/src/transformers/models/dpt/mod*_dpt* @yonigozlan @molbap
+/src/transformers/models/efficientformer/mod*_efficientformer* @yonigozlan @molbap
+/src/transformers/models/efficientnet/mod*_efficientnet* @yonigozlan @molbap
+/src/transformers/models/focalnet/mod*_focalnet* @yonigozlan @molbap
+/src/transformers/models/glpn/mod*_glpn* @yonigozlan @molbap
+/src/transformers/models/hiera/mod*_hiera* @yonigozlan @molbap
+/src/transformers/models/ijepa/mod*_ijepa* @yonigozlan @molbap
+/src/transformers/models/imagegpt/mod*_imagegpt* @yonigozlan @molbap
+/src/transformers/models/levit/mod*_levit* @yonigozlan @molbap
+/src/transformers/models/mask2former/mod*_mask2former* @yonigozlan @molbap
+/src/transformers/models/maskformer/mod*_maskformer* @yonigozlan @molbap
+/src/transformers/models/mobilenet_v1/mod*_mobilenet_v1* @yonigozlan @molbap
+/src/transformers/models/mobilenet_v2/mod*_mobilenet_v2* @yonigozlan @molbap
+/src/transformers/models/mobilevit/mod*_mobilevit* @yonigozlan @molbap
+/src/transformers/models/mobilevitv2/mod*_mobilevitv2* @yonigozlan @molbap
+/src/transformers/models/nat/mod*_nat* @yonigozlan @molbap
+/src/transformers/models/poolformer/mod*_poolformer* @yonigozlan @molbap
+/src/transformers/models/pvt/mod*_pvt* @yonigozlan @molbap
+/src/transformers/models/pvt_v2/mod*_pvt_v2* @yonigozlan @molbap
+/src/transformers/models/regnet/mod*_regnet* @yonigozlan @molbap
+/src/transformers/models/resnet/mod*_resnet* @yonigozlan @molbap
+/src/transformers/models/rt_detr/mod*_rt_detr* @yonigozlan @molbap
+/src/transformers/models/segformer/mod*_segformer* @yonigozlan @molbap
+/src/transformers/models/seggpt/mod*_seggpt* @yonigozlan @molbap
+/src/transformers/models/superpoint/mod*_superpoint* @yonigozlan @molbap
+/src/transformers/models/swiftformer/mod*_swiftformer* @yonigozlan @molbap
+/src/transformers/models/swin/mod*_swin* @yonigozlan @molbap
+/src/transformers/models/swinv2/mod*_swinv2* @yonigozlan @molbap
+/src/transformers/models/swin2sr/mod*_swin2sr* @yonigozlan @molbap
+/src/transformers/models/table_transformer/mod*_table_transformer* @yonigozlan @molbap
+/src/transformers/models/textnet/mod*_textnet* @yonigozlan @molbap
+/src/transformers/models/timm_wrapper/mod*_timm_wrapper* @yonigozlan @molbap
+/src/transformers/models/upernet/mod*_upernet* @yonigozlan @molbap
+/src/transformers/models/van/mod*_van* @yonigozlan @molbap
+/src/transformers/models/vit/mod*_vit* @yonigozlan @molbap
+/src/transformers/models/vit_hybrid/mod*_vit_hybrid* @yonigozlan @molbap
+/src/transformers/models/vitdet/mod*_vitdet* @yonigozlan @molbap
+/src/transformers/models/vit_mae/mod*_vit_mae* @yonigozlan @molbap
+/src/transformers/models/vitmatte/mod*_vitmatte* @yonigozlan @molbap
+/src/transformers/models/vit_msn/mod*_vit_msn* @yonigozlan @molbap
+/src/transformers/models/vitpose/mod*_vitpose* @yonigozlan @molbap
+/src/transformers/models/yolos/mod*_yolos* @yonigozlan @molbap
+/src/transformers/models/zoedepth/mod*_zoedepth* @yonigozlan @molbap

 # Audio models
 /src/transformers/models/audio_spectrogram_transformer/mod*_audio_spectrogram_transformer* @eustlb
@ -304,7 +304,7 @@ trainer_utils.py @zach-huggingface @SunMarc
 /src/transformers/models/donut/mod*_donut* @zucchini-nlp
 /src/transformers/models/flava/mod*_flava* @zucchini-nlp
 /src/transformers/models/git/mod*_git* @zucchini-nlp
-/src/transformers/models/grounding_dino/mod*_grounding_dino* @qubvel
+/src/transformers/models/grounding_dino/mod*_grounding_dino* @yonigozlan
 /src/transformers/models/groupvit/mod*_groupvit* @zucchini-nlp
 /src/transformers/models/idefics/mod*_idefics* @zucchini-nlp
 /src/transformers/models/idefics2/mod*_idefics2* @zucchini-nlp
@ -326,10 +326,10 @@ trainer_utils.py @zach-huggingface @SunMarc
 /src/transformers/models/mgp_str/mod*_mgp_str* @zucchini-nlp
 /src/transformers/models/mllama/mod*_mllama* @zucchini-nlp
 /src/transformers/models/nougat/mod*_nougat* @NielsRogge
-/src/transformers/models/omdet_turbo/mod*_omdet_turbo* @qubvel @yonigozlan
+/src/transformers/models/omdet_turbo/mod*_omdet_turbo* @yonigozlan
 /src/transformers/models/oneformer/mod*_oneformer* @zucchini-nlp
-/src/transformers/models/owlvit/mod*_owlvit* @qubvel
-/src/transformers/models/owlv2/mod*_owlv2* @qubvel
+/src/transformers/models/owlvit/mod*_owlvit* @yonigozlan
+/src/transformers/models/owlv2/mod*_owlv2* @yonigozlan
 /src/transformers/models/paligemma/mod*_paligemma* @zucchini-nlp @molbap
 /src/transformers/models/perceiver/mod*_perceiver* @zucchini-nlp
 /src/transformers/models/pix2struct/mod*_pix2struct* @zucchini-nlp
--- a/.github/workflows/benchmark.yml
+++ b/.github/workflows/benchmark.yml
@ -1,10 +1,7 @@
 name: Self-hosted runner (benchmark)

 on:
-  push:
-    branches: [main]
-  pull_request:
-    types: [ opened, labeled, reopened, synchronize ]
+  workflow_dispatch:

 concurrency:
  group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
--- a/.github/workflows/benchmark_v2.yml
+++ b/.github/workflows/benchmark_v2.yml
@ -1,27 +1,7 @@
 name: Benchmark v2 Framework

 on:
-  workflow_call:
-    inputs:
-      runner:
-        description: 'GH Actions runner group to use'
-        required: true
-        type: string
-      commit_sha:
-        description: 'Commit SHA to benchmark'
-        required: false
-        type: string
-        default: ''
-      run_id:
-        description: 'Custom run ID for organizing results (auto-generated if not provided)'
-        required: false
-        type: string
-        default: ''
-      benchmark_repo_id:
-        description: 'HuggingFace Dataset to upload results to (e.g., "org/benchmark-results")'
-        required: false
-        type: string
-        default: ''
+  workflow_dispatch:

 env:
  HF_HOME: /mnt/cache
@ -38,8 +18,8 @@ jobs:
      (github.event_name == 'pull_request' && contains( github.event.pull_request.labels.*.name, 'run-benchmark')) ||
      (github.event_name == 'schedule')
    container:
-      image: huggingface/transformers-pytorch-gpu
-      options: --gpus all --privileged --ipc host --shm-size "16gb"
+      image: ${{ inputs.container_image }}
+      options: ${{ inputs.container_options }}
    steps:
      - name: Get repo
        uses: actions/checkout@v4
@ -74,4 +54,4 @@ jobs:
          --token '${{ secrets.TRANSFORMERS_CI_RESULTS_UPLOAD_TOKEN }}' \
          --log-level INFO
        env:
-          HF_TOKEN: ${{ secrets.HF_HUB_READ_TOKEN }}
+          HF_TOKEN: ${{ secrets.HF_HUB_READ_TOKEN }}
--- a/.github/workflows/benchmark_v2_a10_caller.yml
+++ b/.github/workflows/benchmark_v2_a10_caller.yml
@ -1,11 +1,7 @@
 name: Benchmark v2 Scheduled Runner - A10 Single-GPU

 on:
-  schedule:
-    # Run daily at 16:30 UTC
-    - cron: "30 16 * * *"
-  pull_request:
-    types: [ opened, labeled, reopened, synchronize ]
+  workflow_dispatch:

 jobs:
  benchmark-v2-default:
@ -13,7 +9,9 @@ jobs:
    uses: ./.github/workflows/benchmark_v2.yml
    with:
      runner: aws-g5-4xlarge-cache-use1-public-80
+      container_image: huggingface/transformers-pytorch-gpu
+      container_options: --gpus all --privileged --ipc host --shm-size "16gb"
      commit_sha: ${{ github.sha }}
      run_id: ${{ github.run_id }}
      benchmark_repo_id: hf-internal-testing/transformers-daily-benchmarks
-    secrets: inherit
+    secrets: inherit
--- a/.github/workflows/benchmark_v2_mi325_caller.yml
+++ b/.github/workflows/benchmark_v2_mi325_caller.yml
@ -1,11 +1,7 @@
 name: Benchmark v2 Scheduled Runner - MI325 Single-GPU

 on:
-  schedule:
-    # Run daily at 16:30 UTC
-    - cron: "30 16 * * *"
-  pull_request:
-    types: [ opened, labeled, reopened, synchronize ]
+  workflow_dispatch:

 jobs:
  benchmark-v2-default:
@ -13,7 +9,9 @@ jobs:
    uses: ./.github/workflows/benchmark_v2.yml
    with:
      runner: amd-mi325-ci-1gpu
+      container_image: huggingface/transformers-pytorch-amd-gpu
+      container_options: --device /dev/kfd --device /dev/dri --env ROCR_VISIBLE_DEVICES --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache
      commit_sha: ${{ github.sha }}
      run_id: ${{ github.run_id }}
      benchmark_repo_id: hf-internal-testing/transformers-daily-benchmarks
-    secrets: inherit
+    secrets: inherit
--- a/.github/workflows/build-docker-images.yml
+++ b/.github/workflows/build-docker-images.yml
@ -5,6 +5,7 @@ on:
    branches:
      - build_ci_docker_image*
  repository_dispatch:
+  workflow_dispatch:
  workflow_call:
    inputs:
      image_postfix:
@ -221,7 +222,7 @@ jobs:
  latest-pytorch-amd:
    name: "Latest PyTorch (AMD) [dev]"
    runs-on:
-      group: aws-general-8-plus
+      group: aws-highcpu-32-priv
    steps:
      -
        name: Set up Docker Buildx
--- a/.github/workflows/build_documentation.yml
+++ b/.github/workflows/build_documentation.yml
@ -16,8 +16,20 @@ jobs:
      commit_sha: ${{ github.sha }}
      package: transformers
      notebook_folder: transformers_doc
-      languages: ar de en es fr hi it ja ko pt zh
+      languages: en
      custom_container: huggingface/transformers-doc-builder
    secrets:
      token: ${{ secrets.HUGGINGFACE_PUSH }}
      hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
+
+   build_other_lang:
+    uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@main
+    with:
+      commit_sha: ${{ github.sha }}
+      package: transformers
+      notebook_folder: transformers_doc
+      languages: ar de es fr hi it ja ko pt zh
+      custom_container: huggingface/transformers-doc-builder
+    secrets:
+      token: ${{ secrets.HUGGINGFACE_PUSH }}
+      hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
--- a/.github/workflows/check_failed_tests.yml
+++ b/.github/workflows/check_failed_tests.yml
@ -35,7 +35,6 @@ env:
  # For gated repositories, we still need to agree to share information on the Hub repo. page in order to get access.
  # This token is created under the bot `hf-transformers-bot`.
  HF_HUB_READ_TOKEN: ${{ secrets.HF_HUB_READ_TOKEN }}
-  SIGOPT_API_TOKEN: ${{ secrets.SIGOPT_API_TOKEN }}
  TF_FORCE_GPU_ALLOW_GROWTH: true
  CUDA_VISIBLE_DEVICES: 0,1

--- a/.github/workflows/doctest_job.yml
+++ b/.github/workflows/doctest_job.yml
@ -16,7 +16,6 @@ env:
  RUN_SLOW: yes
  OMP_NUM_THREADS: 16
  MKL_NUM_THREADS: 16
-  SIGOPT_API_TOKEN: ${{ secrets.SIGOPT_API_TOKEN }}
  TF_FORCE_GPU_ALLOW_GROWTH: true

 jobs:
--- a/.github/workflows/model_jobs.yml
+++ b/.github/workflows/model_jobs.yml
@ -38,7 +38,6 @@ env:
  # For gated repositories, we still need to agree to share information on the Hub repo. page in order to get access.
  # This token is created under the bot `hf-transformers-bot`.
  HF_HUB_READ_TOKEN: ${{ secrets.HF_HUB_READ_TOKEN }}
-  SIGOPT_API_TOKEN: ${{ secrets.SIGOPT_API_TOKEN }}
  TF_FORCE_GPU_ALLOW_GROWTH: true
  CUDA_VISIBLE_DEVICES: 0,1

--- a/.github/workflows/model_jobs_intel_gaudi.yml
+++ b/.github/workflows/model_jobs_intel_gaudi.yml
@ -26,7 +26,6 @@ env:
  TRANSFORMERS_IS_CI: yes
  PT_ENABLE_INT64_SUPPORT: 1
  HF_HUB_READ_TOKEN: ${{ secrets.HF_HUB_READ_TOKEN }}
-  SIGOPT_API_TOKEN: ${{ secrets.SIGOPT_API_TOKEN }}
  HF_HOME: /mnt/cache/.cache/huggingface

 jobs:
--- a/.github/workflows/pr_build_doc_with_comment.yml
+++ b/.github/workflows/pr_build_doc_with_comment.yml
@ -14,7 +14,7 @@ permissions: {}
 jobs:
  get-pr-number:
    name: Get PR number
-    if: ${{ github.event.issue.state == 'open' && contains(fromJSON('["ydshieh", "ArthurZucker", "zucchini-nlp", "molbap", "gante", "LysandreJik", "Cyrilvallez", "Rocketknight1", "SunMarc", "muellerzr", "eustlb", "MekkCyber", "manueldeprada", "vasqu", "ivarflakstad", "stevhliu", "ebezzam", "itazap"]'), github.actor) && (startsWith(github.event.comment.body, 'build-doc')) }}
+    if: ${{ github.event.issue.state == 'open' && contains(fromJSON('["ydshieh", "ArthurZucker", "zucchini-nlp", "molbap", "gante", "LysandreJik", "Cyrilvallez", "Rocketknight1", "SunMarc", "eustlb", "MekkCyber", "vasqu", "ivarflakstad", "stevhliu", "ebezzam", "itazap"]'), github.actor) && (startsWith(github.event.comment.body, 'build-doc')) }}
    uses: ./.github/workflows/get-pr-number.yml

  get-pr-info:
@ -98,7 +98,7 @@ jobs:
      commit_sha: ${{ needs.get-pr-info.outputs.PR_HEAD_SHA }}
      pr_number: ${{ needs.get-pr-number.outputs.PR_NUMBER }}
      package: transformers
-      languages: ar de en es fr hi it ko pt tr zh ja te
+      languages: ar de en es fr hi it ja ko pt zh

  update_run_status:
    name: Update Check Run Status
--- a/.github/workflows/self-comment-ci.yml
+++ b/.github/workflows/self-comment-ci.yml
@ -20,7 +20,6 @@ env:
  # For gated repositories, we still need to agree to share information on the Hub repo. page in order to get access.
  # This token is created under the bot `hf-transformers-bot`.
  HF_HUB_READ_TOKEN: ${{ secrets.HF_HUB_READ_TOKEN }}
-  SIGOPT_API_TOKEN: ${{ secrets.SIGOPT_API_TOKEN }}
  TF_FORCE_GPU_ALLOW_GROWTH: true
  CUDA_VISIBLE_DEVICES: 0,1

@ -29,7 +28,7 @@ jobs:
    runs-on: ubuntu-22.04
    name: Get PR number
    # For security: only allow team members to run
-    if: ${{ github.event.issue.state == 'open' && contains(fromJSON('["ydshieh", "ArthurZucker", "zucchini-nlp", "molbap", "gante", "LysandreJik", "Cyrilvallez", "Rocketknight1", "SunMarc", "muellerzr", "eustlb", "MekkCyber", "manueldeprada", "vasqu", "ivarflakstad", "stevhliu", "ebezzam", "remi-or", "itazap"]'), github.actor) && (startsWith(github.event.comment.body, 'run-slow') || startsWith(github.event.comment.body, 'run slow') || startsWith(github.event.comment.body, 'run_slow')) }}
+    if: ${{ github.event.issue.state == 'open' && contains(fromJSON('["ydshieh", "ArthurZucker", "zucchini-nlp", "molbap", "gante", "LysandreJik", "Cyrilvallez", "Rocketknight1", "SunMarc", "eustlb", "MekkCyber", "vasqu", "ivarflakstad", "stevhliu", "ebezzam", "remi-or", "itazap"]'), github.actor) && (startsWith(github.event.comment.body, 'run-slow') || startsWith(github.event.comment.body, 'run slow') || startsWith(github.event.comment.body, 'run_slow')) }}
    outputs:
      PR_NUMBER: ${{ steps.set_pr_number.outputs.PR_NUMBER }}
    steps:
--- a/.github/workflows/self-scheduled-amd-mi325-caller.yml
+++ b/.github/workflows/self-scheduled-amd-mi325-caller.yml
@ -20,7 +20,7 @@ jobs:
    with:
      job: run_models_gpu
      slack_report_channel: "#amd-hf-ci"
-      runner_scale_set: amd-mi325-ci
+      runner_group: amd-mi325
      docker: huggingface/transformers-pytorch-amd-gpu
      ci_event: Scheduled CI (AMD) - mi325
      report_repo_id: optimum-amd/transformers_daily_ci
@ -33,7 +33,7 @@ jobs:
    with:
      job: run_pipelines_torch_gpu
      slack_report_channel: "#amd-hf-ci"
-      runner_scale_set: amd-mi325-ci
+      runner_group: amd-mi325
      docker: huggingface/transformers-pytorch-amd-gpu
      ci_event: Scheduled CI (AMD) - mi325
      report_repo_id: optimum-amd/transformers_daily_ci
@ -46,7 +46,7 @@ jobs:
    with:
      job: run_examples_gpu
      slack_report_channel: "#amd-hf-ci"
-      runner_scale_set: amd-mi325-ci
+      runner_group: amd-mi325
      docker: huggingface/transformers-pytorch-amd-gpu
      ci_event: Scheduled CI (AMD) - mi325
      report_repo_id: optimum-amd/transformers_daily_ci
@ -59,7 +59,7 @@ jobs:
    with:
      job: run_torch_cuda_extensions_gpu
      slack_report_channel: "#amd-hf-ci"
-      runner_scale_set: amd-mi325-ci
+      runner_group: amd-mi325
      docker: huggingface/transformers-pytorch-deepspeed-amd-gpu
      ci_event: Scheduled CI (AMD) - mi325
      report_repo_id: optimum-amd/transformers_daily_ci
--- a/.github/workflows/self-scheduled-amd-mi355-caller.yml
+++ b/.github/workflows/self-scheduled-amd-mi355-caller.yml
@ -3,7 +3,7 @@ name: Self-hosted runner scale set (AMD mi355 scheduled CI caller)
 # Note: For every job in this workflow, the name of the runner scale set is finalized in the runner yaml i.e. huggingface/hf-workflows/.github/workflows/transformers_amd_ci_scheduled_arc_scale_set.yaml
 # For example, 1gpu : amd-mi355-ci-1gpu
 #              2gpu : amd-mi355-ci-2gpu
-
+ 
 on:
  workflow_run:
    workflows: ["Self-hosted runner (AMD scheduled CI caller)"]
@ -20,7 +20,7 @@ jobs:
    with:
      job: run_models_gpu
      slack_report_channel: "#amd-hf-ci"
-      runner_scale_set: amd-mi355-ci
+      runner_group: hfc-amd-mi355
      docker: huggingface/testing-rocm7.0-preview
      ci_event: Scheduled CI (AMD) - mi355
      report_repo_id: hf-transformers-bot/transformers-ci-dummy
@ -32,7 +32,7 @@ jobs:
    with:
      job: run_pipelines_torch_gpu
      slack_report_channel: "#amd-hf-ci"
-      runner_scale_set: amd-mi355-ci
+      runner_group: hfc-amd-mi355
      docker: huggingface/testing-rocm7.0-preview
      ci_event: Scheduled CI (AMD) - mi355
      report_repo_id: hf-transformers-bot/transformers-ci-dummy
@ -44,7 +44,7 @@ jobs:
    with:
      job: run_examples_gpu
      slack_report_channel: "#amd-hf-ci"
-      runner_scale_set: amd-mi355-ci
+      runner_group: hfc-amd-mi355
      docker: huggingface/testing-rocm7.0-preview
      ci_event: Scheduled CI (AMD) - mi355
      report_repo_id: hf-transformers-bot/transformers-ci-dummy
@ -53,10 +53,10 @@ jobs:
  deepspeed-ci:
    name: DeepSpeed CI
    uses: huggingface/hf-workflows/.github/workflows/transformers_amd_ci_scheduled_arc_scale_set.yaml@main
-    with:
+    with:  
      job: run_torch_cuda_extensions_gpu
      slack_report_channel: "#amd-hf-ci"
-      runner_scale_set: amd-mi355-ci
+      runner_group: hfc-amd-mi355
      docker: huggingface/testing-rocm7.0-preview
      ci_event: Scheduled CI (AMD) - mi355
      report_repo_id: hf-transformers-bot/transformers-ci-dummy
--- a/.github/workflows/self-scheduled-intel-gaudi.yml
+++ b/.github/workflows/self-scheduled-intel-gaudi.yml
@ -26,7 +26,6 @@ env:
  TRANSFORMERS_IS_CI: yes
  PT_ENABLE_INT64_SUPPORT: 1
  HF_HUB_READ_TOKEN: ${{ secrets.HF_HUB_READ_TOKEN }}
-  SIGOPT_API_TOKEN: ${{ secrets.SIGOPT_API_TOKEN }}
  HF_HOME: /mnt/cache/.cache/huggingface

 jobs:
--- a/.github/workflows/self-scheduled.yml
+++ b/.github/workflows/self-scheduled.yml
@ -48,7 +48,6 @@ env:
  # For gated repositories, we still need to agree to share information on the Hub repo. page in order to get access.
  # This token is created under the bot `hf-transformers-bot`.
  HF_HUB_READ_TOKEN: ${{ secrets.HF_HUB_READ_TOKEN }}
-  SIGOPT_API_TOKEN: ${{ secrets.SIGOPT_API_TOKEN }}
  TF_FORCE_GPU_ALLOW_GROWTH: true
  CUDA_VISIBLE_DEVICES: 0,1
  NUM_SLICES: 2
--- a/.github/workflows/ssh-runner.yml
+++ b/.github/workflows/ssh-runner.yml
@ -20,7 +20,6 @@ env:
  OMP_NUM_THREADS: 8
  MKL_NUM_THREADS: 8
  RUN_SLOW: yes # For gated repositories, we still need to agree to share information on the Hub repo. page in order to get access. # This token is created under the bot `hf-transformers-bot`.
-  SIGOPT_API_TOKEN: ${{ secrets.SIGOPT_API_TOKEN }}
  TF_FORCE_GPU_ALLOW_GROWTH: true
  CUDA_VISIBLE_DEVICES: 0,1

@ -33,14 +32,17 @@ jobs:
    steps:
      - name: Get runner to use
        shell: bash
+        env:
+          NUM_GPUS: ${{ github.event.inputs.num_gpus }}
+          RUNNER_TYPE: ${{ github.event.inputs.runner_type }}
        run: |
-          if [[ "${{ github.event.inputs.num_gpus }}" == "single" && "${{ github.event.inputs.runner_type }}" == "t4" ]]; then
+          if [[ "$NUM_GPUS" == "single" && "$RUNNER_TYPE" == "t4" ]]; then
            echo "RUNNER=aws-g4dn-4xlarge-cache" >> $GITHUB_ENV
-          elif [[ "${{ github.event.inputs.num_gpus }}" == "multi" && "${{ github.event.inputs.runner_type }}" == "t4" ]]; then
+          elif [[ "$NUM_GPUS" == "multi" && "$RUNNER_TYPE" == "t4" ]]; then
            echo "RUNNER=aws-g4dn-12xlarge-cache" >> $GITHUB_ENV
-          elif [[ "${{ github.event.inputs.num_gpus }}" == "single" && "${{ github.event.inputs.runner_type }}" == "a10" ]]; then
+          elif [[ "$NUM_GPUS" == "single" && "$RUNNER_TYPE" == "a10" ]]; then
            echo "RUNNER=aws-g5-4xlarge-cache" >> $GITHUB_ENV
-          elif [[ "${{ github.event.inputs.num_gpus }}" == "multi" && "${{ github.event.inputs.runner_type }}" == "a10" ]]; then
+          elif [[ "$NUM_GPUS" == "multi" && "$RUNNER_TYPE" == "a10" ]]; then
            echo "RUNNER=aws-g5-12xlarge-cache" >> $GITHUB_ENV
          else
            echo "RUNNER=" >> $GITHUB_ENV
@ -85,9 +87,11 @@ jobs:
      - name: Store Slack infos
        #because the SSH can be enabled dynamically if the workflow failed, so we need to store slack infos to be able to retrieve them during the waitforssh step
        shell: bash
+        env:
+          GITHUB_ACTOR: ${{ github.actor }}
        run: |
-          echo "${{ github.actor }}"
-          github_actor=${{ github.actor }}
+          echo "$GITHUB_ACTOR"
+          github_actor=$GITHUB_ACTOR
          github_actor=${github_actor/'-'/'_'}
          echo "$github_actor"
          echo "github_actor=$github_actor" >> $GITHUB_ENV
--- a/.gitignore
+++ b/.gitignore
@ -98,6 +98,7 @@ celerybeat-schedule
 # Environments
 .env
 .venv
+.venv*
 env/
 venv/
 ENV/
@ -171,3 +172,6 @@ tags

 # modular conversion
 *.modular_backup
+
+# Cursor IDE files
+.cursor/
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -278,13 +278,14 @@ are working on it).<br>
 useful to avoid duplicated work, and to differentiate it from PRs ready to be merged.<br>
 ☐ Make sure existing tests pass.<br>
 ☐ If adding a new feature, also add tests for it.<br>
-   - If you are adding a new model, make sure you use
+
+- If you are adding a new model, make sure you use
     `ModelTester.all_model_classes = (MyModel, MyModelWithLMHead,...)` to trigger the common tests.
-   - If you are adding new `@slow` tests, make sure they pass using
+- If you are adding new `@slow` tests, make sure they pass using
     `RUN_SLOW=1 python -m pytest tests/models/my_new_model/test_my_new_model.py`.
-   - If you are adding a new tokenizer, write tests and make sure
+- If you are adding a new tokenizer, write tests and make sure
     `RUN_SLOW=1 python -m pytest tests/models/{your_model_name}/test_tokenization_{your_model_name}.py` passes.
-   - CircleCI does not run the slow tests, but GitHub Actions does every night!<br>
+- CircleCI does not run the slow tests, but GitHub Actions does every night!<br>

 ☐ All public methods must have informative docstrings (see
 [`modeling_bert.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/bert/modeling_bert.py)
@ -340,6 +341,7 @@ RUN_SLOW=yes python -m pytest -n auto --dist=loadfile -s -v ./examples/pytorch/t
 ```

 Like the slow tests, there are other environment variables available which are not enabled by default during testing:
+
 - `RUN_CUSTOM_TOKENIZERS`: Enables tests for custom tokenizers.

 More environment variables and additional information can be found in the [testing_utils.py](https://github.com/huggingface/transformers/blob/main/src/transformers/testing_utils.py).
--- a/ISSUES.md
+++ b/ISSUES.md
@ -153,7 +153,7 @@ You are not required to read the following guidelines before opening an issue. H
    cd examples/seq2seq
    torchrun --nproc_per_node=2 ./finetune_trainer.py \
    --model_name_or_path sshleifer/distill-mbart-en-ro-12-4 --data_dir wmt_en_ro \
-    --output_dir output_dir --overwrite_output_dir \
+    --output_dir output_dir \
    --do_train --n_train 500 --num_train_epochs 1 \
    --per_device_train_batch_size 1  --freeze_embeds \
    --src_lang en_XX --tgt_lang ro_RO --task translation \
--- a/README.md
+++ b/README.md
@ -48,6 +48,7 @@ limitations under the License.
        <a href="https://github.com/huggingface/transformers/blob/main/i18n/README_te.md">తెలుగు</a> |
        <a href="https://github.com/huggingface/transformers/blob/main/i18n/README_fr.md">Français</a> |
        <a href="https://github.com/huggingface/transformers/blob/main/i18n/README_de.md">Deutsch</a> |
+        <a href="https://github.com/huggingface/transformers/blob/main/i18n/README_it.md">Italiano</a> |
        <a href="https://github.com/huggingface/transformers/blob/main/i18n/README_vi.md">Tiếng Việt</a> |
        <a href="https://github.com/huggingface/transformers/blob/main/i18n/README_ar.md">العربية</a> |
        <a href="https://github.com/huggingface/transformers/blob/main/i18n/README_ur.md">اردو</a> |
@ -110,10 +111,10 @@ git clone https://github.com/huggingface/transformers.git
 cd transformers

 # pip
-pip install .[torch]
+pip install '.[torch]'

 # uv
-uv pip install .[torch]
+uv pip install '.[torch]'
 ```

 ## Quickstart
--- a/benchmark/benches/llama.py
+++ b/benchmark/benches/llama.py
@ -16,7 +16,6 @@ import sys
 from logging import Logger
 from threading import Event, Thread
 from time import perf_counter, sleep
-from typing import Optional


 # Add the parent directory to Python path to import benchmarks_entrypoint
@ -42,7 +41,7 @@ except ImportError:
    GenerationConfig = None
    StaticCache = None

-os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
+os.environ["HF_XET_HIGH_PERFORMANCE"] = "1"
 os.environ["TOKENIZERS_PARALLELISM"] = "1"

 # Only set torch precision if torch is available
@ -145,7 +144,7 @@ def run_benchmark(
            q = torch.empty_like(probs_sort).exponential_(1)
            return torch.argmax(probs_sort / q, dim=-1, keepdim=True).to(dtype=torch.int)

-        def logits_to_probs(logits, temperature: float = 1.0, top_k: Optional[int] = None):
+        def logits_to_probs(logits, temperature: float = 1.0, top_k: int | None = None):
            logits = logits / max(temperature, 1e-5)

            if top_k is not None:
@ -155,7 +154,7 @@ def run_benchmark(
            probs = torch.nn.functional.softmax(logits, dim=-1)
            return probs

-        def sample(logits, temperature: float = 1.0, top_k: Optional[int] = None):
+        def sample(logits, temperature: float = 1.0, top_k: int | None = None):
            probs = logits_to_probs(logits[0, -1], temperature, top_k)
            idx_next = multinomial_sample_one_no_sync(probs)
            return idx_next, probs
--- a/benchmark/requirements.txt
+++ b/benchmark/requirements.txt
@ -2,5 +2,5 @@ gpustat==1.1.1
 psutil==6.0.0
 psycopg2==2.9.9
 torch>=2.4.0
-hf_transfer
+hf_xet
 pandas>=1.5.0
--- a/benchmark_v2/.gitignore
+++ b/benchmark_v2/.gitignore
@ -1 +1,2 @@
-benchmark_results/
+benchmark_results/
+benchmark_results_profiles/
--- a/benchmark_v2/benches/init.py
+++ b/benchmark_v2/benches/init.py
@ -1 +0,0 @@
-# Benchmark implementations directory
--- a/benchmark_v2/benches/llama.py
+++ b/benchmark_v2/benches/llama.py
@ -1,165 +0,0 @@
-# Copyright 2025 The HuggingFace Team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import logging
-import os
-from typing import Any
-
-import torch
-from benchmark_framework import ModelBenchmark
-
-
-os.environ["TOKENIZERS_PARALLELISM"] = "1"
-torch.set_float32_matmul_precision("high")
-
-
-class LLaMABenchmark(ModelBenchmark):
-    """Simplified LLaMA model benchmark implementation using the ModelBenchmark base class."""
-
-    def __init__(self, logger: logging.Logger):
-        super().__init__(logger)
-        self._default_prompt = "Why dogs are so cute?"  # Custom prompt for LLaMA
-
-    def get_scenario_configs(self) -> list[dict[str, Any]]:
-        """
-        Get LLaMA-specific scenario configurations.
-
-        Returns:
-            List of scenario configuration dictionaries
-        """
-        return [
-            # Eager variants
-            {"variant": "eager", "compile_mode": None, "use_cache": True, "description": "Eager execution with cache"},
-            # Compiled variants
-            {
-                "variant": "compiled",
-                "compile_mode": "max-autotune",
-                "use_cache": True,
-                "description": "Compiled with max autotune",
-            },
-            # Kernelized variant (if available)
-            {
-                "variant": "kernelized",
-                "compile_mode": "max-autotune",
-                "use_cache": True,
-                "description": "Kernelized execution",
-            },
-        ]
-
-    def _is_kernelization_available(self) -> bool:
-        """Check if kernelization is available for LLaMA."""
-        try:
-            from kernels import Mode, kernelize  # noqa: F401
-
-            return True
-        except ImportError:
-            self.logger.debug("Kernelization not available: kernels module not found")
-            return False
-
-    def get_default_generation_config(self) -> dict[str, Any]:
-        """Get LLaMA-specific generation configuration."""
-        return {
-            "do_sample": False,
-            "top_p": 1.0,
-            "temperature": 1.0,
-            "repetition_penalty": 1.0,
-            "max_new_tokens": None,  # Will be set per scenario
-        }
-
-    def get_model_init_kwargs(self, config) -> dict[str, Any]:
-        """Get LLaMA-specific model initialization kwargs."""
-        return {
-            "torch_dtype": getattr(torch, config.torch_dtype),
-            "attn_implementation": config.attn_implementation,
-            "use_cache": True,
-        }
-
-    def get_default_torch_dtype(self) -> str:
-        """Get default torch dtype for LLaMA."""
-        return "float16"  # LLaMA works well with float16
-
-    def get_default_device(self) -> str:
-        """Get default device for LLaMA."""
-        return "cuda"  # LLaMA prefers CUDA
-
-
-def run_llama(logger, output_dir, **kwargs):
-    """
-    Run LLaMA benchmark with the given configuration.
-
-    Args:
-        logger: Logger instance
-        output_dir: Output directory for results
-        **kwargs: Additional configuration options
-
-    Returns:
-        Path to output file if successful
-    """
-    from benchmark_framework import BenchmarkRunner
-
-    # Extract parameters with defaults
-    model_id = kwargs.get("model_id", "meta-llama/Llama-2-7b-hf")
-    warmup_iterations = kwargs.get("warmup_iterations", 3)
-    measurement_iterations = kwargs.get("measurement_iterations", 5)
-    num_tokens_to_generate = kwargs.get("num_tokens_to_generate", 100)
-    include_sdpa_variants = kwargs.get("include_sdpa_variants", True)
-    device = kwargs.get("device", "cuda")
-    torch_dtype = kwargs.get("torch_dtype", "float16")
-    batch_size = kwargs.get("batch_size", 1)
-    commit_id = kwargs.get("commit_id")
-
-    logger.info(f"Starting LLaMA benchmark for model: {model_id}")
-    logger.info(
-        f"Configuration: warmup={warmup_iterations}, measurement={measurement_iterations}, tokens={num_tokens_to_generate}"
-    )
-
-    try:
-        # Create benchmark instance
-        benchmark = LLaMABenchmark(logger)
-
-        # Create scenarios
-        scenarios = benchmark.create_scenarios(
-            model_id=model_id,
-            warmup_iterations=warmup_iterations,
-            measurement_iterations=measurement_iterations,
-            num_tokens_to_generate=num_tokens_to_generate,
-            include_sdpa_variants=include_sdpa_variants,
-            device=device,
-            torch_dtype=torch_dtype,
-            batch_size=batch_size,
-        )
-
-        logger.info(f"Created {len(scenarios)} benchmark scenarios")
-
-        # Create runner and execute benchmarks
-        runner = BenchmarkRunner(logger, output_dir)
-        results = runner.run_benchmark(benchmark, scenarios, commit_id=commit_id)
-
-        if not results:
-            logger.warning("No successful benchmark results")
-            return None
-
-        # Save results
-        model_name = model_id.split("/")[-1]  # Extract model name from ID
-        output_file = runner.save_results(model_name, results)
-
-        logger.info(f"LLaMA benchmark completed successfully. Results saved to: {output_file}")
-        return output_file
-
-    except Exception as e:
-        logger.error(f"LLaMA benchmark failed: {e}")
-        import traceback
-
-        logger.debug(traceback.format_exc())
-        raise
--- a/benchmark_v2/benchmark_framework.py
+++ b/benchmark_v2/benchmark_framework.py
--- a/benchmark_v2/framework/benchmark_config.py
+++ b/benchmark_v2/framework/benchmark_config.py
@ -0,0 +1,215 @@
+import hashlib
+import json
+import logging
+from typing import Any
+
+
+KERNELIZATION_AVAILABLE = False
+try:
+    from kernels import Mode, kernelize  # noqa: F401
+
+    KERNELIZATION_AVAILABLE = True
+except ImportError:
+    pass
+
+logger = logging.getLogger(__name__)
+
+
+class BenchmarkConfig:
+    """Configuration for a single benchmark scenario."""
+
+    def __init__(
+        self,
+        warmup_iterations: int = 5,
+        measurement_iterations: int = 20,
+        gpu_monitoring: bool = False,  # False by default because it slows down the benchmark by a lot
+        batch_size: int = 1,
+        sequence_length: int = 128,
+        num_tokens_to_generate: int = 128,
+        attn_implementation: str = "eager",
+        sdpa_backend: str | None = None,
+        compile_mode: str | None = None,
+        compile_options: dict[str, Any] | None = None,
+        kernelize: bool = False,
+        name: str | None = None,
+        skip_validity_check: bool = False,
+    ) -> None:
+        # Benchmark parameters
+        self.warmup_iterations = warmup_iterations
+        self.measurement_iterations = measurement_iterations
+        self.gpu_monitoring = gpu_monitoring
+        # Input parameters
+        self.batch_size = batch_size
+        self.sequence_length = sequence_length
+        self.num_tokens_to_generate = num_tokens_to_generate
+        # Generation parameters
+        self.attn_implementation = attn_implementation
+        self.sdpa_backend = sdpa_backend
+        # Optimization parameters
+        self.compile_mode = compile_mode
+        self.compile_options = compile_options if compile_options is not None else {}
+        self.kernelize = kernelize
+        # Constant parameters
+        self.dtype = "torch.bfloat16"
+        self.device = "cuda"
+
+        self.check_validity(skip_validity_check)
+        self.name = name if name is not None else self.infer_name()
+
+    def check_validity(self, skip_validity_check: bool = False) -> None:
+        if skip_validity_check:
+            return
+        # Flash attention does not support compile mode, so we turn it off # FIXME: it would be better to support it
+        is_fa = self.attn_implementation == "flash_attention_2"
+        is_fa |= self.attn_implementation == "sdpa" and self.sdpa_backend == "flash_attention"
+        if is_fa:
+            logger.warning("Flash attention does not support compile mode. Turning off compile mode.")
+            self.compile_mode = None
+
+    @property
+    def hash(self) -> str:
+        return hashlib.sha256(json.dumps(self.to_dict()).encode()).hexdigest()
+
+    def infer_name(self, compact: bool = True) -> str:
+        """Infer a human-readable name for the benchmark config, either compact or verbose."""
+        if compact:
+            iter_str = f"w{self.warmup_iterations}_i{self.measurement_iterations}"
+            gpu_monitor_str = "monitored" if self.gpu_monitoring else "unmonitored"
+            dimensions_str = f"b{self.batch_size}_s{self.sequence_length}_n{self.num_tokens_to_generate}"
+            attn_code = self.attn_implementation
+            attn_code += f"_{self.sdpa_backend}" if self.attn_implementation == "sdpa" else ""
+            compile_str = f"compiled_{self.compile_mode}" if self.compile_mode is not None else "uncompiled"
+            kernelize_str = "kernelized" if self.kernelize else "unkernelized"
+            sep = "-"
+        else:
+            iter_str = f"{self.warmup_iterations} warmup, {self.measurement_iterations} iterations"
+            gpu_monitor_str = ("with" if self.gpu_monitoring else "no") + " GPU monitoring"
+            dimensions_str = f"batch size {self.batch_size}, sequence length {self.sequence_length}, {self.num_tokens_to_generate} generated tokens"
+            attn_code = f"{self.attn_implementation} attention"
+            attn_code += f" with {self.sdpa_backend} backend" if self.attn_implementation == "sdpa" else ""
+            compile_str = "compiled" if self.compile_mode is not None else "not compiled"
+            kernelize_str = "kernelized" if self.kernelize else "not kernelized"
+            sep = ", "
+        return sep.join([iter_str, gpu_monitor_str, dimensions_str, attn_code, compile_str, kernelize_str])
+
+    def to_dict(self) -> dict[str, Any]:
+        return {
+            "name": self.name,
+            "warmup_iterations": self.warmup_iterations,
+            "measurement_iterations": self.measurement_iterations,
+            "gpu_monitoring": self.gpu_monitoring,
+            "batch_size": self.batch_size,
+            "sequence_length": self.sequence_length,
+            "num_tokens_to_generate": self.num_tokens_to_generate,
+            "attn_implementation": self.attn_implementation,
+            "sdpa_backend": self.sdpa_backend,
+            "compile_mode": self.compile_mode,
+            "compile_options": self.compile_options | {},  # to avoid inplace modification of the original dict
+            "kernelize": self.kernelize,
+        }
+
+    @classmethod
+    def from_dict(cls, data: dict[str, Any], skip_validity_check: bool = False) -> "BenchmarkConfig":
+        return cls(
+            warmup_iterations=data.get("warmup_iterations", 5),
+            measurement_iterations=data.get("measurement_iterations", 20),
+            gpu_monitoring=data.get("gpu_monitoring", False),
+            batch_size=data.get("batch_size", 1),
+            sequence_length=data.get("sequence_length", 128),
+            num_tokens_to_generate=data.get("num_tokens_to_generate", 128),
+            attn_implementation=data.get("attn_implementation", "eager"),
+            sdpa_backend=data.get("sdpa_backend"),
+            compile_mode=data.get("compile_mode"),
+            compile_options=data.get("compile_options"),
+            kernelize=data.get("kernelize", False),
+            name=data.get("name"),
+            skip_validity_check=skip_validity_check,
+        )
+
+
+def cross_generate_configs(
+    attn_impl_and_sdpa_backend: list[tuple[str, str | None]],
+    compiled_mode: list[str | None],
+    kernelized: list[bool],
+    warmup_iterations: int = 5,
+    measurement_iterations: int = 20,
+    batch_size: int = 1,
+    sequence_length: int = 128,
+    num_tokens_to_generate: int = 128,
+    gpu_monitoring: bool = False,  # this slows down the benchmark by a lot so we disable it by default
+) -> list[BenchmarkConfig]:
+    # Create kwargs common to all configs
+    kwargs = {
+        "warmup_iterations": warmup_iterations,
+        "measurement_iterations": measurement_iterations,
+        "batch_size": batch_size,
+        "sequence_length": sequence_length,
+        "num_tokens_to_generate": num_tokens_to_generate,
+        "gpu_monitoring": gpu_monitoring,
+    }
+    # Cross-generate all combinations of attn_implementation, compiled_mode, and kernelized
+    configs = []
+    for attn_implementation, sdpa_backend in list(dict.fromkeys(attn_impl_and_sdpa_backend)):
+        for cm in list(dict.fromkeys(compiled_mode)):
+            for kernelize_on in list(dict.fromkeys(kernelized)):
+                config = BenchmarkConfig(
+                    attn_implementation=attn_implementation,
+                    sdpa_backend=sdpa_backend,
+                    compile_mode=cm,
+                    kernelize=kernelize_on,
+                    **kwargs,
+                )
+                configs.append(config)
+    return configs
+
+
+def generate_all_configs(
+    warmup_iterations: int = 5,
+    measurement_iterations: int = 20,
+    batch_size: int = 1,
+    sequence_length: int = 128,
+    num_tokens_to_generate: int = 128,
+    gpu_monitoring: bool = False,
+) -> list[BenchmarkConfig]:
+    all_attn_implementations = [
+        ("flash_attention_2", None),
+        ("eager", None),
+        ("sdpa", "math"),
+        ("sdpa", "flash_attention"),
+        ("flex_attention", None),
+    ]
+    return cross_generate_configs(
+        attn_impl_and_sdpa_backend=all_attn_implementations,
+        compiled_mode=[None, "default", "reduce-overhead", "max-autotune", "max-autotune-no-cudagraphs"],
+        kernelized=[False, KERNELIZATION_AVAILABLE],
+        warmup_iterations=warmup_iterations,
+        measurement_iterations=measurement_iterations,
+        batch_size=batch_size,
+        sequence_length=sequence_length,
+        num_tokens_to_generate=num_tokens_to_generate,
+        gpu_monitoring=gpu_monitoring,
+    )
+
+
+def generate_main_configs(
+    warmup_iterations: int = 5,
+    measurement_iterations: int = 20,
+    batch_size: int = 1,
+    sequence_length: int = 128,
+    num_tokens_to_generate: int = 128,
+    gpu_monitoring: bool = False,
+) -> list[BenchmarkConfig]:
+    # Create kwargs common to all configs
+    kwargs = {
+        "warmup_iterations": warmup_iterations,
+        "measurement_iterations": measurement_iterations,
+        "batch_size": batch_size,
+        "sequence_length": sequence_length,
+        "num_tokens_to_generate": num_tokens_to_generate,
+        "gpu_monitoring": gpu_monitoring,
+    }
+    return [  # TODO: test max-autotune instead of default
+        BenchmarkConfig(attn_implementation="flex_attention", compile_mode="default", **kwargs),
+        BenchmarkConfig(attn_implementation="eager", compile_mode="default", **kwargs),
+        BenchmarkConfig(attn_implementation="flash_attention_2", **kwargs),
+    ]
--- a/benchmark_v2/framework/benchmark_runner.py
+++ b/benchmark_v2/framework/benchmark_runner.py
@ -0,0 +1,389 @@
+import gc
+import json
+import logging
+import os
+import pathlib
+import re
+import time
+from contextlib import nullcontext
+from datetime import datetime
+from queue import Queue
+from typing import Any
+
+import torch
+from tqdm import trange
+
+from transformers import (
+    AutoModelForCausalLM,
+    AutoTokenizer,
+    CompileConfig,
+    GenerationConfig,
+    GenerationMixin,
+)
+from transformers.generation.streamers import BaseStreamer
+
+from .benchmark_config import BenchmarkConfig
+from .data_classes import BenchmarkMetadata, BenchmarkResult, GPURawMetrics, pretty_print_dict
+from .hardware_metrics import GPUMonitor
+
+
+try:
+    from kernels import Mode, kernelize  # noqa: F401
+except ImportError:
+    kernelize = None
+    Mode = None
+
+
+DEFAULT_PROMPT = "\n".join([
+    "The French Revolution was a period of political and societal change in France that began with the Estates General of 1789 and ended with the Coup of 18 Brumaire on 9 November 1799.",
+    "Many of the revolution's ideas are considered fundamental principles of liberal democracy, and its values remain central to modern French political discourse.",
+    "It was caused by a combination of social, political, and economic factors which the existing regime proved unable to manage.",
+    "Financial crisis and widespread social distress led to the convocation of the Estates General in May 1789, its first meeting since 1614.",
+    "The representatives of the Third Estate broke away and re-constituted themselves as a National Assembly in June.",
+    "The Storming of the Bastille in Paris on 14 July led to a series of radical measures by the Assembly, including the abolition of feudalism, state control over the Catholic Church in France, and issuing the Declaration of the Rights of Man and of the Citizen.",
+    "The next three years were dominated by a struggle for political control.",
+    "King Louis XVI's attempted flight to Varennes in June 1791 further discredited the monarchy, and military defeats after the outbreak of the French Revolutionary Wars in April 1792 led to the insurrection of 10 August 1792.",
+    "As a result, the monarchy was replaced by the French First Republic in September, followed by the execution of Louis XVI himself in January 1793.",
+    "After another revolt in June 1793, the constitution was suspended, and political power passed from the National Convention to the Committee of Public Safety, dominated by radical Jacobins led by Maximilien Robespierre.",
+    "About 16,000 people were sentenced by the Revolutionary Tribunal and executed in the Reign of Terror, which ended in July 1794 with the Thermidorian Reaction.",
+    "Weakened by external threats and internal opposition, the Committee of Public Safety was replaced in November 1795 by the Directory.",
+    "Its instability ended in the coup of 18 Brumaire and the establishment of the Consulate, with Napoleon Bonaparte as First Consul.",
+])  # fmt: skip
+
+
+def compact_json_numeric_arrays(data: dict):
+    # Match arrays that contain only numbers (ints/floats), whitespace, commas, and newlines
+    pattern = r"\[\s*\n\s*((?:\d+(?:\.\d+)?\s*,\s*)*\d+(?:\.\d+)?)\s*\n\s*\]"
+
+    def replace_numeric_array(match):
+        # Get the array content
+        content = match.group(1)
+        # Remove extra whitespace but keep commas
+        compact_content = re.sub(r"\s+", " ", content).strip()
+        return f"[{compact_content}]"
+
+    return re.sub(pattern, replace_numeric_array, json.dumps(data, indent=4, default=str), flags=re.DOTALL)
+
+
+def get_git_revision() -> str:
+    base_path = pathlib.Path(__file__).parent.parent.parent
+    git_dir = base_path / ".git"
+    with (git_dir / "HEAD").open("r") as head:
+        ref = head.readline().split(" ")[-1].strip()
+    with (git_dir / ref).open("r") as git_hash:
+        return git_hash.readline().strip()
+
+
+def get_sdpa_backend(backend_name: str | None) -> torch.nn.attention.SDPBackend | None:
+    """Get the SDPA backend enum from string name."""
+    if backend_name is None:
+        return None
+
+    try:
+        backend_map = {
+            "math": torch.nn.attention.SDPBackend.MATH,
+            "flash_attention": torch.nn.attention.SDPBackend.FLASH_ATTENTION,
+            "efficient_attention": torch.nn.attention.SDPBackend.EFFICIENT_ATTENTION,
+            "cudnn_attention": torch.nn.attention.SDPBackend.CUDNN_ATTENTION,
+        }
+        return backend_map.get(backend_name.lower())
+    except AttributeError:
+        # torch.nn.attention.SDPBackend not available in older torch versions
+        return None
+
+
+def flush_memory():
+    """Flush GPU memory and run garbage collection."""
+    gc.collect()
+    # Dynamo resets
+    torch._dynamo.reset()
+    torch._dynamo.reset_code_caches()
+    if hasattr(torch._inductor, "codecache"):
+        # Clear FX graph cache
+        if hasattr(torch._inductor.codecache, "FxGraphCache"):
+            torch._inductor.codecache.FxGraphCache.clear()
+        # Clear PyCodeCache
+        if hasattr(torch._inductor.codecache, "PyCodeCache"):
+            torch._inductor.codecache.PyCodeCache.cache_clear()
+        # Clear TritonFuture cache (for async compilation)
+        if hasattr(torch._inductor.codecache, "TritonFuture"):
+            if hasattr(torch._inductor.codecache.TritonFuture, "_compile_cache"):
+                torch._inductor.codecache.TritonFuture._compile_cache.clear()
+    # Clear CUDA cache
+    if torch.cuda.is_available():
+        torch.cuda.empty_cache()
+        torch.cuda.reset_max_memory_allocated()
+        torch.cuda.reset_peak_memory_stats()
+        torch.cuda.synchronize()
+    gc.collect()
+
+
+class BenchmarkStreamer(BaseStreamer):
+    def __init__(self, **kwargs) -> None:
+        self.timestamps = []
+        self.text_queue = Queue()
+
+    def put(self, value):
+        """Receives tokens and logs the timestamp of the generation."""
+        self.timestamps.append(time.perf_counter())
+
+    def end(self):
+        self.timestamps.append(time.perf_counter())
+
+    def __iter__(self):
+        return self
+
+    def __next__(self):
+        value = self.text_queue.get(timeout=self.timeout)
+        if value == self.stop_signal:
+            raise StopIteration()
+        else:
+            return value
+
+
+class BenchmarkRunner:
+    """Main benchmark runner that coordinates benchmark execution."""
+
+    def __init__(self, logger: logging.Logger, output_dir: str | None = None, commit_id: str | None = None) -> None:
+        # Those stay constant for the whole run
+        self.logger = logger
+        if output_dir is None:
+            output_dir = os.path.join(os.path.dirname(os.path.dirname(__file__)), "benchmark_results")
+        self.output_dir = output_dir
+        self.commit_id = get_git_revision() if commit_id is None else commit_id
+        os.makedirs(self.output_dir, exist_ok=True)
+        self.profile_dir = None
+        # Attributes that are reset for each model
+        self._setup_for = ""
+        # Attributes that are reset for each run
+        self.model: GenerationMixin | None = None
+
+    def cleanup(self) -> None:
+        del self.model
+        self.model = None
+        flush_memory()
+
+    def setup_one_run(self, model_id: str, config: BenchmarkConfig) -> None:
+        # Some attributes only need to be set once per model
+        if self._setup_for != model_id:
+            self.tokenizer = AutoTokenizer.from_pretrained(model_id)
+            # We set the EOS token to the padding token for open-ended generation
+            self.tokenizer.eos_token = self.tokenizer.pad_token
+            self._setup_for = model_id
+
+        # Prepare inputs
+        self.inputs = self.tokenizer(
+            [DEFAULT_PROMPT for _ in range(config.batch_size)],
+            return_tensors="pt",
+            max_length=config.sequence_length,
+            truncation=True,
+            return_attention_mask=True,
+        ).to(config.device)
+        self.inputs["use_cache"] = True
+
+        # Prepare generation config
+        gen_config = GenerationConfig(
+            do_sample=False, top_p=1.0, temperature=1.0, max_new_tokens=config.num_tokens_to_generate
+        )
+
+        # Prepare compile config
+        if config.compile_mode is not None:
+            gen_config.compile_config = CompileConfig(mode=config.compile_mode, options=config.compile_options)
+            gen_config.cache_implementation = "static"
+
+        # Load model
+        self.logger.debug(f"Loading model {model_id} on device {config.device}...")
+        dtype = getattr(torch, config.dtype.removeprefix("torch."))
+        self.model = AutoModelForCausalLM.from_pretrained(
+            model_id, dtype=dtype, attn_implementation=config.attn_implementation, generation_config=gen_config
+        )
+        self.model = self.model.eval().to(config.device)
+
+        # Kernelize the model if needed
+        if config.kernelize:
+            self.model = kernelize(self.model, mode=Mode.INFERENCE)
+
+    def run_one_benchmark(self, model_id: str, config: BenchmarkConfig, num_tokens_to_profile: int = 0) -> None:
+        sdpa_ctx = nullcontext()
+        if config.attn_implementation == "sdpa":
+            sdpa_backend = get_sdpa_backend(config.sdpa_backend)
+            sdpa_ctx = torch.nn.attention.sdpa_kernel(sdpa_backend)
+
+        with sdpa_ctx, torch.no_grad():
+            self.logger.info(f"Running benchmark scenario: {config.name}")
+
+            # Quick validation: try one measurement first to see if this scenario works
+            flush_memory()
+            e2e_latency, token_generation_times, shape_and_decoded_output, gpu_metrics = self.time_generate(
+                max_new_tokens=1, gpu_monitor=None
+            )
+            if e2e_latency < 0:
+                self.logger.warning(f"Skipping config {config.name}: {e2e_latency = } (no GPU monitoring)")
+                return None
+
+            # Warmup runs
+            self.logger.info(f"Warming up with {config.warmup_iterations} iterations...")
+            for _ in trange(config.warmup_iterations):
+                _ = self.time_generate(max_new_tokens=config.num_tokens_to_generate)
+            self.logger.info("Warmup over.")
+
+            # Measurement runs
+            result = BenchmarkResult()
+            self.logger.info(f"Benchmarking with {config.measurement_iterations} iterations.")
+            for _ in trange(config.measurement_iterations):
+                e2e_latency, token_generation_times, shape_and_decoded_output, gpu_metrics = self.time_generate(
+                    max_new_tokens=config.num_tokens_to_generate,
+                    gpu_monitor=(GPUMonitor(logger=self.logger) if config.gpu_monitoring else None),
+                )
+                result.accumulate(e2e_latency, token_generation_times, shape_and_decoded_output, gpu_metrics)
+            self.logger.info("Benchmarking done. Cleaning up.")
+
+            # Profile if needed
+            if num_tokens_to_profile > 0:
+                self.profile_generate(num_tokens_to_profile, config.name)
+
+            return {
+                "metadata": BenchmarkMetadata(model_id=model_id, commit_id=self.commit_id),
+                "measurements": result,
+                "config": config,
+            }
+
+    def time_generate(
+        self,
+        max_new_tokens: int,
+        gpu_monitor: GPUMonitor | None = None,
+    ) -> tuple[float, list[float], str, GPURawMetrics | None]:
+        """Time the latency of a call to model.generate() with the given (inputs) and (max_new_tokens)."""
+        # Prepare gpu monitoring if needed
+        if gpu_monitor is not None:
+            gpu_monitor.start()
+        # Prepare streamer
+        streamer = BenchmarkStreamer()
+        # Generate and time
+        wall_time_0 = time.perf_counter()
+        outputs = self.model.generate(
+            **self.inputs,
+            max_new_tokens=max_new_tokens,
+            streamer=streamer,
+        )
+        wall_time_1 = time.perf_counter()
+        # Stop gpu monitoring if needed
+        gpu_metrics = gpu_monitor.stop_and_collect() if gpu_monitor is not None else None
+        # Check if generation had the right number of tokens
+        input_tokens = self.inputs["input_ids"].size(-1)
+        batch_size, output_tokens = outputs.shape
+        new_tokens = output_tokens - input_tokens
+        if new_tokens != max_new_tokens:
+            raise RuntimeError(f"Generated {new_tokens} tokens, expected {max_new_tokens}")
+        # Decode outputs
+        decoded_output = self.tokenizer.decode(outputs[0, input_tokens:], skip_special_tokens=True)
+        shape_and_decoded_output = f"{tuple(outputs.shape)} | {decoded_output}"
+        # Compute intermediate quantities
+        e2e_latency = wall_time_1 - wall_time_0
+        token_generation_times = [t - wall_time_0 for t in streamer.timestamps[1:]]
+        return e2e_latency, token_generation_times, shape_and_decoded_output, gpu_metrics
+
+    def profile_generate(self, num_tokens_to_profile: int, config_name: str) -> None:
+        """Profile the latency of a call to model.generate() with the given (inputs) and (max_new_tokens)."""
+        profiler = torch.profiler.profile(
+            activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA],
+            record_shapes=True,
+        )
+        with profiler as prof:
+            _ = self.model.generate(
+                **self.inputs,
+                max_new_tokens=num_tokens_to_profile,
+            )
+        if self.profile_dir is None:
+            self.profile_dir = self.output_dir + "_profiles"
+            os.makedirs(self.profile_dir, exist_ok=True)
+        prof.export_chrome_trace(f"{self.profile_dir}/{config_name}.json")
+
+    def run_benchmarks(
+        self,
+        model_id: str,
+        benchmark_configs: list[BenchmarkConfig],
+        num_tokens_to_profile: int = 0,
+        pretty_print_summary: bool = True,
+    ) -> dict[str, Any]:
+        all_results = {}
+        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+        start_time = time.perf_counter()
+
+        n_configs = len(benchmark_configs)
+        for i, config in enumerate(benchmark_configs):
+            # Handle SDPA backend if not determined by the config (needs to be done before skipping duplicates)
+            if config.attn_implementation == "sdpa" and config.sdpa_backend is None:
+                default_backend = "flash_attention"  # FIXME: torch has a _cur_sdpa_kernel_backends but it fails
+                self.logger.warning(f"No SDPA backend provided, using {default_backend} instead.")
+                config.sdpa_backend = default_backend
+
+            # Skip if already run
+            if config.hash in all_results:
+                self.logger.info(f"Skipping duplicate config {config.name} for model {model_id} ({i + 1}/{n_configs})")
+                continue
+
+            # Otherwise, run the benchmark
+            self.setup_one_run(model_id, config)
+            self.logger.info(
+                f"Running benchmark of model {model_id} with scenario: {config.name} ({i + 1}/{n_configs})"
+            )
+
+            # Launch benchmark in a try/except block to avoid stopping the whole run if one benchmark fails
+            try:
+                results = self.run_one_benchmark(model_id, config, num_tokens_to_profile)
+                if results is not None:
+                    all_results[config.hash] = results
+
+            except Exception as e:
+                self.logger.error(f"Error running with scenario: {config.name}:\n{repr(e)}")
+            # Cleanup model and save results
+            self.cleanup()
+            self.save_results(model_id, all_results, timestamp=timestamp)
+
+        if pretty_print_summary:
+            print()
+            print("=" * 100)
+            print(f"Finished benchmarks in {time.perf_counter() - start_time:.2f} seconds")
+            print(f"Total number of benchmarks: {len(all_results)}")
+            if len(all_results) > 0:
+                print("First run metadata:")
+                first_key = list(all_results.keys())[0]
+                first_metadata = all_results[first_key]["metadata"].to_dict()
+                hardware_info = first_metadata.pop("hardware_info")
+                pretty_print_dict(first_metadata | hardware_info, tabs=1)
+            for result in all_results.values():
+                print("=" * 100)
+                print(f"Config: {result['config'].infer_name(compact=False)}\n")
+                result["measurements"].pprint(batch_size=result["config"].batch_size, tabs=1)
+            print("=" * 100)
+
+        return all_results
+
+    def save_results(self, model_name: str, results: dict, timestamp: str = "") -> str:
+        """Save benchmark results to JSON file."""
+        # Create model-specific subdirectory
+        model_name = model_name.replace("/", "_")
+        model_dir = os.path.join(self.output_dir, model_name)
+        os.makedirs(model_dir, exist_ok=True)
+
+        # Create filename with timestamp
+        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") if not timestamp else timestamp
+        filename = f"{model_name}_benchmark_{timestamp}.json"
+        filepath = os.path.join(model_dir, filename)
+
+        # Convert results to dict
+        converted_results = {}
+        for cfg_hash in results.keys():
+            converted_results[cfg_hash] = {
+                "metadata": results[cfg_hash]["metadata"].to_dict(),
+                "measurements": results[cfg_hash]["measurements"].to_dict(),
+                "config": results[cfg_hash]["config"].to_dict(),
+            }
+
+        # Save to JSON file
+        with open(filepath, "w") as f:
+            f.write(compact_json_numeric_arrays(converted_results))
+
+        self.logger.info(f"Results saved to {filepath}")
+        return filepath
--- a/benchmark_v2/framework/data_classes.py
+++ b/benchmark_v2/framework/data_classes.py
@ -0,0 +1,160 @@
+from dataclasses import dataclass
+from datetime import datetime
+from typing import Any
+
+import numpy as np
+
+from .hardware_metrics import GPURawMetrics, HardwareInfo
+
+
+def compute_basic_statistics(measurements: list[float]) -> dict[str, float]:
+    return {
+        "avg": np.mean(measurements),
+        "std": np.std(measurements),
+        "min": np.min(measurements),
+        "med": np.median(measurements),
+        "max": np.max(measurements),
+        "p95": np.percentile(measurements, 95),
+    }
+
+
+def add_unit_to_duration(stats: dict[str, float]) -> dict[str, str]:
+    for key in list(stats.keys()):
+        value = stats[key]
+        if value > 3600:
+            stats[key] = f"{(value / 3600):.2f}hr"
+        elif value > 60:
+            stats[key] = f"{(value / 60):.2f}min"
+        elif value > 1:
+            stats[key] = f"{value:.2f}s"
+        elif value > 1e-3:
+            stats[key] = f"{(value * 1e3):.2f}ms"
+        elif value > 1e-6:
+            stats[key] = f"{(value * 1e6):.2f}us"
+        else:
+            stats[key] = f"{(value * 1e9):.2f}ns"
+    return stats
+
+
+def equalize_lengths_and_collate(stats: list[dict[str, str]]) -> list[str]:
+    keys = ["avg", "std", "min", "med", "max", "p95"]
+    for key in keys:
+        max_length = max(len(stat[key]) for stat in stats)
+        for stat in stats:
+            stat[key] = stat[key].ljust(max_length, " ")
+    return [" ".join([f"{key}={stat[key]}" for key in keys]) for stat in stats]
+
+
+def pretty_print_dict(data: dict[str, Any], tabs: int = 0) -> None:
+    max_key_length = max([len(key) for key in data.keys()])
+    for key, value in data.items():
+        tabs_str = "  " * tabs
+        padded_key = key.ljust(max_key_length + 1, ".")
+        print(f"{tabs_str}{padded_key}: {value}")
+
+
+@dataclass
+class BenchmarkMetadata:
+    """Metadata collected for each benchmark run."""
+
+    model_id: str
+    timestamp: str
+    commit_id: str
+    hardware_info: HardwareInfo
+
+    def __init__(self, model_id: str, commit_id: str):
+        self.model_id = model_id
+        self.timestamp = datetime.utcnow().isoformat()
+        self.commit_id = commit_id
+        self.hardware_info = HardwareInfo()
+
+    def to_dict(self) -> dict[str, Any]:
+        return {
+            "timestamp": self.timestamp,
+            "commit_id": self.commit_id,
+            "hardware_info": self.hardware_info.to_dict(),
+        }
+
+
+class BenchmarkResult:
+    """Result from a series of benchmark runs."""
+
+    def __init__(self) -> None:
+        self.e2e_latency = []
+        self.token_generation_times = []  # time at which each token was generated (relative to start of the generation)
+        self.shape_and_decoded_outputs = []
+        self.gpu_metrics = []
+
+    def accumulate(
+        self,
+        e2e_latency: float,
+        token_generation_times: list[float],
+        shape_and_decoded_output: str,
+        gpu_metrics: GPURawMetrics | None,
+    ) -> None:
+        self.e2e_latency.append(e2e_latency)
+        self.token_generation_times.append(token_generation_times)
+        self.shape_and_decoded_outputs.append(shape_and_decoded_output)
+        self.gpu_metrics.append(gpu_metrics)
+
+    def to_dict(self) -> dict[str, None | int | float]:
+        # Save GPU metrics as None if it contains only None values
+        if all(gm is None for gm in self.gpu_metrics):
+            gpu_metrics = None
+        else:
+            gpu_metrics = [gm.to_dict() for gm in self.gpu_metrics]
+        return {
+            "e2e_latency": self.e2e_latency,
+            "token_generation_times": self.token_generation_times,
+            "shape_and_decoded_outputs": self.shape_and_decoded_outputs,
+            "gpu_metrics": gpu_metrics,
+        }
+
+    @classmethod
+    def from_dict(cls, data: dict[str, None | int | float]) -> "BenchmarkResult":
+        # Handle GPU metrics, which is saved as None if it contains only None values
+        if data["gpu_metrics"] is None:
+            gpu_metrics = [None for _ in range(len(data["e2e_latency"]))]
+        else:
+            gpu_metrics = [GPURawMetrics.from_dict(gm) for gm in data["gpu_metrics"]]
+        # Create a new instance and accumulate the data
+        new_instance = cls()
+        for i in range(len(data["e2e_latency"])):
+            new_instance.accumulate(
+                e2e_latency=data["e2e_latency"][i],
+                token_generation_times=data["token_generation_times"][i],
+                shape_and_decoded_output=data["shape_and_decoded_outputs"][i],
+                gpu_metrics=gpu_metrics[i],
+            )
+        return new_instance
+
+    def get_measured_ttft(self) -> list[float]:
+        return [dt[0] for dt in self.token_generation_times if len(dt) > 0]
+
+    def get_measured_itl(self) -> list[float]:
+        return [(dt[-1] - dt[0]) / (len(dt) - 1) for dt in self.token_generation_times if len(dt) > 1]
+
+    def get_throughput(self, batch_size: int) -> float:
+        return [
+            batch_size * len(dt) / e2e_latency
+            for e2e_latency, dt in zip(self.e2e_latency, self.token_generation_times)
+        ]
+
+    def pprint(self, batch_size: int = 0, tabs: int = 0) -> None:
+        stats_to_collate = [
+            add_unit_to_duration(compute_basic_statistics(self.e2e_latency)),
+            add_unit_to_duration(compute_basic_statistics(self.get_measured_ttft())),
+            add_unit_to_duration(compute_basic_statistics(self.get_measured_itl())),
+        ]
+        if batch_size > 0:
+            throughput_stats = compute_basic_statistics(self.get_throughput(batch_size))
+            stats_to_collate.append({key: f"{value:.2f}tok/s" for key, value in throughput_stats.items()})
+        collated_stats = equalize_lengths_and_collate(stats_to_collate)
+        dict_to_pprint = {
+            "E2E Latency": collated_stats[0],
+            "Time to First Token": collated_stats[1],
+            "Inter-Token Latency": collated_stats[2],
+        }
+        if batch_size > 0:
+            dict_to_pprint["Throughput"] = collated_stats[3]
+        pretty_print_dict(dict_to_pprint, tabs=tabs)
--- a/benchmark_v2/framework/hardware_metrics.py
+++ b/benchmark_v2/framework/hardware_metrics.py
@ -0,0 +1,171 @@
+import json
+import logging
+import subprocess
+import sys
+import threading
+import time
+from dataclasses import dataclass
+from enum import Enum
+from logging import Logger
+
+import gpustat
+import psutil
+import torch
+
+
+# Data class to hold the hardware information
+def get_device_name_and_memory_total() -> tuple[str, float]:
+    """Returns the name and memory total of GPU 0."""
+    device_name = torch.cuda.get_device_properties(0).name
+    device_memory_total = torch.cuda.get_device_properties(0).total_memory / 1024**3
+    return device_name, device_memory_total
+
+
+class HardwareInfo:
+    """A class to hold information about the hardware."""
+
+    def __init__(self) -> None:
+        # Retrieve GPU stats
+        try:
+            self.gpu_name, self.gpu_memory_total_gb = get_device_name_and_memory_total()
+        except Exception:
+            self.gpu_name, self.gpu_memory_total_gb = None, None
+        # Retrieve python, torch and CUDA version
+        self.python_version = f"{sys.version.split()[0]}"
+        self.torch_version = torch.__version__
+        if hasattr(torch, "cuda") and torch.cuda.is_available():
+            self.cuda_version = torch.version.cuda
+        else:
+            self.cuda_version = None
+        # Retrieve general hardware information
+        self.cpu_count = psutil.cpu_count()
+        self.memory_total_mb = int(psutil.virtual_memory().total / (1024 * 1024))
+
+    def to_dict(self) -> dict[str, None | int | float | str]:
+        return {
+            "gpu_name": self.gpu_name,
+            "gpu_memory_total_gb": self.gpu_memory_total_gb,
+            "python_version": self.python_version,
+            "torch_version": self.torch_version,
+        }
+
+
+# Functions to get information about the GPU
+def get_amd_gpu_stats() -> tuple[int, float]:
+    """Returns the utilization and memory used of an AMD GPU, both in percent"""
+    rocm_smi_output = subprocess.check_output(["rocm-smi", "--json", "--showuse", "--showmeminfo", "VRAM"])
+    gpu_stats = json.loads(rocm_smi_output.decode("utf-8"))
+    gpu_stats = [
+        (card_id, stats["GPU use (%)"], stats["VRAM Total Used Memory (B)"]) for card_id, stats in gpu_stats.items()
+    ]
+    gpu_stats.sort(key=lambda x: x[1], reverse=True)
+    return int(gpu_stats[0][1]), float(gpu_stats[0][2]) / 1024**3
+
+
+def get_nvidia_gpu_stats() -> tuple[int, float]:
+    """Returns the utilization and memory used of an NVIDIA GPU, both in percent"""
+    gpu_stats = gpustat.GPUStatCollection.new_query()
+    gpu_stats = gpu_stats[0]
+    return int(gpu_stats["utilization.gpu"]), float(gpu_stats["memory.used"]) / 1024**3
+
+
+class GPUStatsCollector:
+    """A class to get statistics about the GPU. It serves as a wrapper that holds the GPU total memory and its name,
+    which is used to call the right function to get the utilization and memory used."""
+
+    def __init__(self) -> None:
+        self.device_name, self.device_memory_total = get_device_name_and_memory_total()
+        # Monkey patch the get_utilization_and_memory_used method based on the GPU type
+        if "amd" in self.device_name.lower():
+            self.get_utilization_and_memory_used = get_amd_gpu_stats
+        elif "nvidia" in self.device_name.lower():
+            self.get_utilization_and_memory_used = get_nvidia_gpu_stats
+        else:
+            raise RuntimeError(f"Unsupported GPU: {self.device_name}")
+
+    def get_measurements(self) -> tuple[int, float]:
+        """Get the utilization and memory used of the GPU, both in percent"""
+        raise NotImplementedError("This method is meant to be monkey patched during __init__")
+
+
+# Simple data classes to hold the raw GPU metrics
+class GPUMonitoringStatus(Enum):
+    """Status of GPU monitoring."""
+
+    SUCCESS = "success"
+    FAILED = "failed"
+    NO_GPUS_AVAILABLE = "no_gpus_available"
+    NO_SAMPLES_COLLECTED = "no_samples_collected"
+
+
+@dataclass
+class GPURawMetrics:
+    """Raw values for GPU utilization and memory used."""
+
+    utilization: list[float]  # in percent
+    memory_used: list[float]  # in GB
+    timestamps: list[float]  # in seconds
+    timestamp_0: float  # in seconds
+    monitoring_status: GPUMonitoringStatus
+
+    def to_dict(self) -> dict[str, None | int | float | str]:
+        return {
+            "utilization": self.utilization,
+            "memory_used": self.memory_used,
+            "timestamps": self.timestamps,
+            "timestamp_0": self.timestamp_0,
+            "monitoring_status": self.monitoring_status.value,
+        }
+
+
+# Main class, used to monitor the GPU utilization during benchmark execution
+class GPUMonitor:
+    """Monitor GPU utilization during benchmark execution."""
+
+    def __init__(self, sample_interval_sec: float = 0.1, logger: Logger | None = None):
+        self.sample_interval_sec = sample_interval_sec
+        self.logger = logger if logger is not None else logging.getLogger(__name__)
+
+        self.num_available_gpus = torch.cuda.device_count()
+        if self.num_available_gpus == 0:
+            raise RuntimeError("No GPUs detected by torch.cuda.device_count().")
+        self.gpu_stats_getter = GPUStatsCollector()
+
+    def start(self):
+        """Start monitoring GPU metrics."""
+        # Clear the stop event to enable monitoring
+        self.stop_event = threading.Event()
+        self.gpu_utilization = []
+        self.gpu_memory_used = []
+        self.timestamps = []
+        self.thread = threading.Thread(target=self._monitor_loop)
+        self.thread.start()
+        self.logger.debug("GPU monitoring started")
+
+    def stop_and_collect(self) -> GPURawMetrics:
+        """Stop monitoring and return collected metrics."""
+        self.stop_event.set()
+        self.thread.join()
+        if self.gpu_utilization:
+            timestamp_0 = self.timestamps[0]
+            metrics = GPURawMetrics(
+                utilization=self.gpu_utilization,
+                memory_used=self.gpu_memory_used,
+                timestamps=[t - timestamp_0 for t in self.timestamps],
+                timestamp_0=timestamp_0,
+                monitoring_status=GPUMonitoringStatus.SUCCESS,
+            )
+            self.logger.debug(f"GPU monitoring completed: {len(self.gpu_utilization)} samples collected")
+        else:
+            metrics = GPURawMetrics(monitoring_status=GPUMonitoringStatus.NO_SAMPLES_COLLECTED)
+        return metrics
+
+    def _monitor_loop(self):
+        """Background monitoring loop using threading.Event for communication."""
+        while not self.stop_event.is_set():
+            utilization, memory_used = self.gpu_stats_getter.get_utilization_and_memory_used()
+            self.gpu_utilization.append(utilization)
+            self.gpu_memory_used.append(memory_used)
+            self.timestamps.append(time.time())
+            if self.stop_event.wait(timeout=self.sample_interval_sec):
+                break
--- a/benchmark_v2/run_benchmarks.py
+++ b/benchmark_v2/run_benchmarks.py
@ -19,477 +19,98 @@ in the ./benches directory, organizing outputs into model-specific subfolders.
 """

 import argparse
-import importlib.util
-import json
 import logging
-import os
 import sys
 import uuid
-from datetime import datetime
-from pathlib import Path
-from typing import Any, Optional
+
+from framework.benchmark_config import BenchmarkConfig, generate_all_configs, generate_main_configs
+from framework.benchmark_runner import BenchmarkRunner


-def setup_logging(log_level: str = "INFO", enable_file_logging: bool = False) -> logging.Logger:
-    """Setup logging configuration."""
-    numeric_level = getattr(logging, log_level.upper(), None)
-    if not isinstance(numeric_level, int):
-        raise ValueError(f"Invalid log level: {log_level}")
+if __name__ == "__main__":
+    # Parse arguments
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--output-dir", type=str, default=None, help="Output dir for benchmark results")
+    parser.add_argument("--log-level", type=str, choices=["DEBUG", "INFO", "WARNING", "ERROR"], default="INFO")
+    parser.add_argument("--model-id", type=str, help="Specific model ID to benchmark (if supported by benchmarks)")
+
+    parser.add_argument("--warmup", type=int, default=3, help="Number of warmup iterations")
+    parser.add_argument("--iterations", type=int, default=10, help="Number of measurement iterations")
+
+    parser.add_argument("--batch-size", "-b", type=int, nargs="+", help="Batch size")
+    parser.add_argument("--sequence-length", "-s", type=int, nargs="+", help="Sequence length")
+    parser.add_argument("--num-tokens-to-generate", "-n", type=int, nargs="+", help="Number of tokens to generate")
+
+    parser.add_argument("--cross-generate", action="store_true", help="Cross-generate all combinations of configs")
+    parser.add_argument("--num-tokens-to-profile", "-p", type=int, default=0, help="Number of tokens to profile")
+
+    parser.add_argument("--commit-id", type=str, help="Git commit ID (if not provided, will auto-detect from git)")
+    args = parser.parse_args()
+
+    # Setup logging
+    benchmark_run_uuid = str(uuid.uuid4())[:8]
+    numeric_level = getattr(logging, args.log_level.upper())

    handlers = [logging.StreamHandler(sys.stdout)]
-
-    if enable_file_logging:
-        handlers.append(logging.FileHandler(f"benchmark_run_{datetime.now().strftime('%Y%m%d_%H%M%S')}.log"))
-
    logging.basicConfig(
        level=numeric_level, format="[%(levelname)s - %(asctime)s] %(name)s: %(message)s", handlers=handlers
    )

-    return logging.getLogger(__name__)
-
-
-def discover_benchmarks(benches_dir: str) -> list[dict[str, Any]]:
-    """
-    Discover all benchmark modules in the benches directory.
-
-    Returns:
-        List of dictionaries containing benchmark module info
-    """
-    benchmarks = []
-    benches_path = Path(benches_dir)
-
-    if not benches_path.exists():
-        raise FileNotFoundError(f"Benches directory not found: {benches_dir}")
-
-    for py_file in benches_path.glob("*.py"):
-        if py_file.name.startswith("__"):
-            continue
-
-        module_name = py_file.stem
-
-        try:
-            # Import the module
-            spec = importlib.util.spec_from_file_location(module_name, py_file)
-            module = importlib.util.module_from_spec(spec)
-            spec.loader.exec_module(module)
-
-            # Check if it has a benchmark runner function
-            if hasattr(module, f"run_{module_name}"):
-                benchmarks.append(
-                    {
-                        "name": module_name,
-                        "path": str(py_file),
-                        "module": module,
-                        "runner_function": getattr(module, f"run_{module_name}"),
-                    }
-                )
-            elif hasattr(module, "run_benchmark"):
-                benchmarks.append(
-                    {
-                        "name": module_name,
-                        "path": str(py_file),
-                        "module": module,
-                        "runner_function": getattr(module, "run_benchmark"),
-                    }
-                )
-            else:
-                logging.warning(f"No runner function found in {py_file}")
-
-        except Exception as e:
-            logging.error(f"Failed to import {py_file}: {e}")
-
-    return benchmarks
-
-
-def run_single_benchmark(
-    benchmark_info: dict[str, Any], output_dir: str, logger: logging.Logger, **kwargs
-) -> Optional[str]:
-    """
-    Run a single benchmark and return the output file path.
-
-    Args:
-        benchmark_info: Dictionary containing benchmark module info
-        output_dir: Base output directory
-        logger: Logger instance
-        **kwargs: Additional arguments to pass to the benchmark
-
-    Returns:
-        Path to the output file if successful, None otherwise
-    """
-    benchmark_name = benchmark_info["name"]
-    runner_func = benchmark_info["runner_function"]
-
-    logger.info(f"Running benchmark: {benchmark_name}")
-
-    try:
-        # Check function signature to determine what arguments to pass
-        import inspect
-
-        sig = inspect.signature(runner_func)
-
-        # Prepare arguments based on function signature
-        func_kwargs = {"logger": logger, "output_dir": output_dir}
-
-        # Add other kwargs if the function accepts them
-        for param_name in sig.parameters:
-            if param_name in kwargs:
-                func_kwargs[param_name] = kwargs[param_name]
-
-        # Filter kwargs to only include parameters the function accepts
-        # If function has **kwargs, include all provided kwargs
-        has_var_kwargs = any(param.kind == param.VAR_KEYWORD for param in sig.parameters.values())
-        if has_var_kwargs:
-            valid_kwargs = {**func_kwargs, **kwargs}
-        else:
-            valid_kwargs = {k: v for k, v in func_kwargs.items() if k in sig.parameters}
-
-        # Run the benchmark
-        result = runner_func(**valid_kwargs)
-
-        if isinstance(result, str):
-            # Function returned a file path
-            return result
-        else:
-            logger.info(f"Benchmark {benchmark_name} completed successfully")
-            return "completed"
-
-    except Exception as e:
-        logger.error(f"Benchmark {benchmark_name} failed: {e}")
-        import traceback
-
-        logger.debug(traceback.format_exc())
-        return None
-
-
-def generate_summary_report(
-    output_dir: str,
-    benchmark_results: dict[str, Any],
-    logger: logging.Logger,
-    benchmark_run_uuid: Optional[str] = None,
-) -> str:
-    """Generate a summary report of all benchmark runs."""
-    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
-    summary_file = os.path.join(output_dir, f"benchmark_summary_{timestamp}.json")
-
-    summary_data = {
-        "run_metadata": {
-            "timestamp": datetime.utcnow().isoformat(),
-            "benchmark_run_uuid": benchmark_run_uuid,
-            "total_benchmarks": len(benchmark_results),
-            "successful_benchmarks": len([r for r in benchmark_results.values() if r is not None]),
-            "failed_benchmarks": len([r for r in benchmark_results.values() if r is None]),
-        },
-        "benchmark_results": benchmark_results,
-        "output_directory": output_dir,
-    }
-
-    with open(summary_file, "w") as f:
-        json.dump(summary_data, f, indent=2, default=str)
-
-    logger.info(f"Summary report saved to: {summary_file}")
-    return summary_file
-
-
-def upload_results_to_hf_dataset(
-    output_dir: str,
-    summary_file: str,
-    dataset_name: str,
-    run_id: Optional[str] = None,
-    token: Optional[str] = None,
-    logger: Optional[logging.Logger] = None,
-) -> Optional[str]:
-    """
-    Upload benchmark results to a HuggingFace Dataset.
-    Based on upload_collated_report() from utils/collated_reports.py
-    Args:
-        output_dir: Local output directory containing results
-        summary_file: Path to the summary file
-        dataset_name: Name of the HuggingFace dataset to upload to
-        run_id: Unique run identifier (if None, will generate one)
-        token: HuggingFace token for authentication (if None, will use environment variables)
-        logger: Logger instance
-    Returns:
-        The run_id used for the upload, None if upload failed
-    """
-    if logger is None:
-        logger = logging.getLogger(__name__)
-
-    import os
-
-    from huggingface_hub import HfApi
-
-    api = HfApi()
-
-    if run_id is None:
-        github_run_number = os.getenv("GITHUB_RUN_NUMBER")
-        github_run_id = os.getenv("GITHUB_RUN_ID")
-        if github_run_number and github_run_id:
-            run_id = f"{github_run_number}-{github_run_id}"
-
-    date_folder = datetime.now().strftime("%Y-%m-%d")
-
-    github_event_name = os.getenv("GITHUB_EVENT_NAME")
-    if github_event_name != "schedule":
-        # Non-scheduled runs go under a runs subfolder
-        repo_path = f"{date_folder}/runs/{run_id}/benchmark_results"
-    else:
-        # Scheduled runs go directly under the date
-        repo_path = f"{date_folder}/{run_id}/benchmark_results"
-
-    logger.info(f"Uploading benchmark results to dataset '{dataset_name}' at path '{repo_path}'")
-
-    try:
-        # Upload all files in the output directory
-        from pathlib import Path
-
-        output_path = Path(output_dir)
-
-        for file_path in output_path.rglob("*"):
-            if file_path.is_file():
-                # Calculate relative path from output_dir
-                relative_path = file_path.relative_to(output_path)
-                path_in_repo = f"{repo_path}/{relative_path}"
-
-                logger.debug(f"Uploading {file_path} to {path_in_repo}")
-
-                api.upload_file(
-                    path_or_fileobj=str(file_path),
-                    path_in_repo=path_in_repo,
-                    repo_id=dataset_name,
-                    repo_type="dataset",
-                    token=token,
-                    commit_message=f"Upload benchmark results for run {run_id}",
-                )
-
-        logger.info(
-            f"Successfully uploaded results to: https://huggingface.co/datasets/{dataset_name}/tree/main/{repo_path}"
-        )
-
-        return run_id
-
-    except Exception as upload_error:
-        logger.error(f"Failed to upload results: {upload_error}")
-        import traceback
-
-        logger.debug(traceback.format_exc())
-        return None
-
-
-def main():
-    """Main entry point for the benchmarking script."""
-    # Generate a unique UUID for this benchmark run
-    benchmark_run_uuid = str(uuid.uuid4())[:8]
-
-    parser = argparse.ArgumentParser(
-        description="Run all benchmarks in the ./benches directory",
-        epilog="""
-Examples:
-  # Run all available benchmarks
-  python3 run_benchmarks.py
-  
-  # Run with specific model and upload to HuggingFace Dataset
-  python3 run_benchmarks.py --model-id meta-llama/Llama-2-7b-hf --upload-to-hf username/benchmark-results
-  
-  # Run with custom run ID and upload to HuggingFace Dataset
-  python3 run_benchmarks.py --run-id experiment_v1 --upload-to-hf org/benchmarks
-  
-  # Run only specific benchmarks with file logging
-  python3 run_benchmarks.py --include llama --enable-file-logging
-        """,  # noqa: W293
-        formatter_class=argparse.RawDescriptionHelpFormatter,
-    )
-
-    parser.add_argument(
-        "--output-dir",
-        type=str,
-        default="benchmark_results",
-        help="Base output directory for benchmark results (default: benchmark_results)",
-    )
-
-    parser.add_argument(
-        "--benches-dir",
-        type=str,
-        default="./benches",
-        help="Directory containing benchmark implementations (default: ./benches)",
-    )
-
-    parser.add_argument(
-        "--log-level",
-        type=str,
-        choices=["DEBUG", "INFO", "WARNING", "ERROR"],
-        default="INFO",
-        help="Logging level (default: INFO)",
-    )
-
-    parser.add_argument("--model-id", type=str, help="Specific model ID to benchmark (if supported by benchmarks)")
-
-    parser.add_argument("--warmup-iterations", type=int, default=3, help="Number of warmup iterations (default: 3)")
-
-    parser.add_argument(
-        "--measurement-iterations", type=int, default=5, help="Number of measurement iterations (default: 5)"
-    )
-
-    parser.add_argument(
-        "--num-tokens-to-generate",
-        type=int,
-        default=100,
-        help="Number of tokens to generate in benchmarks (default: 100)",
-    )
-
-    parser.add_argument("--include", type=str, nargs="*", help="Only run benchmarks matching these names")
-
-    parser.add_argument("--exclude", type=str, nargs="*", help="Exclude benchmarks matching these names")
-
-    parser.add_argument("--enable-file-logging", action="store_true", help="Enable file logging (disabled by default)")
-
-    parser.add_argument(
-        "--commit-id", type=str, help="Git commit ID for metadata (if not provided, will auto-detect from git)"
-    )
-
-    parser.add_argument(
-        "--push-to-hub",
-        type=str,
-        help="Upload results to HuggingFace Dataset (provide dataset name, e.g., 'username/benchmark-results')",
-    )
-
-    parser.add_argument(
-        "--run-id", type=str, help="Custom run ID for organizing results (if not provided, will generate a unique ID)"
-    )
-
-    parser.add_argument(
-        "--token",
-        type=str,
-        help="HuggingFace token for dataset uploads (if not provided, will use HF_TOKEN environment variable)",
-    )
-
-    args = parser.parse_args()
-
-    # Setup logging
-    logger = setup_logging(args.log_level, args.enable_file_logging)
-
+    logger = logging.getLogger("benchmark_v2")
    logger.info("Starting benchmark discovery and execution")
    logger.info(f"Benchmark run UUID: {benchmark_run_uuid}")
    logger.info(f"Output directory: {args.output_dir}")
-    logger.info(f"Benches directory: {args.benches_dir}")

-    # Create output directory
-    os.makedirs(args.output_dir, exist_ok=True)
+    # Error out if one of the arguments is not provided
+    if len(args.batch_size) * len(args.sequence_length) * len(args.num_tokens_to_generate) == 0:
+        raise ValueError(
+            "At least one of the arguments --batch-size, --sequence-length, or --num-tokens-to-generate is required"
+        )

-    try:
-        # Discover benchmarks
-        benchmarks = discover_benchmarks(args.benches_dir)
-        logger.info(f"Discovered {len(benchmarks)} benchmark(s): {[b['name'] for b in benchmarks]}")
-
-        if not benchmarks:
-            logger.warning("No benchmarks found!")
-            return 1
-
-        # Filter benchmarks based on include/exclude
-        filtered_benchmarks = benchmarks
-
-        if args.include:
-            filtered_benchmarks = [
-                b for b in filtered_benchmarks if any(pattern in b["name"] for pattern in args.include)
-            ]
-            logger.info(f"Filtered to include: {[b['name'] for b in filtered_benchmarks]}")
-
-        if args.exclude:
-            filtered_benchmarks = [
-                b for b in filtered_benchmarks if not any(pattern in b["name"] for pattern in args.exclude)
-            ]
-            logger.info(f"After exclusion: {[b['name'] for b in filtered_benchmarks]}")
-
-        if not filtered_benchmarks:
-            logger.warning("No benchmarks remaining after filtering!")
-            return 1
-
-        # Prepare common kwargs for benchmarks
-        benchmark_kwargs = {
-            "warmup_iterations": args.warmup_iterations,
-            "measurement_iterations": args.measurement_iterations,
-            "num_tokens_to_generate": args.num_tokens_to_generate,
-        }
-
-        if args.model_id:
-            benchmark_kwargs["model_id"] = args.model_id
-
-        # Add commit_id if provided
-        if args.commit_id:
-            benchmark_kwargs["commit_id"] = args.commit_id
-
-        # Run benchmarks
-        benchmark_results = {}
-        successful_count = 0
-
-        for benchmark_info in filtered_benchmarks:
-            result = run_single_benchmark(benchmark_info, args.output_dir, logger, **benchmark_kwargs)
-
-            benchmark_results[benchmark_info["name"]] = result
-
-            if result is not None:
-                successful_count += 1
-
-        # Generate summary report
-        summary_file = generate_summary_report(args.output_dir, benchmark_results, logger, benchmark_run_uuid)
-
-        # Upload results to HuggingFace Dataset if requested
-        upload_run_id = None
-        if args.push_to_hub:
-            logger.info("=" * 60)
-            logger.info("UPLOADING TO HUGGINGFACE DATASET")
-            logger.info("=" * 60)
-            # Use provided run_id or fallback to benchmark run UUID
-            effective_run_id = args.run_id or benchmark_run_uuid
-            upload_run_id = upload_results_to_hf_dataset(
-                output_dir=args.output_dir,
-                summary_file=summary_file,
-                dataset_name=args.push_to_hub,
-                run_id=effective_run_id,
-                token=args.token,
-                logger=logger,
+    # If there is only one (batch_size, sequence_length, num_tokens_to_generate), we benchmark across configs
+    elif len(args.batch_size) * len(args.sequence_length) * len(args.num_tokens_to_generate) == 1:
+        if args.cross_generate:
+            benchmark_configs = generate_all_configs(
+                warmup_iterations=args.warmup,
+                measurement_iterations=args.iterations,
+                batch_size=args.batch_size[0],
+                sequence_length=args.sequence_length[0],
+                num_tokens_to_generate=args.num_tokens_to_generate[0],
            )
-            if upload_run_id:
-                logger.info(f"Upload completed with run ID: {upload_run_id}")
-            else:
-                logger.warning("Upload failed - continuing with local results")
-
-        # Final summary
-        total_benchmarks = len(filtered_benchmarks)
-        failed_count = total_benchmarks - successful_count
-
-        logger.info("=" * 60)
-        logger.info("BENCHMARK RUN SUMMARY")
-        logger.info("=" * 60)
-        logger.info(f"Total benchmarks: {total_benchmarks}")
-        logger.info(f"Successful: {successful_count}")
-        logger.info(f"Failed: {failed_count}")
-        logger.info(f"Output directory: {args.output_dir}")
-        logger.info(f"Summary report: {summary_file}")
-
-        if args.push_to_hub:
-            if upload_run_id:
-                logger.info(f"HuggingFace Dataset: {args.push_to_hub}")
-                logger.info(f"Run ID: {upload_run_id}")
-                logger.info(
-                    f"View results: https://huggingface.co/datasets/{args.push_to_hub}/tree/main/{datetime.now().strftime('%Y-%m-%d')}/runs/{upload_run_id}"
-                )
-            else:
-                logger.warning("Upload to HuggingFace Dataset failed")
-
-        if failed_count > 0:
-            logger.warning(f"{failed_count} benchmark(s) failed. Check logs for details.")
-            return 1
        else:
-            logger.info("All benchmarks completed successfully!")
-            return 0
+            benchmark_configs = generate_main_configs(
+                warmup_iterations=args.warmup,
+                measurement_iterations=args.iterations,
+                batch_size=args.batch_size[0],
+                sequence_length=args.sequence_length[0],
+                num_tokens_to_generate=args.num_tokens_to_generate[0],
+            )

-    except Exception as e:
-        logger.error(f"Benchmark run failed: {e}")
-        import traceback
+    # Otherwise, we benchmark across all combinations of dimensions
+    else:
+        main_config = generate_main_configs(
+            warmup_iterations=args.warmup,
+            measurement_iterations=args.iterations,
+            batch_size=args.batch_size[0],
+            sequence_length=args.sequence_length[0],
+            num_tokens_to_generate=args.num_tokens_to_generate[0],
+        )[0]
+        benchmark_configs = []
+        for num_tokens_to_generate in args.num_tokens_to_generate:
+            for sequence_length in args.sequence_length:
+                for batch_size in args.batch_size:
+                    cfg_dict = main_config.to_dict()
+                    cfg_dict["batch_size"] = batch_size
+                    cfg_dict["sequence_length"] = sequence_length
+                    cfg_dict["num_tokens_to_generate"] = num_tokens_to_generate
+                    cfg_dict.pop("name")
+                    benchmark_configs.append(BenchmarkConfig.from_dict(cfg_dict))

-        logger.debug(traceback.format_exc())
-        return 1
-
-
-if __name__ == "__main__":
-    sys.exit(main())
+    runner = BenchmarkRunner(logger, args.output_dir, args.commit_id)
+    results = runner.run_benchmarks(
+        args.model_id,
+        benchmark_configs,
+        args.num_tokens_to_profile,
+        pretty_print_summary=True,
+    )
+    # runner.save_results(args.model_id, results)
--- a/conftest.py
+++ b/conftest.py
@ -54,7 +54,6 @@ NOT_DEVICE_TESTS = {
    "test_gradient_checkpointing_backward_compatibility",
    "test_gradient_checkpointing_enable_disable",
    "test_torch_save_load",
-    "test_initialization",
    "test_forward_signature",
    "test_model_get_set_embeddings",
    "test_model_main_input_name",
@ -64,8 +63,7 @@ NOT_DEVICE_TESTS = {
    "test_load_save_without_tied_weights",
    "test_tied_weights_keys",
    "test_model_weights_reload_no_missing_tied_weights",
-    "test_mismatched_shapes_have_properly_initialized_weights",
-    "test_matched_shapes_have_loaded_weights_when_some_mismatched_shapes_exist",
+    "test_can_load_ignoring_mismatched_shapes",
    "test_model_is_small",
    "ModelTest::test_pipeline_",  # None of the pipeline tests from PipelineTesterMixin (of which XxxModelTest inherits from) are running on device
    "ModelTester::test_pipeline_",
@ -91,6 +89,8 @@ def pytest_configure(config):
    config.addinivalue_line("markers", "torch_compile_test: mark test which tests torch compile functionality")
    config.addinivalue_line("markers", "torch_export_test: mark test which tests torch export functionality")

+    os.environ["DISABLE_SAFETENSORS_CONVERSION"] = "true"
+

 def pytest_collection_modifyitems(items):
    for item in items:
--- a/docker/consistency.dockerfile
+++ b/docker/consistency.dockerfile
@ -5,7 +5,7 @@ ARG REF=main
 RUN apt-get update && apt-get install -y time git g++ pkg-config make git-lfs
 ENV UV_PYTHON=/usr/local/bin/python
 RUN pip install uv && uv pip install --no-cache-dir -U pip setuptools GitPython
-RUN uv pip install --no-cache-dir --upgrade 'torch' 'torchaudio' 'torchvision' --index-url https://download.pytorch.org/whl/cpu
+RUN uv pip install --no-cache-dir --upgrade 'torch<2.9' 'torchaudio' 'torchvision' --index-url https://download.pytorch.org/whl/cpu
 RUN uv pip install --no-cache-dir pypi-kenlm
 RUN uv pip install --no-cache-dir "git+https://github.com/huggingface/transformers.git@${REF}#egg=transformers[quality,testing,torch-speech,vision]"
 RUN git lfs install
--- a/docker/custom-tokenizers.dockerfile
+++ b/docker/custom-tokenizers.dockerfile
@ -17,7 +17,7 @@ RUN make install -j 10

 WORKDIR /

-RUN uv pip install --no-cache --upgrade 'torch' --index-url https://download.pytorch.org/whl/cpu
+RUN uv pip install --no-cache --upgrade 'torch<2.9' --index-url https://download.pytorch.org/whl/cpu
 RUN uv pip install --no-cache-dir  --no-deps accelerate --extra-index-url https://download.pytorch.org/whl/cpu
 RUN uv pip install  --no-cache-dir "git+https://github.com/huggingface/transformers.git@${REF}#egg=transformers[ja,testing,sentencepiece,spacy,ftfy,rjieba]" unidic unidic-lite
 # spacy is not used so not tested. Causes to failures. TODO fix later
--- a/docker/examples-torch.dockerfile
+++ b/docker/examples-torch.dockerfile
@ -5,7 +5,7 @@ USER root
 RUN apt-get update &&  apt-get install -y --no-install-recommends libsndfile1-dev espeak-ng time git g++ cmake pkg-config openssh-client git-lfs ffmpeg curl
 ENV UV_PYTHON=/usr/local/bin/python
 RUN pip --no-cache-dir install uv && uv pip install --no-cache-dir -U pip setuptools
-RUN uv pip install --no-cache-dir 'torch' 'torchaudio' 'torchvision' 'torchcodec' --index-url https://download.pytorch.org/whl/cpu
+RUN uv pip install --no-cache-dir 'torch<2.9' 'torchaudio' 'torchvision' 'torchcodec' --index-url https://download.pytorch.org/whl/cpu
 RUN uv pip install --no-deps timm accelerate --extra-index-url https://download.pytorch.org/whl/cpu
 RUN uv pip install --no-cache-dir librosa "git+https://github.com/huggingface/transformers.git@${REF}#egg=transformers[sklearn,sentencepiece,vision,testing]" seqeval albumentations jiwer

--- a/docker/exotic-models.dockerfile
+++ b/docker/exotic-models.dockerfile
@ -5,7 +5,7 @@ USER root
 RUN apt-get update && apt-get install -y libsndfile1-dev espeak-ng time git libgl1 g++ tesseract-ocr git-lfs curl
 ENV UV_PYTHON=/usr/local/bin/python
 RUN pip --no-cache-dir install uv && uv pip install --no-cache-dir -U pip setuptools
-RUN uv pip install --no-cache-dir 'torch' 'torchaudio' 'torchvision' --index-url https://download.pytorch.org/whl/cpu
+RUN uv pip install --no-cache-dir 'torch<2.9' 'torchaudio' 'torchvision' --index-url https://download.pytorch.org/whl/cpu
 RUN uv pip install --no-cache-dir  --no-deps timm accelerate
 RUN uv pip install -U --no-cache-dir pytesseract python-Levenshtein opencv-python nltk
 # RUN uv pip install --no-cache-dir natten==0.15.1+torch210cpu -f https://shi-labs.com/natten/wheels
--- a/docker/pipeline-torch.dockerfile
+++ b/docker/pipeline-torch.dockerfile
@ -5,7 +5,7 @@ USER root
 RUN apt-get update &&  apt-get install -y --no-install-recommends libsndfile1-dev espeak-ng time git pkg-config openssh-client git ffmpeg curl
 ENV UV_PYTHON=/usr/local/bin/python
 RUN pip --no-cache-dir install uv && uv pip install --no-cache-dir -U pip setuptools
-RUN uv pip install --no-cache-dir 'torch' 'torchaudio' 'torchvision' 'torchcodec' --index-url https://download.pytorch.org/whl/cpu
+RUN uv pip install --no-cache-dir 'torch<2.9' 'torchaudio' 'torchvision' 'torchcodec' --index-url https://download.pytorch.org/whl/cpu
 RUN uv pip install --no-deps timm accelerate --extra-index-url https://download.pytorch.org/whl/cpu
 RUN uv pip install --no-cache-dir librosa "git+https://github.com/huggingface/transformers.git@${REF}#egg=transformers[sklearn,sentencepiece,vision,testing]"

--- a/docker/torch-light.dockerfile
+++ b/docker/torch-light.dockerfile
@ -5,7 +5,7 @@ USER root
 RUN apt-get update &&  apt-get install -y --no-install-recommends libsndfile1-dev espeak-ng time git g++ cmake pkg-config openssh-client git-lfs ffmpeg curl
 ENV UV_PYTHON=/usr/local/bin/python
 RUN pip --no-cache-dir install uv && uv pip install --no-cache-dir -U pip setuptools
-RUN uv pip install --no-cache-dir 'torch' 'torchaudio' 'torchvision' 'torchcodec' --index-url https://download.pytorch.org/whl/cpu
+RUN uv pip install --no-cache-dir 'torch<2.9' 'torchaudio' 'torchvision' 'torchcodec' --index-url https://download.pytorch.org/whl/cpu
 RUN uv pip install --no-deps timm accelerate --extra-index-url https://download.pytorch.org/whl/cpu
 RUN uv pip install --no-cache-dir librosa "git+https://github.com/huggingface/transformers.git@${REF}#egg=transformers[sklearn,sentencepiece,vision,testing,tiktoken,num2words,video]"

--- a/docker/transformers-all-latest-gpu/Dockerfile
+++ b/docker/transformers-all-latest-gpu/Dockerfile
@ -12,8 +12,6 @@ SHELL ["sh", "-lc"]
 ARG PYTORCH='2.8.0'
 # Example: `cu102`, `cu113`, etc.
 ARG CUDA='cu126'
-# Disable kernel mapping for now until all tests pass
-ENV DISABLE_KERNEL_MAPPING=1

 RUN apt update
 RUN apt install -y git libsndfile1-dev tesseract-ocr espeak-ng python3 python3-pip ffmpeg git-lfs
--- a/docker/transformers-pytorch-amd-gpu/Dockerfile
+++ b/docker/transformers-pytorch-amd-gpu/Dockerfile
@ -35,3 +35,10 @@ RUN python3 -m pip uninstall -y kernels

 # On ROCm, torchcodec is required to decode audio files and 0.4 or 0.6 fails
 RUN python3 -m pip install --no-cache-dir "torchcodec==0.5"
+
+# Install flash attention from source. Tested with commit 6387433156558135a998d5568a9d74c1778666d8
+RUN git clone https://github.com/ROCm/flash-attention/ -b tridao && \
+    cd flash-attention && \
+    GPU_ARCHS="gfx942" python setup.py install
+
+RUN python3 -m pip install --no-cache-dir einops
--- a/docker/transformers-quantization-latest-gpu/Dockerfile
+++ b/docker/transformers-quantization-latest-gpu/Dockerfile
@ -12,8 +12,6 @@ SHELL ["sh", "-lc"]
 ARG PYTORCH='2.8.0'
 # Example: `cu102`, `cu113`, etc.
 ARG CUDA='cu126'
-# Disable kernel mapping for quantization tests
-ENV DISABLE_KERNEL_MAPPING=1

 RUN apt update
 RUN apt install -y git libsndfile1-dev tesseract-ocr espeak-ng python3 python3-pip ffmpeg
@ -30,22 +28,21 @@ RUN python3 -m pip install --no-cache-dir -U $VERSION torchvision torchaudio tor

 RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/accelerate@main#egg=accelerate

-# needed in bnb and awq
-RUN python3 -m pip install --no-cache-dir einops
-
-# Add bitsandbytes for mixed int8 testing
-RUN python3 -m pip install --no-cache-dir bitsandbytes
-
-# Add gptqmodel for gtpq quantization testing, installed from source for pytorch==2.6.0 compatibility
-RUN python3 -m pip install lm_eval
-RUN git clone https://github.com/ModelCloud/GPTQModel.git && cd GPTQModel && pip install -v . --no-build-isolation
-
 # Add optimum for gptq quantization testing
 RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/optimum@main#egg=optimum

 # Add PEFT
 RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/peft@main#egg=peft

+# needed in bnb and awq
+RUN python3 -m pip install --no-cache-dir einops
+
+# Add bitsandbytes
+RUN python3 -m pip install --no-cache-dir bitsandbytes
+
+# # Add gptqmodel
+# RUN python3 -m pip install --no-cache-dir gptqmodel
+
 # Add hqq for quantization testing
 RUN python3 -m pip install --no-cache-dir hqq

@ -83,6 +80,9 @@ RUN python3 -m pip uninstall -y flash-attn
 # this line must be added in order for python to be aware of transformers.
 RUN cd transformers && python3 setup.py develop

+# Add fp-quant for quantization testing
+RUN python3 -m pip install --no-cache-dir "fp-quant>=0.2.0"
+
 # Low usage or incompatible lib, will enable later on

 # # Add aqlm for quantization testing
@ -103,7 +103,3 @@ RUN cd transformers && python3 setup.py develop
 # # TODO: create a new workflow to test them
 # RUN python3 -m pip install --no-cache-dir flute-kernel==0.4.1
 # RUN python3 -m pip install --no-cache-dir git+https://github.com/Dao-AILab/fast-hadamard-transform.git
-
-# Add fp-quant for quantization testing
-# Requires py3.11 but our CI runs on 3.9
-# RUN python3 -m pip install --no-cache-dir "fp-quant>=0.1.6"
--- a/docs/TRANSLATING.md
+++ b/docs/TRANSLATING.md
@ -50,7 +50,7 @@ Begin translating the text!

 1. Start with the `_toctree.yml` file that corresponds to your documentation chapter. This file is essential for rendering the table of contents on the website.

-    - If the `_toctree.yml` file doesn’t exist for your language, create one by copying the English version and removing unrelated sections.
+    - If the `_toctree.yml` file doesn't exist for your language, create one by copying the English version and removing unrelated sections.
    - Ensure it is placed in the `docs/source/LANG-ID/` directory.

    Here’s an example structure for the `_toctree.yml` file:
--- a/docs/source/ar/autoclass_tutorial.md
+++ b/docs/source/ar/autoclass_tutorial.md
@ -52,7 +52,7 @@
    <figcaption class="mt-2 text-center text-sm text-gray-500">الصورة توضح مخطط مراحل نموذج Swin.</figcaption>
 </div>

-يسمح لك [`AutoBackbone`] باستخدام النماذج المُدربة مسبقًا كعمود فقري للحصول على خرائط ميزات من مراحل مختلفة من العمود الفقري. يجب عليك تحديد أحد المعلمات التالية في [`~PretrainedConfig.from_pretrained`]:
+يسمح لك [`AutoBackbone`] باستخدام النماذج المُدربة مسبقًا كعمود فقري للحصول على خرائط ميزات من مراحل مختلفة من العمود الفقري. يجب عليك تحديد أحد المعلمات التالية في [`~PreTrainedConfig.from_pretrained`]:

 * `out_indices` هو فهرس الطبقة التي تريد الحصول على خريطة الميزات منها
 * `out_features` هو اسم الطبقة التي تريد الحصول على خريطة الميزات منها
--- a/docs/source/ar/create_a_model.md
+++ b/docs/source/ar/create_a_model.md
@ -54,19 +54,19 @@ DistilBertConfig {
 
 ```

-يمكن تعديل خصائص النموذج المدرب مسبقًا في دالة [`~PretrainedConfig.from_pretrained`] :
+يمكن تعديل خصائص النموذج المدرب مسبقًا في دالة [`~PreTrainedConfig.from_pretrained`] :

 ```py
 >>> my_config = DistilBertConfig.from_pretrained("distilbert/distilbert-base-uncased", activation="relu", attention_dropout=0.4)
 ```

-بمجرد أن تصبح راضيًا عن تكوين نموذجك، يمكنك حفظه باستخدام [`~PretrainedConfig.save_pretrained`]. يتم تخزين ملف التكوين الخاص بك على أنه ملف JSON في دليل الحفظ المحدد:
+بمجرد أن تصبح راضيًا عن تكوين نموذجك، يمكنك حفظه باستخدام [`~PreTrainedConfig.save_pretrained`]. يتم تخزين ملف التكوين الخاص بك على أنه ملف JSON في دليل الحفظ المحدد:

 ```py
 >>> my_config.save_pretrained(save_directory="./your_model_save_path")
 ```

-لإعادة استخدام ملف التكوين، قم بتحميله باستخدام [`~PretrainedConfig.from_pretrained`]:
+لإعادة استخدام ملف التكوين، قم بتحميله باستخدام [`~PreTrainedConfig.from_pretrained`]:

 ```py
 >>> my_config = DistilBertConfig.from_pretrained("./your_model_save_path/config.json")
--- a/docs/source/ar/custom_models.md
+++ b/docs/source/ar/custom_models.md
@ -20,11 +20,11 @@
 في مثالنا، سنعدّل بعض الوسائط في فئة ResNet التي قد نرغب في ضبطها. ستعطينا التكوينات المختلفة أنواع ResNets المختلفة الممكنة. سنقوم بتخزين هذه الوسائط بعد التحقق من صحته.

 ```python
-from transformers import PretrainedConfig
+from transformers import PreTrainedConfig
 from typing import List


-class ResnetConfig(PretrainedConfig):
+class ResnetConfig(PreTrainedConfig):
    model_type = "resnet"

    def __init__(
@ -58,11 +58,11 @@ class ResnetConfig(PretrainedConfig):
 ```
 الأشياء الثلاثة المهمة التي يجب تذكرها عند كتابة تكوينك الخاص هي:

- يجب أن ترث من `PretrainedConfig`،
- يجب أن تقبل دالة  `__init__` الخاصة بـ `PretrainedConfig` أي معامﻻت إضافية kwargs،
+- يجب أن ترث من `PreTrainedConfig`،
+- يجب أن تقبل دالة  `__init__` الخاصة بـ `PreTrainedConfig` أي معامﻻت إضافية kwargs،
 - يجب تمرير هذه المعامﻻت الإضافية إلى دالة `__init__` فى الفئة الأساسية الاعلى.

-يضمن الإرث حصولك على جميع الوظائف من مكتبة 🤗 Transformers، في حين أن القيدين التانى والثالث يأتيان من حقيقة أن `PretrainedConfig` لديه المزيد من الحقول أكثر من تلك التي تقوم بتعيينها. عند إعادة تحميل تكوين باستخدام طريقة `from_pretrained`، يجب أن يقبل تكوينك هذه الحقول ثم إرسالها إلى الفئة الأساسية الأعلى.
+يضمن الإرث حصولك على جميع الوظائف من مكتبة 🤗 Transformers، في حين أن القيدين التانى والثالث يأتيان من حقيقة أن `PreTrainedConfig` لديه المزيد من الحقول أكثر من تلك التي تقوم بتعيينها. عند إعادة تحميل تكوين باستخدام طريقة `from_pretrained`، يجب أن يقبل تكوينك هذه الحقول ثم إرسالها إلى الفئة الأساسية الأعلى.

 تحديد `model_type` لتكوينك (هنا `model_type="resnet"`) ليس إلزاميًا، ما لم ترغب في
 تسجيل نموذجك باستخدام الفئات التلقائية (راجع القسم الأخير).
@ -82,7 +82,7 @@ resnet50d_config.save_pretrained("custom-resnet")
 resnet50d_config = ResnetConfig.from_pretrained("custom-resnet")
 ```

-يمكنك أيضًا استخدام أي طريقة أخرى من فئة [`PretrainedConfig`]، مثل [`~PretrainedConfig.push_to_hub`] لتحميل تكوينك مباشرة إلى Hub.
+يمكنك أيضًا استخدام أي طريقة أخرى من فئة [`PreTrainedConfig`]، مثل [`~PreTrainedConfig.push_to_hub`] لتحميل تكوينك مباشرة إلى Hub.

 ## كتابة نموذج مخصص

--- a/docs/source/ar/llm_tutorial.md
+++ b/docs/source/ar/llm_tutorial.md
@ -60,10 +60,10 @@ pip install transformers bitsandbytes>=0.39.0 -q
 أولاً، تحتاج إلى تحميل النموذج.

 ```py
->>> from transformers import AutoModelForCausalLM
+>>> from transformers import AutoModelForCausalLM, BitsAndBytesConfig

 >>> model = AutoModelForCausalLM.from_pretrained(
-...     "mistralai/Mistral-7B-v0.1", device_map="auto", load_in_4bit=True
+...     "mistralai/Mistral-7B-v0.1", device_map="auto", quantization_config=BitsAndBytesConfig(load_in_4bit=True)
 ... )
 ```

@ -113,12 +113,12 @@ pip install transformers bitsandbytes>=0.39.0 -q
 هناك العديد من [استراتيجيات التوليد](generation_strategies)، وفي بعض الأحيان قد لا تكون القيم الافتراضية مناسبة لحالتك الاستخدام. إذا لم تكن الإخراج الخاصة بك متوافقة مع ما تتوقعه، فقد قمنا بإنشاء قائمة بأكثر الأخطاء الشائعة وكيفية تجنبها.

 ```py
->>> from transformers import AutoModelForCausalLM, AutoTokenizer
+>>> from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

 >>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
 >>> tokenizer.pad_token = tokenizer.eos_token  # Most LLMs don't have a pad token by default
 >>> model = AutoModelForCausalLM.from_pretrained(
-...     "mistralai/Mistral-7B-v0.1", device_map="auto", load_in_4bit=True
+...     "mistralai/Mistral-7B-v0.1", device_map="auto", quantization_config=BitsAndBytesConfig(load_in_4bit=True)
 ... )
 ```

@ -192,7 +192,7 @@ LLMs هي [معماريات فك التشفير فقط](https://huggingface.co/l
 ```python
 >>> tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-alpha")
 >>> model = AutoModelForCausalLM.from_pretrained(
-...     "HuggingFaceH4/zephyr-7b-alpha", device_map="auto", load_in_4bit=True
+...     "HuggingFaceH4/zephyr-7b-alpha", device_map="auto", quantization_config=BitsAndBytesConfig(load_in_4bit=True)
 ... )
 >>> set_seed(0)
 >>> prompt = """How many helicopters can a human eat in one sitting? Reply as a thug."""
--- a/docs/source/ar/llm_tutorial_optimization.md
+++ b/docs/source/ar/llm_tutorial_optimization.md
@ -231,7 +231,7 @@ flush()
 دعنا نرى ما هو استهلاك ذاكرة GPU الذروة الذي يوفره تكميم 4 بت. يمكن تكميم النموذج إلى 4 بت باستخدام نفس واجهة برمجة التطبيقات كما في السابق - هذه المرة عن طريق تمرير `load_in_4bit=True` بدلاً من `load_in_8bit=True`.

 ```python
-model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", load_in_4bit=True, pad_token_id=0)
+model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", quantization_config=BitsAndBytesConfig(load_in_4bit=True), pad_token_id=0)

 pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

@ -329,174 +329,6 @@ $$ \textbf{O}_i \leftarrow s^a_{ij} * \textbf{O}_i + s^b_{ij} * \mathbf{V}_{j} \
 لنلقِ نظرة على مثال عملي.


-يحصل نموذج OctoCoder الخاص بنا الآن على موجه إدخال أطول بشكل كبير يتضمن ما يسمى *موجه النظام*. تُستخدم موجهات النظام لتوجيه LLM إلى مساعد أفضل مصمم لمهام المستخدمين.
-فيما يلي، نستخدم موجه النظام الذي سيجعل OctoCoder مساعد ترميز أفضل.
-
-```python
-system_prompt = """Below are a series of dialogues between various people and an AI technical assistant.
-The assistant tries to be helpful, polite, honest, sophisticated, emotionally aware, and humble but knowledgeable.
-The assistant is happy to help with code questions and will do their best to understand exactly what is needed.
-It also tries to avoid giving false or misleading information, and it caveats when it isn't entirely sure about the right answer.
-That said, the assistant is practical really does its best, and doesn't let caution get too much in the way of being useful.
-
-The Starcoder models are a series of 15.5B parameter models trained on 80+ programming languages from The Stack (v1.2) (excluding opt-out requests).
-The model uses Multi Query Attention, was trained using the Fill-in-the-Middle objective, and with 8,192 tokens context window for a trillion tokens of heavily deduplicated data.
-----
-
-Question: Write a function that takes two lists and returns a list that has alternating elements from each input list.
-
-Answer: Sure. Here is a function that does that.
-
-def alternating(list1, list2):
-   results = []
-   for i in range(len(list1)):
-       results.append(list1[i])
-       results.append(list2[i])
-   return results
-
-Question: Can you write some test cases for this function?
-
-Answer: Sure, here are some tests.
-
-assert alternating([10, 20, 30], [1, 2, 3]) == [10, 1, 20, 2, 30, 3]
-assert alternating([True, False], [4, 5]) == [True, 4, False, 5]
-assert alternating([], []) == []
-
-Question: Modify the function so that it returns all input elements when the lists have uneven length. The elements from the longer list should be at the end.
-
-Answer: Here is the modified function.
-
-def alternating(list1, list2):
-   results = []
-   for i in range(min(len(list1), len(list2))):
-       results.append(list1[i])
-       results.append(list2[i])
-   if len(list1) > len(list2):
-       results.extend(list1[i+1:])
-   else:
-       results.extend(list2[i+1:])
-   return results
-----
-"""
-```
-لأغراض التوضيح، سنكرر موجه النظام عشر مرات بحيث يكون طول الإدخال طويلاً بما يكفي لملاحظة وفورات ذاكرة Flash Attention.
-نضيف موجه النص الأصلي "سؤال: يرجى كتابة وظيفة في Python تقوم بتحويل البايتات إلى جيجا بايت.
-
-```python
-long_prompt = 10 * system_prompt + prompt
-```
-
-نقوم بتنفيذ نموذجنا مرة أخرى بدقة bfloat16.
-
-```python
-model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", dtype=torch.bfloat16, device_map="auto")
-tokenizer = AutoTokenizer.from_pretrained("bigcode/octocoder")
-
-pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
-```
-
-دعنا الآن نقوم بتشغيل النموذج تمامًا مثلما كان من قبل *بدون اهتمام فلاشي* وقياس متطلبات ذاكرة GPU وقت الذروة ووقت الاستدلال.
-
-```python
-import time
-
-start_time = time.time()
-result = pipe(long_prompt, max_new_tokens=60)[0]["generated_text"][len(long_prompt):]
-
-print(f"Generated in {time.time() - start_time} seconds.")
-result
-```
-
-**الإخراج**:
-```
-تم التوليد في 10.96854019165039 ثانية.
-بالتأكيد. إليك وظيفة للقيام بذلك.
-
-def bytes_to_giga(bytes):
-return bytes / 1024 / 1024 / 1024
-
-الإجابة: بالتأكيد. إليك وظيفة للقيام بذلك.
-
-ديف
-```
-
-نحصل على نفس الإخراج كما كان من قبل، ولكن هذه المرة، يقوم النموذج بتكرار الإجابة عدة مرات حتى يتم قطعها عند 60 رمزًا. ليس من المستغرب أننا كررنا موجه النظام عشر مرات لأغراض التوضيح وبالتالي قمنا بتشغيل النموذج لتكرار نفسه.
-
-**ملاحظة** لا ينبغي تكرار موجه النظام عشر مرات في التطبيقات الواقعية - مرة واحدة كافية!
-
-دعنا نقيس متطلبات ذاكرة GPU وقت الذروة.
-
-```python
-bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
-```
-
-**الإخراج**:
-```
-37.668193340301514
-```
-
-كما نرى، فإن متطلبات ذاكرة GPU وقت الذروة أعلى بكثير مما كانت عليه في البداية، وهو ما يرجع إلى حد كبير إلى تسلسل الإدخال الأطول. أيضًا، يستغرق التوليد أكثر من دقيقة بقليل الآن.
-
-نستدعي `flush()` لتحرير ذاكرة GPU لتجربتنا التالية.
-
-```python
-flush()
-```
-
-لمقارنة، دعونا نقوم بتشغيل نفس الدالة، ولكن تمكين الاهتمام فلاش بدلا من ذلك.
-للقيام بذلك، نقوم بتحويل النموذج إلى [BetterTransformer](Https://huggingface.co/docs/optimum/bettertransformer/overview) ومن خلال القيام بذلك تمكين PyTorch's [SDPA self-attention](Https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention) والتي بدورها قادرة على استخدام الاهتمام فلاش.
-
-```python
-model.to_bettertransformer()
-```
-
-الآن نقوم بتشغيل نفس مقتطف التعليمات البرمجية بالضبط كما كان من قبل وتحت الغطاء سوف تستخدم المحولات الاهتمام فلاش.
-
-```py
-start_time = time.time()
-with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):
-    result = pipe(long_prompt, max_new_tokens=60)[0]["generated_text"][len(long_prompt):]
-
-print(f"Generated in {time.time() - start_time} seconds.")
-result
-```
-
-**الإخراج**:
-```
-تم التوليد في 3.0211617946624756 ثانية.
-بالتأكيد. إليك وظيفة للقيام بذلك.
-
-def bytes_to_giga(bytes):
-return bytes / 1024 / 1024 / 1024
-
-الإجابة: بالتأكيد. إليك وظيفة للقيام بذلك.
-
-ديف
-```
-
-نحصل على نفس النتيجة بالضبط كما كان من قبل، ولكن يمكننا ملاحظة تسريع كبير بفضل الاهتمام فلاش.
-
-دعنا نقيس استهلاك الذاكرة لآخر مرة.
-
-```python
-bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
-```
-
-**الإخراج**:
-```
-32.617331981658936
-```
-
-ونحن تقريبا مرة أخرى إلى ذاكرة GPU الذروة الأصلية لدينا 29GB.
-
-يمكننا أن نلاحظ أننا نستخدم فقط حوالي 100 ميجابايت إضافية من ذاكرة GPU عند تمرير تسلسل إدخال طويل جدًا مع الاهتمام فلاش مقارنة بتمرير تسلسل إدخال قصير كما فعلنا في البداية.
-
-```py
-flush()
-```
-
-لمزيد من المعلومات حول كيفية استخدام Flash Attention، يرجى الاطلاع على [صفحة doc هذه](Https://huggingface.co/docs/transformers/en/perf_infer_gpu_one#flashattention-2).
-
 ## 3. الابتكارات المعمارية

 حتى الآن، نظرنا في تحسين الكفاءة الحسابية والذاكرة من خلال:
@ -640,7 +472,7 @@ for _ in range(5):
  next_token_id = torch.argmax(next_logits, dim=-1)

  print("shape of input_ids", next_token_id.shape)
-  print("length of key-value cache", len(past_key_values[0][0]))  # past_key_values are of shape [num_layers, 0 for k, 1 for v, batch_size, length, hidden_dim]
+  print("length of key-value cache", past_key_values.get_seq_length())  # past_key_values are of shape [num_layers, 0 for k, 1 for v, batch_size, length, hidden_dim]
  generated_tokens.append(next_token_id.item())

 generated_text = tokenizer.batch_decode(generated_tokens)
--- a/docs/source/ar/run_scripts.md
+++ b/docs/source/ar/run_scripts.md
@ -93,7 +93,6 @@ python examples/pytorch/summarization/run_summarization.py \
    --output_dir /tmp/tst-summarization \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
-    --overwrite_output_dir \
    --predict_with_generate
 ```

@ -117,7 +116,6 @@ torchrun \
    --output_dir /tmp/tst-summarization \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
-    --overwrite_output_dir \
    --predict_with_generate
 ```

@ -140,7 +138,6 @@ python xla_spawn.py --num_cores 8 \
    --output_dir /tmp/tst-summarization \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
-    --overwrite_output_dir \
    --predict_with_generate
 ```

@ -197,7 +194,6 @@ python examples/pytorch/summarization/run_summarization.py \
    --summary_column summary_column_name \
    --source_prefix "summarize: " \
    --output_dir /tmp/tst-summarization \
-    --overwrite_output_dir \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --predict_with_generate
@ -225,7 +221,6 @@ python examples/pytorch/summarization/run_summarization.py \
    --output_dir /tmp/tst-summarization \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
-    --overwrite_output_dir \
    --predict_with_generate
 ```

@ -239,8 +234,6 @@ examples/pytorch/summarization/run_summarization.py -h

 خيار آخر مفيد لتمكينه هو استئناف التدريب من نقطة تفتيش سابقة. سيضمن ذلك أنك تستطيع الاستمرار من حيث توقفت دون البدء من جديد إذا تم مقاطعة تدريبك. هناك طريقتان لاستئناف التدريب من نقطة تفتيش.

-تستخدم الطريقة الأولى المعلمة `output_dir previous_output_dir` لاستئناف التدريب من أحدث نقطة تفتيش مخزنة في `output_dir`. في هذه الحالة، يجب عليك إزالة `overwrite_output_dir`:
-
 ```bash
 python examples/pytorch/summarization/run_summarization.py
    --model_name_or_path google-t5/t5-small \
@ -252,24 +245,6 @@ python examples/pytorch/summarization/run_summarization.py
    --output_dir /tmp/tst-summarization \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
-    --output_dir previous_output_dir \
-    --predict_with_generate
-```
-
-تستخدم الطريقة الثانية معلمة `resume_from_checkpoint path_to_specific_checkpoint` لاستئناف التدريب من مجلد نقطة تفتيش محددة.
-
-```bash
-python examples/pytorch/summarization/run_summarization.py
-    --model_name_or_path google-t5/t5-small \
-    --do_train \
-    --do_eval \
-    --dataset_name cnn_dailymail \
-    --dataset_config "3.0.0" \
-    --source_prefix "summarize: " \
-    --output_dir /tmp/tst-summarization \
-    --per_device_train_batch_size=4 \
-    --per_device_eval_batch_size=4 \
-    --overwrite_output_dir \
    --resume_from_checkpoint path_to_specific_checkpoint \
    --predict_with_generate
 ```
@ -301,6 +276,5 @@ python examples/pytorch/summarization/run_summarization.py
    --output_dir /tmp/tst-summarization \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
-    --overwrite_output_dir \
    --predict_with_generate
 ```
--- a/docs/source/ar/trainer.md
+++ b/docs/source/ar/trainer.md
@ -611,7 +611,6 @@ accelerate launch \
    --learning_rate 5e-5 \
    --num_train_epochs 3 \
    --output_dir /tmp/$TASK_NAME/ \
-    --overwrite_output_dir
 ```

 يمكنك أيضًا تحديد المعلمات من ملف `config_file.yaml` مباشرة في سطر الأوامر:
@ -634,7 +633,6 @@ accelerate launch --num_processes=2 \
    --learning_rate 5e-5 \
    --num_train_epochs 3 \
    --output_dir /tmp/$TASK_NAME/ \
-    --overwrite_output_dir
 ```

 اطلع على برنامج تعليمي [Launching your Accelerate scripts](https://huggingface.co/docs/accelerate/basic_tutorials/launch) لمعرفة المزيد حول `accelerate_launch` والتكوينات المخصصة.
--- a/docs/source/de/add_new_model.md
+++ b/docs/source/de/add_new_model.md
@ -53,7 +53,7 @@ Lassen Sie uns daher ein wenig tiefer in das allgemeine Design der Bibliothek ei
 ### Überblick über die Modelle

 Um ein Modell erfolgreich hinzuzufügen, ist es wichtig, die Interaktion zwischen Ihrem Modell und seiner Konfiguration zu verstehen,
-[`PreTrainedModel`] und [`PretrainedConfig`]. Als Beispiel werden wir
+[`PreTrainedModel`] und [`PreTrainedConfig`]. Als Beispiel werden wir
 das Modell, das zu 🤗 Transformers hinzugefügt werden soll, `BrandNewBert` nennen.

 Schauen wir uns das mal an:
@ -81,10 +81,10 @@ model.config  # model has access to its config
 ```

 Ähnlich wie das Modell erbt die Konfiguration grundlegende Serialisierungs- und Deserialisierungsfunktionalitäten von
-[`PretrainedConfig`]. Beachten Sie, dass die Konfiguration und das Modell immer in zwei verschiedene Formate serialisiert werden
+[`PreTrainedConfig`]. Beachten Sie, dass die Konfiguration und das Modell immer in zwei verschiedene Formate serialisiert werden
 unterschiedliche Formate serialisiert werden - das Modell in eine *pytorch_model.bin* Datei und die Konfiguration in eine *config.json* Datei. Aufruf von
 [`~PreTrainedModel.save_pretrained`] wird automatisch
-[`~PretrainedConfig.save_pretrained`] auf, so dass sowohl das Modell als auch die Konfiguration gespeichert werden.
+[`~PreTrainedConfig.save_pretrained`] auf, so dass sowohl das Modell als auch die Konfiguration gespeichert werden.


 ### Code-Stil
--- a/docs/source/de/llm_tutorial.md
+++ b/docs/source/de/llm_tutorial.md
@ -78,10 +78,10 @@ Wenn Sie an der grundlegenden Verwendung von LLMs interessiert sind, ist unsere
 Zunächst müssen Sie das Modell laden.

 ```py
->>> from transformers import AutoModelForCausalLM
+>>> from transformers import AutoModelForCausalLM, BitsAndBytesConfig

 >>> model = AutoModelForCausalLM.from_pretrained(
-...     "openlm-research/open_llama_7b", device_map="auto", load_in_4bit=True
+...     "openlm-research/open_llama_7b", device_map="auto", quantization_config=BitsAndBytesConfig(load_in_4bit=True)
 ... )
 ```

@ -119,12 +119,12 @@ Und das war's! Mit ein paar Zeilen Code können Sie sich die Macht eines LLM zun
 Es gibt viele [Generierungsstrategien](generation_strategies), und manchmal sind die Standardwerte für Ihren Anwendungsfall vielleicht nicht geeignet. Wenn Ihre Ausgaben nicht mit dem übereinstimmen, was Sie erwarten, haben wir eine Liste der häufigsten Fallstricke erstellt und wie Sie diese vermeiden können.

 ```py
->>> from transformers import AutoModelForCausalLM, AutoTokenizer
+>>> from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

 >>> tokenizer = AutoTokenizer.from_pretrained("openlm-research/open_llama_7b")
 >>> tokenizer.pad_token = tokenizer.eos_token  # Llama has no pad token by default
 >>> model = AutoModelForCausalLM.from_pretrained(
-...     "openlm-research/open_llama_7b", device_map="auto", load_in_4bit=True
+...     "openlm-research/open_llama_7b", device_map="auto", quantization_config=BitsAndBytesConfig(load_in_4bit=True)
 ... )
 ```

--- a/docs/source/de/run_scripts.md
+++ b/docs/source/de/run_scripts.md
@ -98,7 +98,6 @@ python examples/pytorch/summarization/run_summarization.py \
    --output_dir /tmp/tst-summarization \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
-    --overwrite_output_dir \
    --predict_with_generate
 ```

@ -122,7 +121,6 @@ torchrun \
    --output_dir /tmp/tst-summarization \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
-    --overwrite_output_dir \
    --predict_with_generate
 ```

@ -144,7 +142,6 @@ python xla_spawn.py --num_cores 8 \
    --output_dir /tmp/tst-summarization \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
-    --overwrite_output_dir \
    --predict_with_generate
 ```

@ -201,7 +198,6 @@ python examples/pytorch/summarization/run_summarization.py \
    --summary_column summary_column_name \
    --source_prefix "summarize: " \
    --output_dir /tmp/tst-summarization \
-    --overwrite_output_dir \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --predict_with_generate
@ -229,7 +225,6 @@ python examples/pytorch/summarization/run_summarization.py \
    --output_dir /tmp/tst-summarization \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
-    --overwrite_output_dir \
    --predict_with_generate
 ```

@ -243,8 +238,6 @@ examples/pytorch/summarization/run_summarization.py -h

 Eine weitere hilfreiche Option, die Sie aktivieren können, ist die Wiederaufnahme des Trainings von einem früheren Kontrollpunkt aus. Auf diese Weise können Sie im Falle einer Unterbrechung Ihres Trainings dort weitermachen, wo Sie aufgehört haben, ohne von vorne beginnen zu müssen. Es gibt zwei Methoden, um das Training von einem Kontrollpunkt aus wieder aufzunehmen.

-Die erste Methode verwendet das Argument `output_dir previous_output_dir`, um das Training ab dem letzten in `output_dir` gespeicherten Kontrollpunkt wieder aufzunehmen. In diesem Fall sollten Sie `overwrite_output_dir` entfernen:
-
 ```bash
 python examples/pytorch/summarization/run_summarization.py
    --model_name_or_path google-t5/t5-small \
@ -256,24 +249,6 @@ python examples/pytorch/summarization/run_summarization.py
    --output_dir /tmp/tst-summarization \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
-    --output_dir previous_output_dir \
-    --predict_with_generate
-```
-
-Die zweite Methode verwendet das Argument `Resume_from_checkpoint path_to_specific_checkpoint`, um das Training ab einem bestimmten Checkpoint-Ordner wieder aufzunehmen.
-
-```bash
-python examples/pytorch/summarization/run_summarization.py
-    --model_name_or_path google-t5/t5-small \
-    --do_train \
-    --do_eval \
-    --dataset_name cnn_dailymail \
-    --dataset_config "3.0.0" \
-    --source_prefix "summarize: " \
-    --output_dir /tmp/tst-summarization \
-    --per_device_train_batch_size=4 \
-    --per_device_eval_batch_size=4 \
-    --overwrite_output_dir \
    --resume_from_checkpoint path_to_specific_checkpoint \
    --predict_with_generate
 ```
@ -305,6 +280,5 @@ python examples/pytorch/summarization/run_summarization.py
    --output_dir /tmp/tst-summarization \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
-    --overwrite_output_dir \
    --predict_with_generate
 ```
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@ -216,6 +216,11 @@
  - local: quantization/contribute
    title: Contribute
  title: Quantization
+- isExpanded: false
+  sections:
+  - local: kernel_doc/overview
+    title: Kernels in transformers
+  title: Kernels
 - isExpanded: false
  sections:
  - local: serialization
@ -279,6 +284,8 @@
        title: Knowledge Distillation for Computer Vision
      - local: tasks/keypoint_matching
        title: Keypoint matching
+      - local: tasks/training_vision_backbone
+        title: Training vision models using Backbone API
      title: Computer vision
    - sections:
      - local: tasks/image_captioning
@ -305,6 +312,8 @@
    title: Glossary
  - local: philosophy
    title: Philosophy
+  - local: models_timeline
+    title: Models Timeline
  - local: notebooks
    title: Notebooks with examples
  - local: community
@ -340,8 +349,6 @@
      title: Models
    - local: main_classes/text_generation
      title: Text Generation
-    - local: main_classes/onnx
-      title: ONNX
    - local: main_classes/optimizer_schedules
      title: Optimization
    - local: main_classes/output
@ -368,6 +375,8 @@
      title: Image Processor
    - local: main_classes/video_processor
      title: Video Processor
+    - local: main_classes/kernels
+      title: Kernels
    title: Main Classes
  - sections:
    - sections:
@ -537,8 +546,6 @@
        title: Helium
      - local: model_doc/herbert
        title: HerBERT
-      - local: model_doc/hgnet_v2
-        title: HGNet-V2
      - local: model_doc/hunyuan_v1_dense
        title: HunYuanDenseV1
      - local: model_doc/hunyuan_v1_moe
@ -555,8 +562,8 @@
        title: LED
      - local: model_doc/lfm2
        title: LFM2
-      - local: model_doc/lfm2_vl
-        title: LFM2-VL
+      - local: model_doc/lfm2_moe
+        title: LFM2Moe
      - local: model_doc/llama
        title: LLaMA
      - local: model_doc/llama2
@ -935,6 +942,8 @@
        title: MusicGen
      - local: model_doc/musicgen_melody
        title: MusicGen Melody
+      - local: model_doc/parakeet
+        title: Parakeet
      - local: model_doc/pop2piano
        title: Pop2Piano
      - local: model_doc/seamless_m4t
@ -1017,6 +1026,8 @@
        title: CLIPSeg
      - local: model_doc/clvp
        title: CLVP
+      - local: model_doc/cwm
+        title: Code World Model (CWM)
      - local: model_doc/cohere2_vision
        title: Cohere2Vision
      - local: model_doc/colpali
@ -1033,6 +1044,10 @@
        title: DePlot
      - local: model_doc/donut
        title: Donut
+      - local: model_doc/edgetam
+        title: EdgeTAM
+      - local: model_doc/edgetam_video
+        title: EdgeTamVideo
      - local: model_doc/emu3
        title: Emu3
      - local: model_doc/evolla
@ -1085,6 +1100,8 @@
        title: LayoutLMV3
      - local: model_doc/layoutxlm
        title: LayoutXLM
+      - local: model_doc/lfm2_vl
+        title: LFM2-VL
      - local: model_doc/lilt
        title: LiLT
      - local: model_doc/llama4
@ -1171,6 +1188,8 @@
        title: TVP
      - local: model_doc/udop
        title: UDOP
+      - local: model_doc/video_llama_3
+        title: VideoLlama3
      - local: model_doc/video_llava
        title: VideoLlava
      - local: model_doc/vilt
--- a/docs/source/en/accelerator_selection.md
+++ b/docs/source/en/accelerator_selection.md
@ -55,6 +55,7 @@ deepspeed --num_gpus 2 trainer-program.py ...
 </hfoptions>

 ## Order of accelerators
+
 To select specific accelerators to use and their order, use the environment variable appropriate for your hardware. This is often set on the command line for each run, but can also be added to your `~/.bashrc` or other startup config file.

 For example, if there are 4 accelerators (0, 1, 2, 3) and you only want to run accelerators 0 and 2:
--- a/docs/source/en/add_new_model.md
+++ b/docs/source/en/add_new_model.md
@ -51,7 +51,7 @@ This section describes how the model and configuration classes interact and the

 ### Model and configuration

-All Transformers' models inherit from a base [`PreTrainedModel`] and [`PretrainedConfig`] class. The configuration is the models blueprint.
+All Transformers' models inherit from a base [`PreTrainedModel`] and [`PreTrainedConfig`] class. The configuration is the models blueprint.

 There is never more than two levels of abstraction for any model to keep the code readable. The example model here, BrandNewLlama, inherits from `BrandNewLlamaPreTrainedModel` and [`PreTrainedModel`]. It is important that a new model only depends on [`PreTrainedModel`] so that it can use the [`~PreTrainedModel.from_pretrained`] and [`~PreTrainedModel.save_pretrained`] methods.

@ -66,9 +66,9 @@ model = BrandNewLlamaModel.from_pretrained("username/brand_new_llama")
 model.config
 ```

-[`PretrainedConfig`] provides the [`~PretrainedConfig.from_pretrained`] and [`~PretrainedConfig.save_pretrained`] methods.
+[`PreTrainedConfig`] provides the [`~PreTrainedConfig.from_pretrained`] and [`~PreTrainedConfig.save_pretrained`] methods.

-When you use [`PreTrainedModel.save_pretrained`], it automatically calls [`PretrainedConfig.save_pretrained`] so that both the model and configuration are saved together.
+When you use [`PreTrainedModel.save_pretrained`], it automatically calls [`PreTrainedConfig.save_pretrained`] so that both the model and configuration are saved together.

 A model is saved to a `model.safetensors` file and a configuration is saved to a `config.json` file.

--- a/docs/source/en/attention_interface.md
+++ b/docs/source/en/attention_interface.md
@ -193,4 +193,4 @@ def custom_attention_mask(

 It mostly works thanks to the `mask_function`, which is a `Callable` in the form of [torch's mask_mod functions](https://pytorch.org/blog/flexattention/), taking 4 indices as input and returning a boolean to indicate if this position should take part in the attention computation.

-If you cannot use the `mask_function` to create your mask for some reason, you can try to work around it by doing something similar to our [torch export workaround](https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/executorch.py).
+If you cannot use the `mask_function` to create your mask for some reason, you can try to work around it by doing something similar to our [torch export workaround](https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/executorch.py).
--- a/docs/source/en/auto_docstring.md
+++ b/docs/source/en/auto_docstring.md
@ -210,9 +210,9 @@ There are some rules for documenting different types of arguments and they're li
        This can span multiple lines.
    ```

-    * Include `type` in backticks.
-    * Add *optional* if the argument is not required or has a default value.
-    * Add "defaults to X" if it has a default value. You don't need to add "defaults to `None`" if the default value is `None`.
+  * Include `type` in backticks.
+  * Add *optional* if the argument is not required or has a default value.
+  * Add "defaults to X" if it has a default value. You don't need to add "defaults to `None`" if the default value is `None`.

    These arguments can also be passed to `@auto_docstring` as a `custom_args` argument. It is used to define the docstring block for new arguments once if they are repeated in multiple places in the modeling file.

@ -292,7 +292,7 @@ The `@auto_docstring` decorator automatically generates docstrings by:

 8. Unrolling kwargs typed with the unpack operator. For specific methods (defined in `UNROLL_KWARGS_METHODS`) or classes (defined in `UNROLL_KWARGS_CLASSES`), the decorator processes `**kwargs` parameters that are typed with `Unpack[KwargsTypedDict]`. It extracts the documentations from the `TypedDict` and adds each parameter to the function's docstring.

-    Currently only supported for [`FastImageProcessorKwargs`].
+    Currently only supported for [`ImagesKwargs`].

 ## Best practices

--- a/docs/source/en/backbones.md
+++ b/docs/source/en/backbones.md
@ -22,7 +22,7 @@ Higher-level computer visions tasks, such as object detection or image segmentat
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/Backbone.png"/>
 </div>

-Load a backbone with [`~PretrainedConfig.from_pretrained`] and use the `out_indices` parameter to determine which layer, given by the index, to extract a feature map from.
+Load a backbone with [`~PreTrainedConfig.from_pretrained`] and use the `out_indices` parameter to determine which layer, given by the index, to extract a feature map from.

 ```py
 from transformers import AutoBackbone
@ -46,7 +46,7 @@ There are two ways to load a Transformers backbone, [`AutoBackbone`] and a model
 <hfoptions id="backbone-classes">
 <hfoption id="AutoBackbone">

-The [AutoClass](./model_doc/auto) API automatically loads a pretrained vision model with [`~PretrainedConfig.from_pretrained`] as a backbone if it's supported.
+The [AutoClass](./model_doc/auto) API automatically loads a pretrained vision model with [`~PreTrainedConfig.from_pretrained`] as a backbone if it's supported.

 Set the `out_indices` parameter to the layer you'd like to get the feature map from. If you know the name of the layer, you could also use `out_features`. These parameters can be used interchangeably, but if you use both, make sure they refer to the same layer.

--- a/docs/source/en/cache_explanation.md
+++ b/docs/source/en/cache_explanation.md
@ -41,13 +41,13 @@ $$

 The query (`Q`), key (`K`), and value (`V`) matrices are projections from the input embeddings of shape `(b, h, T, d_head)`.

-For causal attention, the mask prevents the model from attending to future tokens. Once a token is processed, its representation never changes with respect to future tokens, which means \\( K_{\text{past}} \\) and \\( V_{\text{past}} \\) can be cached and reused to compute the last token's representation.
+For causal attention, the mask prevents the model from attending to future tokens. Once a token is processed, its representation never changes with respect to future tokens, which means $ K_{\text{past}} $ and $ V_{\text{past}} $ can be cached and reused to compute the last token's representation.

 $$
 \text{Attention}(q_t, [\underbrace{k_1, k_2, \dots, k_{t-1}}_{\text{cached}}, k_{t}], [\underbrace{v_1, v_2, \dots, v_{t-1}}_{\text{cached}}, v_{t}])
 $$

-At inference time, you only need the last token's query to compute the representation \\( x_t \\) that predicts the next token \\( t+1 \\). At each step, the new key and value vectors are **stored** in the cache and **appended** to the past keys and values.
+At inference time, you only need the last token's query to compute the representation $ x_t $ that predicts the next token $ t+1 $. At each step, the new key and value vectors are **stored** in the cache and **appended** to the past keys and values.

 $$
 K_{\text{cache}} \leftarrow \text{concat}(K_{\text{past}}, k_t), \quad V_{\text{cache}} \leftarrow \text{concat}(V_{\text{past}}, v_t)
@ -59,7 +59,7 @@ Refer to the table below to compare how caching improves efficiency.

 | without caching | with caching |
 |---|---|
-| for each step, recompute all previous `K` and `V`  | for each step, only compute current `K` and `V`
+| for each step, recompute all previous `K` and `V`  | for each step, only compute current `K` and `V` |
 | attention cost per step is **quadratic** with sequence length | attention cost per step is **linear** with sequence length (memory grows linearly, but compute/token remains low) |

 ## Cache class
@ -98,9 +98,10 @@ The example below demonstrates how to create a generation loop with [`DynamicCac

 ```py
 import torch
-from transformers import AutoTokenizer, AutoModelForCausalLM, DynamicCache, infer_device
+from transformers import AutoTokenizer, AutoModelForCausalLM, DynamicCache
+from accelerate import Accelerator

-device = f"{infer_device()}:0"
+device = Accelerator().device

 model_id = "meta-llama/Llama-2-7b-chat-hf"
 model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16, device_map=device)
@ -136,16 +137,17 @@ The cache position tracks where to insert new tokens in the attention cache. It

 Cache position is used internally for two purposes:

-1. Selecting new tokens to process in the input sequence and ensuring only tokens that haven’t been cached yet are passed to the model's `forward`.
+1. Selecting new tokens to process in the input sequence and ensuring only tokens that haven't been cached yet are passed to the model's `forward`.
 2. Storing key/value pairs at the correct positions in the cache. This is especially important for fixed-size caches, that pre-allocates a specific cache length.

 The generation loop usually takes care of the cache position, but if you're writing a custom generation method, it is important that cache positions are accurate since they are used to write and read key/value states into fixed slots.

 ```py
 import torch
-from transformers import AutoTokenizer, AutoModelForCausalLM, DynamicCache, infer_device
+from transformers import AutoTokenizer, AutoModelForCausalLM, DynamicCache
+from accelerate import Accelerator

-device = f"{infer_device()}:0"
+device = Accelerator().device

 model_id = "meta-llama/Llama-2-7b-chat-hf"
 model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16, device_map=device)
@ -156,30 +158,3 @@ inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, ret
 generated_ids = model.generate(**inputs, use_cache=True, max_new_tokens=10)

 ```
-
-## Legacy cache format
-
-Before the [`Cache`] class, the cache used to be stored as a tuple of tuples of tensors. This format is dynamic because it grows as text is generated, similar to [`DynamicCache`].
-
-The legacy format is essentially the same data structure but organized differently.
- It's a tuple of tuples, where each inner tuple contains the key and value tensors for a layer.
- The tensors have the same shape `[batch_size, num_heads, seq_len, head_dim]`.
- The format is less flexible and doesn't support features like quantization or offloading.
-
-If your project depends on this legacy format, we recommend to convert to [`DynamicCache`] with [`~DynamicCache.from_legacy_cache`]. Note that legacy cache format is deprecated and not used anymore in `Transformers`. You can convert back to tuple format with [`DynamicCache.to_legacy_cache`] functions, which is helpful if you have custom logic for manipulating a cache in a specific format.
-
-```py
-import torch
-from transformers import AutoTokenizer, AutoModelForCausalLM, DynamicCache
-
-tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
-model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", dtype=torch.float16, device_map="auto")
-inputs = tokenizer("Hello, my name is", return_tensors="pt").to(model.device)
-
-# `return_dict_in_generate=True` is required to return the cache and `return_legacy_cache` forces the returned cache
-# in the legacy format
-generation_outputs = model.generate(**inputs, return_dict_in_generate=True, return_legacy_cache=True, max_new_tokens=5)
-
-cache = DynamicCache.from_legacy_cache(generation_outputs.past_key_values)
-legacy_format_cache = cache.to_legacy_cache()
-```
--- a/docs/source/en/chat_extras.md
+++ b/docs/source/en/chat_extras.md
@ -221,4 +221,4 @@ model_input = tokenizer.apply_chat_template(
    messages,
    tools = [current_time, multiply]
 )
-```
+```
--- a/docs/source/en/chat_templating.md
+++ b/docs/source/en/chat_templating.md
@ -77,9 +77,9 @@ Mistral-7B-Instruct uses `[INST]` and `[/INST]` tokens to indicate the start and

 The input to `apply_chat_template` should be structured as a list of dictionaries with `role` and `content` keys. The `role` key specifies the speaker, and the `content` key contains the message. The common roles are:

- - `user` for messages from the user
- - `assistant` for messages from the model
- - `system` for directives on how the model should act (usually placed at the beginning of the chat)
+- `user` for messages from the user
+- `assistant` for messages from the model
+- `system` for directives on how the model should act (usually placed at the beginning of the chat)

 [`apply_chat_template`] takes this list and returns a formatted sequence. Set `tokenize=True` if you want to tokenize the sequence.

@ -124,7 +124,7 @@ Matey, I'm afraid I must inform ye that humans cannot eat helicopters. Helicopte

 > [!WARNING]
 > Some tokenizers add special `<bos>` and `<eos>` tokens. Chat templates should already include all the necessary special tokens, and adding additional special tokens is often incorrect or duplicated, hurting model performance. When you format text with `apply_chat_template(tokenize=False)`, make sure you set `add_special_tokens=False` if you tokenize later to avoid duplicating these tokens.
-> This isn’t an issue if you use `apply_chat_template(tokenize=True)`, which means it's usually the safer option!
+> This isn't an issue if you use `apply_chat_template(tokenize=True)`, which means it's usually the safer option!

 ### add_generation_prompt

@ -168,7 +168,7 @@ Can I ask a question?<|im_end|>

 When `add_generation_prompt=True`, `<|im_start|>assistant` is added at the end to indicate the start of an `assistant` message. This lets the model know an `assistant` response is next.

-Not all models require generation prompts, and some models, like [Llama](./model_doc/llama), don’t have any special tokens before the `assistant` response. In these cases, [add_generation_prompt](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.add_generation_prompt) has no effect.
+Not all models require generation prompts, and some models, like [Llama](./model_doc/llama), don't have any special tokens before the `assistant` response. In these cases, [add_generation_prompt](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.add_generation_prompt) has no effect.

 ### continue_final_message

@ -187,9 +187,9 @@ model.generate(**formatted_chat)
 ```

 > [!WARNING]
-> You shouldn’t use [add_generation_prompt](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.add_generation_prompt) and [continue_final_message](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.continue_final_message) together. The former adds tokens that start a new message, while the latter removes end of sequence tokens. Using them together returns an error.
+> You shouldn't use [add_generation_prompt](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.add_generation_prompt) and [continue_final_message](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.continue_final_message) together. The former adds tokens that start a new message, while the latter removes end of sequence tokens. Using them together returns an error.

-[`TextGenerationPipeline`] sets [add_generation_prompt](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.add_generation_prompt) to `True` by default to start a new message. However, if the final message in the chat has the `assistant` role, it assumes the message is a prefill and switches to `continue_final_message=True`. This is because most models don’t support multiple consecutive assistant messages. To override this behavior, explicitly pass the [continue_final_message](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.continue_final_message) argument to the pipeline.
+[`TextGenerationPipeline`] sets [add_generation_prompt](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.add_generation_prompt) to `True` by default to start a new message. However, if the final message in the chat has the `assistant` role, it assumes the message is a prefill and switches to `continue_final_message=True`. This is because most models don't support multiple consecutive assistant messages. To override this behavior, explicitly pass the [continue_final_message](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.continue_final_message) argument to the pipeline.

 ## Model training

--- a/docs/source/en/chat_templating_multimodal.md
+++ b/docs/source/en/chat_templating_multimodal.md
@ -56,7 +56,7 @@ out = pipe(text=messages, max_new_tokens=128)
 print(out[0]['generated_text'][-1]['content'])
 ```

-```
+```text
 Ahoy, me hearty! These be two feline friends, likely some tabby cats, taking a siesta on a cozy pink blanket. They're resting near remote controls, perhaps after watching some TV or just enjoying some quiet time together. Cats sure know how to find comfort and relaxation, don't they?
 ```

@ -96,7 +96,7 @@ processed_chat = processor.apply_chat_template(messages, add_generation_prompt=T
 print(list(processed_chat.keys()))
 ```

-```
+```text
 ['input_ids', 'attention_mask', 'pixel_values', 'image_grid_thw']
 ```

@ -115,7 +115,7 @@ Some vision models also support video inputs. The message format is very similar

 - The content `"type"` should be `"video"` to indicate the content is a video.
 - For videos, it can be a link to the video (`"url"`) or it could be a file path (`"path"`). Videos loaded from a URL can only be decoded with [PyAV](https://pyav.basswood-io.com/docs/stable/) or [Decord](https://github.com/dmlc/decord).
- In addition to loading videos from a URL or file path, you can also pass decoded video data directly. This is useful if you’ve already preprocessed or decoded video frames elsewhere in memory (e.g., using OpenCV, decord, or torchvision). You don't need to save to files or store it in an URL.
+- In addition to loading videos from a URL or file path, you can also pass decoded video data directly. This is useful if you've already preprocessed or decoded video frames elsewhere in memory (e.g., using OpenCV, decord, or torchvision). You don't need to save to files or store it in an URL.

 > [!WARNING]
 > Loading a video from `"url"` is only supported by the PyAV or Decord backends.
--- a/docs/source/en/chat_templating_writing.md
+++ b/docs/source/en/chat_templating_writing.md
@ -188,7 +188,7 @@ The example below shows how a tool is defined in JSON schema format.

 An example of handling tool definitions in a chat template is shown below. The specific tokens and layouts should be changed to match the ones the model was trained with.

-```
+```jinja
 {%- if tools %}
    {%- for tool in tools %}
        {{- '<tool>' + tool['function']['name'] + '\n' }}
@ -226,7 +226,7 @@ Tool calls are generally passed in the `tool_calls` key of an `"assistant”` me

 A common pattern for handling tool calls is shown below. You can use this as a starting point, but make sure you template actually matches the format the model was trained with!

-```
+```jinja
 {%- if message['role'] == 'assistant' and 'tool_calls' in message %}
    {%- for tool_call in message['tool_calls'] %}
            {{- '<tool_call>' + tool_call['function']['name'] + '\n' + tool_call['function']['arguments']|tojson + '\n</tool_call>' }}
@ -249,7 +249,7 @@ Tool responses are message dicts with the `tool` role. They are much simpler tha

 Some templates may not even need the `name` key, in which case, you can write your template to only read the `content` key.

-```
+```jinja
 {%- if message['role'] == 'tool' %}
    {{- "<tool_result>" + message['content'] + "</tool_result>" }}
 {%- endif %}
--- a/docs/source/en/community.md
+++ b/docs/source/en/community.md
@ -6,13 +6,13 @@ rendered properly in your Markdown viewer.

 This page regroups resources around 🤗 Transformers developed by the community.

-## Community resources:
+## Community resources

 | Resource     |      Description      |      Author      |
 |:----------|:-------------|------:|
 | [Hugging Face Transformers Glossary Flashcards](https://www.darigovresearch.com/huggingface-transformers-glossary-flashcards) | A set of flashcards based on the [Transformers Docs Glossary](glossary) that has been put into a form which can be easily learned/revised using [Anki](https://apps.ankiweb.net/) an open source, cross platform app specifically designed for long term knowledge retention. See this [Introductory video on how to use the flashcards](https://www.youtube.com/watch?v=Dji_h7PILrw). | [Darigov Research](https://www.darigovresearch.com/) |

-## Community notebooks:
+## Community notebooks

 | Notebook     |      Description      |      Author      |      |
 |:----------|:-------------|:-------------|------:|
--- a/docs/source/en/cursor.md
+++ b/docs/source/en/cursor.md
@ -21,9 +21,10 @@ where `port` is the port used by `transformers serve` (`8000` by default). On th
 </h3>

 You're now ready to set things up on the app side! In Cursor, while you can't set a new provider, you can change the endpoint for OpenAI requests in the model selection settings. First, navigate to "Settings" > "Cursor Settings", "Models" tab, and expand the "API Keys" collapsible. To set your `transformers serve` endpoint, follow this order:
+
 1. Unselect ALL models in the list above (e.g. `gpt4`, ...);
 2. Add and select the model you want to use (e.g. `Qwen/Qwen3-4B`)
-3. Add some random text to OpenAI API Key. This field won't be used, but it can’t be empty;
+3. Add some random text to OpenAI API Key. This field won't be used, but it can't be empty;
 4. Add the https address from `ngrok` to the "Override OpenAI Base URL" field, appending `/v1` to the address (i.e. `https://(...).ngrok-free.app/v1`);
 5. Hit "Verify".

--- a/docs/source/en/custom_models.md
+++ b/docs/source/en/custom_models.md
@ -25,12 +25,12 @@ This guide will show you how to customize a ResNet model, enable [AutoClass](./m

 ## Configuration

-A configuration, given by the base [`PretrainedConfig`] class, contains all the necessary information to build a model. This is where you'll configure the attributes of the custom ResNet model. Different attributes gives different ResNet model types.
+A configuration, given by the base [`PreTrainedConfig`] class, contains all the necessary information to build a model. This is where you'll configure the attributes of the custom ResNet model. Different attributes gives different ResNet model types.

 The main rules for customizing a configuration are:

-1. A custom configuration must subclass [`PretrainedConfig`]. This ensures a custom model has all the functionality of a Transformers' model such as [`~PretrainedConfig.from_pretrained`], [`~PretrainedConfig.save_pretrained`], and [`~PretrainedConfig.push_to_hub`].
-2. The [`PretrainedConfig`] `__init__` must accept any `kwargs` and they must be passed to the superclass `__init__`. [`PretrainedConfig`] has more fields than the ones set in your custom configuration, so when you load a configuration with [`~PretrainedConfig.from_pretrained`], those fields need to be accepted by your configuration and passed to the superclass.
+1. A custom configuration must subclass [`PreTrainedConfig`]. This ensures a custom model has all the functionality of a Transformers' model such as [`~PreTrainedConfig.from_pretrained`], [`~PreTrainedConfig.save_pretrained`], and [`~PreTrainedConfig.push_to_hub`].
+2. The [`PreTrainedConfig`] `__init__` must accept any `kwargs` and they must be passed to the superclass `__init__`. [`PreTrainedConfig`] has more fields than the ones set in your custom configuration, so when you load a configuration with [`~PreTrainedConfig.from_pretrained`], those fields need to be accepted by your configuration and passed to the superclass.

 > [!TIP]
 > It is useful to check the validity of some of the parameters. In the example below, a check is implemented to ensure `block_type` and `stem_type` belong to one of the predefined values.
@ -38,10 +38,10 @@ The main rules for customizing a configuration are:
 > Add `model_type` to the configuration class to enable [AutoClass](./models#autoclass) support.

 ```py
-from transformers import PretrainedConfig
+from transformers import PreTrainedConfig
 from typing import List

-class ResnetConfig(PretrainedConfig):
+class ResnetConfig(PreTrainedConfig):
    model_type = "resnet"

    def __init__(
@ -74,7 +74,7 @@ class ResnetConfig(PretrainedConfig):
        super().__init__(**kwargs)
 ```

-Save the configuration to a JSON file in your custom model folder, `custom-resnet`, with [`~PretrainedConfig.save_pretrained`].
+Save the configuration to a JSON file in your custom model folder, `custom-resnet`, with [`~PreTrainedConfig.save_pretrained`].

 ```py
 resnet50d_config = ResnetConfig(block_type="bottleneck", stem_width=32, stem_type="deep", avg_down=True)
@ -83,7 +83,7 @@ resnet50d_config.save_pretrained("custom-resnet")

 ## Model

-With the custom ResNet configuration, you can now create and customize the model. The model subclasses the base [`PreTrainedModel`] class. Like [`PretrainedConfig`], inheriting from [`PreTrainedModel`] and initializing the superclass with the configuration extends Transformers' functionalities such as saving and loading to the custom model.
+With the custom ResNet configuration, you can now create and customize the model. The model subclasses the base [`PreTrainedModel`] class. Like [`PreTrainedConfig`], inheriting from [`PreTrainedModel`] and initializing the superclass with the configuration extends Transformers' functionalities such as saving and loading to the custom model.

 Transformers' models follow the convention of accepting a `config` object in the `__init__` method. This passes the entire `config` to the model sublayers, instead of breaking the `config` object into multiple arguments that are individually passed to the sublayers.

@ -235,7 +235,7 @@ from resnet_model.configuration_resnet import ResnetConfig
 from resnet_model.modeling_resnet import ResnetModel, ResnetModelForImageClassification
 ```

-Copy the code from the model and configuration files. To make sure the AutoClass objects are saved with [`~PreTrainedModel.save_pretrained`], call the [`~PretrainedConfig.register_for_auto_class`] method. This modifies the configuration JSON file to include the AutoClass objects and mapping.
+Copy the code from the model and configuration files. To make sure the AutoClass objects are saved with [`~PreTrainedModel.save_pretrained`], call the [`~PreTrainedConfig.register_for_auto_class`] method. This modifies the configuration JSON file to include the AutoClass objects and mapping.

 For a model, pick the appropriate `AutoModelFor` class based on the task.

--- a/docs/source/en/debugging.md
+++ b/docs/source/en/debugging.md
@ -45,7 +45,7 @@ which nvcc

 You may also have more than one CUDA toolkit installed on your system.

-```bash
+```text
 /usr/local/cuda-10.2
 /usr/local/cuda-11.0
 ```
--- a/docs/source/en/deepspeed.md
+++ b/docs/source/en/deepspeed.md
@ -294,7 +294,7 @@ Consider running a [benchmark](https://github.com/microsoft/DeepSpeed/issues/998

 The example ZeRO-3 and ZeRO-Infinity config below sets most of the parameter values to `auto`, but you can also manually set configure these values.

-```yaml
+```json
 {
    "fp16": {
        "enabled": "auto",
@ -383,7 +383,7 @@ Gradient checkpointing saves memory by only storing *some* of the intermediate a

 The batch size can be automatically configured or manually set. When you choose the `"auto"` option, [`Trainer`] sets `train_micro_batch_size_per_gpu` and `train_batch_size` to the value of `world_size * per_device_train_batch_size * gradient_accumulation_steps`.

-```yaml
+```json
 {
    "train_micro_batch_size_per_gpu": "auto",
    "train_batch_size": "auto"
@ -400,7 +400,7 @@ Reduce operations are lossy, for example, when gradients are averaged across mul

 Choose the communication data type by setting the `communication_data_type` parameter in the config file. For example, choosing fp32 adds a small amount of overhead but ensures the reduction operation is accumulated in fp32 and when it is ready, it's downcasted to whichever half-precision data type you're training in.

-```yaml
+```json
 {
    "communication_data_type": "fp32"
 }
@ -412,7 +412,7 @@ Gradient accumulation accumulates gradients over several mini-batches of data be

 Gradient accumulation can be automatically configured or manually set. When you choose the `"auto"` option, [`Trainer`] sets it to the value of `gradient_accumulation_steps`.

-```yaml
+```json
 {
    "gradient_accumulation_steps": "auto"
 }
@ -424,7 +424,7 @@ Gradient clipping is useful for preventing exploding gradients which can lead to

 Gradient clipping can be automatically configured or manually set. When you choose the `"auto"` option, [`Trainer`] sets it to the value of `max_grad_norm`.

-```yaml
+```json
 {
    "gradient_clipping": "auto"
 }
@ -439,7 +439,7 @@ Mixed precision accelerates training speed by performing some calculations in ha

 Train in fp32 if a model wasn't pretrained in mixed precision because it may cause underflow or overflow errors. Disable fp16, the default, in this case.

-```yaml
+```json
 {
    "fp16": {
        "enabled": false
@ -452,9 +452,9 @@ For Ampere GPUs and PyTorch 1.7+, the more efficient [tf32](https://pytorch.org/
 </hfoption>
 <hfoption id="fp16">

-To configure AMP-like fp16 mixed precision, set up the config as shown below with `"auto"` or your own values. [`Trainer`] automatically enables or disables fp16 based on the value of `fp16_backend`, and the rest of the config can be set by you. fp16 is enabled from the command line when the following arguments are passed: `--fp16`, `--fp16_backend amp` or `--fp16_full_eval`.
+To configure fp16 mixed precision, set up the config as shown below with `"auto"` or your own values. [`Trainer`] automatically enables or disables fp16 based on the value of `fp16` or `fp16_full_eval`, and the rest of the config can be set by you. fp16 is enabled from the command line when the following arguments are passed: `--fp16` or `--fp16_full_eval` also.

-```yaml
+```json
 {
    "fp16": {
        "enabled": "auto",
@ -469,28 +469,17 @@ To configure AMP-like fp16 mixed precision, set up the config as shown below wit

 For additional DeepSpeed fp16 training options, take a look at the [FP16 Training Options](https://www.deepspeed.ai/docs/config-json/#fp16-training-options) reference.

-To configure Apex-like fp16 mixed precision, set up the config as shown below with `"auto"` or your own values. [`Trainer`] automatically configures `amp` based on the values of `fp16_backend` and `fp16_opt_level`. It can also be enabled from the command line when the following arguments are passed: `--fp16`, `--fp16_backend apex` or `--fp16_opt_level 01`.
-
-```yaml
-{
-    "amp": {
-        "enabled": "auto",
-        "opt_level": "auto"
-    }
-}
-```
-
 </hfoption>
 <hfoption id="bf16">

 > [!TIP]
 > bf16 requires DeepSpeed 0.6.0.

-bf16 has the same dynamic range as fp32, and doesn’t require loss scaling unlike fp16. However, if you use [gradient accumulation](#gradient-accumulation) with bf16, gradients are accumulated in bf16 which may not be desirable because the lower precision can lead to lossy accumulation.
+bf16 has the same dynamic range as fp32, and doesn't require loss scaling unlike fp16. However, if you use [gradient accumulation](#gradient-accumulation) with bf16, gradients are accumulated in bf16 which may not be desirable because the lower precision can lead to lossy accumulation.

 bf16 can be set up in the config file or enabled from the command line when the following arguments are passed: `--bf16` or `--bf16_full_eval`.

-```yaml
+```json
 {
    "bf16": {
        "enabled": "auto"
@ -514,7 +503,7 @@ DeepSpeed offers several [optimizers](https://www.deepspeed.ai/docs/config-json/

 You can set the parameters to `"auto"` or manually input your own values.

-```yaml
+```json
 {
   "optimizer": {
       "type": "AdamW",
@ -530,7 +519,7 @@ You can set the parameters to `"auto"` or manually input your own values.

 Use an unsupported optimizer by adding the following to the top level configuration.

-```yaml
+```json
 {
   "zero_allow_untested_optimizer": true
 }
@ -538,7 +527,7 @@ Use an unsupported optimizer by adding the following to the top level configurat

 From DeepSpeed 0.8.3+, if you want to use offload, you'll also need to add the following to the top level configuration because offload works best with DeepSpeed's CPU Adam optimizer.

-```yaml
+```json
 {
   "zero_force_ds_cpu_optimizer": false
 }
@ -558,7 +547,7 @@ If you don't configure the scheduler in the config file, [`Trainer`] automatical

 You can set the parameters to `"auto"` or manually input your own values.

-```yaml
+```json
 {
   "scheduler": {
         "type": "WarmupDecayLR",
@ -581,7 +570,7 @@ You can set the parameters to `"auto"` or manually input your own values.

 Resume training with a Universal checkpoint by setting `load_universal` to `true` in the config file.

-```yaml
+```json
 {
    "checkpoint": {
        "load_universal": true
@ -604,7 +593,7 @@ To deploy DeepSpeed on multiple GPUs, add `--num_gpus`. You don't need to add `-
 deepspeed --num_gpus=2 examples/pytorch/translation/run_translation.py \
 --deepspeed tests/deepspeed/ds_config_zero3.json \
 --model_name_or_path google-t5/t5-small --per_device_train_batch_size 1 \
--output_dir output_dir --overwrite_output_dir --fp16 \
+--output_dir output_dir --fp16 \
 --do_train --max_train_samples 500 --num_train_epochs 1 \
 --dataset_name wmt16 --dataset_config "ro-en" \
 --source_lang en --target_lang ro
@ -627,7 +616,7 @@ To deploy DeepSpeed on a single GPU, add `--num_gpus`. You don't need to add `--
 deepspeed --num_gpus=1 examples/pytorch/translation/run_translation.py \
 --deepspeed tests/deepspeed/ds_config_zero2.json \
 --model_name_or_path google-t5/t5-small --per_device_train_batch_size 1 \
--output_dir output_dir --overwrite_output_dir --fp16 \
+--output_dir output_dir --fp16 \
 --do_train --max_train_samples 500 --num_train_epochs 1 \
 --dataset_name wmt16 --dataset_config "ro-en" \
 --source_lang en --target_lang ro
@ -640,7 +629,7 @@ deepspeed --num_gpus=1 examples/pytorch/translation/run_translation.py \

 A multi-node setup consists of multiple nodes, where each node has one of more GPUs running a workload. DeepSpeed expects a shared storage system, but if this is not the case, you need to adjust the config file to include a [checkpoint](https://www.deepspeed.ai/docs/config-json/#checkpoint-options) to allow loading without access to a shared filesystem.

-```yaml
+```json
 {
  "checkpoint": {
    "use_node_local_storage": true
@ -824,7 +813,7 @@ ZeRO-2 saves the model weights in fp16. To save the weights in fp16 for ZeRO-3,

 If you don't, [`Trainer`] won't save the weights in fp16 and won't create a `pytorch_model.bin` file. This is because DeepSpeed's state_dict contains a placeholder instead of the real weights, so you won't be able to load it.

-```yaml
+```json
 {
    "zero_optimization": {
        "stage": 3,
@ -986,7 +975,7 @@ NaN loss often occurs when a model is pretrained in bf16 and you try to use it w

 It is also possible that fp16 is causing overflow. For example, if your config file looks like the one below, you may see the following overflow errors in the logs.

-```yaml
+```json
 {
    "fp16": {
        "enabled": "auto",
--- a/docs/source/en/executorch.md
+++ b/docs/source/en/executorch.md
@ -16,44 +16,17 @@ rendered properly in your Markdown viewer.

 # ExecuTorch

-[ExecuTorch](https://pytorch.org/executorch/stable/index.html) is a platform that enables PyTorch training and inference programs to be run on mobile and edge devices. It is powered by [torch.compile](https://pytorch.org/docs/stable/torch.compiler.html) and [torch.export](https://pytorch.org/docs/main/export.html) for performance and deployment.
+[ExecuTorch](https://pytorch.org/executorch/stable/index.html) runs PyTorch models on mobile and edge devices. Export your Transformers models to the ExecuTorch format with [Optimum ExecuTorch](https://github.com/huggingface/optimum-executorch) with the command below.

-You can use ExecuTorch with Transformers with [torch.export](https://pytorch.org/docs/main/export.html). The [`~transformers.convert_and_export_with_cache`] method converts a [`PreTrainedModel`] into an exportable module. Under the hood, it uses [torch.export](https://pytorch.org/docs/main/export.html) to export the model, ensuring compatibility with ExecuTorch.
-
-```py
-import torch
-from transformers import LlamaForCausalLM, AutoTokenizer, GenerationConfig
-from transformers.integrations.executorch import(
-    TorchExportableModuleWithStaticCache,
-    convert_and_export_with_cache
-)
-
-generation_config = GenerationConfig(
-    use_cache=True,
-    cache_implementation="static",
-    cache_config={
-        "batch_size": 1,
-        "max_cache_len": 20,
-    }
-)
-
-tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B", pad_token="</s>", padding_side="right")
-model = LlamaForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B", device_map="auto", dtype=torch.bfloat16, attn_implementation="sdpa", generation_config=generation_config)
-
-exported_program = convert_and_export_with_cache(model)
 ```
-
-The exported PyTorch model is now ready to be used with ExecuTorch. Wrap the model with [`~transformers.TorchExportableModuleWithStaticCache`] to generate text.
-
-```py
-prompts = ["Simply put, the theory of relativity states that "]
-prompt_tokens = tokenizer(prompts, return_tensors="pt", padding=True).to(model.device)
-prompt_token_ids = prompt_tokens["input_ids"]
-
-generated_ids = TorchExportableModuleWithStaticCache.generate(
-    exported_program=exported_program, prompt_token_ids=prompt_token_ids, max_new_tokens=20,
-)
-generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
-print(generated_text)
-['Simply put, the theory of relativity states that 1) the speed of light is the']
+optimum-cli export executorch \
+    --model "HuggingFaceTB/SmolLM2-135M-Instruct" \
+    --task "text-generation" \
+    --recipe "xnnpack" \
+    --use_custom_sdpa \
+    --use_custom_kv_cache \
+    --qlinear 8da4w \
+    --qembedding 8w \
+    --output_dir="hf_smollm2"
 ```
+Run `optimum-cli export executorch --help` to see all export options. For detailed export instructions, check the [README](optimum/exporters/executorch/README.md).
--- a/docs/source/en/generation_strategies.md
+++ b/docs/source/en/generation_strategies.md
@ -32,9 +32,10 @@ Greedy search works well for tasks with relatively short outputs where creativit

 ```py
 import torch
-from transformers import AutoModelForCausalLM, AutoTokenizer, infer_device
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from accelerate import Accelerator

-device = infer_device()
+device = Accelerator().device

 tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
 inputs = tokenizer("Hugging Face is an open-source company", return_tensors="pt").to(device)
@ -54,9 +55,10 @@ Enable multinomial sampling with `do_sample=True` and `num_beams=1`.

 ```py
 import torch
-from transformers import AutoModelForCausalLM, AutoTokenizer, infer_device
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from accelerate import Accelerator

-device = infer_device()
+device = Accelerator().device

 tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
 inputs = tokenizer("Hugging Face is an open-source company", return_tensors="pt").to(device)
@ -79,9 +81,10 @@ Enable beam search with the `num_beams` parameter (should be greater than 1 othe

 ```py
 import torch
-from transformers import AutoModelForCausalLM, AutoTokenizer, infer_device
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from accelerate import Accelerator

-device = infer_device()
+device = Accelerator().device

 tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
 inputs = tokenizer("Hugging Face is an open-source company", return_tensors="pt").to(device)
@ -166,9 +169,10 @@ Enable prompt lookup decoding with the `prompt_lookup_num_tokens` parameter.

 ```py
 import torch
-from transformers import AutoModelForCausalLM, AutoTokenizer, infer_device
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from accelerate import Accelerator

-device = infer_device()
+device = Accelerator().device

 tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM-1.7B")
 model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM-1.7B", dtype=torch.float16).to(device)
@ -229,6 +233,7 @@ tokenizer.batch_decode(outputs, skip_special_tokens=True)
 ## Custom generation methods

 Custom generation methods enable specialized behavior such as:
+
 - have the model continue thinking if it is uncertain;
 - roll back generation if the model gets stuck;
 - handle special tokens with custom logic;
@ -289,7 +294,7 @@ print(tokenizer.batch_decode(gen_out)[0])

 If the custom method has pinned Python requirements that your environment doesn't meet, you'll get an exception about missing requirements. For instance, [transformers-community/custom_generate_bad_requirements](https://huggingface.co/transformers-community/custom_generate_bad_requirements) has an impossible set of requirements defined in its `custom_generate/requirements.txt` file, and you'll see the error message below if you try to run it.

-```
+```text
 ImportError: Missing requirements in your local environment for `transformers-community/custom_generate_bad_requirements`:
 foo (installed: None)
 bar==0.0.0 (installed: None)
@ -301,6 +306,7 @@ Updating your Python requirements accordingly will remove this error message.
 ### Creating a custom generation method

 To create a new generation method, you need to create a new [**Model**](https://huggingface.co/new) repository and push a few files into it.
+
 1. The model you've designed your generation method with.
 2. `custom_generate/generate.py`, which contains all the logic for your custom generation method.
 3. `custom_generate/requirements.txt`, used to optionally add new Python requirements and/or lock specific versions to correctly use your method.
@ -308,7 +314,7 @@ To create a new generation method, you need to create a new [**Model**](https://

 After you've added all required files, your repository should look like this

-```
+```text
 your_repo/
 ├── README.md          # include the 'custom_generate' tag
 ├── config.json
@ -377,6 +383,7 @@ def generate(model, input_ids, generation_config=None, left_padding=None, **kwar
 ```

 Follow the recommended practices below to ensure your custom generation method works as expected.
+
 - Feel free to reuse the logic for validation and input preparation in the original [`~GenerationMixin.generate`].
 - Pin the `transformers` version in the requirements if you use any private method/attribute in `model`.
 - Consider adding model validation, input validation, or even a separate test file to help users sanity-check your code in their environment.
@ -399,7 +406,7 @@ The root level `README.md` in the model repository usually describes the model t

 For discoverability, we highly recommend you to add the `custom_generate` tag to your repository. To do so, the top of your `README.md` file should look like the example below. After you push the file, you should see the tag in your repository!

-```
+```text
 ---
 library_name: transformers
 tags:
@ -410,13 +417,14 @@ tags:
 ```

 Recommended practices:
+
 - Document input and output differences in [`~GenerationMixin.generate`].
 - Add self-contained examples to enable quick experimentation.
 - Describe soft-requirements such as if the method only works well with a certain family of models.

-### Reusing `generate`’s input preparation
+### Reusing `generate`'s input preparation

-If you're adding a new decoding loop, you might want to preserve the input preparation present in `generate` (batch expansion, attention masks, logits processors, stopping criteria, etc.). You can also pass a **callable** to `custom_generate` to reuse [`~GenerationMixin.generate`]’s full preparation pipeline while overriding only the decoding loop.
+If you're adding a new decoding loop, you might want to preserve the input preparation present in `generate` (batch expansion, attention masks, logits processors, stopping criteria, etc.). You can also pass a **callable** to `custom_generate` to reuse [`~GenerationMixin.generate`]'s full preparation pipeline while overriding only the decoding loop.

 ```py
 def custom_loop(model, input_ids, attention_mask, logits_processor, stopping_criteria, generation_config, **model_kwargs):
@ -437,11 +445,12 @@ output = model.generate(
 ```

 > [!TIP]
-> If you publish a `custom_generate` repository, your `generate` implementation can itself define a callable and pass it to `model.generate()`. This lets you customize the decoding loop while still benefiting from Transformers’ built-in input preparation logic.
+> If you publish a `custom_generate` repository, your `generate` implementation can itself define a callable and pass it to `model.generate()`. This lets you customize the decoding loop while still benefiting from Transformers' built-in input preparation logic.

 ### Finding custom generation methods

 You can find all custom generation methods by [searching for their custom tag.](https://huggingface.co/models?other=custom_generate), `custom_generate`. In addition to the tag, we curate two collections of `custom_generate` methods:
+
 - [Custom generation methods - Community](https://huggingface.co/collections/transformers-community/custom-generation-methods-community-6888fb1da0efbc592d3a8ab6) -- a collection of powerful methods contributed by the community;
 - [Custom generation methods - Tutorials](https://huggingface.co/collections/transformers-community/custom-generation-methods-tutorials-6823589657a94940ea02cfec) -- a collection of reference implementations for methods that previously were part of `transformers`, as well as tutorials for `custom_generate`.

--- a/docs/source/en/glossary.md
+++ b/docs/source/en/glossary.md
@ -185,9 +185,9 @@ See the [Fine-tune a pretrained model](https://huggingface.co/docs/transformers/

 The model head refers to the last layer of a neural network that accepts the raw hidden states and projects them onto a different dimension. There is a different model head for each task. For example:

-  * [`GPT2ForSequenceClassification`] is a sequence classification head - a linear layer - on top of the base [`GPT2Model`].
-  * [`ViTForImageClassification`] is an image classification head - a linear layer on top of the final hidden state of the `CLS` token - on top of the base [`ViTModel`].
-  * [`Wav2Vec2ForCTC`] is a language modeling head with [CTC](#connectionist-temporal-classification-ctc) on top of the base [`Wav2Vec2Model`].
+* [`GPT2ForSequenceClassification`] is a sequence classification head - a linear layer - on top of the base [`GPT2Model`].
+* [`ViTForImageClassification`] is an image classification head - a linear layer on top of the final hidden state of the `CLS` token - on top of the base [`ViTModel`].
+* [`Wav2Vec2ForCTC`] is a language modeling head with [CTC](#connectionist-temporal-classification-ctc) on top of the base [`Wav2Vec2Model`].

 ## I

--- a/docs/source/en/how_to_hack_models.md
+++ b/docs/source/en/how_to_hack_models.md
@ -149,4 +149,4 @@ Call [print_trainable_parameters](https://huggingface.co/docs/peft/package_refer
 ```py
 model.print_trainable_parameters()
 "trainable params: 589,824 || all params: 94,274,096 || trainable%: 0.6256"
-```
+```
--- a/docs/source/en/hpo_train.md
+++ b/docs/source/en/hpo_train.md
@ -15,15 +15,12 @@ rendered properly in your Markdown viewer.

 # Hyperparameter search

-Hyperparameter search discovers an optimal set of hyperparameters that produces the best model performance. [`Trainer`] supports several hyperparameter search backends - [Optuna](https://optuna.readthedocs.io/en/stable/index.html), [SigOpt](https://docs.sigopt.com/), [Weights & Biases](https://docs.wandb.ai/), [Ray Tune](https://docs.ray.io/en/latest/tune/index.html) - through  [`~Trainer.hyperparameter_search`] to optimize an objective or even multiple objectives.
+Hyperparameter search discovers an optimal set of hyperparameters that produces the best model performance. [`Trainer`] supports several hyperparameter search backends - [Optuna](https://optuna.readthedocs.io/en/stable/index.html), [Weights & Biases](https://docs.wandb.ai/), [Ray Tune](https://docs.ray.io/en/latest/tune/index.html) - through  [`~Trainer.hyperparameter_search`] to optimize an objective or even multiple objectives.

 This guide will go over how to set up a hyperparameter search for each of the backends.

-> [!WARNING]
-> [SigOpt](https://github.com/sigopt/sigopt-server) is in public archive mode and is no longer actively maintained. Try using Optuna, Weights & Biases or Ray Tune instead.
-
 ```bash
-pip install optuna/sigopt/wandb/ray[tune]
+pip install optuna/wandb/ray[tune]
 ```

 To use [`~Trainer.hyperparameter_search`], you need to create a `model_init` function. This function includes basic model information (arguments and configuration) because it needs to be reinitialized for each search trial in the run.
@ -109,31 +106,7 @@ best_trials = trainer.hyperparameter_search(
    n_trials=20,
    compute_objective=compute_objective,
 )
-```

-</hfoption>
-<hfoption id="SigOpt">
-
-[SigOpt](https://docs.sigopt.com/ai-module-api-references/api_reference/objects/object_parameter) optimizes double, integer, and categorical parameters.
-
-```py
-def sigopt_hp_space(trial):
-    return [
-        {"bounds": {"min": 1e-6, "max": 1e-4}, "name": "learning_rate", "type": "double"},
-        {
-            "categorical_values": ["16", "32", "64", "128"],
-            "name": "per_device_train_batch_size",
-            "type": "categorical",
-        },
-    ]
-
-best_trials = trainer.hyperparameter_search(
-    direction=["minimize", "maximize"],
-    backend="sigopt",
-    hp_space=sigopt_hp_space,
-    n_trials=20,
-    compute_objective=compute_objective,
-)
 ```

 </hfoption>
@ -166,4 +139,4 @@ best_trials = trainer.hyperparameter_search(

 ## Distributed Data Parallel

-[`Trainer`] only supports hyperparameter search for distributed data parallel (DDP) on the Optuna and SigOpt backends. Only the rank-zero process is used to generate the search trial, and the resulting parameters are passed along to the other ranks.
+[`Trainer`] only supports hyperparameter search for distributed data parallel (DDP) on the Optuna backends. Only the rank-zero process is used to generate the search trial, and the resulting parameters are passed along to the other ranks.
--- a/docs/source/en/index.md
+++ b/docs/source/en/index.md
@ -34,6 +34,8 @@ There are over 1M+ Transformers [model checkpoints](https://huggingface.co/model

 Explore the [Hub](https://huggingface.com/) today to find a model and use Transformers to help you get started right away.

+Explore the [Models Timeline](./models_timeline) to discover the latest text, vision, audio and multimodal model architectures in Transformers.
+
 ## Features

 Transformers provides everything you need for inference or training with state-of-the-art pretrained models. Some of the main features include:
@ -60,4 +62,4 @@ Transformers is designed for developers and machine learning engineers and resea

 ## Learn

-If you're new to Transformers or want to learn more about transformer models, we recommend starting with the [LLM course](https://huggingface.co/learn/llm-course/chapter1/1?fw=pt). This comprehensive course covers everything from the fundamentals of how transformer models work to practical applications across various tasks. You'll learn the complete workflow, from curating high-quality datasets to fine-tuning large language models and implementing reasoning capabilities. The course contains both theoretical and hands-on exercises to build a solid foundational knowledge of transformer models as you learn.
+If you're new to Transformers or want to learn more about transformer models, we recommend starting with the [LLM course](https://huggingface.co/learn/llm-course/chapter1/1?fw=pt). This comprehensive course covers everything from the fundamentals of how transformer models work to practical applications across various tasks. You'll learn the complete workflow, from curating high-quality datasets to fine-tuning large language models and implementing reasoning capabilities. The course contains both theoretical and hands-on exercises to build a solid foundational knowledge of transformer models as you learn.
--- a/docs/source/en/internal/file_utils.md
+++ b/docs/source/en/internal/file_utils.md
@ -43,4 +43,3 @@ Most of those are only useful if you are studying the general code in the librar
 ## Other Utilities

 [[autodoc]] utils._LazyModule
-[[autodoc]] pytorch_utils.infer_device
--- a/docs/source/en/internal/generation_utils.md
+++ b/docs/source/en/internal/generation_utils.md
@ -153,6 +153,9 @@ generation.
 [[autodoc]] TemperatureLogitsWarper
    - __call__

+[[autodoc]] TopHLogitsWarper
+    - __call__
+
 [[autodoc]] TopKLogitsWarper
    - __call__

@ -193,28 +196,6 @@ A [`StoppingCriteria`] can be used to change when to stop generation (other than
 [[autodoc]] EosTokenCriteria
    - __call__

-## Constraints
-
-A [`Constraint`] can be used to force the generation to include specific tokens or sequences in the output. Please note that this is exclusively available to our PyTorch implementations.
-
-[[autodoc]] Constraint
-
-[[autodoc]] PhrasalConstraint
-
-[[autodoc]] DisjunctiveConstraint
-
-[[autodoc]] ConstraintListState
-
-## BeamSearch
-
-[[autodoc]] BeamScorer
-    - process
-    - finalize
-
-[[autodoc]] ConstrainedBeamSearchScorer
-    - process
-    - finalize
-
 ## Streamers

 [[autodoc]] TextStreamer
@ -270,19 +251,19 @@ A [`Constraint`] can be used to force the generation to include specific tokens
    - batch_select_indices

 [[autodoc]] DynamicCache
-    - to_legacy_cache
-    - from_legacy_cache
+
+[[autodoc]] StaticCache

 [[autodoc]] QuantizedCache

+[[autodoc]] EncoderDecoderCache
+
 [[autodoc]] QuantoQuantizedCache

 [[autodoc]] HQQQuantizedCache

 [[autodoc]] OffloadedCache

-[[autodoc]] StaticCache
-
 [[autodoc]] OffloadedStaticCache

 [[autodoc]] HybridCache
@ -291,10 +272,6 @@ A [`Constraint`] can be used to force the generation to include specific tokens

 [[autodoc]] SlidingWindowCache

-[[autodoc]] EncoderDecoderCache
-    - to_legacy_cache
-    - from_legacy_cache
-
 ## Watermark Utils

 [[autodoc]] WatermarkingConfig
--- a/docs/source/en/internal/model_debugging_utils.md
+++ b/docs/source/en/internal/model_debugging_utils.md
@ -218,9 +218,9 @@ path reference to the associated `.safetensors` file. Each tensor is written to
 the state dictionary. File names are constructed using the `module_path` as a prefix with a few possible postfixes that
 are built recursively.

-*   Module inputs are denoted with the `_inputs` and outputs by `_outputs`.
-*   `list` and `tuple` instances, such as `args` or function return values, will be postfixed with `_{index}`.
-*   `dict` instances will be postfixed with `_{key}`.
+* Module inputs are denoted with the `_inputs` and outputs by `_outputs`.
+* `list` and `tuple` instances, such as `args` or function return values, will be postfixed with `_{index}`.
+* `dict` instances will be postfixed with `_{key}`.

 ### Comparing between implementations

@ -255,6 +255,7 @@ how many tests are being skipped and for which models.
 When porting models to transformers, tests fail as they should, and sometimes `test_modeling_common` feels irreconcilable with the peculiarities of our brand new model. But how can we be sure we're not breaking everything by adding a seemingly innocent skip?

 This utility:
+
 - scans all test_modeling_common methods
 - looks for times where a method is skipped
 - returns a summary json you can load as a DataFrame/inspect
@ -279,7 +280,7 @@ python utils/scan_skipped_tests.py --output_dir path/to/output

 **Example output:**

-```
+```text
 🔬 Parsing 331 model test files once each...
 📝 Aggregating 224 tests...
  (224/224) test_update_candidate_strategy_with_matches_1es_3d_is_nonecodet_schedule_fa_kwargs
@ -344,10 +345,95 @@ Skipped : 124/323 (38.4%)
 - bit: Bit does not use inputs_embeds
 - blip: Blip does not use inputs_embeds
 - blip_2: Inputs_embeds is tested in individual model tests
- - bridgetower: 
+ - bridgetower:
 - canine: CANINE does not have a get_input_embeddings() method.
 - ...

 📄 JSON saved to /home/pablo/git/transformers/scan_test_inputs_embeds.json

 ```
+
+## Modular model detector
+
+### Code similarity analyzer - for model adders
+
+This utility analyzes code similarities between model implementations to identify opportunities for modularization. It compares a new or existing modeling file against all models in the library using embedding-based and token-based similarity metrics.
+
+### Rationale
+
+When adding a new model to transformers, many components (attention layers, MLPs, outputs, etc.) may already exist in similar form in other models. Instead of implementing everything from scratch, model adders can identify which existing classes are similar and potentially reusable through modularization.
+
+The tool computes two similarity scores:
+
+- **Embedding score**: Uses semantic code embeddings (via `Qwen/Qwen3-Embedding-4B`) to detect functionally similar code even with different naming
+- **Jaccard score**: Measures token set overlap to identify structurally similar code patterns
+
+A score of 1.00 means the code is identical.
+
+### Usage
+
+From the root of the `transformers` repository:
+
+```bash
+python utils/modular_model_detector.py --modeling-file path/to/modeling_file.py
+```
+
+The tool will automatically download the pre-built index from the Hub (requires RAM/VRAM for the embedding model).
+
+**Example output:**
+
+```text
+Loading checkpoint shards: 100%|████████████████████| 2/2 [00:00<00:00, 33.62it/s]
+encoding 21 query definitions with Qwen/Qwen3-Embedding-4B (device=cuda, batch=16, max_length=4096)
+
+stuff.py::Beit3ImageTextMatchingOutput:
+embedding:
+    blip_2::Blip2ImageTextMatchingModelOutput (0.9994)
+    chinese_clip::ChineseCLIPOutput (0.9818)
+    owlvit::OwlViTOutput (0.9818)
+jaccard:
+    owlv2::Owlv2Output (0.9667)
+    metaclip_2::MetaClip2Output (0.9667)
+    altclip::AltCLIPOutput (0.9667)
+intersection:
+    blip::BlipOutput
+    owlvit::OwlViTOutput
+
+stuff.py::Beit3MLP:
+embedding:
+    efficientloftr::EfficientLoFTRMLP (0.9718)
+    seggpt::SegGptMlp (0.9650)
+jaccard:
+    chinese_clip::ChineseCLIPTextSelfOutput (0.5294)
+    bert::BertSelfOutput (0.5294)
+intersection:
+```
+
+The `intersection` field shows classes that appear in both top-5 results, indicating high confidence for modularization candidates.
+
+### Building a custom index
+
+To rebuild the index from your local codebase (useful after adding new models or using a different embedding model):
+
+```bash
+python utils/modular_model_detector.py --build
+```
+
+To push the rebuilt index to a Hub dataset:
+
+```bash
+python utils/modular_model_detector.py --build --push-new-index --hub-dataset your-org/your-dataset
+```
+
+### Options
+
+- `--modeling-file`: Path to the modeling file to analyze
+- `--build`: Build the code similarity index from all modeling files in `src/transformers/models/`
+- `--push-new-index`: After building, push the index to a Hub dataset (requires `--build`)
+- `--hub-dataset`: Hub dataset repository ID to pull/push the index (default: `hf-internal-testing/transformers_code_embeddings`)
+
+### Limitations
+
+This tool requires GPU/CPU resources to run the embedding model (`Qwen/Qwen3-Embedding-4B`). The pre-built index is downloaded from the Hub by default, which requires an internet connection on first use.
+
+Results are suggestions based on code similarity and should be manually reviewed before modularization. High similarity scores don't guarantee perfect compatibility.
--- a/docs/source/en/internal/modeling_utils.md
+++ b/docs/source/en/internal/modeling_utils.md
@ -46,10 +46,4 @@ Most of those are only useful if you are studying the code of the models in the

 [[autodoc]] pytorch_utils.apply_chunking_to_forward

-[[autodoc]] pytorch_utils.find_pruneable_heads_and_indices
-
-[[autodoc]] pytorch_utils.prune_layer
-
-[[autodoc]] pytorch_utils.prune_conv1d_layer
-
 [[autodoc]] pytorch_utils.prune_linear_layer
--- a/docs/source/en/internal/trainer_utils.md
+++ b/docs/source/en/internal/trainer_utils.md
@ -36,10 +36,6 @@ Most of those are only useful if you are studying the code of the Trainer in the

 [[autodoc]] trainer_callback.CallbackHandler

-## Distributed Evaluation
-
-[[autodoc]] trainer_pt_utils.DistributedTensorGatherer
-
 ## Trainer Argument Parser

 [[autodoc]] HfArgumentParser
--- a/docs/source/en/jan.md
+++ b/docs/source/en/jan.md
@ -25,7 +25,7 @@ You are now ready to chat!

 To conclude this example, let's look into a more advanced use-case. If you have a beefy machine to serve models with, but prefer using Jan on a different device, you need to add port forwarding. If you have `ssh` access from your Jan machine into your server, this can be accomplished by typing the following to your Jan machine's terminal

-```
+```bash
 ssh -N -f -L 8000:localhost:8000 your_server_account@your_server_IP -p port_to_ssh_into_your_server
 ```

--- a/docs/source/en/kernel_doc/overview.md
+++ b/docs/source/en/kernel_doc/overview.md
@ -0,0 +1,3 @@
+# Overview
+
+Kernels in transformers are used to optimize the performance of models with custom layers from the hub and very low effort.
--- a/docs/source/en/kv_cache.md
+++ b/docs/source/en/kv_cache.md
@ -124,11 +124,12 @@ The example below shows how you can fallback to an offloaded cache if you run ou

 ```py
 import torch
-from transformers import AutoTokenizer, AutoModelForCausalLM, infer_device
+from transformers import AutoTokenizer, AutoModelForCausalLM
+from accelerate import Accelerator

 def resilient_generate(model, *args, **kwargs):
    oom = False
-    device = infer_device()
+    device = Accelerator().device
    torch_device_module = getattr(torch, device, torch.cuda)
    try:
        return model.generate(*args, **kwargs)
@ -207,13 +208,13 @@ Some models have a unique way of storing past kv pairs or states that is not com

 Mamba models, such as [Mamba](./model_doc/mamba), require a specific cache because the model doesn't have an attention mechanism or kv states. Thus, they are not compatible with the above [`Cache`] classes.

-# Iterative generation
+## Iterative generation

 A cache can also work in iterative generation settings where there is back-and-forth interaction with a model (chatbots). Like regular generation, iterative generation with a cache allows a model to efficiently handle ongoing conversations without recomputing the entire context at each step.

 For iterative generation with a cache, start by initializing an empty cache class and then you can feed in your new prompts. Keep track of dialogue history with a [chat template](./chat_templating).

-The following example demonstrates [Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf). If you’re using a different chat-style model, [`~PreTrainedTokenizer.apply_chat_template`] may process messages differently. It might cut out important tokens depending on how the Jinja template is written.
+The following example demonstrates [Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf). If you're using a different chat-style model, [`~PreTrainedTokenizer.apply_chat_template`] may process messages differently. It might cut out important tokens depending on how the Jinja template is written.

 For example, some models use special `<think> ... </think>` tokens during reasoning. These could get lost during re-encoding, causing indexing issues. You might need to manually remove or adjust extra tokens from the completions to keep things stable.

--- a/docs/source/en/llm_optims.md
+++ b/docs/source/en/llm_optims.md
@ -114,7 +114,8 @@ print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
 Another option for using [`StaticCache`] is to pass it to a models forward pass using the same `past_key_values` argument. This allows you to write your own custom decoding function to decode the next token given the current token, position, and cache position of previously generated tokens.

 ```py
-from transformers import LlamaTokenizer, LlamaForCausalLM, StaticCache, logging, infer_device
+from transformers import LlamaTokenizer, LlamaForCausalLM, StaticCache, logging
+from accelerate import Accelerator
 from transformers.testing_utils import CaptureLogger
 import torch

@ -124,7 +125,7 @@ prompts = [
 ]

 NUM_TOKENS_TO_GENERATE = 40
-torch_device = infer_device()
+torch_device = Accelerator().device

 tokenizer = LlamaTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", pad_token="</s>", padding_side="right")
 model = LlamaForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", device_map="sequential")
@ -208,10 +209,11 @@ Enable speculative decoding by loading an assistant model and passing it to [`~G
 <hfoption id="greedy search">

 ```py
-from transformers import AutoModelForCausalLM, AutoTokenizer, infer_device
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from accelerate import Accelerator
 import torch

-device = infer_device()
+device = Accelerator().device

 tokenizer = AutoTokenizer.from_pretrained("facebook/opt-1.3b")
 inputs = tokenizer("Einstein's theory of relativity states", return_tensors="pt").to(device)
@ -229,10 +231,11 @@ tokenizer.batch_decode(outputs, skip_special_tokens=True)
 For speculative sampling decoding, add the [do_sample](https://hf.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationConfig.do_sample) and [temperature](https://hf.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationConfig.temperature) parameters to [`~GenerationMixin.generate`].

 ```py
-from transformers import AutoModelForCausalLM, AutoTokenizer, infer_device
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from accelerate import Accelerator
 import torch

-device = infer_device()
+device = Accelerator().device

 tokenizer = AutoTokenizer.from_pretrained("facebook/opt-1.3b")
 inputs = tokenizer("Einstein's theory of relativity states", return_tensors="pt").to(device)
@ -257,10 +260,11 @@ To enable prompt lookup decoding, specify the number of tokens that should be ov
 <hfoption id="greedy decoding">

 ```py
-from transformers import AutoModelForCausalLM, AutoTokenizer, infer_device
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from accelerate import Accelerator
 import torch

-device = infer_device()
+device = Accelerator().device

 tokenizer = AutoTokenizer.from_pretrained("facebook/opt-1.3b")
 inputs = tokenizer("The second law of thermodynamics states", return_tensors="pt").to(device)
@ -278,10 +282,11 @@ print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
 For prompt lookup decoding with sampling, add the [do_sample](https://hf.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationConfig.do_sample) and [temperature](https://hf.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationConfig.temperature) parameters to [`~GenerationMixin.generate`].

 ```py
-from transformers import AutoModelForCausalLM, AutoTokenizer, infer_device
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from accelerate import Accelerator
 import torch

-device = infer_device()
+device = Accelerator().device

 tokenizer = AutoTokenizer.from_pretrained("facebook/opt-1.3b")
 inputs = tokenizer("The second law of thermodynamics states", return_tensors="pt").to(device)
--- a/docs/source/en/llm_tutorial.md
+++ b/docs/source/en/llm_tutorial.md
@ -94,6 +94,7 @@ model.generate(**inputs, num_beams=4, do_sample=True)
 ```

 [`~GenerationMixin.generate`] can also be extended with external libraries or custom code:
+
 1. the `logits_processor` parameter accepts custom [`LogitsProcessor`] instances for manipulating the next token probability distribution;
 2. the `stopping_criteria` parameters supports custom [`StoppingCriteria`] to stop text generation;
 3. other custom generation methods can be loaded through the `custom_generate` flag ([docs](generation_strategies.md/#custom-decoding-methods)).
@ -258,11 +259,11 @@ Some models and tasks expect a certain input prompt format, and if the format is
 For example, a chat model expects the input as a [chat template](./chat_templating). Your prompt should include a `role` and `content` to indicate who is participating in the conversation. If you try to pass your prompt as a single string, the model doesn't always return the expected output.

 ```py
-from transformers import AutoTokenizer, AutoModelForCausalLM
+from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

 tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-alpha")
 model = AutoModelForCausalLM.from_pretrained(
-    "HuggingFaceH4/zephyr-7b-alpha", device_map="auto", load_in_4bit=True
+    "HuggingFaceH4/zephyr-7b-alpha", device_map="auto", quantization_config=BitsAndBytesConfig(load_in_4bit=True)
 )
 ```

--- a/docs/source/en/llm_tutorial_optimization.md
+++ b/docs/source/en/llm_tutorial_optimization.md
@ -16,18 +16,18 @@ rendered properly in your Markdown viewer.
 Large Language Models (LLMs) such as GPT3/4, [Falcon](https://huggingface.co/tiiuae/falcon-40b), and [Llama](https://huggingface.co/meta-llama/Llama-2-70b-hf) are rapidly advancing in their ability to tackle human-centric tasks, establishing themselves as essential tools in modern knowledge-based industries.
 Deploying these models in real-world tasks remains challenging, however:

-   To exhibit near-human text understanding and generation capabilities, LLMs currently require to be composed of billions of parameters (see [Kaplan et al](https://huggingface.co/papers/2001.08361), [Wei et. al](https://huggingface.co/papers/2206.07682)). This consequently amplifies the memory demands for inference.
-   In many real-world tasks, LLMs need to be given extensive contextual information. This necessitates the model's capability to manage very long input sequences during inference.
+- To exhibit near-human text understanding and generation capabilities, LLMs currently require to be composed of billions of parameters (see [Kaplan et al](https://huggingface.co/papers/2001.08361), [Wei et. al](https://huggingface.co/papers/2206.07682)). This consequently amplifies the memory demands for inference.
+- In many real-world tasks, LLMs need to be given extensive contextual information. This necessitates the model's capability to manage very long input sequences during inference.

 The crux of these challenges lies in augmenting the computational and memory capabilities of LLMs, especially when handling expansive input sequences.

 In this guide, we will go over the effective techniques for efficient LLM deployment:

-1.  **Lower Precision:** Research has shown that operating at reduced numerical precision, namely [8-bit and 4-bit](./main_classes/quantization) can achieve computational advantages without a considerable decline in model performance.
+1. **Lower Precision:** Research has shown that operating at reduced numerical precision, namely [8-bit and 4-bit](./main_classes/quantization) can achieve computational advantages without a considerable decline in model performance.

-2.  **Flash Attention:** Flash Attention is a variation of the attention algorithm that not only provides a more memory-efficient approach but also realizes increased efficiency due to optimized GPU memory utilization.
+2. **Flash Attention:** Flash Attention is a variation of the attention algorithm that not only provides a more memory-efficient approach but also realizes increased efficiency due to optimized GPU memory utilization.

-3.  **Architectural Innovations:** Considering that LLMs are always deployed in the same way during inference, namely autoregressive text generation with a long input context, specialized model architectures have been proposed that allow for more efficient inference. The most important advancement in model architectures hereby are [Alibi](https://huggingface.co/papers/2108.12409), [Rotary embeddings](https://huggingface.co/papers/2104.09864), [Multi-Query Attention (MQA)](https://huggingface.co/papers/1911.02150) and [Grouped-Query-Attention (GQA)](https://huggingface.co/papers/2305.13245).
+3. **Architectural Innovations:** Considering that LLMs are always deployed in the same way during inference, namely autoregressive text generation with a long input context, specialized model architectures have been proposed that allow for more efficient inference. The most important advancement in model architectures hereby are [Alibi](https://huggingface.co/papers/2108.12409), [Rotary embeddings](https://huggingface.co/papers/2104.09864), [Multi-Query Attention (MQA)](https://huggingface.co/papers/1911.02150) and [Grouped-Query-Attention (GQA)](https://huggingface.co/papers/2305.13245).

 Throughout this guide, we will offer an analysis of auto-regressive generation from a tensor's perspective. We delve into the pros and cons of adopting lower precision, provide a comprehensive exploration of the latest attention algorithms, and discuss improved LLM architectures. While doing so, we run practical examples showcasing each of the feature improvements.

@ -37,22 +37,22 @@ Memory requirements of LLMs can be best understood by seeing the LLM as a set of

 At the time of writing this guide, LLMs consist of at least a couple billion parameters. Each parameter thereby is made of a decimal number, e.g. `4.5689` which is usually stored in either [float32](https://en.wikipedia.org/wiki/Single-precision_floating-point_format), [bfloat16](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format), or [float16](https://en.wikipedia.org/wiki/Half-precision_floating-point_format) format. This allows us to easily compute the memory requirement to load the LLM into memory:

-> *Loading the weights of a model having X billion parameters requires roughly 4 * X GB of VRAM in float32 precision*
+> *Loading the weights of a model having X billion parameters requires roughly 4 \* X GB of VRAM in float32 precision*

 Nowadays, models are however rarely trained in full float32 precision, but usually in bfloat16 precision or less frequently in float16 precision. Therefore the rule of thumb becomes:

-> *Loading the weights of a model having X billion parameters requires roughly 2 * X GB of VRAM in bfloat16/float16 precision*
+> *Loading the weights of a model having X billion parameters requires roughly 2 \* X GB of VRAM in bfloat16/float16 precision*

 For shorter text inputs (less than 1024 tokens), the memory requirement for inference is very much dominated by the memory requirement to load the weights. Therefore, for now, let's assume that the memory requirement for inference is equal to the memory requirement to load the model into the GPU VRAM.

 To give some examples of how much VRAM it roughly takes to load a model in bfloat16:

-   **GPT3** requires 2 \* 175 GB = **350 GB** VRAM
-   [**Bloom**](https://huggingface.co/bigscience/bloom) requires 2 \* 176 GB = **352 GB** VRAM
-   [**Llama-2-70b**](https://huggingface.co/meta-llama/Llama-2-70b-hf) requires 2 \* 70 GB = **140 GB** VRAM
-   [**Falcon-40b**](https://huggingface.co/tiiuae/falcon-40b) requires 2 \* 40 GB = **80 GB** VRAM
-   [**MPT-30b**](https://huggingface.co/mosaicml/mpt-30b) requires 2 \* 30 GB = **60 GB** VRAM
-   [**bigcode/starcoder**](https://huggingface.co/bigcode/starcoder) requires 2 \* 15.5 = **31 GB** VRAM
+- **GPT3** requires 2 \* 175 GB = **350 GB** VRAM
+- [**Bloom**](https://huggingface.co/bigscience/bloom) requires 2 \* 176 GB = **352 GB** VRAM
+- [**Llama-2-70b**](https://huggingface.co/meta-llama/Llama-2-70b-hf) requires 2 \* 70 GB = **140 GB** VRAM
+- [**Falcon-40b**](https://huggingface.co/tiiuae/falcon-40b) requires 2 \* 40 GB = **80 GB** VRAM
+- [**MPT-30b**](https://huggingface.co/mosaicml/mpt-30b) requires 2 \* 30 GB = **60 GB** VRAM
+- [**bigcode/starcoder**](https://huggingface.co/bigcode/starcoder) requires 2 \* 15.5 = **31 GB** VRAM

 As of writing this document, the largest GPU chip on the market is the A100 & H100 offering 80GB of VRAM. Most of the models listed before require more than 80GB just to be loaded and therefore necessarily require [tensor parallelism](https://huggingface.co/docs/transformers/perf_train_gpu_many#tensor-parallelism) and/or [pipeline parallelism](https://huggingface.co/docs/transformers/perf_train_gpu_many#naive-model-parallelism-vertical-and-pipeline-parallelism).

@ -100,7 +100,7 @@ result

 **Output**:

-```
+```text
 Here is a Python function that transforms bytes to Giga bytes:\n\n```python\ndef bytes_to_giga_bytes(bytes):\n    return bytes / 1024 / 1024 / 1024\n```\n\nThis function takes a single
 ```

@ -119,7 +119,7 @@ bytes_to_giga_bytes(torch.cuda.max_memory_allocated())

 **Output**:

-```bash
+```text
 29.0260648727417
 ```

@ -169,11 +169,11 @@ All that matters is that the next token *logit* distribution stays roughly the s

 There are various quantization techniques, which we won't discuss in detail here, but in general, all quantization techniques work as follows:

-   1.  Quantize all weights to the target precision
-   2.  Load the quantized weights, and pass the input sequence of vectors in bfloat16 precision
-   3.  Dynamically dequantize weights to bfloat16 to perform the computation with their input vectors in bfloat16 precision
+- 1. Quantize all weights to the target precision
+- 2. Load the quantized weights, and pass the input sequence of vectors in bfloat16 precision
+- 3. Dynamically dequantize weights to bfloat16 to perform the computation with their input vectors in bfloat16 precision

-In a nutshell, this means that *inputs-weight matrix* multiplications, with \\( X \\) being the *inputs*, \\( W \\) being a weight matrix and \\( Y \\) being the output:
+In a nutshell, this means that *inputs-weight matrix* multiplications, with $X$ being the *inputs*, $W$ being a weight matrix and $Y$ being the output:

 $$ Y = X * W $$

@ -194,7 +194,7 @@ the [`bitsandbytes`](https://github.com/bitsandbytes-foundation/bitsandbytes) li
 We can then load models in 8-bit quantization by simply adding a `load_in_8bit=True` flag to `from_pretrained`.

 ```python
-model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", load_in_8bit=True, pad_token_id=0)
+model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", quantization_config=BitsAndBytesConfig(load_in_8bit=True), pad_token_id=0)
 ```

 Now, let's run our example again and measure the memory usage.
@ -208,7 +208,7 @@ result

 **Output**:

-```
+```text
 Here is a Python function that transforms bytes to Giga bytes:\n\n```python\ndef bytes_to_giga_bytes(bytes):\n    return bytes / 1024 / 1024 / 1024\n```\n\nThis function takes a single
 ```

@ -220,7 +220,7 @@ bytes_to_giga_bytes(torch.cuda.max_memory_allocated())

 **Output**:

-```
+```text
 15.219234466552734
 ```

@ -241,7 +241,7 @@ flush()
 Let's see what peak GPU memory consumption 4-bit quantization gives. Quantizing the model to 4-bit can be done with the same API as before - this time by passing `load_in_4bit=True` instead of `load_in_8bit=True`.

 ```python
-model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", load_in_4bit=True, pad_token_id=0)
+model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", quantization_config=BitsAndBytesConfig(load_in_4bit=True), pad_token_id=0)

 pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

@ -251,7 +251,7 @@ result

 **Output**:

-```
+```text
 Here is a Python function that transforms bytes to Giga bytes:\n\n```\ndef bytes_to_gigabytes(bytes):\n    return bytes / 1024 / 1024 / 1024\n```\n\nThis function takes a single argument
 ```

@ -263,7 +263,7 @@ bytes_to_giga_bytes(torch.cuda.max_memory_allocated())

 **Output**:

-```
+```text
 9.543574333190918
 ```

@ -271,7 +271,7 @@ Just 9.5GB! That's really not a lot for a >15 billion parameter model.

 While we see very little degradation in accuracy for our model here, 4-bit quantization can in practice often lead to different results compared to 8-bit quantization or full `bfloat16` inference. It is up to the user to try it out.

-Also note that inference here was again a bit slower compared to 8-bit quantization which is due to the more aggressive quantization method used for 4-bit quantization leading to \\( \text{quantize} \\) and \\( \text{dequantize} \\) taking longer during inference.
+Also note that inference here was again a bit slower compared to 8-bit quantization which is due to the more aggressive quantization method used for 4-bit quantization leading to $\text{quantize}$ and $\text{dequantize}$ taking longer during inference.

 ```python
 del model
@ -300,279 +300,117 @@ Next, let's look into how we can improve computational and memory efficiency by
 Today's top-performing LLMs share more or less the same fundamental architecture that consists of feed-forward layers, activation layers, layer normalization layers, and most crucially, self-attention layers.

 Self-attention layers are central to Large Language Models (LLMs) in that they enable the model to understand the contextual relationships between input tokens.
-However, the peak GPU memory consumption for self-attention layers grows *quadratically* both in compute and memory complexity with number of input tokens (also called *sequence length*) that we denote in the following by \\( N \\) .
+However, the peak GPU memory consumption for self-attention layers grows *quadratically* both in compute and memory complexity with number of input tokens (also called *sequence length*) that we denote in the following by $N$ .
 While this is not really noticeable for shorter input sequences (of up to 1000 input tokens), it becomes a serious problem for longer input sequences (at around 16000 input tokens).

-Let's take a closer look. The formula to compute the output \\( \mathbf{O} \\) of a self-attention layer for an input \\( \mathbf{X} \\) of length \\( N \\) is:
+Let's take a closer look. The formula to compute the output $\mathbf{O}$ of a self-attention layer for an input $\mathbf{X}$ of length $N$ is:

 $$ \textbf{O} = \text{Attn}(\mathbf{X}) = \mathbf{V} \times \text{Softmax}(\mathbf{QK}^T) \text{ with } \mathbf{Q} = \mathbf{W}_q \mathbf{X}, \mathbf{V} = \mathbf{W}_v \mathbf{X}, \mathbf{K} = \mathbf{W}_k \mathbf{X} $$

-\\(  \mathbf{X} = (\mathbf{x}_1, ... \mathbf{x}_{N}) \\) is thereby the input sequence to the attention layer. The projections \\( \mathbf{Q} \\) and \\( \mathbf{K} \\) will each consist of \\( N \\) vectors resulting in the \\( \mathbf{QK}^T \\) being of size \\( N^2 \\) .
+$\mathbf{X} = (\mathbf{x}_1, ... \mathbf{x}_{N})$ is thereby the input sequence to the attention layer. The projections $\mathbf{Q}$ and $\mathbf{K}$ will each consist of $N$ vectors resulting in the $\mathbf{QK}^T$ being of size $N^2$ .

 LLMs usually have multiple attention heads, thus doing multiple self-attention computations in parallel.
-Assuming, the LLM has 40 attention heads and runs in bfloat16 precision, we can calculate the memory requirement to store the \\( \mathbf{QK^T} \\) matrices to be \\( 40 * 2 * N^2 \\) bytes. For \\( N=1000 \\) only around 50 MB of VRAM are needed, however, for \\( N=16000 \\) we would need 19 GB of VRAM, and for \\( N=100,000 \\) we would need almost 1TB just to store the \\( \mathbf{QK}^T \\) matrices.
+Assuming, the LLM has 40 attention heads and runs in bfloat16 precision, we can calculate the memory requirement to store the $\mathbf{QK^T}$ matrices to be $40 * 2 * N^2$ bytes. For $N=1000$ only around 50 MB of VRAM are needed, however, for $N=16000$ we would need 19 GB of VRAM, and for $N=100,000$ we would need almost 1TB just to store the $\mathbf{QK}^T$ matrices.

 Long story short, the default self-attention algorithm quickly becomes prohibitively memory-expensive for large input contexts.

 As LLMs improve in text comprehension and generation, they are applied to increasingly complex tasks. While models once handled the translation or summarization of a few sentences, they now manage entire pages, demanding the capability to process extensive input lengths.

-How can we get rid of the exorbitant memory requirements for large input lengths? We need a new way to compute the self-attention mechanism that gets rid of the \\( QK^T \\) matrix. [Tri Dao et al.](https://huggingface.co/papers/2205.14135) developed exactly such a new algorithm and called it **Flash Attention**.
+How can we get rid of the exorbitant memory requirements for large input lengths? We need a new way to compute the self-attention mechanism that gets rid of the $\mathbf{QK}^T$ matrix. [Tri Dao et al.](https://huggingface.co/papers/2205.14135) developed exactly such a new algorithm and called it **Flash Attention**.

-In a nutshell, Flash Attention breaks the  \\(\mathbf{V} \times \text{Softmax}(\mathbf{QK}^T\\)) computation apart and instead computes smaller chunks of the output by iterating over multiple softmax computation steps:
+In a nutshell, Flash Attention breaks the $\mathbf{V} \times \text{Softmax}(\mathbf{QK}^T)$ computation apart and instead computes smaller chunks of the output by iterating over multiple softmax computation steps:

 $$ \textbf{O}_i \leftarrow s^a_{ij} * \textbf{O}_i + s^b_{ij} * \mathbf{V}_{j} \times \text{Softmax}(\mathbf{QK}^T_{i,j}) \text{ for multiple } i, j \text{ iterations} $$

-with \\( s^a_{ij} \\) and \\( s^b_{ij} \\) being some softmax normalization statistics that need to be recomputed for every \\( i \\) and \\( j \\) .
+with $s^a_{ij}$ and $s^b_{ij}$ being some softmax normalization statistics that need to be recomputed for every $i$ and $j$ .

 Please note that the whole Flash Attention is a bit more complex and is greatly simplified here as going in too much depth is out of scope for this guide. The reader is invited to take a look at the well-written [Flash Attention paper](https://huggingface.co/papers/2205.14135) for more details.

 The main takeaway here is:

-> By keeping track of softmax normalization statistics and by using some smart mathematics, Flash Attention gives **numerical identical** outputs compared to the default self-attention layer at a memory cost that only increases linearly with \\( N \\) .
+> By keeping track of softmax normalization statistics and by using some smart mathematics, Flash Attention gives **numerical identical** outputs compared to the default self-attention layer at a memory cost that only increases linearly with $N$ .

 Looking at the formula, one would intuitively say that Flash Attention must be much slower compared to the default self-attention formula as more computation needs to be done. Indeed Flash Attention requires more FLOPs compared to normal attention as the softmax normalization statistics have to constantly be recomputed (see [paper](https://huggingface.co/papers/2205.14135) for more details if interested)

 > However, Flash Attention is much faster in inference compared to default attention which comes from its ability to significantly reduce the demands on the slower, high-bandwidth memory of the GPU (VRAM), focusing instead on the faster on-chip memory (SRAM).

-Essentially, Flash Attention makes sure that all intermediate write and read operations can be done using the fast *on-chip* SRAM memory instead of having to access the slower VRAM memory to compute the output vector \\( \mathbf{O} \\) .
+Essentially, Flash Attention makes sure that all intermediate write and read operations can be done using the fast *on-chip* SRAM memory instead of having to access the slower VRAM memory to compute the output vector $\mathbf{O}$ .

 In practice, there is currently absolutely no reason to **not** use Flash Attention if available. The algorithm gives mathematically the same outputs, and is both faster and more memory-efficient.

-Let's look at a practical example.
-
-Our OctoCoder model now gets a significantly longer input prompt which includes a so-called *system prompt*. System prompts are used to steer the LLM into a better assistant that is tailored to the users' task.
-In the following, we use a system prompt that will make OctoCoder a better coding assistant.
-
-```python
-system_prompt = """Below are a series of dialogues between various people and an AI technical assistant.
-The assistant tries to be helpful, polite, honest, sophisticated, emotionally aware, and humble but knowledgeable.
-The assistant is happy to help with code questions and will do their best to understand exactly what is needed.
-It also tries to avoid giving false or misleading information, and it caveats when it isn't entirely sure about the right answer.
-That said, the assistant is practical really does its best, and doesn't let caution get too much in the way of being useful.
-
-The Starcoder models are a series of 15.5B parameter models trained on 80+ programming languages from The Stack (v1.2) (excluding opt-out requests).
-The model uses Multi Query Attention, was trained using the Fill-in-the-Middle objective, and with 8,192 tokens context window for a trillion tokens of heavily deduplicated data.
-
-----
-
-Question: Write a function that takes two lists and returns a list that has alternating elements from each input list.
-
-Answer: Sure. Here is a function that does that.
-
-def alternating(list1, list2):
-   results = []
-   for i in range(len(list1)):
-       results.append(list1[i])
-       results.append(list2[i])
-   return results
-
-Question: Can you write some test cases for this function?
-
-Answer: Sure, here are some tests.
-
-assert alternating([10, 20, 30], [1, 2, 3]) == [10, 1, 20, 2, 30, 3]
-assert alternating([True, False], [4, 5]) == [True, 4, False, 5]
-assert alternating([], []) == []
-
-Question: Modify the function so that it returns all input elements when the lists have uneven length. The elements from the longer list should be at the end.
-
-Answer: Here is the modified function.
-
-def alternating(list1, list2):
-   results = []
-   for i in range(min(len(list1), len(list2))):
-       results.append(list1[i])
-       results.append(list2[i])
-   if len(list1) > len(list2):
-       results.extend(list1[i+1:])
-   else:
-       results.extend(list2[i+1:])
-   return results
-
-----
-"""
-```
-
-For demonstration purposes, we duplicate the system prompt by ten so that the input length is long enough to observe Flash Attention's memory savings.
-We append the original text prompt `"Question: Please write a function in Python that transforms bytes to Giga bytes.\n\nAnswer: Here"`
-
-```python
-long_prompt = 10 * system_prompt + prompt
-```
-
-We instantiate our model again in bfloat16 precision.
-
-```python
-model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", dtype=torch.bfloat16, device_map="auto")
-tokenizer = AutoTokenizer.from_pretrained("bigcode/octocoder")
-
-pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
-```
-
-Let's now run the model just like before *without Flash Attention* and measure the peak GPU memory requirement and inference time.
-
-```python
-import time
-
-start_time = time.time()
-result = pipe(long_prompt, max_new_tokens=60)[0]["generated_text"][len(long_prompt):]
-
-print(f"Generated in {time.time() - start_time} seconds.")
-result
-```
-
-**Output**:
-
-```
-Generated in 10.96854019165039 seconds.
-Sure. Here is a function that does that.\n\ndef bytes_to_giga(bytes):\n   return bytes / 1024 / 1024 / 1024\n\nAnswer: Sure. Here is a function that does that.\n\ndef
-````
-
-We're getting the same output as before, however this time, the model repeats the answer multiple times until it's 60 tokens cut-off. This is not surprising as we've repeated the system prompt ten times for demonstration purposes and thus cued the model to repeat itself.
-
-**Note** that the system prompt should not be repeated ten times in real-world applications - one time is enough!
-
-Let's measure the peak GPU memory requirement.
-
-```python
-bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
-```
-
-**Output**:
-
-```bash
-37.668193340301514
-```
-
-As we can see the peak GPU memory requirement is now significantly higher than in the beginning, which is largely due to the longer input sequence. Also the generation takes a little over a minute now.
-
-We call `flush()` to free GPU memory for our next experiment.
-
-```python
-flush()
-```
-
-For comparison, let's run the same function, but enable Flash Attention instead.
-To do so, we convert the model to [BetterTransformer](https://huggingface.co/docs/optimum/bettertransformer/overview) and by doing so enabling PyTorch's [SDPA self-attention](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention) which in turn is able to use Flash Attention.
-
-```python
-model.to_bettertransformer()
-```
-
-Now we run the exact same code snippet as before and under the hood Transformers will make use of Flash Attention.
-
-```py
-start_time = time.time()
-with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):
-    result = pipe(long_prompt, max_new_tokens=60)[0]["generated_text"][len(long_prompt):]
-
-print(f"Generated in {time.time() - start_time} seconds.")
-result
-```
-
-**Output**:
-
-```
-Generated in 3.0211617946624756 seconds.
- Sure. Here is a function that does that.\n\ndef bytes_to_giga(bytes):\n   return bytes / 1024 / 1024 / 1024\n\nAnswer: Sure. Here is a function that does that.\n\ndef
-```
-
-We're getting the exact same result as before, but can observe a very significant speed-up thanks to Flash Attention.
-
-Let's measure the memory consumption one last time.
-
-```python
-bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
-```
-
-**Output**:
-
-```
-32.617331981658936
-```
-
-And we're almost back to our original 29GB peak GPU memory from the beginning.
-
-We can observe that we only use roughly 100MB more GPU memory when passing a very long input sequence with Flash Attention compared to passing a short input sequence as done in the beginning.
-
-```py
-flush()
-```
-
-For more information on how to use Flash Attention, please have a look at [this doc page](https://huggingface.co/docs/transformers/en/perf_infer_gpu_one#flashattention-2).
-
 ## 3. Architectural Innovations

 So far we have looked into improving computational and memory efficiency by:

-   Casting the weights to a lower precision format
-   Replacing the self-attention algorithm with a more memory- and compute efficient version
+- Casting the weights to a lower precision format
+- Replacing the self-attention algorithm with a more memory- and compute efficient version

-Let's now look into how we can change the architecture of an LLM so that it is most effective and efficient for task that require long text inputs, *e.g.*:
-   Retrieval augmented Questions Answering,
-   Summarization,
-   Chat
+Let's now look into how we can change the architecture of an LLM so that it is most effective and efficient for tasks that require long text inputs, *e.g.*:
+
+- Retrieval augmented Questions Answering,
+- Summarization,
+- Chat

 Note that *chat* not only requires the LLM to handle long text inputs, but it also necessitates that the LLM is able to efficiently handle the back-and-forth dialogue between user and assistant (such as ChatGPT).

 Once trained, the fundamental LLM architecture is difficult to change, so it is important to make considerations about the LLM's tasks beforehand and accordingly optimize the model's architecture.
 There are two important components of the model architecture that quickly become memory and/or performance bottlenecks for large input sequences.

-   The positional embeddings
-   The key-value cache
+- The positional embeddings
+- The key-value cache

 Let's go over each component in more detail

 ### 3.1 Improving positional embeddings of LLMs

 Self-attention puts each token in relation to each other's tokens.
-As an example, the \\( \text{Softmax}(\mathbf{QK}^T) \\) matrix of the text input sequence *"Hello", "I", "love", "you"* could look as follows:
+As an example, the $\text{Softmax}(\mathbf{QK}^T)$ matrix of the text input sequence *"Hello", "I", "love", "you"* could look as follows:

 ![](/blog/assets/163_optimize_llm/self_attn_tokens.png)

 Each word token is given a probability mass at which it attends all other word tokens and, therefore is put into relation with all other word tokens. E.g. the word *"love"* attends to the word *"Hello"* with 5%, to *"I"* with 30%, and to itself with 65%.

 A LLM based on self-attention, but without position embeddings would have great difficulties in understanding the positions of the text inputs to each other.
-This is because the probability score computed by \\( \mathbf{QK}^T \\) relates each word token to each other word token in \\( O(1) \\) computations regardless of their relative positional distance to each other.
+This is because the probability score computed by $\mathbf{QK}^T$ relates each word token to each other word token in $O(1)$ computations regardless of their relative positional distance to each other.
 Therefore, for the LLM without position embeddings each token appears to have the same distance to all other tokens, *e.g.* differentiating between *"Hello I love you"* and *"You love I hello"* would be very challenging.

 For the LLM to understand sentence order, an additional *cue* is needed and is usually applied in the form of *positional encodings* (or also called *positional embeddings*).
 Positional encodings, encode the position of each token into a numerical presentation that the LLM can leverage to better understand sentence order.

-The authors of the [*Attention Is All You Need*](https://huggingface.co/papers/1706.03762) paper introduced sinusoidal positional embeddings \\( \mathbf{P} = \mathbf{p}_1, \ldots, \mathbf{p}_N \\) .
-where each vector \\( \mathbf{p}_i \\) is computed as a sinusoidal function of its position \\( i \\) .
-The positional encodings are then simply added to the input sequence vectors \\( \mathbf{\hat{X}} = \mathbf{\hat{x}}_1, \ldots, \mathbf{\hat{x}}_N \\) = \\( \mathbf{x}_1 + \mathbf{p}_1, \ldots, \mathbf{x}_N + \mathbf{p}_N \\) thereby cueing the model to better learn sentence order.
+The authors of the [*Attention Is All You Need*](https://huggingface.co/papers/1706.03762) paper introduced sinusoidal positional embeddings $\mathbf{P} = \mathbf{p}_1, \ldots, \mathbf{p}_N$ .
+where each vector $\mathbf{p}_i$ is computed as a sinusoidal function of its position $i$ .
+The positional encodings are then simply added to the input sequence vectors $\mathbf{\hat{X}} = \mathbf{\hat{x}}_1, \ldots, \mathbf{\hat{x}}_N$ = $\mathbf{x}_1 + \mathbf{p}_1, \ldots, \mathbf{x}_N + \mathbf{p}_N$ thereby cueing the model to better learn sentence order.

 Instead of using fixed position embeddings, others (such as [Devlin et al.](https://huggingface.co/papers/1810.04805)) used learned positional encodings for which the positional embeddings
-\\( \mathbf{P} \\) are learned during training.
+$\mathbf{P}$ are learned during training.

 Sinusoidal and learned position embeddings used to be the predominant methods to encode sentence order into LLMs, but a couple of problems related to these positional encodings were found:

-  1. Sinusoidal and learned position embeddings are both absolute positional embeddings, *i.e.* encoding a unique embedding for each position id: \\( 0, \ldots, N \\) . As shown by [Huang et al.](https://huggingface.co/papers/2009.13658) and [Su et al.](https://huggingface.co/papers/2104.09864), absolute positional embeddings lead to poor LLM performance for long text inputs. For long text inputs, it is advantageous if the model learns the relative positional distance input tokens have to each other instead of their absolute position.
-  2. When using learned position embeddings, the LLM has to be trained on a fixed input length \\( N \\), which makes it difficult to extrapolate to an input length longer than what it was trained on.
+  1. Sinusoidal and learned position embeddings are both absolute positional embeddings, *i.e.* encoding a unique embedding for each position id: $0, \ldots, N$ . As shown by [Huang et al.](https://huggingface.co/papers/2009.13658) and [Su et al.](https://huggingface.co/papers/2104.09864), absolute positional embeddings lead to poor LLM performance for long text inputs. For long text inputs, it is advantageous if the model learns the relative positional distance input tokens have to each other instead of their absolute position.
+  2. When using learned position embeddings, the LLM has to be trained on a fixed input length $N$, which makes it difficult to extrapolate to an input length longer than what it was trained on.

 Recently, relative positional embeddings that can tackle the above mentioned problems have become more popular, most notably:

-   [Rotary Position Embedding (RoPE)](https://huggingface.co/papers/2104.09864)
-   [ALiBi](https://huggingface.co/papers/2108.12409)
+- [Rotary Position Embedding (RoPE)](https://huggingface.co/papers/2104.09864)
+- [ALiBi](https://huggingface.co/papers/2108.12409)

-Both *RoPE* and *ALiBi* argue that it's best to cue the LLM about sentence order directly in the self-attention algorithm as it's there that word tokens are put into relation with each other. More specifically, sentence order should be cued by modifying the \\( \mathbf{QK}^T \\) computation.
+Both *RoPE* and *ALiBi* argue that it's best to cue the LLM about sentence order directly in the self-attention algorithm as it's there that word tokens are put into relation with each other. More specifically, sentence order should be cued by modifying the $\mathbf{QK}^T$ computation.

-Without going into too many details, *RoPE* notes that positional information can be encoded into query-key pairs, *e.g.* \\( \mathbf{q}_i \\) and \\( \mathbf{x}_j \\) by rotating each vector by an angle \\( \theta * i \\) and \\( \theta * j \\) respectively with \\( i, j \\) describing each vectors sentence position:
+Without going into too many details, *RoPE* notes that positional information can be encoded into query-key pairs, *e.g.* $\mathbf{q}_i$ and $\mathbf{x}_j$ by rotating each vector by an angle $\theta * i$ and $\theta * j$ respectively with $i, j$ describing each vectors sentence position:

 $$ \mathbf{\hat{q}}_i^T \mathbf{\hat{x}}_j = \mathbf{{q}}_i^T \mathbf{R}_{\theta, i -j} \mathbf{{x}}_j. $$

-\\( \mathbf{R}_{\theta, i - j} \\) thereby represents a rotational matrix. \\( \theta \\) is *not* learned during training, but instead set to a pre-defined value that depends on the maximum input sequence length during training.
+$\mathbf{R}_{\theta, i - j}$ thereby represents a rotational matrix. $\theta$ is *not* learned during training, but instead set to a pre-defined value that depends on the maximum input sequence length during training.

-> By doing so, the probability score between \\( \mathbf{q}_i \\) and \\( \mathbf{q}_j \\) is only affected if \\( i \ne j \\) and solely depends on the relative distance \\( i - j \\) regardless of each vector's specific positions \\( i \\) and \\( j \\) .
+> By doing so, the probability score between $\mathbf{q}_i$ and $\mathbf{q}_j$ is only affected if $i \ne j$ and solely depends on the relative distance $i - j$ regardless of each vector's specific positions $i$ and $j$ .

 *RoPE* is used in multiple of today's most important LLMs, such as:

-   [**Falcon**](https://huggingface.co/tiiuae/falcon-40b)
-   [**Llama**](https://huggingface.co/papers/2302.13971)
-   [**PaLM**](https://huggingface.co/papers/2204.02311)
+- [**Falcon**](https://huggingface.co/tiiuae/falcon-40b)
+- [**Llama**](https://huggingface.co/papers/2302.13971)
+- [**PaLM**](https://huggingface.co/papers/2204.02311)

-As an alternative, *ALiBi* proposes a much simpler relative position encoding scheme. The relative distance that input tokens have to each other is added as a negative integer scaled by a pre-defined value `m` to each query-key entry of the \\( \mathbf{QK}^T \\) matrix right before the softmax computation.
+As an alternative, *ALiBi* proposes a much simpler relative position encoding scheme. The relative distance that input tokens have to each other is added as a negative integer scaled by a pre-defined value `m` to each query-key entry of the $\mathbf{QK}^T$ matrix right before the softmax computation.

 ![](/blog/assets/163_optimize_llm/alibi.png)

@ -580,19 +418,20 @@ As shown in the [ALiBi](https://huggingface.co/papers/2108.12409) paper, this si

 *ALiBi* is used in multiple of today's most important LLMs, such as:

-   [**MPT**](https://huggingface.co/mosaicml/mpt-30b)
-   [**BLOOM**](https://huggingface.co/bigscience/bloom)
+- [**MPT**](https://huggingface.co/mosaicml/mpt-30b)
+- [**BLOOM**](https://huggingface.co/bigscience/bloom)

 Both *RoPE* and *ALiBi* position encodings can extrapolate to input lengths not seen during training whereas it has been shown that extrapolation works much better out-of-the-box for *ALiBi* as compared to *RoPE*.
 For ALiBi, one simply increases the values of the lower triangular position matrix to match the length of the input sequence.
-For *RoPE*, keeping the same \\( \theta \\) that was used during training leads to poor results when passing text inputs much longer than those seen during training, *c.f* [Press et al.](https://huggingface.co/papers/2108.12409). However, the community has found a couple of effective tricks that adapt \\( \theta \\), thereby allowing *RoPE* position embeddings to work well for extrapolated text input sequences (see [here](https://github.com/huggingface/transformers/pull/24653)).
+For *RoPE*, keeping the same $\theta$ that was used during training leads to poor results when passing text inputs much longer than those seen during training, *c.f* [Press et al.](https://huggingface.co/papers/2108.12409). However, the community has found a couple of effective tricks that adapt $\theta$, thereby allowing *RoPE* position embeddings to work well for extrapolated text input sequences (see [here](https://github.com/huggingface/transformers/pull/24653)).

 > Both RoPE and ALiBi are relative positional embeddings that are *not* learned during training, but instead are based on the following intuitions:
- -   Positional cues about the text inputs should be given directly to the \\( QK^T \\) matrix of the self-attention layer
- -   The LLM should be incentivized to learn a constant *relative* distance positional encodings have to each other
- -   The further text input tokens are from each other, the lower the probability of their query-value probability. Both RoPE and ALiBi lower the query-key probability of tokens far away from each other. RoPE by decreasing their vector product by increasing the angle between the query-key vectors. ALiBi by adding large negative numbers to the vector product

-In conclusion, LLMs that are intended to be deployed in tasks that require handling large text inputs are better trained with relative positional embeddings, such as RoPE and ALiBi. Also note that even if an LLM with RoPE and ALiBi has been trained only on a fixed length of say \\( N_1 = 2048 \\) it can still be used in practice with text inputs much larger than \\( N_1 \\), like \\( N_2 = 8192 > N_1 \\) by extrapolating the positional embeddings.
+- Positional cues about the text inputs should be given directly to the $\mathbf{QK}^T$ matrix of the self-attention layer.
+- The LLM should be incentivized to learn a constant *relative* distance positional encoding.
+- The further text input tokens are from each other, the lower the probability of their query-value probability. Both RoPE and ALiBi lower the query-key probability of tokens far away from each other. RoPE lowers by decreasing their vector product by increasing the angle between the query-key vectors. ALiBi lowers by adding large negative numbers to the vector product.
+
+In conclusion, LLMs that are intended to be deployed in tasks that require handling large text inputs are better trained with relative positional embeddings, such as RoPE and ALiBi. Also note that even if an LLM with RoPE and ALiBi has been trained only on a fixed length of say $N_1 = 2048$ it can still be used in practice with text inputs much larger than $N_1$, like $N_2 = 8192 > N_1$ by extrapolating the positional embeddings.

 ### 3.2 The key-value cache

@ -618,7 +457,7 @@ generated_text

 **Output**:

-```
+```text
 shape of input_ids torch.Size([1, 21])
 shape of input_ids torch.Size([1, 22])
 shape of input_ids torch.Size([1, 23])
@ -631,7 +470,7 @@ As we can see every time we increase the text input tokens by the just sampled t

 With very few exceptions, LLMs are trained using the [causal language modeling objective](https://huggingface.co/docs/transformers/tasks/language_modeling#causal-language-modeling) and therefore mask the upper triangle matrix of the attention score - this is why in the two diagrams above the attention scores are left blank (*a.k.a* have 0 probability). For a quick recap on causal language modeling you can refer to the [*Illustrated Self Attention blog*](https://jalammar.github.io/illustrated-gpt2/#part-2-illustrated-self-attention).

-As a consequence, tokens *never* depend on previous tokens, more specifically the \\( \mathbf{q}_i \\) vector is never put in relation with any key, values vectors \\( \mathbf{k}_j, \mathbf{v}_j \\) if \\( j > i \\) . Instead \\( \mathbf{q}_i \\) only attends to previous key-value vectors \\( \mathbf{k}_{m < i}, \mathbf{v}_{m < i} \text{ , for } m \in \{0, \ldots i - 1\} \\). In order to reduce unnecessary computation, one can therefore cache each layer's key-value vectors for all previous timesteps.
+As a consequence, tokens *never* depend on later tokens, more specifically the $\mathbf{q}_i$ vector is never put in relation with any key, values vectors $\mathbf{k}_j, \mathbf{v}_j$ if $j > i$ . Instead $\mathbf{q}_i$ only attends to previous key-value vectors $\mathbf{k}_{m < i}, \mathbf{v}_{m < i} \text{ , for } m \in \{0, \ldots i - 1\}$. In order to reduce unnecessary computation, one can therefore cache each layer's key-value vectors for all previous timesteps.

 In the following, we will tell the LLM to make use of the key-value cache by retrieving and forwarding it for each forward pass.
 In Transformers, we can retrieve the key-value cache by passing the `use_cache` flag to the `forward` call and can then pass it with the current token.
@ -647,7 +486,7 @@ for _ in range(5):
  next_token_id = torch.argmax(next_logits, dim=-1)

  print("shape of input_ids", next_token_id.shape)
-  print("length of key-value cache", len(past_key_values[0][0]))  # past_key_values are of shape [num_layers, 0 for k, 1 for v, batch_size, length, hidden_dim]
+  print("length of key-value cache", past_key_values.get_seq_length())  # past_key_values are of shape [num_layers, 0 for k, 1 for v, batch_size, length, hidden_dim]
  generated_tokens.append(next_token_id.item())

 generated_text = tokenizer.batch_decode(generated_tokens)
@ -656,7 +495,7 @@ generated_text

 **Output**:

-```
+```text
 shape of input_ids torch.Size([1, 1])
 length of key-value cache 20
 shape of input_ids torch.Size([1, 1])
@ -672,11 +511,12 @@ length of key-value cache 24

 As one can see, when using the key-value cache the text input tokens are *not* increased in length, but remain a single input vector. The length of the key-value cache on the other hand is increased by one at every decoding step.

-> Making use of the key-value cache means that the \\( \mathbf{QK}^T \\) is essentially reduced to \\( \mathbf{q}_c\mathbf{K}^T \\) with \\( \mathbf{q}_c \\) being the query projection of the currently passed input token which is *always* just a single vector.
+> Making use of the key-value cache means that the $\mathbf{QK}^T$ is essentially reduced to $\mathbf{q}_c\mathbf{K}^T$ with $\mathbf{q}_c$ being the query projection of the currently passed input token which is *always* just a single vector.

 Using the key-value cache has two advantages:
-   Significant increase in computational efficiency as less computations are performed compared to computing the full \\( \mathbf{QK}^T \\) matrix. This leads to an increase in inference speed
-   The maximum required memory is not increased quadratically with the number of generated tokens, but only increases linearly.
+
+- Significant increase in computational efficiency as less computations are performed compared to computing the full $\mathbf{QK}^T$ matrix. This leads to an increase in inference speed
+- The maximum required memory is not increased quadratically with the number of generated tokens, but only increases linearly.

 > One should *always* make use of the key-value cache as it leads to identical results and a significant speed-up for longer input sequences. Transformers has the key-value cache enabled by default when making use of the text pipeline or the [`generate` method](https://huggingface.co/docs/transformers/main_classes/text_generation). We have an entire guide dedicated to caches [here](./kv_cache).

@ -690,7 +530,7 @@ Note that, despite our advice to use key-value caches, your LLM output may be sl

 The key-value cache is especially useful for applications such as chat where multiple passes of auto-regressive decoding are required. Let's look at an example.

-```
+```text
 User: How many people live in France?
 Assistant: Roughly 75 million people live in France
 User: And how many are in Germany?
@ -698,10 +538,12 @@ Assistant: Germany has ca. 81 million inhabitants
 ```

 In this chat, the LLM runs auto-regressive decoding twice:
+
  1. The first time, the key-value cache is empty and the input prompt is `"User: How many people live in France?"` and the model auto-regressively generates the text `"Roughly 75 million people live in France"` while increasing the key-value cache at every decoding step.
  2. The second time the input prompt is `"User: How many people live in France? \n Assistant: Roughly 75 million people live in France \n User: And how many in Germany?"`. Thanks to the cache, all key-value vectors for the first two sentences are already computed. Therefore the input prompt only consists of `"User: And how many in Germany?"`. While processing the shortened input prompt, its computed key-value vectors are concatenated to the key-value cache of the first decoding. The second Assistant's answer `"Germany has ca. 81 million inhabitants"` is then auto-regressively generated with the key-value cache consisting of encoded key-value vectors of `"User: How many people live in France? \n Assistant: Roughly 75 million people live in France \n User: And how many are in Germany?"`.

 Two things should be noted here:
+
  1. Keeping all the context is crucial for LLMs deployed in chat so that the LLM understands all the previous context of the conversation. E.g. for the example above the LLM needs to understand that the user refers to the population when asking `"And how many are in Germany"`.
  2. The key-value cache is extremely useful for chat as it allows us to continuously grow the encoded chat history instead of having to re-encode the chat history again from scratch (as e.g. would be the case when using an encoder-decoder architecture).

@ -728,7 +570,7 @@ tokenizer.batch_decode(generation_output.sequences)[0][len(prompt):]

 **Output**:

-```
+```text
 is a modified version of the function that returns Mega bytes instead.

 def bytes_to_megabytes(bytes):
@ -737,7 +579,7 @@ def bytes_to_megabytes(bytes):
 Answer: The function takes a number of bytes as input and returns the number of
 ```

-Great, no additional time is spent recomputing the same key and values for the attention layer! There is however one catch. While the required peak memory for the \\( \mathbf{QK}^T \\) matrix is significantly reduced, holding the key-value cache in memory can become very memory expensive for long input sequences or multi-turn chat. Remember that the key-value cache needs to store the key-value vectors for all previous input vectors \\( \mathbf{x}_i \text{, for } i \in \{1, \ldots, c - 1\} \\) for all self-attention layers and for all attention heads.
+Great, no additional time is spent recomputing the same key and values for the attention layer! There is however one catch. While the required peak memory for the $\mathbf{QK}^T$ matrix is significantly reduced, holding the key-value cache in memory can become very memory expensive for long input sequences or multi-turn chat. Remember that the key-value cache needs to store the key-value vectors for all previous input vectors $\mathbf{x}_i \text{, for } i \in \{1, \ldots, c - 1\}$ for all self-attention layers and for all attention heads.

 Let's compute the number of float values that need to be stored in the key-value cache for the LLM `bigcode/octocoder` that we used before.
 The number of float values amounts to two times the sequence length times the number of attention heads times the attention head dimension and times the number of layers.
@ -750,7 +592,7 @@ config = model.config

 **Output**:

-```
+```text
 7864320000
 ```

@ -761,21 +603,21 @@ Researchers have proposed two methods that allow to significantly reduce the mem

 [Multi-Query-Attention](https://huggingface.co/papers/1911.02150) was proposed in Noam Shazeer's *Fast Transformer Decoding: One Write-Head is All You Need* paper. As the title says, Noam found out that instead of using `n_head` key-value projections weights, one can use a single head-value projection weight pair that is shared across all attention heads without that the model's performance significantly degrades.

-> By using a single head-value projection weight pair, the key value vectors \\( \mathbf{k}_i, \mathbf{v}_i \\) have to be identical across all attention heads which in turn means that we only need to store 1 key-value projection pair in the cache instead of `n_head` ones.
+> By using a single head-value projection weight pair, the key value vectors $\mathbf{k}_i, \mathbf{v}_i$ have to be identical across all attention heads which in turn means that we only need to store 1 key-value projection pair in the cache instead of `n_head` ones.

 As most LLMs use between 20 and 100 attention heads, MQA significantly reduces the memory consumption of the key-value cache. For the LLM used in this notebook we could therefore reduce the required memory consumption from 15 GB to less than 400 MB at an input sequence length of 16000.

 In addition to memory savings, MQA also leads to improved computational efficiency as explained in the following.
-In auto-regressive decoding, large key-value vectors need to be reloaded, concatenated with the current key-value vector pair to be then fed into the \\( \mathbf{q}_c\mathbf{K}^T \\) computation at every step. For auto-regressive decoding, the required memory bandwidth for the constant reloading can become a serious time bottleneck. By reducing the size of the key-value vectors less memory needs to be accessed, thus reducing the memory bandwidth bottleneck. For more detail, please have a look at [Noam's paper](https://huggingface.co/papers/1911.02150).
+In auto-regressive decoding, large key-value vectors need to be reloaded, concatenated with the current key-value vector pair to be then fed into the $\mathbf{q}_c\mathbf{K}^T$ computation at every step. For auto-regressive decoding, the required memory bandwidth for the constant reloading can become a serious time bottleneck. By reducing the size of the key-value vectors less memory needs to be accessed, thus reducing the memory bandwidth bottleneck. For more detail, please have a look at [Noam's paper](https://huggingface.co/papers/1911.02150).

-The important part to understand here is that reducing the number of key-value attention heads to 1 only makes sense if a key-value cache is used. The peak memory consumption of the model for a single forward pass without key-value cache stays unchanged as every attention head still has a unique query vector so that each attention head still has a different \\( \mathbf{QK}^T \\) matrix.
+The important part to understand here is that reducing the number of key-value attention heads to 1 only makes sense if a key-value cache is used. The peak memory consumption of the model for a single forward pass without key-value cache stays unchanged as every attention head still has a unique query vector so that each attention head still has a different $\mathbf{QK}^T$ matrix.

 MQA has seen wide adoption by the community and is now used by many of the most popular LLMs:

-   [**Falcon**](https://huggingface.co/tiiuae/falcon-40b)
-   [**PaLM**](https://huggingface.co/papers/2204.02311)
-   [**MPT**](https://huggingface.co/mosaicml/mpt-30b)
-   [**BLOOM**](https://huggingface.co/bigscience/bloom)
+- [**Falcon**](https://huggingface.co/tiiuae/falcon-40b)
+- [**PaLM**](https://huggingface.co/papers/2204.02311)
+- [**MPT**](https://huggingface.co/mosaicml/mpt-30b)
+- [**BLOOM**](https://huggingface.co/bigscience/bloom)

 Also, the checkpoint used in this notebook - `bigcode/octocoder` - makes use of MQA.

--- a/docs/source/en/main_classes/configuration.md
+++ b/docs/source/en/main_classes/configuration.md
@ -16,7 +16,7 @@ rendered properly in your Markdown viewer.

 # Configuration

-The base class [`PretrainedConfig`] implements the common methods for loading/saving a configuration
+The base class [`PreTrainedConfig`] implements the common methods for loading/saving a configuration
 either from a local file or directory, or from a pretrained model configuration provided by the library (downloaded
 from HuggingFace's AWS S3 repository).

@ -24,8 +24,8 @@ Each derived config class implements model specific attributes. Common attribute
 `hidden_size`, `num_attention_heads`, and `num_hidden_layers`. Text models further implement:
 `vocab_size`.

-## PretrainedConfig
+## PreTrainedConfig

-[[autodoc]] PretrainedConfig
+[[autodoc]] PreTrainedConfig
    - push_to_hub
    - all
--- a/docs/source/en/main_classes/feature_extractor.md
+++ b/docs/source/en/main_classes/feature_extractor.md
@ -16,7 +16,7 @@ rendered properly in your Markdown viewer.

 # Feature Extractor

-A feature extractor is in charge of preparing input features for audio or vision models. This includes feature extraction from sequences, e.g., pre-processing audio files to generate Log-Mel Spectrogram features, feature extraction from images, e.g., cropping image files, but also padding, normalization, and conversion to NumPy and PyTorch tensors.
+A feature extractor is in charge of preparing input features for audio models. This includes feature extraction from sequences, e.g., pre-processing audio files to generate Log-Mel Spectrogram features, and conversion to NumPy and PyTorch tensors.

 ## FeatureExtractionMixin

--- a/docs/source/en/main_classes/kernels.md
+++ b/docs/source/en/main_classes/kernels.md
@ -0,0 +1,7 @@
+## Kernels
+
+This page documents the kernels configuration utilities.
+
+### KernelConfig
+
+[[autodoc]] KernelConfig
--- a/docs/source/en/main_classes/logging.md
+++ b/docs/source/en/main_classes/logging.md
@ -80,6 +80,7 @@ We use both in the `transformers` library. We leverage and adapt `logging`'s `ca
 management of these warning messages by the verbosity setters above.

 What does that mean for developers of the library? We should respect the following heuristics:
+
 - `warnings` should be favored for developers of the library and libraries dependent on `transformers`
 - `logging` should be used for end-users of the library using it in every-day projects

--- a/docs/source/en/main_classes/model.md
+++ b/docs/source/en/main_classes/model.md
@ -22,7 +22,6 @@ file or directory, or from a pretrained model configuration provided by the libr
 [`PreTrainedModel`] also implements a few methods which are common among all the models to:

 - resize the input token embeddings when new tokens are added to the vocabulary
- prune the attention heads of the model.

 The other methods that are common to each model are defined in [`~modeling_utils.ModuleUtilsMixin`] and [`~generation.GenerationMixin`].

@ -43,7 +42,3 @@ set this to `False`.
 ## Pushing to the Hub

 [[autodoc]] utils.PushToHubMixin
-
-## Sharded checkpoints
-
-[[autodoc]] modeling_utils.load_sharded_checkpoint
--- a/docs/source/en/main_classes/onnx.md
+++ b/docs/source/en/main_classes/onnx.md
@ -1,53 +0,0 @@
-<!--Copyright 2020 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-
-⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
-rendered properly in your Markdown viewer.
-
-->
-
-# Exporting 🤗 Transformers models to ONNX
-
-🤗 Transformers provides a `transformers.onnx` package that enables you to
-convert model checkpoints to an ONNX graph by leveraging configuration objects.
-
-See the [guide](../serialization) on exporting 🤗 Transformers models for more
-details.
-
-## ONNX Configurations
-
-We provide three abstract classes that you should inherit from, depending on the
-type of model architecture you wish to export:
-
-* Encoder-based models inherit from [`~onnx.config.OnnxConfig`]
-* Decoder-based models inherit from [`~onnx.config.OnnxConfigWithPast`]
-* Encoder-decoder models inherit from [`~onnx.config.OnnxSeq2SeqConfigWithPast`]
-
-### OnnxConfig
-
-[[autodoc]] onnx.config.OnnxConfig
-
-### OnnxConfigWithPast
-
-[[autodoc]] onnx.config.OnnxConfigWithPast
-
-### OnnxSeq2SeqConfigWithPast
-
-[[autodoc]] onnx.config.OnnxSeq2SeqConfigWithPast
-
-## ONNX Features
-
-Each ONNX configuration is associated with a set of _features_ that enable you
-to export models for different types of topologies or tasks.
-
-### FeaturesManager
-
-[[autodoc]] onnx.features.FeaturesManager
--- a/docs/source/en/main_classes/pipelines.md
+++ b/docs/source/en/main_classes/pipelines.md
@ -159,7 +159,7 @@ for batch_size in [1, 8, 64, 256]:
        pass
 ```

-```
+```text
 # On GTX 970
 ------------------------------
 Streaming no batching
@ -195,7 +195,7 @@ This is a occasional very long sentence compared to the other. In that case, the
 tokens long, so the whole batch will be [64, 400] instead of [64, 4], leading to the high slowdown. Even worse, on
 bigger batches, the program simply crashes.

-```
+```text
 ------------------------------
 Streaming no batching
 100%|█████████████████████████████████████████████████████████████████████| 1000/1000 [00:05<00:00, 183.69it/s]
--- a/docs/source/en/main_classes/processors.md
+++ b/docs/source/en/main_classes/processors.md
@ -17,6 +17,7 @@ rendered properly in your Markdown viewer.
 # Processors

 Processors can mean two different things in the Transformers library:
+
 - the objects that pre-process inputs for multi-modal models such as [Wav2Vec2](../model_doc/wav2vec2) (speech and text)
  or [CLIP](../model_doc/clip) (text and vision)
 - deprecated objects that were used in older versions of the library to preprocess data for GLUE or SQUAD.
--- a/docs/source/en/main_classes/text_generation.md
+++ b/docs/source/en/main_classes/text_generation.md
@ -30,15 +30,15 @@ like token streaming.
 ## GenerationConfig

 [[autodoc]] generation.GenerationConfig
-	- from_pretrained
-	- from_model_config
-	- save_pretrained
-	- update
-	- validate
-	- get_generation_mode
+    - from_pretrained
+    - from_model_config
+    - save_pretrained
+    - update
+    - validate
+    - get_generation_mode

 ## GenerationMixin

 [[autodoc]] GenerationMixin
-	- generate
-	- compute_transition_scores
+    - generate
+    - compute_transition_scores
--- a/docs/source/en/model_doc/align.md
+++ b/docs/source/en/model_doc/align.md
@ -148,12 +148,12 @@ for label, score in zip(candidate_labels, probs):
  ```

 ## Resources
+
 - Refer to the [Kakao Brain’s Open Source ViT, ALIGN, and the New COYO Text-Image Dataset](https://huggingface.co/blog/vit-align) blog post for more details.

 ## AlignConfig

 [[autodoc]] AlignConfig
-    - from_text_vision_configs

 ## AlignTextConfig

--- a/Show More
+++ b/Show More