Format empty lines and white space in markdown files. (#41100)

* Remove additional white space and empty lines from markdown files Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Add empty lines around code Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> --------- Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>
2025-10-20 17:13:56 +08:00 · 2025-09-24 07:20:01 +08:00
parent 99b0995138
commit f64354e89a
344 changed files with 673 additions and 1092 deletions
--- a/ISSUES.md
+++ b/ISSUES.md
@ -38,7 +38,6 @@ In particular all "Please explain" questions or objectively very user-specific f
 * "How to train T5 on De->En translation?"
 ## The GitHub Issues
 Everything which hints at a bug should be opened as an [issue](https://github.com/huggingface/transformers/issues).
@ -247,7 +246,6 @@ You are not required to read the following guidelines before opening an issue. H
    Try not use italics and bold text too much as these often make the text more difficult to read.
 12. If you are cross-referencing a specific comment in a given thread or another issue, always link to that specific comment, rather than using the issue link. If you do the latter it could be quite impossible to find which specific comment you're referring to.
    To get the link to the specific comment do not copy the url from the location bar of your browser, but instead, click the `...` icon in the upper right corner of the comment and then select "Copy Link".
@ -257,7 +255,6 @@ You are not required to read the following guidelines before opening an issue. H
    1. https://github.com/huggingface/transformers/issues/9257
    2. https://github.com/huggingface/transformers/issues/9257#issuecomment-749945162
 13. If you are replying to a last comment, it's totally fine to make your reply with just your comment in it. The readers can follow the information flow here.
    But if you're replying to a comment that happened some comments back it's always a good practice to quote just the relevant lines you're replying it. The `>` is used for quoting, or you can always use the menu to do so. For example your editor box will look like:
--- a/README.md
+++ b/README.md
@ -63,7 +63,6 @@ limitations under the License.
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/transformers_as_a_model_definition.png"/>
 </h3>
 Transformers acts as the model-definition framework for state-of-the-art machine learning models in text, computer
 vision, audio, video, and multimodal model, for both inference and training.
@ -194,7 +193,6 @@ pipeline("https://huggingface.co/datasets/Narsil/image_dummy/raw/main/parrots.pn
 <details>
 <summary>Visual question answering</summary>
 <h3 align="center">
    <a><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/idefics-few-shot.jpg"></a>
 </h3>
--- a/awesome-transformers.md
+++ b/awesome-transformers.md
@ -606,4 +606,3 @@ Keywords: BentoML, Framework, Deployment, AI Applications
 [LLaMA Factory](https://github.com/hiyouga/LLaMA-Factory) offers a user-friendly fine-tuning framework that incorporates PEFT. The repository includes training(fine-tuning) and inference examples for LLaMA-2, BLOOM, Falcon, Baichuan, Qwen, and other LLMs. A ChatGLM version is also available in [ChatGLM-Efficient-Tuning](https://github.com/hiyouga/ChatGLM-Efficient-Tuning).
 Keywords: PEFT, fine-tuning, LLaMA-2, ChatGLM, Qwen
--- a/docs/source/en/accelerator_selection.md
+++ b/docs/source/en/accelerator_selection.md
@ -69,7 +69,6 @@ CUDA_VISIBLE_DEVICES=0,2 torchrun trainer-program.py ...
 Only GPUs 0 and 2 are "visible" to PyTorch and are mapped to `cuda:0` and `cuda:1` respectively.  
 To reverse the order (use GPU 2 as `cuda:0` and GPU 0 as `cuda:1`):
 ```bash
 CUDA_VISIBLE_DEVICES=2,0 torchrun trainer-program.py ...
 ```
@ -108,7 +107,6 @@ To reverse the order (use XPU 2 as `xpu:0` and XPU 0 as `xpu:1`):
 ZE_AFFINITY_MASK=2,0 torchrun trainer-program.py ...
 ```
 You can also control the order of Intel XPUs with:
 ```bash
@ -120,7 +118,5 @@ For more information about device enumeration and sorting on Intel XPU, please r
 </hfoption>
 </hfoptions>
 > [!WARNING]
 > Environment variables can be exported instead of being added to the command line. This is not recommended because it can be confusing if you forget how the environment variable was set up and you end up using the wrong accelerators. Instead, it is common practice to set the environment variable for a specific training run on the same command line.
--- a/docs/source/en/auto_docstring.md
+++ b/docs/source/en/auto_docstring.md
@ -145,7 +145,6 @@ Arguments can also be passed directly to `@auto_docstring` for more control. Use
 The `Returns` and `Examples` parts of the docstring can also be manually specified.
 ```python
 MODEL_COMMON_CUSTOM_ARGS = r"""
    common_arg_1 (`torch.Tensor`, *optional*, defaults to `default_value`):
@ -202,7 +201,6 @@ There are some rules for documenting different types of arguments and they're li
    If a standard argument behaves differently in your model, then you can override it locally in a `r""" """` block. This local definition has a higher priority. For example, the `labels` argument is often customized per model and typically requires overriding.
 - New or custom arguments should be documented within an `r""" """` block after the signature if it is a function or in the `__init__` method's docstring if it is a class.
    ```py
--- a/docs/source/en/cache_explanation.md
+++ b/docs/source/en/cache_explanation.md
@ -62,8 +62,6 @@ Refer to the table below to compare how caching improves efficiency.
 | for each step, recompute all previous `K` and `V`  | for each step, only compute current `K` and `V`
 | attention cost per step is **quadratic** with sequence length | attention cost per step is **linear** with sequence length (memory grows linearly, but compute/token remains low) |
 ## Cache class
 A basic KV cache interface takes a key and value tensor for the current token and returns the updated `K` and `V` tensors. This is internally managed by a model's `forward` method.
@ -143,7 +141,6 @@ Cache position is used internally for two purposes:
 The generation loop usually takes care of the cache position, but if you're writing a custom generation method, it is important that cache positions are accurate since they are used to write and read key/value states into fixed slots.
 ```py
 import torch
 from transformers import AutoTokenizer, AutoModelForCausalLM, DynamicCache, infer_device
@ -160,7 +157,6 @@ generated_ids = model.generate(**inputs, use_cache=True, max_new_tokens=10)
 ```
 ## Legacy cache format
 Before the [`Cache`] class, the cache used to be stored as a tuple of tuples of tensors. This format is dynamic because it grows as text is generated, similar to [`DynamicCache`].
--- a/docs/source/en/chat_extras.md
+++ b/docs/source/en/chat_extras.md
@ -29,7 +29,6 @@ the arguments, argument types, and function docstring are parsed in order to gen
 Although passing Python functions is very convenient, the parser can only handle [Google-style](https://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings)
 docstrings. Refer to the examples below for how to format a tool-ready function.
 ```py
 def get_current_temperature(location: str, unit: str):
    """
@ -103,7 +102,6 @@ Hold the call in the `tool_calls` key of an `assistant` message. This is the rec
 > [!WARNING]
 > Although `tool_calls` is similar to the OpenAI API, the OpenAI API uses a JSON string as its `tool_calls` format. This may cause errors or strange model behavior if used in Transformers, which expects a dict.
 ```py
 tool_call = {"name": "get_current_temperature", "arguments": {"location": "Paris, France", "unit": "celsius"}}
 messages.append({"role": "assistant", "tool_calls": [{"type": "function", "function": tool_call}]})
@ -131,7 +129,6 @@ The temperature in Paris, France right now is 22°C.<|im_end|>
 > Although the key in the assistant message is called `tool_calls`, in most cases, models only emit a single tool call at a time. Some older models emit multiple tool calls at the same time, but this is a
 > significantly more complex process, as you need to handle multiple tool responses at once and disambiguate them, often using tool call IDs. Please refer to the model card to see exactly what format a model expects for tool calls.
 ## JSON schemas
 Another way to define tools is by passing a [JSON schema](https://json-schema.org/learn/getting-started-step-by-step).
--- a/docs/source/en/chat_templating.md
+++ b/docs/source/en/chat_templating.md
@ -43,6 +43,7 @@ chat = [
 tokenizer.apply_chat_template(chat, tokenize=False)
 ```
 ```md
 <s>[INST] Hello, how are you? [/INST]I'm doing great. How can I help you today?</s> [INST] I'd like to show off how chat templating works! [/INST]
 ```
@ -62,6 +63,7 @@ chat = [
 tokenizer.apply_chat_template(chat, tokenize=False)
 ```
 ```md
 <|user|>\nHello, how are you?</s>\n<|assistant|>\nI'm doing great. How can I help you today?</s>\n<|user|>\nI'd like to show off how chat templating works!</s>\n
 ```
@ -110,6 +112,7 @@ Pass the tokenized chat to [`~GenerationMixin.generate`] to generate a response.
 outputs = model.generate(tokenized_chat, max_new_tokens=128) 
 print(tokenizer.decode(outputs[0]))
 ```
 ```md
 <|system|>
 You are a friendly chatbot who always responds in the style of a pirate</s>
@ -135,6 +138,7 @@ Let's see an example to understand what `add_generation_prompt` is actually doin
 tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
 tokenized_chat
 ```
 ```md
 <|im_start|>user
 Hi there!<|im_end|>
@ -150,6 +154,7 @@ Now, let's format the same chat with `add_generation_prompt=True`:
 tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
 tokenized_chat
 ```
 ```md
 <|im_start|>user
 Hi there!<|im_end|>
@ -186,7 +191,6 @@ model.generate(**formatted_chat)
 [`TextGenerationPipeline`] sets [add_generation_prompt](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.add_generation_prompt) to `True` by default to start a new message. However, if the final message in the chat has the `assistant` role, it assumes the message is a prefill and switches to `continue_final_message=True`. This is because most models don’t support multiple consecutive assistant messages. To override this behavior, explicitly pass the [continue_final_message](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.continue_final_message) argument to the pipeline.
 ## Model training
 Training a model with a chat template is a good way to ensure the template matches the tokens the model was trained on. Apply the chat template as a preprocessing step to your dataset. Set `add_generation_prompt=False` because the additional tokens to prompt an assistant response aren't helpful during training.
@ -212,6 +216,7 @@ dataset = Dataset.from_dict({"chat": [chat1, chat2]})
 dataset = dataset.map(lambda x: {"formatted_chat": tokenizer.apply_chat_template(x["chat"], tokenize=False, add_generation_prompt=False)})
 print(dataset['formatted_chat'][0])
 ```
 ```md
 <|user|>
 Which is bigger, the moon or the sun?</s>
--- a/docs/source/en/chat_templating_multimodal.md
+++ b/docs/source/en/chat_templating_multimodal.md
@ -18,7 +18,6 @@ rendered properly in your Markdown viewer.
 Multimodal chat models accept inputs like images, audio or video, in addition to text. The `content` key in a multimodal chat history is a list containing multiple items of different types. This is unlike text-only chat models whose `content` key is a single string.
 In the same way the [Tokenizer](./fast_tokenizer) class handles chat templates and tokenization for text-only models,
 the [Processor](./processors) class handles preprocessing, tokenization and chat templates for multimodal models. Their [`~ProcessorMixin.apply_chat_template`] methods are almost identical.
@ -57,7 +56,6 @@ out = pipe(text=messages, max_new_tokens=128)
 print(out[0]['generated_text'][-1]['content'])
 ```
 ```
 Ahoy, me hearty! These be two feline friends, likely some tabby cats, taking a siesta on a cozy pink blanket. They're resting near remote controls, perhaps after watching some TV or just enjoying some quiet time together. Cats sure know how to find comfort and relaxation, don't they?
 ```
@ -69,7 +67,6 @@ Aside from the gradual descent from pirate-speak into modern American English (i
 Like [text-only models](./chat_templating), use the [`~ProcessorMixin.apply_chat_template`] method to prepare the chat messages for multimodal models.
 This method handles the tokenization and formatting of the chat messages, including images and other media types. The resulting inputs are passed to the model for generation.
 ```python
 from transformers import AutoProcessor, AutoModelForImageTextToText
@ -99,7 +96,6 @@ processed_chat = processor.apply_chat_template(messages, add_generation_prompt=T
 print(list(processed_chat.keys()))
 ```
 ```
 ['input_ids', 'attention_mask', 'pixel_values', 'image_grid_thw']
 ```
@ -113,7 +109,6 @@ print(processor.decode(out[0]))
 The decoded output contains the full conversation so far, including the user message and the placeholder tokens that contain the image information. You may need to trim the previous conversation from the output before displaying it to the user.
 ## Video inputs
 Some vision models also support video inputs. The message format is very similar to the format for [image inputs](#image-inputs).
@ -148,6 +143,7 @@ messages = [
 ```
 ### Example: Passing decoded video objects
 ```python
 import numpy as np
@ -167,7 +163,9 @@ messages = [
    },
 ]
 ```
 You can also use existing (`"load_video()"`) function to load a video, edit the video in memory and pass it in the messages.
 ```python
 # Make sure a video backend library (pyav, decord, or torchvision) is available.
@ -200,7 +198,6 @@ Pass `messages` to [`~ProcessorMixin.apply_chat_template`] to tokenize the input
 The `num_frames` parameter controls how many frames to uniformly sample from the video. Each checkpoint has a maximum frame count it was pretrained with and exceeding this count can significantly lower generation quality. It's important to choose a frame count that fits both the model capacity and your hardware resources. If `num_frames` isn't specified, the entire video is loaded without any frame sampling.
 ```python
 processed_chat = processor.apply_chat_template(
    messages,
@ -265,4 +262,3 @@ print(processed_chat.keys())
 </hfoption>
 </hfoptions>
--- a/docs/source/en/chat_templating_writing.md
+++ b/docs/source/en/chat_templating_writing.md
@ -18,7 +18,6 @@ rendered properly in your Markdown viewer.
 A chat template is a [Jinja](https://jinja.palletsprojects.com/en/stable/templates/) template stored in the tokenizer's [chat_template](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.chat_template) attribute. Jinja is a templating language that allows you to write Python-like code and syntax.
 ```jinja
 {%- for message in messages %}
    {{- '<|' + message['role'] + |>\n' }}
@ -108,7 +107,6 @@ We strongly recommend using `-` to ensure only the intended content is printed.
 ### Special variables and callables
 The only constants in a template are the `messages` variable and the `add_generation_prompt` boolean. However, you have
 access to **any other keyword arguments that are passed** to the [`~PreTrainedTokenizerBase.apply_chat_template`] method.
--- a/docs/source/en/conversations.md
+++ b/docs/source/en/conversations.md
@ -48,7 +48,6 @@ transformers chat -h
 The chat is implemented on top of the [AutoClass](./model_doc/auto), using tooling from [text generation](./llm_tutorial) and [chat](./chat_templating). It uses the `transformers serve` CLI under the hood ([docs](./serving.md#serve-cli)).
 ## TextGenerationPipeline
 [`TextGenerationPipeline`] is a high-level text generation class with a "chat mode". Chat mode is enabled when a conversational model is detected and the chat prompt is [properly formatted](./llm_tutorial#wrong-prompt-format).
--- a/docs/source/en/cursor.md
+++ b/docs/source/en/cursor.md
@ -38,5 +38,3 @@ You are now ready to use your local model in Cursor! For instance, if you toggle
 <h3 align="center">
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/transformers_serve_cursor_chat.png"/>
 </h3>
--- a/docs/source/en/generation_strategies.md
+++ b/docs/source/en/generation_strategies.md
@ -389,7 +389,6 @@ from .utils import some_function
 Only relative imports from the same-level `custom_generate` folder are supported. Parent/sibling folder imports are not valid. The `custom_generate` argument also works locally with any directory that contains a `custom_generate` structure. This is the recommended workflow for developing your custom generation method.
 #### requirements.txt
 You can optionally specify additional Python requirements in a `requirements.txt` file inside the `custom_generate` folder. These are checked at runtime and an exception will be thrown if they're missing, nudging users to update their environment accordingly.
--- a/docs/source/en/index.md
+++ b/docs/source/en/index.md
@ -19,7 +19,6 @@ rendered properly in your Markdown viewer.
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/transformers_as_a_model_definition.png"/>
 </h3>
 Transformers acts as the model-definition framework for state-of-the-art machine learning models in text, computer
 vision, audio, video, and multimodal model, for both inference and training.
--- a/docs/source/en/internal/file_utils.md
+++ b/docs/source/en/internal/file_utils.md
@ -20,7 +20,6 @@ This page lists all of Transformers general utility functions that are found in
 Most of those are only useful if you are studying the general code in the library.
 ## Enums and namedtuples
 [[autodoc]] utils.ExplicitEnum
--- a/docs/source/en/internal/generation_utils.md
+++ b/docs/source/en/internal/generation_utils.md
@ -65,7 +65,6 @@ values. Here, for instance, it has two keys that are `sequences` and `scores`.
 We document here all output types.
 [[autodoc]] generation.GenerateDecoderOnlyOutput
 [[autodoc]] generation.GenerateEncoderDecoderOutput
@ -74,13 +73,11 @@ We document here all output types.
 [[autodoc]] generation.GenerateBeamEncoderDecoderOutput
 ## LogitsProcessor
 A [`LogitsProcessor`] can be used to modify the prediction scores of a language model head for
 generation.
 [[autodoc]] AlternatingCodebooksLogitsProcessor
    - __call__
@ -174,8 +171,6 @@ generation.
 [[autodoc]] WatermarkLogitsProcessor
    - __call__
 ## StoppingCriteria
 A [`StoppingCriteria`] can be used to change when to stop generation (other than EOS token). Please note that this is exclusively available to our PyTorch implementations.
@ -300,7 +295,6 @@ A [`Constraint`] can be used to force the generation to include specific tokens
    - to_legacy_cache
    - from_legacy_cache
 ## Watermark Utils
 [[autodoc]] WatermarkingConfig
--- a/docs/source/en/internal/model_debugging_utils.md
+++ b/docs/source/en/internal/model_debugging_utils.md
@ -21,10 +21,8 @@ provides for it.
 Most of those are only useful if you are adding new models in the library.
 ## Model addition debuggers
 ### Model addition debugger - context manager for model adders
 This context manager is a power user tool intended for model adders. It tracks all forward calls within a model forward
@ -72,7 +70,6 @@ with model_addition_debugger_context(
 ```
 ### Reading results
 The debugger generates two files from the forward call, both with the same base name, but ending either with
@ -231,10 +228,8 @@ Once the forward passes of two models have been traced by the debugger, one can
 below: we can see slight differences between these two implementations' key projection layer. Inputs are mostly
 identical, but not quite. Looking through the file differences makes it easier to pinpoint which layer is wrong.
 ![download-icon](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/files_difference_debugging.png)
 ### Limitations and scope
 This feature will only work for torch-based models. Models relying heavily on external kernel calls may work, but trace will
@ -268,7 +263,6 @@ This utility:
 ![download-icon](https://huggingface.co/datasets/huggingface/documentation-images/resolve/f7f671f69b88ce4967e19179172c248958d35742/transformers/tests_skipped_visualisation.png)
 ### Usage
 You can run the skipped test analyzer in two ways:
--- a/docs/source/en/internal/pipelines_utils.md
+++ b/docs/source/en/internal/pipelines_utils.md
@ -20,7 +20,6 @@ This page lists all the utility functions the library provides for pipelines.
 Most of those are only useful if you are studying the code of the models in the library.
 ## Argument handling
 [[autodoc]] pipelines.ArgumentHandler
--- a/docs/source/en/llm_tutorial.md
+++ b/docs/source/en/llm_tutorial.md
@ -24,6 +24,7 @@ In Transformers, the [`~GenerationMixin.generate`] API handles text generation,
 > [!TIP]
 > You can also chat with a model directly from the command line. ([reference](./conversations.md#transformers))
 >
 > ```shell
 > transformers chat Qwen/Qwen2.5-0.5B-Instruct
 > ```
@ -35,6 +36,7 @@ Before you begin, it's helpful to install [bitsandbytes](https://hf.co/docs/bits
 ```bash
 !pip install -U transformers bitsandbytes
 ```
 Bitsandbytes supports multiple backends in addition to CUDA-based GPUs. Refer to the multi-backend installation [guide](https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend) to learn more.
 Load a LLM with [`~PreTrainedModel.from_pretrained`] and add the following two parameters to reduce the memory requirements.
@ -154,7 +156,6 @@ print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
 | `repetition_penalty` | `float` | Set it to `>1.0` if you're seeing the model repeat itself often. Larger values apply a larger penalty. |
 | `eos_token_id` | `list[int]` | The token(s) that will cause generation to stop. The default value is usually good, but you can specify a different token. |
 ## Pitfalls
 The section below covers some common issues you may encounter during text generation and how to solve them.
--- a/docs/source/en/llm_tutorial_optimization.md
+++ b/docs/source/en/llm_tutorial_optimization.md
@ -66,6 +66,7 @@ If you have access to an 8 x 80GB A100 node, you could load BLOOM as follows
 ```bash
 !pip install transformers accelerate bitsandbytes optimum
 ```
 ```python
 from transformers import AutoModelForCausalLM
@ -98,6 +99,7 @@ result
 ```
 **Output**:
 ```
 Here is a Python function that transforms bytes to Giga bytes:\n\n```python\ndef bytes_to_giga_bytes(bytes):\n    return bytes / 1024 / 1024 / 1024\n```\n\nThis function takes a single
 ```
@ -116,6 +118,7 @@ bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
 ```
 **Output**:
 ```bash
 29.0260648727417
 ```
@ -127,7 +130,6 @@ Note that if we had tried to run the model in full float32 precision, a whopping
 If you are unsure in which format the model weights are stored on the Hub, you can always look into the checkpoint's config under `"dtype"`, *e.g.* [here](https://huggingface.co/meta-llama/Llama-2-7b-hf/blob/6fdf2e60f86ff2481f2241aaee459f85b5b0bbb9/config.json#L21). It is recommended to set the model to the same precision type as written in the config when loading with `from_pretrained(..., dtype=...)` except when the original type is float32 in which case one can use both `float16` or `bfloat16` for inference.
 Let's define a `flush(...)` function to free all allocated memory so that we can accurately measure the peak allocated GPU memory.
 ```python
@ -148,6 +150,7 @@ Let's call it now for the next experiment.
 ```python
 flush()
 ```
 From the Accelerate library, you can also use a device-agnostic utility method called [release_memory](https://github.com/huggingface/accelerate/blob/29be4788629b772a3b722076e433b5b3b5c85da3/src/accelerate/utils/memory.py#L63), which takes various hardware backends like XPU, MLU, NPU, MPS, and more into account.
 ```python
@ -204,6 +207,7 @@ result
 ```
 **Output**:
 ```
 Here is a Python function that transforms bytes to Giga bytes:\n\n```python\ndef bytes_to_giga_bytes(bytes):\n    return bytes / 1024 / 1024 / 1024\n```\n\nThis function takes a single
 ```
@ -215,6 +219,7 @@ bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
 ```
 **Output**:
 ```
 15.219234466552734
 ```
@ -222,8 +227,8 @@ bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
 Significantly less! We're down to just a bit over 15 GBs and could therefore run this model on consumer GPUs like the 4090.
 We're seeing a very nice gain in memory efficiency and more or less no degradation to the model's output. However, we can also notice a slight slow-down during inference.
 We delete the models and flush the memory again.
 ```python
 del model
 del pipe
@ -245,6 +250,7 @@ result
 ```
 **Output**:
 ```
 Here is a Python function that transforms bytes to Giga bytes:\n\n```\ndef bytes_to_gigabytes(bytes):\n    return bytes / 1024 / 1024 / 1024\n```\n\nThis function takes a single argument
 ```
@ -256,6 +262,7 @@ bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
 ```
 **Output**:
 ```
 9.543574333190918
 ```
@ -270,6 +277,7 @@ Also note that inference here was again a bit slower compared to 8-bit quantizat
 del model
 del pipe
 ```
 ```python
 flush()
 ```
@ -384,6 +392,7 @@ def alternating(list1, list2):
 -----
 """
 ```
 For demonstration purposes, we duplicate the system prompt by ten so that the input length is long enough to observe Flash Attention's memory savings.
 We append the original text prompt `"Question: Please write a function in Python that transforms bytes to Giga bytes.\n\nAnswer: Here"`
@ -413,6 +422,7 @@ result
 ```
 **Output**:
 ```
 Generated in 10.96854019165039 seconds.
 Sure. Here is a function that does that.\n\ndef bytes_to_giga(bytes):\n   return bytes / 1024 / 1024 / 1024\n\nAnswer: Sure. Here is a function that does that.\n\ndef
@ -429,6 +439,7 @@ bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
 ```
 **Output**:
 ```bash
 37.668193340301514
 ```
@ -460,6 +471,7 @@ result
 ```
 **Output**:
 ```
 Generated in 3.0211617946624756 seconds.
 Sure. Here is a function that does that.\n\ndef bytes_to_giga(bytes):\n   return bytes / 1024 / 1024 / 1024\n\nAnswer: Sure. Here is a function that does that.\n\ndef
@ -474,6 +486,7 @@ bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
 ```
 **Output**:
 ```
 32.617331981658936
 ```
@ -604,6 +617,7 @@ generated_text
 ```
 **Output**:
 ```
 shape of input_ids torch.Size([1, 21])
 shape of input_ids torch.Size([1, 22])
@ -641,6 +655,7 @@ generated_text
 ```
 **Output**:
 ```
 shape of input_ids torch.Size([1, 1])
 length of key-value cache 20
@ -712,6 +727,7 @@ tokenizer.batch_decode(generation_output.sequences)[0][len(prompt):]
 ```
 **Output**:
 ```
 is a modified version of the function that returns Mega bytes instead.
@ -733,6 +749,7 @@ config = model.config
 ```
 **Output**:
 ```
 7864320000
 ```
@ -773,7 +790,6 @@ The most notable application of GQA is [Llama-v2](https://huggingface.co/meta-ll
 > As a conclusion, it is strongly recommended to make use of either GQA or MQA if the LLM is deployed with auto-regressive decoding and is required to handle large input sequences as is the case for example for chat.
 ## Conclusion
 The research community is constantly coming up with new, nifty ways to speed up inference time for ever-larger LLMs. As an example, one such promising research direction is [speculative decoding](https://huggingface.co/papers/2211.17192) where "easy tokens" are generated by smaller, faster language models and only "hard tokens" are generated by the LLM itself. Going into more detail is out of the scope of this notebook, but can be read upon in this [nice blog post](https://huggingface.co/blog/assisted-generation).
--- a/docs/source/en/main_classes/callback.md
+++ b/docs/source/en/main_classes/callback.md
@ -54,7 +54,6 @@ The main class that implements callbacks is [`TrainerCallback`]. It gets the
 Trainer's internal state via [`TrainerState`], and can take some actions on the training loop via
 [`TrainerControl`].
 ## Available Callbacks
 Here is the list of the available [`TrainerCallback`] in the library:
--- a/docs/source/en/main_classes/configuration.md
+++ b/docs/source/en/main_classes/configuration.md
@ -24,7 +24,6 @@ Each derived config class implements model specific attributes. Common attribute
 `hidden_size`, `num_attention_heads`, and `num_hidden_layers`. Text models further implement:
 `vocab_size`.
 ## PretrainedConfig
 [[autodoc]] PretrainedConfig
--- a/docs/source/en/main_classes/data_collator.md
+++ b/docs/source/en/main_classes/data_collator.md
@ -25,7 +25,6 @@ on the formed batch.
 Examples of use can be found in the [example scripts](../examples) or [example notebooks](../notebooks).
 ## Default data collator
 [[autodoc]] data.data_collator.default_data_collator
--- a/docs/source/en/main_classes/executorch.md
+++ b/docs/source/en/main_classes/executorch.md
@ -15,14 +15,12 @@ rendered properly in your Markdown viewer.
 -->
 # ExecuTorch
 [`ExecuTorch`](https://github.com/pytorch/executorch) is an end-to-end solution for enabling on-device inference capabilities across mobile and edge devices including wearables, embedded devices and microcontrollers. It is part of the PyTorch ecosystem and supports the deployment of PyTorch models with a focus on portability, productivity, and performance.
 ExecuTorch introduces well defined entry points to perform model, device, and/or use-case specific optimizations such as backend delegation, user-defined compiler transformations, memory planning, and more. The first step in preparing a PyTorch model for execution on an edge device using ExecuTorch is to export the model. This is achieved through the use of a PyTorch API called [`torch.export`](https://pytorch.org/docs/stable/export.html).
 ## ExecuTorch Integration
 An integration point is being developed to ensure that 🤗 Transformers can be exported using `torch.export`. The goal of this integration is not only to enable export but also to ensure that the exported artifact can be further lowered and optimized to run efficiently in `ExecuTorch`, particularly for mobile and edge use cases.
--- a/docs/source/en/main_classes/feature_extractor.md
+++ b/docs/source/en/main_classes/feature_extractor.md
@ -18,7 +18,6 @@ rendered properly in your Markdown viewer.
 A feature extractor is in charge of preparing input features for audio or vision models. This includes feature extraction from sequences, e.g., pre-processing audio files to generate Log-Mel Spectrogram features, feature extraction from images, e.g., cropping image files, but also padding, normalization, and conversion to NumPy and PyTorch tensors.
 ## FeatureExtractionMixin
 [[autodoc]] feature_extraction_utils.FeatureExtractionMixin
--- a/docs/source/en/main_classes/image_processor.md
+++ b/docs/source/en/main_classes/image_processor.md
@ -26,6 +26,7 @@ from transformers import AutoImageProcessor
 processor = AutoImageProcessor.from_pretrained("facebook/detr-resnet-50", use_fast=True)
 ```
 Note that `use_fast` will be set to `True` by default in a future release.
 When using a fast image processor, you can also set the `device` argument to specify the device on which the processing should be done. By default, the processing is done on the same device as the inputs if the inputs are tensors, or on the CPU otherwise.
@ -57,7 +58,6 @@ Here are some speed comparisons between the base and fast image processors for t
 These benchmarks were run on an [AWS EC2 g5.2xlarge instance](https://aws.amazon.com/ec2/instance-types/g5/), utilizing an NVIDIA A10G Tensor Core GPU.
 ## ImageProcessingMixin
 [[autodoc]] image_processing_utils.ImageProcessingMixin
@ -72,7 +72,6 @@ These benchmarks were run on an [AWS EC2 g5.2xlarge instance](https://aws.amazon
 [[autodoc]] image_processing_utils.BaseImageProcessor
 ## BaseImageProcessorFast
 [[autodoc]] image_processing_utils_fast.BaseImageProcessorFast
--- a/docs/source/en/main_classes/logging.md
+++ b/docs/source/en/main_classes/logging.md
@ -55,7 +55,6 @@ logger.info("INFO")
 logger.warning("WARN")
 ```
 All the methods of this logging module are documented below, the main ones are
 [`logging.get_verbosity`] to get the current level of verbosity in the logger and
 [`logging.set_verbosity`] to set the verbosity to the level of your choice. In order (from the least
--- a/docs/source/en/main_classes/model.md
+++ b/docs/source/en/main_classes/model.md
@ -26,7 +26,6 @@ file or directory, or from a pretrained model configuration provided by the libr
 The other methods that are common to each model are defined in [`~modeling_utils.ModuleUtilsMixin`] and [`~generation.GenerationMixin`].
 ## PreTrainedModel
 [[autodoc]] PreTrainedModel
--- a/docs/source/en/main_classes/onnx.md
+++ b/docs/source/en/main_classes/onnx.md
@ -51,4 +51,3 @@ to export models for different types of topologies or tasks.
 ### FeaturesManager
 [[autodoc]] onnx.features.FeaturesManager
--- a/docs/source/en/main_classes/optimizer_schedules.md
+++ b/docs/source/en/main_classes/optimizer_schedules.md
@ -22,7 +22,6 @@ The `.optimization` module provides:
 - several schedules in the form of schedule objects that inherit from `_LRSchedule`:
 - a gradient accumulation class to accumulate the gradients of multiple batches
 ## AdaFactor
 [[autodoc]] Adafactor
--- a/docs/source/en/main_classes/output.md
+++ b/docs/source/en/main_classes/output.md
@ -47,7 +47,6 @@ However, this is not always the case. Some models apply normalization or subsequ
 </Tip>
 You can access each attribute as you would usually do, and if that attribute has not been returned by the model, you
 will get `None`. Here for instance `outputs.loss` is the loss computed by the model, and `outputs.attentions` is
 `None`.
--- a/docs/source/en/main_classes/pipelines.md
+++ b/docs/source/en/main_classes/pipelines.md
@ -81,7 +81,6 @@ for out in tqdm(pipe(KeyDataset(dataset, "file"))):
 For ease of use, a generator is also possible:
 ```python
 from transformers import pipeline
@ -196,7 +195,6 @@ This is a occasional very long sentence compared to the other. In that case, the
 tokens long, so the whole batch will be [64, 400] instead of [64, 4], leading to the high slowdown. Even worse, on
 bigger batches, the program simply crashes.
 ```
 ------------------------------
 Streaming no batching
@ -245,7 +243,6 @@ multiple forward pass of a model. Under normal circumstances, this would yield i
 In order to circumvent this issue, both of these pipelines are a bit specific, they are `ChunkPipeline` instead of
 regular `Pipeline`. In short:
 ```python
 preprocessed = pipe.preprocess(inputs)
 model_outputs = pipe.forward(preprocessed)
@ -254,7 +251,6 @@ outputs = pipe.postprocess(model_outputs)
 Now becomes:
 ```python
 all_model_outputs = []
 for preprocessed in pipe.preprocess(inputs):
@ -282,7 +278,6 @@ If you want to override a specific pipeline.
 Don't hesitate to create an issue for your task at hand, the goal of the pipeline is to be easy to use and support most
 cases, so `transformers` could maybe support your use case.
 If you want to try simply you can:
 - Subclass your pipeline of choice
@ -302,7 +297,6 @@ my_pipeline = pipeline(model="xxxx", pipeline_class=MyPipeline)
 That should enable you to do all the custom code you want.
 ## Implementing a pipeline
 [Implementing a new pipeline](../add_new_pipeline)
@ -329,7 +323,6 @@ Pipelines available for audio tasks include the following.
    - __call__
    - all
 ### ZeroShotAudioClassificationPipeline
 [[autodoc]] ZeroShotAudioClassificationPipeline
--- a/docs/source/en/main_classes/processors.md
+++ b/docs/source/en/main_classes/processors.md
@ -71,7 +71,6 @@ Additionally, the following method can be used to load values from a data file a
 [[autodoc]] data.processors.glue.glue_convert_examples_to_features
 ## XNLI
 [The Cross-Lingual NLI Corpus (XNLI)](https://www.nyu.edu/projects/bowman/xnli/) is a benchmark that evaluates the
@ -88,7 +87,6 @@ Please note that since the gold labels are available on the test set, evaluation
 An example using these processors is given in the [run_xnli.py](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification/run_xnli.py) script.
 ## SQuAD
 [The Stanford Question Answering Dataset (SQuAD)](https://rajpurkar.github.io/SQuAD-explorer//) is a benchmark that
@ -115,11 +113,9 @@ Additionally, the following method can be used to convert SQuAD examples into
 [[autodoc]] data.processors.squad.squad_convert_examples_to_features
 These processors as well as the aforementioned method can be used with files containing the data as well as with the
 *tensorflow_datasets* package. Examples are given below.
 ### Example usage
 Here is an example using the processors as well as the conversion method using data files:
--- a/docs/source/en/main_classes/tokenizer.md
+++ b/docs/source/en/main_classes/tokenizer.md
@ -50,7 +50,6 @@ several advanced alignment methods which can be used to map between the original
 token space (e.g., getting the index of the token comprising a given character or the span of characters corresponding
 to a given token).
 # Multimodal Tokenizer
 Apart from that each tokenizer can be a "multimodal" tokenizer which means that the tokenizer will hold all relevant special tokens
--- a/docs/source/en/main_classes/video_processor.md
+++ b/docs/source/en/main_classes/video_processor.md
@ -22,7 +22,6 @@ The video processor extends the functionality of image processors by allowing Vi
 When adding a new VLM or updating an existing one to enable distinct video preprocessing, saving and reloading the processor configuration will store the video related arguments in a dedicated file named `video_preprocessing_config.json`. Don't worry if you haven't updated your VLM, the processor will try to load video related configurations from a file named `preprocessing_config.json`.
 ### Usage Example
 Here's an example of how to load a video processor with [`llava-hf/llava-onevision-qwen2-0.5b-ov-hf`](https://huggingface.co/llava-hf/llava-onevision-qwen2-0.5b-ov-hf) model:
@ -59,7 +58,6 @@ The video processor can also sample video frames using the technique best suited
 </Tip>
 ```python
 from transformers import AutoVideoProcessor
@ -92,4 +90,3 @@ print(processed_video_inputs.pixel_values_videos.shape)
 ## BaseVideoProcessor
 [[autodoc]] video_processing_utils.BaseVideoProcessor
--- a/docs/source/en/model_doc/aimv2.md
+++ b/docs/source/en/model_doc/aimv2.md
@ -25,7 +25,6 @@ The abstract from the paper is the following:
 *We introduce a novel method for pre-training of large-scale vision encoders. Building on recent advancements in autoregressive pre-training of vision models, we extend this framework to a multimodal setting, i.e., images and text. In this paper, we present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process, scalability, and remarkable performance across a range of downstream tasks. This is achieved by pairing the vision encoder with a multimodal decoder that autoregressively generates raw image patches and text tokens. Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification. Notably, our AIMV2-3B encoder achieves 89.5% accuracy on ImageNet-1k with a frozen trunk. Furthermore, AIMV2 consistently outperforms state-of-the-art contrastive models (e.g., CLIP, SigLIP) in multimodal image understanding across diverse settings.*
 This model was contributed by [Yaswanth Gali](https://huggingface.co/yaswanthgali).
 The original code can be found [here](https://github.com/apple/ml-aim).
--- a/docs/source/en/model_doc/aria.md
+++ b/docs/source/en/model_doc/aria.md
@ -142,7 +142,6 @@ response = processor.decode(output_ids, skip_special_tokens=True)
 print(response)
 ```
 ## AriaImageProcessor
 [[autodoc]] AriaImageProcessor
--- a/docs/source/en/model_doc/auto.md
+++ b/docs/source/en/model_doc/auto.md
@ -23,7 +23,6 @@ automatically retrieve the relevant model given the name/path to the pretrained
 Instantiating one of [`AutoConfig`], [`AutoModel`], and
 [`AutoTokenizer`] will directly create a class of the relevant architecture. For instance
 ```python
 model = AutoModel.from_pretrained("google-bert/bert-base-cased")
 ```
--- a/docs/source/en/model_doc/bark.md
+++ b/docs/source/en/model_doc/bark.md
@ -86,7 +86,6 @@ Next, [install](https://github.com/Dao-AILab/flash-attention#installation-and-fe
 pip install -U flash-attn --no-build-isolation
 ```
 ##### Usage
 To load a model using Flash Attention 2, we can pass the `attn_implementation="flash_attention_2"` flag to [`.from_pretrained`](https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel.from_pretrained). We'll also load the model in half-precision (e.g. `torch.float16`), since it results in almost no degradation to audio quality but significantly lower memory usage and faster inference:
@ -97,7 +96,6 @@ model = BarkModel.from_pretrained("suno/bark-small", dtype=torch.float16, attn_i
 ##### Performance comparison
 The following diagram shows the latency for the native attention implementation (no optimisation) against Better Transformer and Flash Attention 2. In all cases, we generate 400 semantic tokens on a 40GB A100 GPU with PyTorch 2.1. Flash Attention 2 is also consistently faster than Better Transformer, and its performance improves even more as batch sizes increase:
 <div style="text-align: center">
@ -108,7 +106,6 @@ To put this into perspective, on an NVIDIA A100 and when generating 400 semantic
 At batch size 8, on an NVIDIA A100, Flash Attention 2 is also 10% faster than Better Transformer, and at batch size 16, 25%.
 #### Combining optimization techniques
 You can combine optimization techniques, and use CPU offload, half-precision and Flash Attention 2 (or 🤗 Better Transformer) all at once.
@ -165,7 +162,6 @@ Bark can generate highly realistic, **multilingual** speech as well as other aud
 The model can also produce **nonverbal communications** like laughing, sighing and crying.
 ```python
 >>> # Adding non-speech cues to the input text
 >>> inputs = processor("Hello uh ... [clears throat], my dog is cute [laughter]")
@ -235,4 +231,3 @@ To save the audio, simply take the sample rate from the model config and some sc
 [[autodoc]] BarkSemanticConfig
    - all
--- a/docs/source/en/model_doc/bart.md
+++ b/docs/source/en/model_doc/bart.md
@ -15,7 +15,6 @@ rendered properly in your Markdown viewer.
 -->
 *This model was released on 2019-10-29 and added to Hugging Face Transformers on 2020-11-16.*
 <div style="float: right;">
    <div class="flex flex-wrap space-x-1">
    <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
@ -46,6 +45,7 @@ pipeline = pipeline(
 pipeline("Plants create <mask> through a process known as photosynthesis.")
 ```
 </hfoption>
 <hfoption id="AutoModel">
--- a/docs/source/en/model_doc/barthez.md
+++ b/docs/source/en/model_doc/barthez.md
@ -31,7 +31,6 @@ You can find all of the original BARThez checkpoints under the [BARThez](https:/
 > This model was contributed by [moussakam](https://huggingface.co/moussakam).
 > Refer to the [BART](./bart) docs for more usage examples.
 The example below demonstrates how to predict the `<mask>` token with [`Pipeline`], [`AutoModel`], and from the command line.
 <hfoptions id="usage">
--- a/docs/source/en/model_doc/bartpho.md
+++ b/docs/source/en/model_doc/bartpho.md
@ -33,12 +33,9 @@ You can find all the original checkpoints under the [VinAI](https://huggingface.
 The example below demonstrates how to summarize text with [`Pipeline`] or the [`AutoModel`] class.
 <hfoptions id="usage">
 <hfoption id="Pipeline">
 ```python
 import torch
 from transformers import pipeline
@ -98,8 +95,6 @@ transformers run --task summarization --model vinai/bartpho-word --device 0
 </hfoption>
 </hfoptions>
 ## Notes
 - BARTpho uses the large architecture of BART with an additional layer-normalization layer on top of the encoder and decoder. The BART-specific classes should be replaced with the mBART-specific classes.
--- a/docs/source/en/model_doc/bert-japanese.md
+++ b/docs/source/en/model_doc/bert-japanese.md
@ -81,7 +81,6 @@ API reference information.
 </Tip>
 ## BertJapaneseTokenizer
 [[autodoc]] BertJapaneseTokenizer
--- a/docs/source/en/model_doc/bertweet.md
+++ b/docs/source/en/model_doc/bertweet.md
@ -26,7 +26,6 @@ rendered properly in your Markdown viewer.
 [BERTweet](https://huggingface.co/papers/2005.10200) shares the same architecture as [BERT-base](./bert), but it’s pretrained like [RoBERTa](./roberta) on English Tweets. It performs really well on Tweet-related tasks like part-of-speech tagging, named entity recognition, and text classification.
 You can find all the original BERTweet checkpoints under the [VinAI Research](https://huggingface.co/vinai?search_models=BERTweet) organization.
 > [!TIP]
@ -49,6 +48,7 @@ pipeline = pipeline(
 )
 pipeline("Plants create <mask> through a process known as photosynthesis.")
 ```
 </hfoption>
 <hfoption id="AutoModel">
--- a/docs/source/en/model_doc/big_bird.md
+++ b/docs/source/en/model_doc/big_bird.md
@ -47,6 +47,7 @@ pipeline = pipeline(
 )
 pipeline("Plants create [MASK] through a process known as photosynthesis.")
 ```
 </hfoption>
 <hfoption id="AutoModel">
@ -81,6 +82,7 @@ print(f"The predicted token is: {predicted_token}")
 ```bash
 !echo -e "Plants create [MASK] through a process known as photosynthesis." | transformers run --task fill-mask --model google/bigbird-roberta-base --device 0
 ```
 </hfoption>
 </hfoptions>
--- a/docs/source/en/model_doc/bigbird_pegasus.md
+++ b/docs/source/en/model_doc/bigbird_pegasus.md
@ -52,6 +52,7 @@ Through photosynthesis, plants capture energy from sunlight using a green pigmen
 These ingredients are then transformed into glucose, a type of sugar that serves as a source of chemical energy, and oxygen, which is released as a byproduct into the atmosphere. The glucose produced during photosynthesis is not just used immediately; plants also store it as starch or convert it into other organic compounds like cellulose, which is essential for building their cellular structure.
 This energy reserve allows them to grow, develop leaves, produce flowers, bear fruit, and carry out various physiological processes throughout their lifecycle.""")
 ```
 </hfoption>
 <hfoption id="AutoModel">
@ -77,6 +78,7 @@ input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)
 output = model.generate(**input_ids, cache_implementation="static")
 print(tokenizer.decode(output[0], skip_special_tokens=True))
 ```
 </hfoption>
 <hfoption id="transformers">
--- a/docs/source/en/model_doc/biogpt.md
+++ b/docs/source/en/model_doc/biogpt.md
@ -135,31 +135,26 @@ print(output)
 [[autodoc]] BioGptConfig
 ## BioGptTokenizer
 [[autodoc]] BioGptTokenizer
    - save_vocabulary
 ## BioGptModel
 [[autodoc]] BioGptModel
    - forward
 ## BioGptForCausalLM
 [[autodoc]] BioGptForCausalLM
    - forward
 ## BioGptForTokenClassification
 [[autodoc]] BioGptForTokenClassification
    - forward
 ## BioGptForSequenceClassification
 [[autodoc]] BioGptForSequenceClassification
--- a/docs/source/en/model_doc/bitnet.md
+++ b/docs/source/en/model_doc/bitnet.md
@ -35,10 +35,8 @@ Several versions of the model weights are available on Hugging Face:
 * [**`microsoft/bitnet-b1.58-2B-4T-gguf`**](https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf): Contains the model weights in GGUF format, compatible with the `bitnet.cpp` library for CPU inference.
 ### Model Details
 * **Architecture:** Transformer-based, modified with `BitLinear` layers (BitNet framework).
    * Uses Rotary Position Embeddings (RoPE).
    * Uses squared ReLU (ReLU²) activation in FFN layers.
@ -58,10 +56,8 @@ Several versions of the model weights are available on Hugging Face:
    3.  **Direct Preference Optimization (DPO):** Aligned with human preferences using preference pairs.
 * **Tokenizer:** LLaMA 3 Tokenizer (vocab size: 128,256).
 ## Usage tips
 **VERY IMPORTANT NOTE ON EFFICIENCY**
 > Please do NOT expect performance efficiency gains (in terms of speed, latency, or energy consumption) when using this model with the standard transformers library.
@ -106,7 +102,6 @@ response = tokenizer.decode(chat_outputs[0][chat_input.shape[-1]:], skip_special
 print("\nAssistant Response:", response)
 ```
 ## BitNetConfig
 [[autodoc]] BitNetConfig
--- a/docs/source/en/model_doc/blenderbot-small.md
+++ b/docs/source/en/model_doc/blenderbot-small.md
@ -55,7 +55,6 @@ found [here](https://github.com/facebookresearch/ParlAI).
 Blenderbot Small is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
 the left.
 ## Resources
 - [Causal language modeling task guide](../tasks/language_modeling)
--- a/docs/source/en/model_doc/blenderbot.md
+++ b/docs/source/en/model_doc/blenderbot.md
@ -71,7 +71,6 @@ An example:
  `facebook/blenderbot_small_90M`, have a different architecture and consequently should be used with
  [BlenderbotSmall](blenderbot-small).
 ## Resources
 - [Causal language modeling task guide](../tasks/language_modeling)
--- a/docs/source/en/model_doc/blip.md
+++ b/docs/source/en/model_doc/blip.md
@ -25,7 +25,6 @@ rendered properly in your Markdown viewer.
 [BLIP](https://huggingface.co/papers/2201.12086) (Bootstrapped Language-Image Pretraining) is a vision-language pretraining (VLP) framework designed for *both* understanding and generation tasks. Most existing pretrained models are only good at one or the other. It uses a captioner to generate captions and a filter to remove the noisy captions. This increases training data quality and more effectively uses the messy web data.
 You can find all the original BLIP checkpoints under the [BLIP](https://huggingface.co/collections/Salesforce/blip-models-65242f40f1491fbf6a9e9472) collection.
 > [!TIP]
--- a/docs/source/en/model_doc/bloom.md
+++ b/docs/source/en/model_doc/bloom.md
@ -48,7 +48,6 @@ See also:
 - [Token classification task guide](../tasks/token_classification)
 - [Question answering task guide](../tasks/question_answering)
 ⚡️ Inference
 - A blog on [Optimization story: Bloom inference](https://huggingface.co/blog/bloom-inference-optimization).
 - A blog on [Incredibly Fast BLOOM Inference with DeepSpeed and Accelerate](https://huggingface.co/blog/bloom-inference-pytorch-scripts).
--- a/docs/source/en/model_doc/blt.md
+++ b/docs/source/en/model_doc/blt.md
@ -83,7 +83,6 @@ print(tokenizer.decode(generated_ids[0]))
 This model was contributed by [itazap](https://huggingface.co/<itazap>).
 The original code can be found [here](<https://github.com/facebookresearch/blt>).
 ## BltConfig
 [[autodoc]] BltConfig
--- a/docs/source/en/model_doc/bridgetower.md
+++ b/docs/source/en/model_doc/bridgetower.md
@ -54,6 +54,7 @@ The [`BridgeTowerProcessor`] wraps [`RobertaTokenizer`] and [`BridgeTowerImagePr
 encode the text and prepare the images respectively.
 The following example shows how to run contrastive learning using [`BridgeTowerProcessor`] and [`BridgeTowerForContrastiveLearning`].
 ```python
 >>> from transformers import BridgeTowerProcessor, BridgeTowerForContrastiveLearning
 >>> import requests
@ -76,6 +77,7 @@ The following example shows how to run contrastive learning using [`BridgeTowerP
 ```
 The following example shows how to run image-text retrieval using [`BridgeTowerProcessor`] and [`BridgeTowerForImageAndTextRetrieval`].
 ```python
 >>> from transformers import BridgeTowerProcessor, BridgeTowerForImageAndTextRetrieval
 >>> import requests
@ -130,7 +132,6 @@ Tips:
 - Please refer to [Table 5](https://huggingface.co/papers/2206.08657) for BridgeTower's performance on Image Retrieval and other down stream tasks.
 - The PyTorch version of this model is only available in torch 1.10 and higher.
 ## BridgeTowerConfig
 [[autodoc]] BridgeTowerConfig
@ -177,4 +178,3 @@ Tips:
 [[autodoc]] BridgeTowerForImageAndTextRetrieval
    - forward
--- a/docs/source/en/model_doc/bros.md
+++ b/docs/source/en/model_doc/bros.md
@ -57,7 +57,6 @@ def expand_and_normalize_bbox(bboxes, doc_width, doc_height):
 - [`~transformers.BrosForTokenClassification.forward`, `~transformers.BrosSpadeEEForTokenClassification.forward`, `~transformers.BrosSpadeEEForTokenClassification.forward`] require not only `input_ids` and `bbox` but also `box_first_token_mask` for loss calculation. It is a mask to filter out non-first tokens of each box. You can obtain this mask by saving start token indices of bounding boxes when creating `input_ids` from words. You can make `box_first_token_mask` with following code,
 ```python
 def make_box_first_token_mask(bboxes, words, tokenizer, max_seq_length=512):
@ -102,7 +101,6 @@ def make_box_first_token_mask(bboxes, words, tokenizer, max_seq_length=512):
 [[autodoc]] BrosModel
    - forward
 ## BrosForTokenClassification
 [[autodoc]] BrosForTokenClassification
--- a/docs/source/en/model_doc/camembert.md
+++ b/docs/source/en/model_doc/camembert.md
@ -50,6 +50,7 @@ from transformers import pipeline
 pipeline = pipeline("fill-mask", model="camembert-base", dtype=torch.float16, device=0)
 pipeline("Le camembert est un délicieux fromage <mask>.")
 ```
 </hfoption>
 <hfoption id="AutoModel">
@ -72,6 +73,7 @@ predicted_token = tokenizer.decode(predicted_token_id)
 print(f"The predicted token is: {predicted_token}")
 ```
 </hfoption>
 <hfoption id="transformers CLI">
@ -84,7 +86,6 @@ echo -e "Le camembert est un délicieux fromage <mask>." | transformers run --ta
 </hfoptions>
 Quantization reduces the memory burden of large models by representing weights in lower precision. Refer to the [Quantization](../quantization/overview) overview for available options.
 The example below uses [bitsandbytes](../quantization/bitsandbytes) quantization to quantize the weights to 8-bits.
--- a/docs/source/en/model_doc/canine.md
+++ b/docs/source/en/model_doc/canine.md
@ -86,6 +86,7 @@ echo -e "Plant create energy through a process known as photosynthesis." | trans
    inputs = ["Life is like a box of chocolates.", "You never know what you gonna get."]
    encoding = tokenizer(inputs, padding="longest", truncation=True, return_tensors="pt")
    ```
 - CANINE is primarily designed to be fine-tuned on a downstream task. The pretrained model can be used for either masked language modeling or next sentence prediction.
 ## CanineConfig
--- a/docs/source/en/model_doc/chameleon.md
+++ b/docs/source/en/model_doc/chameleon.md
@ -28,7 +28,6 @@ rendered properly in your Markdown viewer.
 The Chameleon model was proposed in [Chameleon: Mixed-Modal Early-Fusion Foundation Models
 ](https://huggingface.co/papers/2405.09818) by META AI Chameleon Team. Chameleon is a Vision-Language Model that use vector quantization to tokenize images which enables the model to generate multimodal output. The model takes images and texts as input, including an interleaved format, and generates textual response. Image generation module is not released yet.
 The abstract from the paper is the following:
 *We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. We outline a stable training
@ -43,7 +42,6 @@ including Gemini Pro and GPT-4V, according to human judgments on a new long-form
 generation evaluation, where either the prompt or outputs contain mixed sequences of both images and
 text. Chameleon marks a significant step forward in unified modeling of full multimodal documents*
 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/chameleon_arch.png"
 alt="drawing" width="600"/>
@ -52,7 +50,6 @@ alt="drawing" width="600"/>
 This model was contributed by [joaogante](https://huggingface.co/joaogante) and [RaushanTurganbay](https://huggingface.co/RaushanTurganbay).
 The original code can be found [here](https://github.com/facebookresearch/chameleon).
 ## Usage tips
 - We advise users to use `padding_side="left"` when computing batched generation as it leads to more accurate results. Simply make sure to set `processor.tokenizer.padding_side = "left"` before generating.
--- a/docs/source/en/model_doc/clvp.md
+++ b/docs/source/en/model_doc/clvp.md
@ -29,11 +29,9 @@ The abstract from the paper is the following:
 *In recent years, the field of image generation has been revolutionized by the application of autoregressive transformers and DDPMs. These approaches model the process of image generation as a step-wise probabilistic processes and leverage large amounts of compute and data to learn the image distribution. This methodology of improving performance need not be confined to images. This paper describes a way to apply advances in the image generative domain to speech synthesis. The result is TorToise - an expressive, multi-voice text-to-speech system.*
 This model was contributed by [Susnato Dhar](https://huggingface.co/susnato).
 The original code can be found [here](https://github.com/neonbjb/tortoise-tts).
 ## Usage tips
 1. CLVP is an integral part of the Tortoise TTS model.
@ -41,7 +39,6 @@ The original code can be found [here](https://github.com/neonbjb/tortoise-tts).
 3. The use of the [`ClvpModelForConditionalGeneration.generate()`] method is strongly recommended for tortoise usage.
 4. Note that the CLVP model expects the audio to be sampled at 22.05 kHz contrary to other audio models which expects 16 kHz.
 ## Brief Explanation:
 - The [`ClvpTokenizer`] tokenizes the text input, and the [`ClvpFeatureExtractor`] extracts the log mel-spectrogram from the desired audio.
@ -51,7 +48,6 @@ The original code can be found [here](https://github.com/neonbjb/tortoise-tts).
 - At the end, we compare each speech vector with the text vector to see which speech vector is most similar to the text vector.
 - [`ClvpModelForConditionalGeneration.generate()`] compresses all of the logic described above into a single method.  
 Example :
 ```python
@ -74,7 +70,6 @@ Example :
 >>> generated_output = model.generate(**processor_output)
 ```
 ## ClvpConfig
 [[autodoc]] ClvpConfig
@ -128,4 +123,3 @@ Example :
 ## ClvpDecoder
 [[autodoc]] ClvpDecoder
--- a/docs/source/en/model_doc/code_llama.md
+++ b/docs/source/en/model_doc/code_llama.md
@ -143,6 +143,7 @@ visualizer("""def func(a, b):
 - Infilling is only available in the 7B and 13B base models, and not in the Python, Instruct, 34B, or 70B models.
 - Use the `<FILL_ME>` token where you want your input to be filled. The tokenizer splits this token to create a formatted input string that follows the [original training pattern](https://github.com/facebookresearch/codellama/blob/cb51c14ec761370ba2e2bc351374a79265d0465e/llama/generation.py#L402). This is more robust than preparing the pattern yourself.
    ```py
    from transformers import LlamaForCausalLM, CodeLlamaTokenizer
@ -158,6 +159,7 @@ visualizer("""def func(a, b):
    filling = tokenizer.batch_decode(generated_ids[:, input_ids.shape[1]:], skip_special_tokens = True)[0]
    print(PROMPT.replace("<FILL_ME>", filling))
    ```
 - Use `bfloat16` for further training or fine-tuning and `float16` for inference.
 - The `BOS` character is not used for infilling when encoding the prefix or suffix, but only at the beginning of each prompt.
 - The tokenizer is a byte-pair encoding model based on [SentencePiece](https://github.com/google/sentencepiece). During decoding, if the first token is the start of the word (for example, “Banana”), the tokenizer doesn’t prepend the prefix space to the string.
--- a/docs/source/en/model_doc/cohere.md
+++ b/docs/source/en/model_doc/cohere.md
@ -22,14 +22,12 @@ rendered properly in your Markdown viewer.
    </div>
 </div>
 # Cohere
 Cohere [Command-R](https://cohere.com/blog/command-r) is a 35B parameter multilingual large language model designed for long context tasks like retrieval-augmented generation (RAG) and calling external APIs and tools. The model is specifically trained for grounded generation and supports both single-step and multi-step tool use. It supports a context length of 128K tokens.
 You can find all the original Command-R checkpoints under the [Command Models](https://huggingface.co/collections/CohereForAI/command-models-67652b401665205e17b192ad) collection.
 > [!TIP]
 > Click on the Cohere models in the right sidebar for more examples of how to apply Cohere to different language tasks.
@ -123,7 +121,6 @@ visualizer("Plants create energy through a process known as")
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/cohere-attn-mask.png"/>
 </div>
 ## Notes
 - Don’t use the dtype parameter in [`~AutoModel.from_pretrained`] if you’re using FlashAttention-2 because it only supports fp16 or bf16. You should use [Automatic Mixed Precision](https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html), set fp16 or bf16 to True if using [`Trainer`], or use [torch.autocast](https://pytorch.org/docs/stable/amp.html#torch.autocast).
@ -145,7 +142,6 @@ visualizer("Plants create energy through a process known as")
 [[autodoc]] CohereModel
    - forward
 ## CohereForCausalLM
 [[autodoc]] CohereForCausalLM
--- a/docs/source/en/model_doc/cohere2.md
+++ b/docs/source/en/model_doc/cohere2.md
@ -22,7 +22,6 @@ rendered properly in your Markdown viewer.
    </div>
 </div>
 # Cohere 2
 [Cohere Command R7B](https://cohere.com/blog/command-r7b) is an open weights research release of a 7B billion parameter model. It is a multilingual model trained on 23 languages and has a context window of 128k. The model features three layers with sliding window attention and ROPE for efficient local context modeling and relative positional encoding. A fourth layer uses global attention without positional embeddings, enabling unrestricted token interactions across the entire sequence.
@ -31,7 +30,6 @@ This model is optimized for speed, cost-performance, and compute resources.
 You can find all the original Command-R checkpoints under the [Command Models](https://huggingface.co/collections/CohereForAI/command-models-67652b401665205e17b192ad) collection.
 > [!TIP]
 > Click on the Cohere models in the right sidebar for more examples of how to apply Cohere to different language tasks.
@ -136,7 +134,6 @@ print(tokenizer.decode(output[0], skip_special_tokens=True))
 [[autodoc]] Cohere2Model
    - forward
 ## Cohere2ForCausalLM
 [[autodoc]] Cohere2ForCausalLM
--- a/docs/source/en/model_doc/cohere2_vision.md
+++ b/docs/source/en/model_doc/cohere2_vision.md
@ -113,6 +113,7 @@ outputs = pipe(text=messages, max_new_tokens=300, return_full_text=False)
 print(outputs)
 ```
 </hfoption>
 </hfoptions>
--- a/docs/source/en/model_doc/cpm.md
+++ b/docs/source/en/model_doc/cpm.md
@ -42,7 +42,6 @@ NLP tasks in the settings of few-shot (even zero-shot) learning.*
 This model was contributed by [canwenxu](https://huggingface.co/canwenxu). The original implementation can be found
 here: https://github.com/TsinghuaAI/CPM-Generate
 <Tip>
 CPM's architecture is the same as GPT-2, except for tokenization method. Refer to [GPT-2 documentation](gpt2) for
@ -50,7 +49,6 @@ API reference information.
 </Tip>
 ## CpmTokenizer
 [[autodoc]] CpmTokenizer
--- a/docs/source/en/model_doc/csm.md
+++ b/docs/source/en/model_doc/csm.md
@ -346,7 +346,6 @@ out.loss.backward()
 This model was contributed by [Eustache Le Bihan](https://huggingface.co/eustlb).
 The original code can be found [here](https://github.com/SesameAILabs/csm).
 ## CsmConfig
 [[autodoc]] CsmConfig
--- a/docs/source/en/model_doc/ctrl.md
+++ b/docs/source/en/model_doc/ctrl.md
@ -55,7 +55,6 @@ This model was contributed by [keskarnitishr](https://huggingface.co/keskarnitis
  pre-computed values in the context of text generation. See the [`forward`](model_doc/ctrl#transformers.CTRLModel.forward)
  method for more information on the usage of this argument.
 ## Resources
 - [Text classification task guide](../tasks/sequence_classification)
--- a/docs/source/en/model_doc/dab-detr.md
+++ b/docs/source/en/model_doc/dab-detr.md
@ -77,7 +77,9 @@ for result in results:
        box = [round(i, 2) for i in box.tolist()]
        print(f"{model.config.id2label[label]}: {score:.2f} {box}")
 ```
 This should output
 ```
 cat: 0.87 [14.7, 49.39, 320.52, 469.28]
 remote: 0.86 [41.08, 72.37, 173.39, 117.2]
@ -89,6 +91,7 @@ couch: 0.59 [-0.04, 1.34, 639.9, 477.09]
 There are three other ways to instantiate a DAB-DETR model (depending on what you prefer):
 Option 1: Instantiate DAB-DETR with pre-trained weights for entire model
 ```py
 >>> from transformers import DabDetrForObjectDetection
@ -96,19 +99,21 @@ Option 1: Instantiate DAB-DETR with pre-trained weights for entire model
 ```
 Option 2: Instantiate DAB-DETR with randomly initialized weights for Transformer, but pre-trained weights for backbone
 ```py
 >>> from transformers import DabDetrConfig, DabDetrForObjectDetection
 >>> config = DabDetrConfig()
 >>> model = DabDetrForObjectDetection(config)
 ```
 Option 3: Instantiate DAB-DETR with randomly initialized weights for backbone + Transformer
 ```py
 >>> config = DabDetrConfig(use_pretrained_backbone=False)
 >>> model = DabDetrForObjectDetection(config)
 ```
 ## DabDetrConfig
 [[autodoc]] DabDetrConfig
--- a/docs/source/en/model_doc/dac.md
+++ b/docs/source/en/model_doc/dac.md
@ -23,7 +23,6 @@ rendered properly in your Markdown viewer.
 ## Overview
 The DAC model was proposed in [Descript Audio Codec: High-Fidelity Audio Compression with Improved RVQGAN](https://huggingface.co/papers/2306.06546) by Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, Kundan Kumar.
 The Descript Audio Codec (DAC) model is a powerful tool for compressing audio data, making it highly efficient for storage and transmission. By compressing 44.1 KHz audio into tokens at just 8kbps bandwidth, the DAC model enables high-quality audio processing while significantly reducing the data footprint. This is particularly useful in scenarios where bandwidth is limited or storage space is at a premium, such as in streaming applications, remote conferencing, and archiving large audio datasets.
@ -35,7 +34,6 @@ The abstract from the paper is the following:
 This model was contributed by [Kamil Akesbi](https://huggingface.co/kamilakesbi).
 The original code can be found [here](https://github.com/descriptinc/descript-audio-codec/tree/main?tab=readme-ov-file).
 ## Model structure
 The Descript Audio Codec (DAC) model is structured into three distinct stages:
--- a/docs/source/en/model_doc/dbrx.md
+++ b/docs/source/en/model_doc/dbrx.md
@ -35,7 +35,6 @@ We estimate that this data is at least 2x better token-for-token than the data w
 This new dataset was developed using the full suite of Databricks tools, including Apache Spark™ and Databricks notebooks for data processing, and Unity Catalog for data management and governance.
 We used curriculum learning for pretraining, changing the data mix during training in ways we found to substantially improve model quality.
 More detailed information about DBRX Instruct and DBRX Base can be found in our [technical blog post](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm).
 This model was contributed by [eitan-turok](https://huggingface.co/eitanturok) and [abhi-db](https://huggingface.co/abhi-db). The original code can be found [here](https://github.com/databricks/dbrx-instruct), though this may not be up to date.
@ -65,6 +64,7 @@ print(tokenizer.decode(outputs[0]))
 ```
 If you have flash-attention installed (`pip install flash-attn`), it is possible to generate faster. (The HuggingFace documentation for flash-attention can be found [here](https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2).)
 ```python
 from transformers import DbrxForCausalLM, AutoTokenizer
 import torch
@ -87,6 +87,7 @@ print(tokenizer.decode(outputs[0]))
 ```
 You can also generate faster using the PyTorch scaled dot product attention. (The HuggingFace documentation for scaled dot product attention can be found [here](https://huggingface.co/docs/transformers/perf_infer_gpu_one#pytorch-scaled-dot-product-attention).)
 ```python
 from transformers import DbrxForCausalLM, AutoTokenizer
 import torch
@ -112,15 +113,12 @@ print(tokenizer.decode(outputs[0]))
 [[autodoc]] DbrxConfig
 ## DbrxModel
 [[autodoc]] DbrxModel
    - forward
 ## DbrxForCausalLM
 [[autodoc]] DbrxForCausalLM
    - forward
--- a/docs/source/en/model_doc/deberta-v2.md
+++ b/docs/source/en/model_doc/deberta-v2.md
@ -21,14 +21,12 @@ rendered properly in your Markdown viewer.
    </div>
 </div>
 # DeBERTa-v2
 [DeBERTa-v2](https://huggingface.co/papers/2006.03654) improves on the original [DeBERTa](./deberta) architecture by using a SentencePiece-based tokenizer and a new vocabulary size of 128K. It also adds an additional convolutional layer within the first transformer layer to better learn local dependencies of input tokens. Finally, the position projection and content projection matrices are shared in the attention layer to reduce the number of parameters.
 You can find all the original [DeBERTa-v2] checkpoints under the [Microsoft](https://huggingface.co/microsoft?search_models=deberta-v2) organization.
 > [!TIP]
 > This model was contributed by [Pengcheng He](https://huggingface.co/DeBERTa).
 >
@ -86,6 +84,7 @@ print(f"Predicted label: {predicted_label}")
 ```bash
 echo -e "DeBERTa-v2 is great at understanding context!" | transformers run --task fill-mask --model microsoft/deberta-v2-xlarge-mnli --device 0
 ```
 </hfoption>
 </hfoptions>
@ -119,7 +118,6 @@ print(f"Predicted label: {predicted_label}")
 ```
 ## DebertaV2Config
 [[autodoc]] DebertaV2Config
--- a/docs/source/en/model_doc/deberta.md
+++ b/docs/source/en/model_doc/deberta.md
@ -31,7 +31,6 @@ Even with less training data than RoBERTa, DeBERTa manages to outperform it on s
 You can find all the original DeBERTa checkpoints under the [Microsoft](https://huggingface.co/microsoft?search_models=deberta) organization.
 > [!TIP]
 > Click on the DeBERTa models in the right sidebar for more examples of how to apply DeBERTa to different language tasks.
--- a/docs/source/en/model_doc/decision_transformer.md
+++ b/docs/source/en/model_doc/decision_transformer.md
@ -46,7 +46,6 @@ This model was contributed by [edbeeching](https://huggingface.co/edbeeching). T
 [[autodoc]] DecisionTransformerConfig
 ## DecisionTransformerGPT2Model
 [[autodoc]] DecisionTransformerGPT2Model
--- a/docs/source/en/model_doc/deepseek_v3.md
+++ b/docs/source/en/model_doc/deepseek_v3.md
@ -61,6 +61,7 @@ outputs = model.generate(inputs, max_new_tokens=50)
 print(tokenizer.batch_decode(outputs))
 print(time.time()-start)
 ```
 This generated:
 ``````
@ -157,18 +158,20 @@ Want to dive deeper or see a specific framework’s implementation (e.g., OpenAI
 ``````
 Use the following to run it
 ```bash
 torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0|1 --rdzv-id an_id --rdzv-backend c10d --rdzv-endpoint master_addr:master_port run_deepseek_r1.py
 ```
 If you have:
 ```bash
 [rank0]: ncclInternalError: Internal check failed.
 [rank0]: Last error:
 [rank0]: Bootstrap : no socket interface found
 ```
 error, it means NCCL was probably not loaded. 
 error, it means NCCL was probably not loaded.
 ## DeepseekV3Config
--- a/docs/source/en/model_doc/deepseek_vl.md
+++ b/docs/source/en/model_doc/deepseek_vl.md
@ -63,6 +63,7 @@ messages = [
 pipe(text=messages, max_new_tokens=20, return_full_text=False)
 ```
 </hfoption>
 <hfoption id="AutoModel">
@ -115,6 +116,7 @@ output_text = processor.batch_decode(
 print(output_text)
 ```
 </hfoption>
 </hfoptions>
@ -138,9 +140,11 @@ model = DeepseekVLForConditionalGeneration.from_pretrained(
    quantization_config=quantization_config
 )
 ```
 ### Notes
 - Do inference with multiple images in a single conversation.
    ```py
    import torch
    from transformers import DeepseekVLForConditionalGeneration, AutoProcessor
--- a/docs/source/en/model_doc/deepseek_vl_hybrid.md
+++ b/docs/source/en/model_doc/deepseek_vl_hybrid.md
@ -62,6 +62,7 @@ messages = [
 pipe(text=messages, max_new_tokens=20, return_full_text=False)
 ```
 </hfoption>
 <hfoption id="AutoModel">
@ -114,6 +115,7 @@ output_text = processor.batch_decode(
 print(output_text)
 ```
 </hfoption>
 </hfoptions>
@ -137,9 +139,11 @@ model = DeepseekVLHybridForConditionalGeneration.from_pretrained(
    quantization_config=quantization_config
 )
 ```
 ### Notes
 - Do inference with multiple images in a single conversation.
    ```py
    import torch
    from transformers import DeepseekVLHybridForConditionalGeneration, AutoProcessor
--- a/docs/source/en/model_doc/deplot.md
+++ b/docs/source/en/model_doc/deplot.md
@ -38,7 +38,6 @@ Currently one checkpoint is available for DePlot:
 - `google/deplot`: DePlot fine-tuned on ChartQA dataset
 ```python
 from transformers import AutoProcessor, Pix2StructForConditionalGeneration
 import requests
@ -57,6 +56,7 @@ print(processor.decode(predictions[0], skip_special_tokens=True))
 ## Fine-tuning
 To fine-tune DePlot, refer to the pix2struct [fine-tuning notebook](https://github.com/huggingface/notebooks/blob/main/examples/image_captioning_pix2struct.ipynb). For `Pix2Struct` models, we have found out that fine-tuning the model with Adafactor and cosine learning rate scheduler leads to faster convergence:
 ```python
 from transformers.optimization import Adafactor, get_cosine_schedule_with_warmup
--- a/docs/source/en/model_doc/depth_pro.md
+++ b/docs/source/en/model_doc/depth_pro.md
@ -102,12 +102,14 @@ The network is supplemented with a focal length estimation head. A small convolu
 The `use_fov_model` parameter in `DepthProConfig` controls whether **FOV prediction** is enabled. By default, it is set to `False` to conserve memory and computation. When enabled, the **FOV encoder** is instantiated based on the `fov_model_config` parameter, which defaults to a `Dinov2Model`. The `use_fov_model` parameter can also be passed when initializing the `DepthProForDepthEstimation` model.
 The pretrained model at checkpoint `apple/DepthPro-hf` uses the FOV encoder. To use the pretrained-model without FOV encoder, set `use_fov_model=False` when loading the model, which saves computation.
 ```py
 >>> from transformers import DepthProForDepthEstimation
 >>> model = DepthProForDepthEstimation.from_pretrained("apple/DepthPro-hf", use_fov_model=False)
 ```
 To instantiate a new model with FOV encoder, set `use_fov_model=True` in the config.
 ```py
 >>> from transformers import DepthProConfig, DepthProForDepthEstimation
 >>> config = DepthProConfig(use_fov_model=True)
@ -115,6 +117,7 @@ To instantiate a new model with FOV encoder, set `use_fov_model=True` in the con
 ```
 Or set `use_fov_model=True` when initializing the model, which overrides the value in config.
 ```py
 >>> from transformers import DepthProConfig, DepthProForDepthEstimation
 >>> config = DepthProConfig()
--- a/docs/source/en/model_doc/detr.md
+++ b/docs/source/en/model_doc/detr.md
@ -113,6 +113,7 @@ DETR can be naturally extended to perform panoptic segmentation (which unifies s
 There are three other ways to instantiate a DETR model (depending on what you prefer):
 - Option 1: Instantiate DETR with pre-trained weights for entire model
 ```python
 from transformers import DetrForObjectDetection
@ -120,6 +121,7 @@ model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")
 ```
 - Option 2: Instantiate DETR with randomly initialized weights for Transformer, but pre-trained weights for backbone
 ```python
 from transformers import DetrConfig, DetrForObjectDetection
@ -128,6 +130,7 @@ model = DetrForObjectDetection(config)
 ```
 - Option 3: Instantiate DETR with randomly initialized weights for backbone + Transformer
 ```python
 config = DetrConfig(use_pretrained_backbone=False)
 model = DetrForObjectDetection(config)
--- a/docs/source/en/model_doc/dia.md
+++ b/docs/source/en/model_doc/dia.md
@ -117,11 +117,9 @@ out = model(**inputs)
 out.loss.backward()
 ```
 This model was contributed by [Jaeyong Sung](https://huggingface.co/buttercrab), [Arthur Zucker](https://huggingface.co/ArthurZ),
 and [Anton Vlasjuk](https://huggingface.co/AntonV). The original code can be found [here](https://github.com/nari-labs/dia/).
 ## DiaConfig
 [[autodoc]] DiaConfig
--- a/docs/source/en/model_doc/diffllama.md
+++ b/docs/source/en/model_doc/diffllama.md
@ -35,7 +35,6 @@ The abstract from the paper is the following:
 ### Usage tips
 The hyperparameters of this model is the same as Llama model.
 ## DiffLlamaConfig
 [[autodoc]] DiffLlamaConfig
--- a/docs/source/en/model_doc/dinov2.md
+++ b/docs/source/en/model_doc/dinov2.md
@ -19,7 +19,6 @@ specific language governing permissions and limitations under the License.
    </div>
 </div>
 # DINOv2
 [DINOv2](https://huggingface.co/papers/2304.07193) is a vision foundation model that uses [ViT](./vit) as a feature extractor for multiple downstream tasks like image classification and depth estimation. It focuses on stabilizing and accelerating training through techniques like a faster memory-efficient attention, sequence packing, improved stochastic depth, Fully Sharded Data Parallel (FSDP), and model distillation.
--- a/docs/source/en/model_doc/dinov2_with_registers.md
+++ b/docs/source/en/model_doc/dinov2_with_registers.md
@ -45,7 +45,6 @@ Tips:
 This model was contributed by [nielsr](https://huggingface.co/nielsr).
 The original code can be found [here](https://github.com/facebookresearch/dinov2).
 ## Dinov2WithRegistersConfig
 [[autodoc]] Dinov2WithRegistersConfig
--- a/docs/source/en/model_doc/dinov3.md
+++ b/docs/source/en/model_doc/dinov3.md
@ -19,7 +19,6 @@ specific language governing permissions and limitations under the License.
    </div>
 </div>
 # DINOv3
 [DINOv3](https://huggingface.co/papers/2508.10104) is a family of versatile vision foundation models that outperforms the specialized state of the art across a broad range of settings, without fine-tuning. DINOv3 produces high-quality dense features that achieve outstanding performance on various vision tasks, significantly surpassing previous self- and weakly-supervised foundation models.
--- a/docs/source/en/model_doc/dit.md
+++ b/docs/source/en/model_doc/dit.md
@ -85,6 +85,7 @@ print(f"The predicted class label is: {predicted_class_label}")
 ## Notes
 - The pretrained DiT weights can be loaded in a [BEiT] model with a modeling head to predict visual tokens.
   ```py
   from transformers import BeitForMaskedImageModeling
--- a/docs/source/en/model_doc/doge.md
+++ b/docs/source/en/model_doc/doge.md
@ -17,7 +17,6 @@ rendered properly in your Markdown viewer.
 # Doge
 ## Overview
 Doge is a series of small language models based on the [Doge](https://github.com/SmallDoges/small-doge) architecture, aiming to combine the advantages of state-space and self-attention algorithms, calculate dynamic masks from cached value states using the zero-order hold method, and solve the problem of existing mainstream language models getting lost in context. It uses the `wsd_scheduler` scheduler to pre-train on the `smollm-corpus`, and can continue training on new datasets or add sparse activation feedforward networks from stable stage checkpoints.
@ -28,7 +27,6 @@ As shown in the figure below, the sequence transformation part of the Doge archi
 Checkout all Doge model checkpoints [here](https://huggingface.co/collections/SmallDoge/doge-slm-679cc991f027c4a3abbded4a).
 ## Usage
 <details>
@ -44,6 +42,7 @@ inputs = tokenizer("Hey how are you doing?", return_tensors="pt")
 outputs = model.generate(**inputs, max_new_tokens=100)
 print(tokenizer.batch_decode(outputs))
 ```
 </details>
 <details>
@ -82,6 +81,7 @@ outputs = model.generate(
    streamer=steamer
 )
 ```
 </details>
 ## DogeConfig
--- a/docs/source/en/model_doc/dots1.md
+++ b/docs/source/en/model_doc/dots1.md
@ -25,7 +25,6 @@ The abstract from the report is the following:
 *Mixture of Experts (MoE) models have emerged as a promising paradigm for scaling language models efficiently by activating only a subset of parameters for each input token. In this report, we present dots.llm1, a large-scale MoE model that activates 14B parameters out of a total of 142B parameters, delivering performance on par with state-of-the-art models while reducing training and inference costs. Leveraging our meticulously crafted and efficient data processing pipeline, dots.llm1 achieves performance comparable to Qwen2.5-72B after pretraining on high-quality corpus and post-training to fully unlock its capabilities. Notably, no synthetic data is used during pretraining. To foster further research, we open-source intermediate training checkpoints spanning the entire training process, providing valuable insights into the learning dynamics of large language models.*
 ## Dots1Config
 [[autodoc]] Dots1Config
--- a/docs/source/en/model_doc/efficientloftr.md
+++ b/docs/source/en/model_doc/efficientloftr.md
@ -45,6 +45,7 @@ results = keypoint_matcher([url_0, url_1], threshold=0.9)
 print(results[0])
 # {'keypoint_image_0': {'x': ..., 'y': ...}, 'keypoint_image_1': {'x': ..., 'y': ...}, 'score': ...}
 ```
 </hfoption>
 <hfoption id="AutoModel">
@ -167,4 +168,3 @@ processed_outputs = processor.post_process_keypoint_matching(outputs, image_size
 [[autodoc]] EfficientLoFTRForKeypointMatching
 - forward
--- a/docs/source/en/model_doc/efficientnet.md
+++ b/docs/source/en/model_doc/efficientnet.md
@ -34,7 +34,6 @@ To go even further, we use neural architecture search to design a new baseline n
 This model was contributed by [adirik](https://huggingface.co/adirik).
 The original code can be found [here](https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet).
 ## EfficientNetConfig
 [[autodoc]] EfficientNetConfig
@ -58,4 +57,3 @@ The original code can be found [here](https://github.com/tensorflow/tpu/tree/mas
 [[autodoc]] EfficientNetForImageClassification
    - forward
--- a/docs/source/en/model_doc/emu3.md
+++ b/docs/source/en/model_doc/emu3.md
@ -29,7 +29,6 @@ The Emu3 model was proposed in [Emu3: Next-Token Prediction is All You Need](htt
 Emu3 is a multimodal LLM that uses vector quantization to tokenize images into discrete tokens. Discretized image tokens are later fused with text token ids for image and text generation. The model can additionally generate images by predicting image token ids.
 The abstract from the paper is the following:
 *While next-token prediction is considered a promising path towards artificial general intelligence, it has struggled to excel in multimodal tasks, which are still dominated by diffusion models (e.g., Stable Diffusion) and compositional approaches (e.g., CLIP combined with LLMs). In this paper, we introduce Emu3, a new suite of state-of-the-art multimodal models trained solely with next-token prediction. By tokenizing images, text, and videos into a discrete space, we train a single transformer from scratch on a mixture of multimodal sequences. Emu3 outperforms several well-established task-specific models in both generation and perception tasks, surpassing flagship models such as SDXL and LLaVA-1.6, while eliminating the need for diffusion or compositional architectures. Emu3 is also capable of generating high-fidelity video via predicting the next token in a video sequence. We simplify complex multimodal model designs by converging on a singular focus: tokens, unlocking great potential for scaling both during training and inference. Our results demonstrate that next-token prediction is a promising path towards building general multimodal intelligence beyond language. We open-source key techniques and models to support further research in this direction.*
@ -45,11 +44,9 @@ Tips:
 > [!TIP]
 > Emu3 implementation in Transformers uses a special image token to indicate where to merge image embeddings. The special image token isn't new and uses one of the reserved tokens: `<|extra_0|>`. You have to add `<image>` to your prompt in the place where the image should be embedded for correct generation.
 This model was contributed by [RaushanTurganbay](https://huggingface.co/RaushanTurganbay).
 The original code can be found [here](https://github.com/baaivision/Emu3).
 ## Usage example
 ### Text generation inference
@ -143,7 +140,6 @@ for i, image in enumerate(images['pixel_values']):
 ```
 ## Emu3Config
 [[autodoc]] Emu3Config
--- a/docs/source/en/model_doc/eomt.md
+++ b/docs/source/en/model_doc/eomt.md
@ -39,7 +39,6 @@ Architecturally, EoMT introduces a small set of **learned queries** and a lightw
       alt="drawing" width="500"/>
 </div>
 The model supports semantic, instance, and panoptic segmentation using a unified architecture and task-specific post-processing.
 ## Usage Examples
--- a/docs/source/en/model_doc/ernie4_5.md
+++ b/docs/source/en/model_doc/ernie4_5.md
@ -38,7 +38,6 @@ Other models from the family can be found at [Ernie 4.5 Moe](./ernie4_5_moe).
    <img src="https://ernie.baidu.com/blog/posts/ernie4.5/overview.png"/>
 </div>
 ## Usage Tips
 ### Generate text
@ -84,7 +83,6 @@ generate_text = tokenizer.decode(output_ids, skip_special_tokens=True)
 This model was contributed by [Anton Vlasjuk](https://huggingface.co/AntonV).
 The original code can be found [here](https://github.com/PaddlePaddle/ERNIE).
 ## Ernie4_5Config
 [[autodoc]] Ernie4_5Config
--- a/docs/source/en/model_doc/ernie4_5_moe.md
+++ b/docs/source/en/model_doc/ernie4_5_moe.md
@ -40,7 +40,6 @@ Other models from the family can be found at [Ernie 4.5](./ernie4_5).
    <img src="https://ernie.baidu.com/blog/posts/ernie4.5/overview.png"/>
 </div>
 ## Usage Tips
 ### Generate text
@ -167,7 +166,6 @@ generate_text = tokenizer.decode(output_ids, skip_special_tokens=True)
 This model was contributed by [Anton Vlasjuk](https://huggingface.co/AntonV).
 The original code can be found [here](https://github.com/PaddlePaddle/ERNIE).
 ## Ernie4_5_MoeConfig
 [[autodoc]] Ernie4_5_MoeConfig
--- a/docs/source/en/model_doc/ernie_m.md
+++ b/docs/source/en/model_doc/ernie_m.md
@ -40,7 +40,6 @@ The abstract from the paper is the following:
 *Recent studies have demonstrated that pre-trained cross-lingual models achieve impressive performance in downstream cross-lingual tasks. This improvement benefits from learning a large amount of monolingual and parallel corpora. Although it is generally acknowledged that parallel corpora are critical for improving the model performance, existing methods are often constrained by the size of parallel corpora, especially for lowresource languages. In this paper, we propose ERNIE-M, a new training method that encourages the model to align the representation of multiple languages with monolingual corpora, to overcome the constraint that the parallel corpus size places on the model performance. Our key insight is to integrate back-translation into the pre-training process. We generate pseudo-parallel sentence pairs on a monolingual corpus to enable the learning of semantic alignments between different languages, thereby enhancing the semantic modeling of cross-lingual models. Experimental results show that ERNIE-M outperforms existing cross-lingual models and delivers new state-of-the-art results in various cross-lingual downstream tasks.*
 This model was contributed by [Susnato Dhar](https://huggingface.co/susnato). The original code can be found [here](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/paddlenlp/transformers/ernie_m).
 ## Usage tips
 - Ernie-M is a BERT-like model so it is a stacked Transformer Encoder.
@ -59,7 +58,6 @@ This model was contributed by [Susnato Dhar](https://huggingface.co/susnato). Th
 [[autodoc]] ErnieMConfig
 ## ErnieMTokenizer
 [[autodoc]] ErnieMTokenizer
@ -68,7 +66,6 @@ This model was contributed by [Susnato Dhar](https://huggingface.co/susnato). Th
    - create_token_type_ids_from_sequences
    - save_vocabulary
 ## ErnieMModel
 [[autodoc]] ErnieMModel
@ -79,19 +76,16 @@ This model was contributed by [Susnato Dhar](https://huggingface.co/susnato). Th
 [[autodoc]] ErnieMForSequenceClassification
    - forward
 ## ErnieMForMultipleChoice
 [[autodoc]] ErnieMForMultipleChoice
    - forward
 ## ErnieMForTokenClassification
 [[autodoc]] ErnieMForTokenClassification
    - forward
 ## ErnieMForQuestionAnswering
 [[autodoc]] ErnieMForQuestionAnswering
--- a/docs/source/en/model_doc/esm.md
+++ b/docs/source/en/model_doc/esm.md
@ -44,12 +44,10 @@ sequence alignment (MSA) step at inference time, which means that ESMFold checkp
 they do not require a database of known protein sequences and structures with associated external query tools
 to make predictions, and are much faster as a result.
 The abstract from
 "Biological structure and function emerge from scaling unsupervised learning to 250
 million protein sequences" is
 *In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised
 learning has led to major advances in representation learning and statistical generation. In the life sciences, the
 anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Protein language modeling
@ -63,7 +61,6 @@ can be identified by linear projections. Representation learning produces featur
 applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure and
 improving state-of-the-art features for long-range contact prediction.*
 The abstract from
 "Language models of protein sequences at the scale of evolution enable accurate structure prediction" is
--- a/docs/source/en/model_doc/evolla.md
+++ b/docs/source/en/model_doc/evolla.md
@ -75,7 +75,6 @@ Tips:
 - This model was contributed by [Xibin Bayes Zhou](https://huggingface.co/XibinBayesZhou).
 - The original code can be found [here](https://github.com/westlake-repl/Evolla).
 ## EvollaConfig
 [[autodoc]] EvollaConfig
--- a/docs/source/en/model_doc/exaone4.md
+++ b/docs/source/en/model_doc/exaone4.md
@ -33,7 +33,6 @@ For more details, please refer to our [technical report](https://huggingface.co/
 All model weights including quantized versions are available at [Huggingface Collections](https://huggingface.co/collections/LGAI-EXAONE/exaone-40-686b2e0069800c835ed48375).
 ## Model Details
 ### Model Specifications
@ -57,7 +56,6 @@ All model weights including quantized versions are available at [Huggingface Col
 | Tied word embedding | False | True |
 | Knowledge cut-off | Nov. 2024 | Nov. 2024 |
 ## Usage tips
 ### Non-reasoning mode
--- a/docs/source/en/model_doc/falcon_h1.md
+++ b/docs/source/en/model_doc/falcon_h1.md
@ -21,7 +21,6 @@ The [FalconH1](https://huggingface.co/blog/tiiuae/falcon-h1) model was developed
 This model was contributed by [DhiyaEddine](https://huggingface.co/DhiyaEddine), [ybelkada](https://huggingface.co/ybelkada), [JingweiZuo](https://huggingface.co/JingweiZuo), [IlyasChahed](https://huggingface.co/IChahed), and [MaksimVelikanov](https://huggingface.co/yellowvm).
 The original code can be found [here](https://github.com/tiiuae/Falcon-H1).
 ## FalconH1Config
 | Model     | Depth | Dim  | Attn Heads | KV | Mamba Heads | d_head       | d_state | Ctx Len        |
@ -33,8 +32,6 @@ The original code can be found [here](https://github.com/tiiuae/Falcon-H1).
 | H1 7B     | 44     | 3072 | 12         | 2  | 24           | 128 / 128    | 256  | 256K            |
 | H1 34B    | 72     | 5120 | 20         | 4  | 32           | 128 / 128    | 256  | 256K            |
 [[autodoc]] FalconH1Config
 <!---
--- a/docs/source/en/model_doc/fastspeech2_conformer.md
+++ b/docs/source/en/model_doc/fastspeech2_conformer.md
@ -27,7 +27,6 @@ The abstract from the original FastSpeech2 paper is the following:
 This model was contributed by [Connor Henderson](https://huggingface.co/connor-henderson). The original code can be found [here](https://github.com/espnet/espnet/blob/master/espnet2/tts/fastspeech2/fastspeech2.py).
 ## 🤗 Model Architecture
 FastSpeech2's general structure with a Mel-spectrogram decoder was implemented, and the traditional transformer blocks were replaced with conformer blocks as done in the ESPnet library.
@ -90,6 +89,7 @@ sf.write("speech.wav", waveform.squeeze().detach().numpy(), samplerate=22050)
 ```
 4. Run inference with a pipeline and specify which vocoder to use
 ```python
 from transformers import pipeline, FastSpeech2ConformerHifiGan
 import soundfile as sf
@ -102,7 +102,6 @@ speech = synthesiser("Hello, my dog is cooler than you!")
 sf.write("speech.wav", speech["audio"].squeeze(), samplerate=speech["sampling_rate"])
 ```
 ## FastSpeech2ConformerConfig
 [[autodoc]] FastSpeech2ConformerConfig
--- a/docs/source/en/model_doc/flan-ul2.md
+++ b/docs/source/en/model_doc/flan-ul2.md
@ -35,7 +35,6 @@ Google has released the following variants:
 The original checkpoints can be found [here](https://github.com/google-research/google-research/tree/master/ul2).
 ## Running on low resource devices
 The model is pretty heavy (~40GB in half precision) so if you just want to run the model, make sure you load your model in 8bit, and use `device_map="auto"` to make sure  you don't have any OOM issue!
--- a/docs/source/en/model_doc/flex_olmo.md
+++ b/docs/source/en/model_doc/flex_olmo.md
@ -12,7 +12,6 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be rendered properly in your Markdown viewer.
 -->
@ -90,6 +89,7 @@ echo -e "Plants create energy through a process known as" | transformers run --t
 Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
 The example below uses [torchao](../quantization/torchao) to only quantize the weights to 4-bits.
 ```py
 #pip install torchao
@ -119,7 +119,6 @@ print(tokenizer.decode(output[0], skip_special_tokens=True))
 ```
 ## FlexOlmoConfig
 [[autodoc]] FlexOlmoConfig
--- a/Show More
+++ b/Show More
`@ -606,4 +606,3 @@ Keywords: BentoML, Framework, Deployment, AI Applications`
	`[LLaMA Factory](https://github.com/hiyouga/LLaMA-Factory) offers a user-friendly fine-tuning framework that incorporates PEFT. The repository includes training(fine-tuning) and inference examples for LLaMA-2, BLOOM, Falcon, Baichuan, Qwen, and other LLMs. A ChatGLM version is also available in [ChatGLM-Efficient-Tuning](https://github.com/hiyouga/ChatGLM-Efficient-Tuning).`	`[LLaMA Factory](https://github.com/hiyouga/LLaMA-Factory) offers a user-friendly fine-tuning framework that incorporates PEFT. The repository includes training(fine-tuning) and inference examples for LLaMA-2, BLOOM, Falcon, Baichuan, Qwen, and other LLMs. A ChatGLM version is also available in [ChatGLM-Efficient-Tuning](https://github.com/hiyouga/ChatGLM-Efficient-Tuning).`

	`Keywords: PEFT, fine-tuning, LLaMA-2, ChatGLM, Qwen`	`Keywords: PEFT, fine-tuning, LLaMA-2, ChatGLM, Qwen`
`@ -51,4 +51,3 @@ to export models for different types of topologies or tasks.`
	`### FeaturesManager`	`### FeaturesManager`

	`[[autodoc]] onnx.features.FeaturesManager`	`[[autodoc]] onnx.features.FeaturesManager`