Format empty lines and white space in markdown files. (#41100)

* Remove additional white space and empty lines from markdown files Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Add empty lines around code Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> --------- Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>
2025-10-20 09:03:53 +08:00 · 2025-09-24 07:20:01 +08:00
parent 99b0995138
commit f64354e89a
344 changed files with 673 additions and 1092 deletions
--- a/ISSUES.md
+++ b/ISSUES.md
@ -38,7 +38,6 @@ In particular all "Please explain" questions or objectively very user-specific f

 * "How to train T5 on De->En translation?"

-
 ## The GitHub Issues

 Everything which hints at a bug should be opened as an [issue](https://github.com/huggingface/transformers/issues).
@ -247,7 +246,6 @@ You are not required to read the following guidelines before opening an issue. H

    Try not use italics and bold text too much as these often make the text more difficult to read.

-
 12. If you are cross-referencing a specific comment in a given thread or another issue, always link to that specific comment, rather than using the issue link. If you do the latter it could be quite impossible to find which specific comment you're referring to.

    To get the link to the specific comment do not copy the url from the location bar of your browser, but instead, click the `...` icon in the upper right corner of the comment and then select "Copy Link".
@ -257,7 +255,6 @@ You are not required to read the following guidelines before opening an issue. H
    1. https://github.com/huggingface/transformers/issues/9257
    2. https://github.com/huggingface/transformers/issues/9257#issuecomment-749945162

-
 13. If you are replying to a last comment, it's totally fine to make your reply with just your comment in it. The readers can follow the information flow here.

    But if you're replying to a comment that happened some comments back it's always a good practice to quote just the relevant lines you're replying it. The `>` is used for quoting, or you can always use the menu to do so. For example your editor box will look like:
--- a/README.md
+++ b/README.md
@ -63,12 +63,11 @@ limitations under the License.
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/transformers_as_a_model_definition.png"/>
 </h3>

+Transformers acts as the model-definition framework for state-of-the-art machine learning models in text, computer
+vision, audio, video, and multimodal model, for both inference and training.

-Transformers acts as the model-definition framework for state-of-the-art machine learning models in text, computer 
-vision, audio, video, and multimodal model, for both inference and training. 
-
-It centralizes the model definition so that this definition is agreed upon across the ecosystem. `transformers` is the 
-pivot across frameworks: if a model definition is supported, it will be compatible with the majority of training 
+It centralizes the model definition so that this definition is agreed upon across the ecosystem. `transformers` is the
+pivot across frameworks: if a model definition is supported, it will be compatible with the majority of training
 frameworks (Axolotl, Unsloth, DeepSpeed, FSDP, PyTorch-Lightning, ...), inference engines (vLLM, SGLang, TGI, ...),
 and adjacent modeling libraries (llama.cpp, mlx, ...) which leverage the model definition from `transformers`.

@ -194,7 +193,6 @@ pipeline("https://huggingface.co/datasets/Narsil/image_dummy/raw/main/parrots.pn
 <details>
 <summary>Visual question answering</summary>

-
 <h3 align="center">
    <a><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/idefics-few-shot.jpg"></a>
 </h3>
--- a/awesome-transformers.md
+++ b/awesome-transformers.md
@ -6,7 +6,7 @@ developers, researchers, students, professors, engineers, and anyone else to bui

 In this list, we showcase incredibly impactful and novel projects that have pushed the field forward. We celebrate
 100 of these projects as we reach the milestone of 100k stars as a community; but we're very open to pull requests
-adding other projects to the list. If you believe a project should be here and it's not, then please, open a PR 
+adding other projects to the list. If you believe a project should be here and it's not, then please, open a PR
 to add it.

 ## [gpt4all](https://github.com/nomic-ai/gpt4all)
@ -49,7 +49,7 @@ Keywords: LLMs, Large Language Models, Agents, Chains

 [LlamaIndex](https://github.com/run-llama/llama_index) is a project that provides a central interface to connect your LLM's with external data. It provides various kinds of indices and retrieval mechanisms to perform different LLM tasks and obtain knowledge-augmented results.

-Keywords: LLMs, Large Language Models, Data Retrieval, Indices, Knowledge Augmentation 
+Keywords: LLMs, Large Language Models, Data Retrieval, Indices, Knowledge Augmentation

 ## [ParlAI](https://github.com/facebookresearch/ParlAI)

@ -257,7 +257,7 @@ Stable-Dreamfusion is a pytorch implementation of the text-to-3D model Dreamfusi
 Keywords: Text-to-3D, Stable Diffusion

 ## [txtai](https://github.com/neuml/txtai)
- 
+
 [txtai](https://github.com/neuml/txtai) is an open-source platform for semantic search and workflows powered by language models. txtai builds embeddings databases, which are a union of vector indexes and relational databases enabling similarity search with SQL. Semantic workflows connect language models together into unified applications.

 Keywords: Semantic search, LLM
@ -309,8 +309,8 @@ Keywords: OCR, LaTeX, Math formula

 OpenCLIP is an open source implementation of OpenAI's CLIP.

-The goal of this repository is to enable training models with contrastive image-text supervision, and to investigate their properties such as robustness to distribution shift. 
-The starting point is an implementation of CLIP that matches the accuracy of the original CLIP models when trained on the same dataset. 
+The goal of this repository is to enable training models with contrastive image-text supervision, and to investigate their properties such as robustness to distribution shift.
+The starting point is an implementation of CLIP that matches the accuracy of the original CLIP models when trained on the same dataset.

 Specifically, a ResNet-50 model trained with this codebase on OpenAI's 15 million image subset of YFCC achieves 32.7% top-1 accuracy on ImageNet.

@ -596,7 +596,7 @@ Keywords: Data-Centric AI, Data Quality, Noisy Labels, Outlier Detection, Active

 ## [BentoML](https://github.com/bentoml/BentoML)

-[BentoML](https://github.com/bentoml) is the unified framework for building, shipping, and scaling production-ready AI applications incorporating traditional ML, pre-trained AI models, Generative and Large Language Models. 
+[BentoML](https://github.com/bentoml) is the unified framework for building, shipping, and scaling production-ready AI applications incorporating traditional ML, pre-trained AI models, Generative and Large Language Models.
 All Hugging Face models and pipelines can be seamlessly integrated into BentoML applications, enabling the running of models on the most suitable hardware and independent scaling based on usage.

 Keywords: BentoML, Framework, Deployment, AI Applications
@ -606,4 +606,3 @@ Keywords: BentoML, Framework, Deployment, AI Applications
 [LLaMA Factory](https://github.com/hiyouga/LLaMA-Factory) offers a user-friendly fine-tuning framework that incorporates PEFT. The repository includes training(fine-tuning) and inference examples for LLaMA-2, BLOOM, Falcon, Baichuan, Qwen, and other LLMs. A ChatGLM version is also available in [ChatGLM-Efficient-Tuning](https://github.com/hiyouga/ChatGLM-Efficient-Tuning).

 Keywords: PEFT, fine-tuning, LLaMA-2, ChatGLM, Qwen
-
--- a/docs/source/en/accelerator_selection.md
+++ b/docs/source/en/accelerator_selection.md
@ -69,7 +69,6 @@ CUDA_VISIBLE_DEVICES=0,2 torchrun trainer-program.py ...
 Only GPUs 0 and 2 are "visible" to PyTorch and are mapped to `cuda:0` and `cuda:1` respectively.  
 To reverse the order (use GPU 2 as `cuda:0` and GPU 0 as `cuda:1`):

-
 ```bash
 CUDA_VISIBLE_DEVICES=2,0 torchrun trainer-program.py ...
 ```
@ -108,7 +107,6 @@ To reverse the order (use XPU 2 as `xpu:0` and XPU 0 as `xpu:1`):
 ZE_AFFINITY_MASK=2,0 torchrun trainer-program.py ...
 ```

-
 You can also control the order of Intel XPUs with:

 ```bash
@ -120,7 +118,5 @@ For more information about device enumeration and sorting on Intel XPU, please r
 </hfoption>
 </hfoptions>

-
-
 > [!WARNING]
 > Environment variables can be exported instead of being added to the command line. This is not recommended because it can be confusing if you forget how the environment variable was set up and you end up using the wrong accelerators. Instead, it is common practice to set the environment variable for a specific training run on the same command line.
--- a/docs/source/en/auto_docstring.md
+++ b/docs/source/en/auto_docstring.md
@ -145,7 +145,6 @@ Arguments can also be passed directly to `@auto_docstring` for more control. Use

 The `Returns` and `Examples` parts of the docstring can also be manually specified.

-
 ```python
 MODEL_COMMON_CUSTOM_ARGS = r"""
    common_arg_1 (`torch.Tensor`, *optional*, defaults to `default_value`):
@ -202,7 +201,6 @@ There are some rules for documenting different types of arguments and they're li

    If a standard argument behaves differently in your model, then you can override it locally in a `r""" """` block. This local definition has a higher priority. For example, the `labels` argument is often customized per model and typically requires overriding.

-
 - New or custom arguments should be documented within an `r""" """` block after the signature if it is a function or in the `__init__` method's docstring if it is a class.

    ```py
--- a/docs/source/en/cache_explanation.md
+++ b/docs/source/en/cache_explanation.md
@ -59,11 +59,9 @@ Refer to the table below to compare how caching improves efficiency.

 | without caching | with caching |
 |---|---|
-| for each step, recompute all previous `K` and `V`  | for each step, only compute current `K` and `V` 
+| for each step, recompute all previous `K` and `V`  | for each step, only compute current `K` and `V`
 | attention cost per step is **quadratic** with sequence length | attention cost per step is **linear** with sequence length (memory grows linearly, but compute/token remains low) |

-
-
 ## Cache class

 A basic KV cache interface takes a key and value tensor for the current token and returns the updated `K` and `V` tensors. This is internally managed by a model's `forward` method.
@ -143,7 +141,6 @@ Cache position is used internally for two purposes:

 The generation loop usually takes care of the cache position, but if you're writing a custom generation method, it is important that cache positions are accurate since they are used to write and read key/value states into fixed slots.

-
 ```py
 import torch
 from transformers import AutoTokenizer, AutoModelForCausalLM, DynamicCache, infer_device
@ -160,7 +157,6 @@ generated_ids = model.generate(**inputs, use_cache=True, max_new_tokens=10)

 ```

-
 ## Legacy cache format

 Before the [`Cache`] class, the cache used to be stored as a tuple of tuples of tensors. This format is dynamic because it grows as text is generated, similar to [`DynamicCache`].
--- a/docs/source/en/chat_extras.md
+++ b/docs/source/en/chat_extras.md
@ -29,7 +29,6 @@ the arguments, argument types, and function docstring are parsed in order to gen
 Although passing Python functions is very convenient, the parser can only handle [Google-style](https://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings)
 docstrings. Refer to the examples below for how to format a tool-ready function.

-
 ```py
 def get_current_temperature(location: str, unit: str):
    """
@ -103,7 +102,6 @@ Hold the call in the `tool_calls` key of an `assistant` message. This is the rec
 > [!WARNING]
 > Although `tool_calls` is similar to the OpenAI API, the OpenAI API uses a JSON string as its `tool_calls` format. This may cause errors or strange model behavior if used in Transformers, which expects a dict.

-
 ```py
 tool_call = {"name": "get_current_temperature", "arguments": {"location": "Paris, France", "unit": "celsius"}}
 messages.append({"role": "assistant", "tool_calls": [{"type": "function", "function": tool_call}]})
@ -131,7 +129,6 @@ The temperature in Paris, France right now is 22°C.<|im_end|>
 > Although the key in the assistant message is called `tool_calls`, in most cases, models only emit a single tool call at a time. Some older models emit multiple tool calls at the same time, but this is a
 > significantly more complex process, as you need to handle multiple tool responses at once and disambiguate them, often using tool call IDs. Please refer to the model card to see exactly what format a model expects for tool calls.

-
 ## JSON schemas

 Another way to define tools is by passing a [JSON schema](https://json-schema.org/learn/getting-started-step-by-step).
--- a/docs/source/en/chat_templating.md
+++ b/docs/source/en/chat_templating.md
@ -16,13 +16,13 @@ rendered properly in your Markdown viewer.

 # Chat templates

-The [chat basics](./conversations) guide covers how to store chat histories and generate text from chat models using [`TextGenerationPipeline`]. 
+The [chat basics](./conversations) guide covers how to store chat histories and generate text from chat models using [`TextGenerationPipeline`].

 This guide is intended for more advanced users, and covers the underlying classes and methods, as well as the key concepts for understanding what's actually going on when you chat with a model.

 The critical insight needed to understand chat models is this: All causal LMs, whether chat-trained or not, continue a sequence of tokens. When causal LMs are trained, the training usually begins with "pre-training" on a huge corpus of text, which creates a "base" model.
 These base models are then often "fine-tuned" for chat, which means training them on data that is formatted as a sequence of messages. The chat is still just a sequence of tokens, though! The list of `role` and `content` dictionaries that you pass
-to a chat model get converted to a token sequence, often with control tokens like `<|user|>` or `<|assistant|>` or `<|end_of_message|>`, which allow the model to see the chat structure. 
+to a chat model get converted to a token sequence, often with control tokens like `<|user|>` or `<|assistant|>` or `<|end_of_message|>`, which allow the model to see the chat structure.
 There are many possible chat formats, and different models may use different formats or control tokens, even if they were fine-tuned from the same base model!

 Don't panic, though - you don't need to memorize every possible chat format in order to use chat models. Chat models come with **chat templates**, which indicate how they expect chats to be formatted.
@ -43,6 +43,7 @@ chat = [

 tokenizer.apply_chat_template(chat, tokenize=False)
 ```
+
 ```md
 <s>[INST] Hello, how are you? [/INST]I'm doing great. How can I help you today?</s> [INST] I'd like to show off how chat templating works! [/INST]
 ```
@ -62,6 +63,7 @@ chat = [

 tokenizer.apply_chat_template(chat, tokenize=False)
 ```
+
 ```md
 <|user|>\nHello, how are you?</s>\n<|assistant|>\nI'm doing great. How can I help you today?</s>\n<|user|>\nI'd like to show off how chat templating works!</s>\n
 ```
@ -110,6 +112,7 @@ Pass the tokenized chat to [`~GenerationMixin.generate`] to generate a response.
 outputs = model.generate(tokenized_chat, max_new_tokens=128) 
 print(tokenizer.decode(outputs[0]))
 ```
+
 ```md
 <|system|>
 You are a friendly chatbot who always responds in the style of a pirate</s>
@ -125,9 +128,9 @@ Matey, I'm afraid I must inform ye that humans cannot eat helicopters. Helicopte

 ### add_generation_prompt

-You may have noticed the [add_generation_prompt](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.add_generation_prompt) argument in the above examples. 
+You may have noticed the [add_generation_prompt](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.add_generation_prompt) argument in the above examples.
 This argument adds tokens to the end of the chat that indicate the start of an `assistant` response. Remember: Beneath all the chat abstractions, chat models are still just language models that continue a sequence of tokens!
-If you include tokens that tell it that it's now in an `assistant` response, it will correctly write a response, but if you don't include these tokens, the model may get confused and do something strange, like **continuing** the user's message instead of replying to it! 
+If you include tokens that tell it that it's now in an `assistant` response, it will correctly write a response, but if you don't include these tokens, the model may get confused and do something strange, like **continuing** the user's message instead of replying to it!

 Let's see an example to understand what `add_generation_prompt` is actually doing. First, let's format a chat without `add_generation_prompt`:

@ -135,6 +138,7 @@ Let's see an example to understand what `add_generation_prompt` is actually doin
 tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
 tokenized_chat
 ```
+
 ```md
 <|im_start|>user
 Hi there!<|im_end|>
@ -150,6 +154,7 @@ Now, let's format the same chat with `add_generation_prompt=True`:
 tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
 tokenized_chat
 ```
+
 ```md
 <|im_start|>user
 Hi there!<|im_end|>
@ -186,7 +191,6 @@ model.generate(**formatted_chat)

 [`TextGenerationPipeline`] sets [add_generation_prompt](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.add_generation_prompt) to `True` by default to start a new message. However, if the final message in the chat has the `assistant` role, it assumes the message is a prefill and switches to `continue_final_message=True`. This is because most models don’t support multiple consecutive assistant messages. To override this behavior, explicitly pass the [continue_final_message](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.continue_final_message) argument to the pipeline.

-
 ## Model training

 Training a model with a chat template is a good way to ensure the template matches the tokens the model was trained on. Apply the chat template as a preprocessing step to your dataset. Set `add_generation_prompt=False` because the additional tokens to prompt an assistant response aren't helpful during training.
@ -212,6 +216,7 @@ dataset = Dataset.from_dict({"chat": [chat1, chat2]})
 dataset = dataset.map(lambda x: {"formatted_chat": tokenizer.apply_chat_template(x["chat"], tokenize=False, add_generation_prompt=False)})
 print(dataset['formatted_chat'][0])
 ```
+
 ```md
 <|user|>
 Which is bigger, the moon or the sun?</s>
--- a/docs/source/en/chat_templating_multimodal.md
+++ b/docs/source/en/chat_templating_multimodal.md
@ -18,8 +18,7 @@ rendered properly in your Markdown viewer.

 Multimodal chat models accept inputs like images, audio or video, in addition to text. The `content` key in a multimodal chat history is a list containing multiple items of different types. This is unlike text-only chat models whose `content` key is a single string.

-
-In the same way the [Tokenizer](./fast_tokenizer) class handles chat templates and tokenization for text-only models, 
+In the same way the [Tokenizer](./fast_tokenizer) class handles chat templates and tokenization for text-only models,
 the [Processor](./processors) class handles preprocessing, tokenization and chat templates for multimodal models. Their [`~ProcessorMixin.apply_chat_template`] methods are almost identical.

 This guide will show you how to chat with multimodal models with the high-level [`ImageTextToTextPipeline`] and at a lower level using the [`~ProcessorMixin.apply_chat_template`] and [`~GenerationMixin.generate`] methods.
@ -57,7 +56,6 @@ out = pipe(text=messages, max_new_tokens=128)
 print(out[0]['generated_text'][-1]['content'])
 ```

-
 ```
 Ahoy, me hearty! These be two feline friends, likely some tabby cats, taking a siesta on a cozy pink blanket. They're resting near remote controls, perhaps after watching some TV or just enjoying some quiet time together. Cats sure know how to find comfort and relaxation, don't they?
 ```
@ -66,10 +64,9 @@ Aside from the gradual descent from pirate-speak into modern American English (i

 ## Using `apply_chat_template`

-Like [text-only models](./chat_templating), use the [`~ProcessorMixin.apply_chat_template`] method to prepare the chat messages for multimodal models. 
+Like [text-only models](./chat_templating), use the [`~ProcessorMixin.apply_chat_template`] method to prepare the chat messages for multimodal models.
 This method handles the tokenization and formatting of the chat messages, including images and other media types. The resulting inputs are passed to the model for generation.

-
 ```python
 from transformers import AutoProcessor, AutoModelForImageTextToText

@ -99,7 +96,6 @@ processed_chat = processor.apply_chat_template(messages, add_generation_prompt=T
 print(list(processed_chat.keys()))
 ```

-
 ```
 ['input_ids', 'attention_mask', 'pixel_values', 'image_grid_thw']
 ```
@ -113,7 +109,6 @@ print(processor.decode(out[0]))

 The decoded output contains the full conversation so far, including the user message and the placeholder tokens that contain the image information. You may need to trim the previous conversation from the output before displaying it to the user.

-
 ## Video inputs

 Some vision models also support video inputs. The message format is very similar to the format for [image inputs](#image-inputs).
@ -148,6 +143,7 @@ messages = [
 ```

 ### Example: Passing decoded video objects
+
 ```python
 import numpy as np

@ -167,7 +163,9 @@ messages = [
    },
 ]
 ```
+
 You can also use existing (`"load_video()"`) function to load a video, edit the video in memory and pass it in the messages.
+
 ```python

 # Make sure a video backend library (pyav, decord, or torchvision) is available.
@ -200,7 +198,6 @@ Pass `messages` to [`~ProcessorMixin.apply_chat_template`] to tokenize the input

 The `num_frames` parameter controls how many frames to uniformly sample from the video. Each checkpoint has a maximum frame count it was pretrained with and exceeding this count can significantly lower generation quality. It's important to choose a frame count that fits both the model capacity and your hardware resources. If `num_frames` isn't specified, the entire video is loaded without any frame sampling.

-
 ```python
 processed_chat = processor.apply_chat_template(
    messages,
@ -265,4 +262,3 @@ print(processed_chat.keys())

 </hfoption>
 </hfoptions>
-
--- a/docs/source/en/chat_templating_writing.md
+++ b/docs/source/en/chat_templating_writing.md
@ -18,7 +18,6 @@ rendered properly in your Markdown viewer.

 A chat template is a [Jinja](https://jinja.palletsprojects.com/en/stable/templates/) template stored in the tokenizer's [chat_template](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.chat_template) attribute. Jinja is a templating language that allows you to write Python-like code and syntax.

-
 ```jinja
 {%- for message in messages %}
    {{- '<|' + message['role'] + |>\n' }}
@ -108,7 +107,6 @@ We strongly recommend using `-` to ensure only the intended content is printed.

 ### Special variables and callables

-
 The only constants in a template are the `messages` variable and the `add_generation_prompt` boolean. However, you have
 access to **any other keyword arguments that are passed** to the [`~PreTrainedTokenizerBase.apply_chat_template`] method.

--- a/docs/source/en/conversations.md
+++ b/docs/source/en/conversations.md
@ -48,7 +48,6 @@ transformers chat -h

 The chat is implemented on top of the [AutoClass](./model_doc/auto), using tooling from [text generation](./llm_tutorial) and [chat](./chat_templating). It uses the `transformers serve` CLI under the hood ([docs](./serving.md#serve-cli)).

-
 ## TextGenerationPipeline

 [`TextGenerationPipeline`] is a high-level text generation class with a "chat mode". Chat mode is enabled when a conversational model is detected and the chat prompt is [properly formatted](./llm_tutorial#wrong-prompt-format).
@ -109,7 +108,7 @@ quantization_config = BitsAndBytesConfig(load_in_8bit=True)
 pipeline = pipeline(task="text-generation", model="meta-llama/Meta-Llama-3-8B-Instruct", device_map="auto", model_kwargs={"quantization_config": quantization_config})
 ```

-In general, model size and performance are directly correlated. Larger models are slower in addition to requiring more memory because each active parameter must be read from memory for every generated token. 
+In general, model size and performance are directly correlated. Larger models are slower in addition to requiring more memory because each active parameter must be read from memory for every generated token.
 This is a bottleneck for LLM text generation and the main options for improving generation speed are to either quantize a model or use hardware with higher memory bandwidth. Adding more compute power doesn't meaningfully help.

 You can also try techniques like [speculative decoding](./generation_strategies#speculative-decoding), where a smaller model generates candidate tokens that are verified by the larger model. If the candidate tokens are correct, the larger model can generate more than one token at a time. This significantly alleviates the bandwidth bottleneck and improves generation speed.
--- a/docs/source/en/cursor.md
+++ b/docs/source/en/cursor.md
@ -38,5 +38,3 @@ You are now ready to use your local model in Cursor! For instance, if you toggle
 <h3 align="center">
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/transformers_serve_cursor_chat.png"/>
 </h3>
-
-
--- a/docs/source/en/generation_strategies.md
+++ b/docs/source/en/generation_strategies.md
@ -389,7 +389,6 @@ from .utils import some_function

 Only relative imports from the same-level `custom_generate` folder are supported. Parent/sibling folder imports are not valid. The `custom_generate` argument also works locally with any directory that contains a `custom_generate` structure. This is the recommended workflow for developing your custom generation method.

-
 #### requirements.txt

 You can optionally specify additional Python requirements in a `requirements.txt` file inside the `custom_generate` folder. These are checked at runtime and an exception will be thrown if they're missing, nudging users to update their environment accordingly.
--- a/docs/source/en/index.md
+++ b/docs/source/en/index.md
@ -19,7 +19,6 @@ rendered properly in your Markdown viewer.
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/transformers_as_a_model_definition.png"/>
 </h3>

-
 Transformers acts as the model-definition framework for state-of-the-art machine learning models in text, computer
 vision, audio, video, and multimodal model, for both inference and training.

--- a/docs/source/en/internal/file_utils.md
+++ b/docs/source/en/internal/file_utils.md
@ -20,7 +20,6 @@ This page lists all of Transformers general utility functions that are found in

 Most of those are only useful if you are studying the general code in the library.

-
 ## Enums and namedtuples

 [[autodoc]] utils.ExplicitEnum
--- a/docs/source/en/internal/generation_utils.md
+++ b/docs/source/en/internal/generation_utils.md
@ -65,7 +65,6 @@ values. Here, for instance, it has two keys that are `sequences` and `scores`.

 We document here all output types.

-
 [[autodoc]] generation.GenerateDecoderOnlyOutput

 [[autodoc]] generation.GenerateEncoderDecoderOutput
@ -74,13 +73,11 @@ We document here all output types.

 [[autodoc]] generation.GenerateBeamEncoderDecoderOutput

-
 ## LogitsProcessor

 A [`LogitsProcessor`] can be used to modify the prediction scores of a language model head for
 generation.

-
 [[autodoc]] AlternatingCodebooksLogitsProcessor
    - __call__

@ -174,8 +171,6 @@ generation.
 [[autodoc]] WatermarkLogitsProcessor
    - __call__

-
-
 ## StoppingCriteria

 A [`StoppingCriteria`] can be used to change when to stop generation (other than EOS token). Please note that this is exclusively available to our PyTorch implementations.
@ -300,7 +295,6 @@ A [`Constraint`] can be used to force the generation to include specific tokens
    - to_legacy_cache
    - from_legacy_cache

-
 ## Watermark Utils

 [[autodoc]] WatermarkingConfig
--- a/docs/source/en/internal/import_utils.md
+++ b/docs/source/en/internal/import_utils.md
@ -22,8 +22,8 @@ worked around. We don't want for all users of `transformers` to have to install
 we therefore mark those as soft dependencies rather than hard dependencies.

 The transformers toolkit is not made to error-out on import of a model that has a specific dependency; instead, an
-object for which you are lacking a dependency will error-out when calling any method on it. As an example, if 
-`torchvision` isn't installed, the fast image processors will not be available. 
+object for which you are lacking a dependency will error-out when calling any method on it. As an example, if
+`torchvision` isn't installed, the fast image processors will not be available.

 This object is still importable:

@ -55,7 +55,7 @@ All objects under a given filename have an automatic dependency to the tool link

 **Tokenizers**: All files starting with `tokenization_` and ending with `_fast` have an automatic `tokenizers` dependency

-**Vision**: All files starting with `image_processing_` have an automatic dependency to the `vision` dependency group; 
+**Vision**: All files starting with `image_processing_` have an automatic dependency to the `vision` dependency group;
 at the time of writing, this only contains the `pillow` dependency.

 **Vision + Torch + Torchvision**: All files starting with `image_processing_` and ending with `_fast` have an automatic
@ -66,7 +66,7 @@ All of these automatic dependencies are added on top of the explicit dependencie
 ### Explicit Object Dependencies

 We add a method called `requires` that is used to explicitly specify the dependencies of a given object. As an
-example, the `Trainer` class has two hard dependencies: `torch` and `accelerate`. Here is how we specify these 
+example, the `Trainer` class has two hard dependencies: `torch` and `accelerate`. Here is how we specify these
 required dependencies:

 ```python
--- a/docs/source/en/internal/model_debugging_utils.md
+++ b/docs/source/en/internal/model_debugging_utils.md
@ -21,10 +21,8 @@ provides for it.

 Most of those are only useful if you are adding new models in the library.

-
 ## Model addition debuggers

-
 ### Model addition debugger - context manager for model adders

 This context manager is a power user tool intended for model adders. It tracks all forward calls within a model forward
@ -72,7 +70,6 @@ with model_addition_debugger_context(

 ```

-
 ### Reading results

 The debugger generates two files from the forward call, both with the same base name, but ending either with
@ -231,10 +228,8 @@ Once the forward passes of two models have been traced by the debugger, one can
 below: we can see slight differences between these two implementations' key projection layer. Inputs are mostly
 identical, but not quite. Looking through the file differences makes it easier to pinpoint which layer is wrong.

-
 ![download-icon](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/files_difference_debugging.png)

-
 ### Limitations and scope

 This feature will only work for torch-based models. Models relying heavily on external kernel calls may work, but trace will
@ -253,7 +248,7 @@ layers.

 This small util is a power user tool intended for model adders and maintainers. It lists all test methods
 existing in `test_modeling_common.py`, inherited by all model tester classes, and scans the repository to measure
-how many tests are being skipped and for which models. 
+how many tests are being skipped and for which models.

 ### Rationale

@ -268,8 +263,7 @@ This utility:

 ![download-icon](https://huggingface.co/datasets/huggingface/documentation-images/resolve/f7f671f69b88ce4967e19179172c248958d35742/transformers/tests_skipped_visualisation.png)

-
-### Usage 
+### Usage

 You can run the skipped test analyzer in two ways:

--- a/docs/source/en/internal/pipelines_utils.md
+++ b/docs/source/en/internal/pipelines_utils.md
@ -20,7 +20,6 @@ This page lists all the utility functions the library provides for pipelines.

 Most of those are only useful if you are studying the code of the models in the library.

-
 ## Argument handling

 [[autodoc]] pipelines.ArgumentHandler
--- a/docs/source/en/kv_cache.md
+++ b/docs/source/en/kv_cache.md
@ -67,7 +67,7 @@ out = model.generate(**inputs, do_sample=False, max_new_tokens=20, past_key_valu

 ## Fixed-size cache

-The default [`DynamicCache`] prevents you from taking advantage of most just-in-time (JIT) optimizations because the cache size isn't fixed. JIT optimizations enable you to maximize latency at the expense of memory usage. All of the following cache types are compatible with JIT optimizations like [torch.compile](./llm_optims#static-kv-cache-and-torchcompile) to accelerate generation. 
+The default [`DynamicCache`] prevents you from taking advantage of most just-in-time (JIT) optimizations because the cache size isn't fixed. JIT optimizations enable you to maximize latency at the expense of memory usage. All of the following cache types are compatible with JIT optimizations like [torch.compile](./llm_optims#static-kv-cache-and-torchcompile) to accelerate generation.

 A fixed-size cache ([`StaticCache`]) pre-allocates a specific maximum cache size for the kv pairs. You can generate up to the maximum cache size without needing to modify it. However, having a fixed (usually large) size for the key/value states means that while generating, a lot of tokens will actually be masked as they should not take part in the attention. So this trick allows to easily `compile` the decoding stage, but it incurs a waste of tokens in the attention computation. As all things, it's then a trade-off which should be very good if you generate with several sequence of more or less the same lengths, but may be sub-optimal if you have for example 1 very large sequence, and then only short sequences (as the fix cache size would be large, a lot would be wasted for the short sequences). Make sure you understand the impact if you use it!

--- a/docs/source/en/llm_tutorial.md
+++ b/docs/source/en/llm_tutorial.md
@ -24,6 +24,7 @@ In Transformers, the [`~GenerationMixin.generate`] API handles text generation,

 > [!TIP]
 > You can also chat with a model directly from the command line. ([reference](./conversations.md#transformers))
+>
 > ```shell
 > transformers chat Qwen/Qwen2.5-0.5B-Instruct
 > ```
@ -35,6 +36,7 @@ Before you begin, it's helpful to install [bitsandbytes](https://hf.co/docs/bits
 ```bash
 !pip install -U transformers bitsandbytes
 ```
+
 Bitsandbytes supports multiple backends in addition to CUDA-based GPUs. Refer to the multi-backend installation [guide](https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend) to learn more.

 Load a LLM with [`~PreTrainedModel.from_pretrained`] and add the following two parameters to reduce the memory requirements.
@ -154,7 +156,6 @@ print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
 | `repetition_penalty` | `float` | Set it to `>1.0` if you're seeing the model repeat itself often. Larger values apply a larger penalty. |
 | `eos_token_id` | `list[int]` | The token(s) that will cause generation to stop. The default value is usually good, but you can specify a different token. |

-
 ## Pitfalls

 The section below covers some common issues you may encounter during text generation and how to solve them.
--- a/docs/source/en/llm_tutorial_optimization.md
+++ b/docs/source/en/llm_tutorial_optimization.md
@ -66,6 +66,7 @@ If you have access to an 8 x 80GB A100 node, you could load BLOOM as follows
 ```bash
 !pip install transformers accelerate bitsandbytes optimum
 ```
+
 ```python
 from transformers import AutoModelForCausalLM

@ -98,6 +99,7 @@ result
 ```

 **Output**:
+
 ```
 Here is a Python function that transforms bytes to Giga bytes:\n\n```python\ndef bytes_to_giga_bytes(bytes):\n    return bytes / 1024 / 1024 / 1024\n```\n\nThis function takes a single
 ```
@ -116,6 +118,7 @@ bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
 ```

 **Output**:
+
 ```bash
 29.0260648727417
 ```
@ -127,7 +130,6 @@ Note that if we had tried to run the model in full float32 precision, a whopping

 If you are unsure in which format the model weights are stored on the Hub, you can always look into the checkpoint's config under `"dtype"`, *e.g.* [here](https://huggingface.co/meta-llama/Llama-2-7b-hf/blob/6fdf2e60f86ff2481f2241aaee459f85b5b0bbb9/config.json#L21). It is recommended to set the model to the same precision type as written in the config when loading with `from_pretrained(..., dtype=...)` except when the original type is float32 in which case one can use both `float16` or `bfloat16` for inference.

-
 Let's define a `flush(...)` function to free all allocated memory so that we can accurately measure the peak allocated GPU memory.

 ```python
@ -148,6 +150,7 @@ Let's call it now for the next experiment.
 ```python
 flush()
 ```
+
 From the Accelerate library, you can also use a device-agnostic utility method called [release_memory](https://github.com/huggingface/accelerate/blob/29be4788629b772a3b722076e433b5b3b5c85da3/src/accelerate/utils/memory.py#L63), which takes various hardware backends like XPU, MLU, NPU, MPS, and more into account.

 ```python
@ -204,6 +207,7 @@ result
 ```

 **Output**:
+
 ```
 Here is a Python function that transforms bytes to Giga bytes:\n\n```python\ndef bytes_to_giga_bytes(bytes):\n    return bytes / 1024 / 1024 / 1024\n```\n\nThis function takes a single
 ```
@ -215,6 +219,7 @@ bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
 ```

 **Output**:
+
 ```
 15.219234466552734
 ```
@ -222,8 +227,8 @@ bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
 Significantly less! We're down to just a bit over 15 GBs and could therefore run this model on consumer GPUs like the 4090.
 We're seeing a very nice gain in memory efficiency and more or less no degradation to the model's output. However, we can also notice a slight slow-down during inference.

-
 We delete the models and flush the memory again.
+
 ```python
 del model
 del pipe
@ -245,6 +250,7 @@ result
 ```

 **Output**:
+
 ```
 Here is a Python function that transforms bytes to Giga bytes:\n\n```\ndef bytes_to_gigabytes(bytes):\n    return bytes / 1024 / 1024 / 1024\n```\n\nThis function takes a single argument
 ```
@ -256,6 +262,7 @@ bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
 ```

 **Output**:
+
 ```
 9.543574333190918
 ```
@ -270,6 +277,7 @@ Also note that inference here was again a bit slower compared to 8-bit quantizat
 del model
 del pipe
 ```
+
 ```python
 flush()
 ```
@ -384,6 +392,7 @@ def alternating(list1, list2):
 -----
 """
 ```
+
 For demonstration purposes, we duplicate the system prompt by ten so that the input length is long enough to observe Flash Attention's memory savings.
 We append the original text prompt `"Question: Please write a function in Python that transforms bytes to Giga bytes.\n\nAnswer: Here"`

@ -413,6 +422,7 @@ result
 ```

 **Output**:
+
 ```
 Generated in 10.96854019165039 seconds.
 Sure. Here is a function that does that.\n\ndef bytes_to_giga(bytes):\n   return bytes / 1024 / 1024 / 1024\n\nAnswer: Sure. Here is a function that does that.\n\ndef
@ -429,6 +439,7 @@ bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
 ```

 **Output**:
+
 ```bash
 37.668193340301514
 ```
@ -460,6 +471,7 @@ result
 ```

 **Output**:
+
 ```
 Generated in 3.0211617946624756 seconds.
 Sure. Here is a function that does that.\n\ndef bytes_to_giga(bytes):\n   return bytes / 1024 / 1024 / 1024\n\nAnswer: Sure. Here is a function that does that.\n\ndef
@ -474,6 +486,7 @@ bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
 ```

 **Output**:
+
 ```
 32.617331981658936
 ```
@ -604,6 +617,7 @@ generated_text
 ```

 **Output**:
+
 ```
 shape of input_ids torch.Size([1, 21])
 shape of input_ids torch.Size([1, 22])
@ -641,6 +655,7 @@ generated_text
 ```

 **Output**:
+
 ```
 shape of input_ids torch.Size([1, 1])
 length of key-value cache 20
@ -712,6 +727,7 @@ tokenizer.batch_decode(generation_output.sequences)[0][len(prompt):]
 ```

 **Output**:
+
 ```
 is a modified version of the function that returns Mega bytes instead.

@ -733,6 +749,7 @@ config = model.config
 ```

 **Output**:
+
 ```
 7864320000
 ```
@ -773,7 +790,6 @@ The most notable application of GQA is [Llama-v2](https://huggingface.co/meta-ll

 > As a conclusion, it is strongly recommended to make use of either GQA or MQA if the LLM is deployed with auto-regressive decoding and is required to handle large input sequences as is the case for example for chat.

-
 ## Conclusion

 The research community is constantly coming up with new, nifty ways to speed up inference time for ever-larger LLMs. As an example, one such promising research direction is [speculative decoding](https://huggingface.co/papers/2211.17192) where "easy tokens" are generated by smaller, faster language models and only "hard tokens" are generated by the LLM itself. Going into more detail is out of the scope of this notebook, but can be read upon in this [nice blog post](https://huggingface.co/blog/assisted-generation).
--- a/docs/source/en/main_classes/callback.md
+++ b/docs/source/en/main_classes/callback.md
@ -54,7 +54,6 @@ The main class that implements callbacks is [`TrainerCallback`]. It gets the
 Trainer's internal state via [`TrainerState`], and can take some actions on the training loop via
 [`TrainerControl`].

-
 ## Available Callbacks

 Here is the list of the available [`TrainerCallback`] in the library:
--- a/docs/source/en/main_classes/configuration.md
+++ b/docs/source/en/main_classes/configuration.md
@ -24,7 +24,6 @@ Each derived config class implements model specific attributes. Common attribute
 `hidden_size`, `num_attention_heads`, and `num_hidden_layers`. Text models further implement:
 `vocab_size`.

-
 ## PretrainedConfig

 [[autodoc]] PretrainedConfig
--- a/docs/source/en/main_classes/data_collator.md
+++ b/docs/source/en/main_classes/data_collator.md
@ -25,7 +25,6 @@ on the formed batch.

 Examples of use can be found in the [example scripts](../examples) or [example notebooks](../notebooks).

-
 ## Default data collator

 [[autodoc]] data.data_collator.default_data_collator
--- a/docs/source/en/main_classes/deepspeed.md
+++ b/docs/source/en/main_classes/deepspeed.md
@ -16,7 +16,7 @@ rendered properly in your Markdown viewer.

 # DeepSpeed

-[DeepSpeed](https://github.com/deepspeedai/DeepSpeed), powered by Zero Redundancy Optimizer (ZeRO), is an optimization library for training and fitting very large models onto a GPU. It is available in several ZeRO stages, where each stage progressively saves more GPU memory by partitioning the optimizer state, gradients, parameters, and enabling offloading to a CPU or NVMe. DeepSpeed is integrated with the [`Trainer`] class and most of the setup is automatically taken care of for you. 
+[DeepSpeed](https://github.com/deepspeedai/DeepSpeed), powered by Zero Redundancy Optimizer (ZeRO), is an optimization library for training and fitting very large models onto a GPU. It is available in several ZeRO stages, where each stage progressively saves more GPU memory by partitioning the optimizer state, gradients, parameters, and enabling offloading to a CPU or NVMe. DeepSpeed is integrated with the [`Trainer`] class and most of the setup is automatically taken care of for you.

 However, if you want to use DeepSpeed without the [`Trainer`], Transformers provides a [`HfDeepSpeedConfig`] class.

--- a/docs/source/en/main_classes/executorch.md
+++ b/docs/source/en/main_classes/executorch.md
@ -15,14 +15,12 @@ rendered properly in your Markdown viewer.

 -->

-
 # ExecuTorch

 [`ExecuTorch`](https://github.com/pytorch/executorch) is an end-to-end solution for enabling on-device inference capabilities across mobile and edge devices including wearables, embedded devices and microcontrollers. It is part of the PyTorch ecosystem and supports the deployment of PyTorch models with a focus on portability, productivity, and performance.

 ExecuTorch introduces well defined entry points to perform model, device, and/or use-case specific optimizations such as backend delegation, user-defined compiler transformations, memory planning, and more. The first step in preparing a PyTorch model for execution on an edge device using ExecuTorch is to export the model. This is achieved through the use of a PyTorch API called [`torch.export`](https://pytorch.org/docs/stable/export.html).

-
 ## ExecuTorch Integration

 An integration point is being developed to ensure that 🤗 Transformers can be exported using `torch.export`. The goal of this integration is not only to enable export but also to ensure that the exported artifact can be further lowered and optimized to run efficiently in `ExecuTorch`, particularly for mobile and edge use cases.
--- a/docs/source/en/main_classes/feature_extractor.md
+++ b/docs/source/en/main_classes/feature_extractor.md
@ -18,7 +18,6 @@ rendered properly in your Markdown viewer.

 A feature extractor is in charge of preparing input features for audio or vision models. This includes feature extraction from sequences, e.g., pre-processing audio files to generate Log-Mel Spectrogram features, feature extraction from images, e.g., cropping image files, but also padding, normalization, and conversion to NumPy and PyTorch tensors.

-
 ## FeatureExtractionMixin

 [[autodoc]] feature_extraction_utils.FeatureExtractionMixin
--- a/docs/source/en/main_classes/image_processor.md
+++ b/docs/source/en/main_classes/image_processor.md
@ -26,6 +26,7 @@ from transformers import AutoImageProcessor

 processor = AutoImageProcessor.from_pretrained("facebook/detr-resnet-50", use_fast=True)
 ```
+
 Note that `use_fast` will be set to `True` by default in a future release.

 When using a fast image processor, you can also set the `device` argument to specify the device on which the processing should be done. By default, the processing is done on the same device as the inputs if the inputs are tensors, or on the CPU otherwise.
@ -57,7 +58,6 @@ Here are some speed comparisons between the base and fast image processors for t

 These benchmarks were run on an [AWS EC2 g5.2xlarge instance](https://aws.amazon.com/ec2/instance-types/g5/), utilizing an NVIDIA A10G Tensor Core GPU.

-
 ## ImageProcessingMixin

 [[autodoc]] image_processing_utils.ImageProcessingMixin
@ -72,7 +72,6 @@ These benchmarks were run on an [AWS EC2 g5.2xlarge instance](https://aws.amazon

 [[autodoc]] image_processing_utils.BaseImageProcessor

-
 ## BaseImageProcessorFast

 [[autodoc]] image_processing_utils_fast.BaseImageProcessorFast
--- a/docs/source/en/main_classes/logging.md
+++ b/docs/source/en/main_classes/logging.md
@ -55,7 +55,6 @@ logger.info("INFO")
 logger.warning("WARN")
 ```

-
 All the methods of this logging module are documented below, the main ones are
 [`logging.get_verbosity`] to get the current level of verbosity in the logger and
 [`logging.set_verbosity`] to set the verbosity to the level of your choice. In order (from the least
--- a/docs/source/en/main_classes/model.md
+++ b/docs/source/en/main_classes/model.md
@ -26,7 +26,6 @@ file or directory, or from a pretrained model configuration provided by the libr

 The other methods that are common to each model are defined in [`~modeling_utils.ModuleUtilsMixin`] and [`~generation.GenerationMixin`].

-
 ## PreTrainedModel

 [[autodoc]] PreTrainedModel
--- a/docs/source/en/main_classes/onnx.md
+++ b/docs/source/en/main_classes/onnx.md
@ -51,4 +51,3 @@ to export models for different types of topologies or tasks.
 ### FeaturesManager

 [[autodoc]] onnx.features.FeaturesManager
-
--- a/docs/source/en/main_classes/optimizer_schedules.md
+++ b/docs/source/en/main_classes/optimizer_schedules.md
@ -22,7 +22,6 @@ The `.optimization` module provides:
 - several schedules in the form of schedule objects that inherit from `_LRSchedule`:
 - a gradient accumulation class to accumulate the gradients of multiple batches

-
 ## AdaFactor

 [[autodoc]] Adafactor
--- a/docs/source/en/main_classes/output.md
+++ b/docs/source/en/main_classes/output.md
@ -47,7 +47,6 @@ However, this is not always the case. Some models apply normalization or subsequ

 </Tip>

-
 You can access each attribute as you would usually do, and if that attribute has not been returned by the model, you
 will get `None`. Here for instance `outputs.loss` is the loss computed by the model, and `outputs.attentions` is
 `None`.
--- a/docs/source/en/main_classes/pipelines.md
+++ b/docs/source/en/main_classes/pipelines.md
@ -81,7 +81,6 @@ for out in tqdm(pipe(KeyDataset(dataset, "file"))):

 For ease of use, a generator is also possible:

-
 ```python
 from transformers import pipeline

@ -196,7 +195,6 @@ This is a occasional very long sentence compared to the other. In that case, the
 tokens long, so the whole batch will be [64, 400] instead of [64, 4], leading to the high slowdown. Even worse, on
 bigger batches, the program simply crashes.

-
 ```
 ------------------------------
 Streaming no batching
@ -245,7 +243,6 @@ multiple forward pass of a model. Under normal circumstances, this would yield i
 In order to circumvent this issue, both of these pipelines are a bit specific, they are `ChunkPipeline` instead of
 regular `Pipeline`. In short:

-
 ```python
 preprocessed = pipe.preprocess(inputs)
 model_outputs = pipe.forward(preprocessed)
@ -254,7 +251,6 @@ outputs = pipe.postprocess(model_outputs)

 Now becomes:

-
 ```python
 all_model_outputs = []
 for preprocessed in pipe.preprocess(inputs):
@ -282,7 +278,6 @@ If you want to override a specific pipeline.
 Don't hesitate to create an issue for your task at hand, the goal of the pipeline is to be easy to use and support most
 cases, so `transformers` could maybe support your use case.

-
 If you want to try simply you can:

 - Subclass your pipeline of choice
@ -302,7 +297,6 @@ my_pipeline = pipeline(model="xxxx", pipeline_class=MyPipeline)

 That should enable you to do all the custom code you want.

-
 ## Implementing a pipeline

 [Implementing a new pipeline](../add_new_pipeline)
@ -329,7 +323,6 @@ Pipelines available for audio tasks include the following.
    - __call__
    - all

-
 ### ZeroShotAudioClassificationPipeline

 [[autodoc]] ZeroShotAudioClassificationPipeline
--- a/docs/source/en/main_classes/processors.md
+++ b/docs/source/en/main_classes/processors.md
@ -71,7 +71,6 @@ Additionally, the following method can be used to load values from a data file a

 [[autodoc]] data.processors.glue.glue_convert_examples_to_features

-
 ## XNLI

 [The Cross-Lingual NLI Corpus (XNLI)](https://www.nyu.edu/projects/bowman/xnli/) is a benchmark that evaluates the
@ -88,7 +87,6 @@ Please note that since the gold labels are available on the test set, evaluation

 An example using these processors is given in the [run_xnli.py](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification/run_xnli.py) script.

-
 ## SQuAD

 [The Stanford Question Answering Dataset (SQuAD)](https://rajpurkar.github.io/SQuAD-explorer//) is a benchmark that
@ -115,11 +113,9 @@ Additionally, the following method can be used to convert SQuAD examples into

 [[autodoc]] data.processors.squad.squad_convert_examples_to_features

-
 These processors as well as the aforementioned method can be used with files containing the data as well as with the
 *tensorflow_datasets* package. Examples are given below.

-
 ### Example usage

 Here is an example using the processors as well as the conversion method using data files:
--- a/docs/source/en/main_classes/tokenizer.md
+++ b/docs/source/en/main_classes/tokenizer.md
@ -22,7 +22,7 @@ Rust library [🤗 Tokenizers](https://github.com/huggingface/tokenizers). The "

 1. a significant speed-up in particular when doing batched tokenization and
 2. additional methods to map between the original string (character and words) and the token space (e.g. getting the
-   index of the token comprising a given character or the span of characters corresponding to a given token). 
+   index of the token comprising a given character or the span of characters corresponding to a given token).

 The base classes [`PreTrainedTokenizer`] and [`PreTrainedTokenizerFast`]
 implement the common methods for encoding string inputs in model inputs (see below) and instantiating/saving python and
@ -50,12 +50,11 @@ several advanced alignment methods which can be used to map between the original
 token space (e.g., getting the index of the token comprising a given character or the span of characters corresponding
 to a given token).

-
 # Multimodal Tokenizer

 Apart from that each tokenizer can be a "multimodal" tokenizer which means that the tokenizer will hold all relevant special tokens
 as part of tokenizer attributes for easier access. For example, if the tokenizer is loaded from a vision-language model like LLaVA, you will
-be able to access `tokenizer.image_token_id` to obtain the special image token used as a placeholder. 
+be able to access `tokenizer.image_token_id` to obtain the special image token used as a placeholder.

 To enable extra special tokens for any type of tokenizer, you have to add the following lines and save the tokenizer. Extra special tokens do not
 have to be modality related and can ne anything that the model often needs access to. In the below code, tokenizer at `output_dir` will have direct access
--- a/docs/source/en/main_classes/video_processor.md
+++ b/docs/source/en/main_classes/video_processor.md
@ -22,7 +22,6 @@ The video processor extends the functionality of image processors by allowing Vi

 When adding a new VLM or updating an existing one to enable distinct video preprocessing, saving and reloading the processor configuration will store the video related arguments in a dedicated file named `video_preprocessing_config.json`. Don't worry if you haven't updated your VLM, the processor will try to load video related configurations from a file named `preprocessing_config.json`.

-
 ### Usage Example
 Here's an example of how to load a video processor with [`llava-hf/llava-onevision-qwen2-0.5b-ov-hf`](https://huggingface.co/llava-hf/llava-onevision-qwen2-0.5b-ov-hf) model:

@ -59,7 +58,6 @@ The video processor can also sample video frames using the technique best suited

 </Tip>

-
 ```python
 from transformers import AutoVideoProcessor

@ -92,4 +90,3 @@ print(processed_video_inputs.pixel_values_videos.shape)
 ## BaseVideoProcessor

 [[autodoc]] video_processing_utils.BaseVideoProcessor
-
--- a/docs/source/en/model_doc/aimv2.md
+++ b/docs/source/en/model_doc/aimv2.md
@ -25,7 +25,6 @@ The abstract from the paper is the following:

 *We introduce a novel method for pre-training of large-scale vision encoders. Building on recent advancements in autoregressive pre-training of vision models, we extend this framework to a multimodal setting, i.e., images and text. In this paper, we present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process, scalability, and remarkable performance across a range of downstream tasks. This is achieved by pairing the vision encoder with a multimodal decoder that autoregressively generates raw image patches and text tokens. Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification. Notably, our AIMV2-3B encoder achieves 89.5% accuracy on ImageNet-1k with a frozen trunk. Furthermore, AIMV2 consistently outperforms state-of-the-art contrastive models (e.g., CLIP, SigLIP) in multimodal image understanding across diverse settings.*

-
 This model was contributed by [Yaswanth Gali](https://huggingface.co/yaswanthgali).
 The original code can be found [here](https://github.com/apple/ml-aim).

--- a/docs/source/en/model_doc/aria.md
+++ b/docs/source/en/model_doc/aria.md
@ -98,7 +98,7 @@ print(response)
 </hfoptions>

 Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
-	
+
 The example below uses [torchao](../quantization/torchao) to only quantize the weights to int4 and the [rhymes-ai/Aria-sequential_mlp](https://huggingface.co/rhymes-ai/Aria-sequential_mlp) checkpoint. This checkpoint replaces grouped GEMM with `torch.nn.Linear` layers for easier quantization.

 ```py
@ -142,7 +142,6 @@ response = processor.decode(output_ids, skip_special_tokens=True)
 print(response)
 ```

-
 ## AriaImageProcessor

 [[autodoc]] AriaImageProcessor
--- a/docs/source/en/model_doc/audio-spectrogram-transformer.md
+++ b/docs/source/en/model_doc/audio-spectrogram-transformer.md
@ -52,13 +52,13 @@ the authors compute the stats for a downstream dataset.

 ### Using Scaled Dot Product Attention (SDPA)

-PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function 
-encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the 
-[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) 
+PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function
+encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the
+[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html)
 or the [GPU Inference](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention)
 page for more information.

-SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set 
+SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set
 `attn_implementation="sdpa"` in `from_pretrained()` to explicitly request SDPA to be used.

 ```
--- a/docs/source/en/model_doc/auto.md
+++ b/docs/source/en/model_doc/auto.md
@ -23,7 +23,6 @@ automatically retrieve the relevant model given the name/path to the pretrained
 Instantiating one of [`AutoConfig`], [`AutoModel`], and
 [`AutoTokenizer`] will directly create a class of the relevant architecture. For instance

-
 ```python
 model = AutoModel.from_pretrained("google-bert/bert-base-cased")
 ```
--- a/docs/source/en/model_doc/aya_vision.md
+++ b/docs/source/en/model_doc/aya_vision.md
@ -29,7 +29,7 @@ You can find all the original Aya Vision checkpoints under the [Aya Vision](http

 > [!TIP]
 > This model was contributed by [saurabhdash](https://huggingface.co/saurabhdash) and [yonigozlan](https://huggingface.co/yonigozlan).
-> 
+>
 > Click on the Aya Vision models in the right sidebar for more examples of how to apply Aya Vision to different image-to-text tasks.

 The example below demonstrates how to generate text based on an image with [`Pipeline`] or the [`AutoModel`] class.
--- a/docs/source/en/model_doc/bark.md
+++ b/docs/source/en/model_doc/bark.md
@ -76,7 +76,7 @@ Note that 🤗 Optimum must be installed before using this feature. [Here's how

 Flash Attention 2 is an even faster, optimized version of the previous optimization.

-##### Installation 
+##### Installation

 First, check whether your hardware is compatible with Flash Attention 2. The latest list of compatible hardware can be found in the [official documentation](https://github.com/Dao-AILab/flash-attention#installation-and-features). If your hardware is not compatible with Flash Attention 2, you can still benefit from attention kernel optimisations through Better Transformer support covered [above](https://huggingface.co/docs/transformers/main/en/model_doc/bark#using-better-transformer).

@ -86,7 +86,6 @@ Next, [install](https://github.com/Dao-AILab/flash-attention#installation-and-fe
 pip install -U flash-attn --no-build-isolation
 ```

-
 ##### Usage

 To load a model using Flash Attention 2, we can pass the `attn_implementation="flash_attention_2"` flag to [`.from_pretrained`](https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel.from_pretrained). We'll also load the model in half-precision (e.g. `torch.float16`), since it results in almost no degradation to audio quality but significantly lower memory usage and faster inference:
@ -97,7 +96,6 @@ model = BarkModel.from_pretrained("suno/bark-small", dtype=torch.float16, attn_i

 ##### Performance comparison

-
 The following diagram shows the latency for the native attention implementation (no optimisation) against Better Transformer and Flash Attention 2. In all cases, we generate 400 semantic tokens on a 40GB A100 GPU with PyTorch 2.1. Flash Attention 2 is also consistently faster than Better Transformer, and its performance improves even more as batch sizes increase:

 <div style="text-align: center">
@ -108,7 +106,6 @@ To put this into perspective, on an NVIDIA A100 and when generating 400 semantic

 At batch size 8, on an NVIDIA A100, Flash Attention 2 is also 10% faster than Better Transformer, and at batch size 16, 25%.

-
 #### Combining optimization techniques

 You can combine optimization techniques, and use CPU offload, half-precision and Flash Attention 2 (or 🤗 Better Transformer) all at once.
@ -147,7 +144,7 @@ These presets are also uploaded in the hub [here](https://huggingface.co/suno/ba
 >>> audio_array = audio_array.cpu().numpy().squeeze()
 ```

-Bark can generate highly realistic, **multilingual** speech as well as other audio - including music, background noise and simple sound effects. 
+Bark can generate highly realistic, **multilingual** speech as well as other audio - including music, background noise and simple sound effects.

 ```python
 >>> # Multilingual speech - simplified Chinese
@ -165,7 +162,6 @@ Bark can generate highly realistic, **multilingual** speech as well as other aud

 The model can also produce **nonverbal communications** like laughing, sighing and crying.

-
 ```python
 >>> # Adding non-speech cues to the input text
 >>> inputs = processor("Hello uh ... [clears throat], my dog is cute [laughter]")
@ -235,4 +231,3 @@ To save the audio, simply take the sample rate from the model config and some sc

 [[autodoc]] BarkSemanticConfig
    - all
-
--- a/docs/source/en/model_doc/bart.md
+++ b/docs/source/en/model_doc/bart.md
@ -15,7 +15,6 @@ rendered properly in your Markdown viewer.
 -->
 *This model was released on 2019-10-29 and added to Hugging Face Transformers on 2020-11-16.*

-
 <div style="float: right;">
    <div class="flex flex-wrap space-x-1">
    <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
@ -46,6 +45,7 @@ pipeline = pipeline(
 pipeline("Plants create <mask> through a process known as photosynthesis.")

 ```
+
 </hfoption>
 <hfoption id="AutoModel">

--- a/docs/source/en/model_doc/barthez.md
+++ b/docs/source/en/model_doc/barthez.md
@ -31,7 +31,6 @@ You can find all of the original BARThez checkpoints under the [BARThez](https:/
 > This model was contributed by [moussakam](https://huggingface.co/moussakam).
 > Refer to the [BART](./bart) docs for more usage examples.

-
 The example below demonstrates how to predict the `<mask>` token with [`Pipeline`], [`AutoModel`], and from the command line.

 <hfoptions id="usage">
--- a/docs/source/en/model_doc/bartpho.md
+++ b/docs/source/en/model_doc/bartpho.md
@ -33,12 +33,9 @@ You can find all the original checkpoints under the [VinAI](https://huggingface.

 The example below demonstrates how to summarize text with [`Pipeline`] or the [`AutoModel`] class.

-
 <hfoptions id="usage">
 <hfoption id="Pipeline">

-
-
 ```python
 import torch
 from transformers import pipeline
@ -98,8 +95,6 @@ transformers run --task summarization --model vinai/bartpho-word --device 0
 </hfoption>
 </hfoptions>

-
-
 ## Notes

 - BARTpho uses the large architecture of BART with an additional layer-normalization layer on top of the encoder and decoder. The BART-specific classes should be replaced with the mBART-specific classes.
--- a/docs/source/en/model_doc/bert-japanese.md
+++ b/docs/source/en/model_doc/bert-japanese.md
@ -81,7 +81,6 @@ API reference information.

 </Tip>

-
 ## BertJapaneseTokenizer

 [[autodoc]] BertJapaneseTokenizer
--- a/docs/source/en/model_doc/bertweet.md
+++ b/docs/source/en/model_doc/bertweet.md
@ -26,7 +26,6 @@ rendered properly in your Markdown viewer.

 [BERTweet](https://huggingface.co/papers/2005.10200) shares the same architecture as [BERT-base](./bert), but it’s pretrained like [RoBERTa](./roberta) on English Tweets. It performs really well on Tweet-related tasks like part-of-speech tagging, named entity recognition, and text classification.

-
 You can find all the original BERTweet checkpoints under the [VinAI Research](https://huggingface.co/vinai?search_models=BERTweet) organization.

 > [!TIP]
@ -49,6 +48,7 @@ pipeline = pipeline(
 )
 pipeline("Plants create <mask> through a process known as photosynthesis.")
 ```
+
 </hfoption>
 <hfoption id="AutoModel">

--- a/docs/source/en/model_doc/big_bird.md
+++ b/docs/source/en/model_doc/big_bird.md
@ -47,6 +47,7 @@ pipeline = pipeline(
 )
 pipeline("Plants create [MASK] through a process known as photosynthesis.")
 ```
+
 </hfoption>
 <hfoption id="AutoModel">

@ -81,6 +82,7 @@ print(f"The predicted token is: {predicted_token}")
 ```bash
 !echo -e "Plants create [MASK] through a process known as photosynthesis." | transformers run --task fill-mask --model google/bigbird-roberta-base --device 0
 ```
+
 </hfoption>
 </hfoptions>

--- a/docs/source/en/model_doc/bigbird_pegasus.md
+++ b/docs/source/en/model_doc/bigbird_pegasus.md
@ -52,6 +52,7 @@ Through photosynthesis, plants capture energy from sunlight using a green pigmen
 These ingredients are then transformed into glucose, a type of sugar that serves as a source of chemical energy, and oxygen, which is released as a byproduct into the atmosphere. The glucose produced during photosynthesis is not just used immediately; plants also store it as starch or convert it into other organic compounds like cellulose, which is essential for building their cellular structure.
 This energy reserve allows them to grow, develop leaves, produce flowers, bear fruit, and carry out various physiological processes throughout their lifecycle.""")
 ```
+
 </hfoption>
 <hfoption id="AutoModel">

@ -77,6 +78,7 @@ input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)
 output = model.generate(**input_ids, cache_implementation="static")
 print(tokenizer.decode(output[0], skip_special_tokens=True))
 ```
+
 </hfoption>
 <hfoption id="transformers">

--- a/docs/source/en/model_doc/biogpt.md
+++ b/docs/source/en/model_doc/biogpt.md
@ -135,31 +135,26 @@ print(output)

 [[autodoc]] BioGptConfig

-
 ## BioGptTokenizer

 [[autodoc]] BioGptTokenizer
    - save_vocabulary

-
 ## BioGptModel

 [[autodoc]] BioGptModel
    - forward

-
 ## BioGptForCausalLM

 [[autodoc]] BioGptForCausalLM
    - forward

-
 ## BioGptForTokenClassification

 [[autodoc]] BioGptForTokenClassification
    - forward

-
 ## BioGptForSequenceClassification

 [[autodoc]] BioGptForSequenceClassification
--- a/docs/source/en/model_doc/bitnet.md
+++ b/docs/source/en/model_doc/bitnet.md
@ -35,10 +35,8 @@ Several versions of the model weights are available on Hugging Face:

 * [**`microsoft/bitnet-b1.58-2B-4T-gguf`**](https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf): Contains the model weights in GGUF format, compatible with the `bitnet.cpp` library for CPU inference.

-
 ### Model Details

-
 * **Architecture:** Transformer-based, modified with `BitLinear` layers (BitNet framework).
    * Uses Rotary Position Embeddings (RoPE).
    * Uses squared ReLU (ReLU²) activation in FFN layers.
@ -58,10 +56,8 @@ Several versions of the model weights are available on Hugging Face:
    3.  **Direct Preference Optimization (DPO):** Aligned with human preferences using preference pairs.
 * **Tokenizer:** LLaMA 3 Tokenizer (vocab size: 128,256).

-
 ## Usage tips

-
 **VERY IMPORTANT NOTE ON EFFICIENCY**

 > Please do NOT expect performance efficiency gains (in terms of speed, latency, or energy consumption) when using this model with the standard transformers library.
@ -106,7 +102,6 @@ response = tokenizer.decode(chat_outputs[0][chat_input.shape[-1]:], skip_special
 print("\nAssistant Response:", response)
 ```

-
 ## BitNetConfig

 [[autodoc]] BitNetConfig
--- a/docs/source/en/model_doc/blenderbot-small.md
+++ b/docs/source/en/model_doc/blenderbot-small.md
@ -55,7 +55,6 @@ found [here](https://github.com/facebookresearch/ParlAI).
 Blenderbot Small is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
 the left.

-
 ## Resources

 - [Causal language modeling task guide](../tasks/language_modeling)
--- a/docs/source/en/model_doc/blenderbot.md
+++ b/docs/source/en/model_doc/blenderbot.md
@ -71,7 +71,6 @@ An example:
  `facebook/blenderbot_small_90M`, have a different architecture and consequently should be used with
  [BlenderbotSmall](blenderbot-small).

-
 ## Resources

 - [Causal language modeling task guide](../tasks/language_modeling)
--- a/docs/source/en/model_doc/blip-2.md
+++ b/docs/source/en/model_doc/blip-2.md
@ -26,14 +26,14 @@ rendered properly in your Markdown viewer.
 The BLIP-2 model was proposed in [BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](https://huggingface.co/papers/2301.12597) by
 Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi. BLIP-2 leverages frozen pre-trained image encoders and large language models (LLMs) by training a lightweight, 12-layer Transformer
 encoder in between them, achieving state-of-the-art performance on various vision-language tasks. Most notably, BLIP-2 improves upon [Flamingo](https://huggingface.co/papers/2204.14198), an 80 billion parameter model, by 8.7%
-on zero-shot VQAv2 with 54x fewer trainable parameters. 
+on zero-shot VQAv2 with 54x fewer trainable parameters.

 The abstract from the paper is the following:

 *The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen language model. BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods. For example, our model outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters. We also demonstrate the model's emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.*

 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/blip2_architecture.jpg"
-alt="drawing" width="600"/> 
+alt="drawing" width="600"/>

 <small> BLIP-2 architecture. Taken from the <a href="https://huggingface.co/papers/2301.12597">original paper.</a> </small>

--- a/docs/source/en/model_doc/blip.md
+++ b/docs/source/en/model_doc/blip.md
@ -25,7 +25,6 @@ rendered properly in your Markdown viewer.

 [BLIP](https://huggingface.co/papers/2201.12086) (Bootstrapped Language-Image Pretraining) is a vision-language pretraining (VLP) framework designed for *both* understanding and generation tasks. Most existing pretrained models are only good at one or the other. It uses a captioner to generate captions and a filter to remove the noisy captions. This increases training data quality and more effectively uses the messy web data.

-
 You can find all the original BLIP checkpoints under the [BLIP](https://huggingface.co/collections/Salesforce/blip-models-65242f40f1491fbf6a9e9472) collection.

 > [!TIP]
--- a/docs/source/en/model_doc/bloom.md
+++ b/docs/source/en/model_doc/bloom.md
@ -48,7 +48,6 @@ See also:
 - [Token classification task guide](../tasks/token_classification)
 - [Question answering task guide](../tasks/question_answering)

-
 ⚡️ Inference
 - A blog on [Optimization story: Bloom inference](https://huggingface.co/blog/bloom-inference-optimization).
 - A blog on [Incredibly Fast BLOOM Inference with DeepSpeed and Accelerate](https://huggingface.co/blog/bloom-inference-pytorch-scripts).
--- a/docs/source/en/model_doc/blt.md
+++ b/docs/source/en/model_doc/blt.md
@ -83,7 +83,6 @@ print(tokenizer.decode(generated_ids[0]))
 This model was contributed by [itazap](https://huggingface.co/<itazap>).
 The original code can be found [here](<https://github.com/facebookresearch/blt>).

-
 ## BltConfig

 [[autodoc]] BltConfig
--- a/docs/source/en/model_doc/bridgetower.md
+++ b/docs/source/en/model_doc/bridgetower.md
@ -26,7 +26,7 @@ rendered properly in your Markdown viewer.
 The BridgeTower model was proposed in [BridgeTower: Building Bridges Between Encoders in Vision-Language Representative Learning](https://huggingface.co/papers/2206.08657) by Xiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, Wanxiang Che, Nan Duan. The goal of this model is to build a
 bridge between each uni-modal encoder and the cross-modal encoder to enable comprehensive and detailed interaction at each layer of the cross-modal encoder thus achieving remarkable performance on various downstream tasks with almost negligible additional performance and computational costs.

-This paper has been accepted to the [AAAI'23](https://aaai.org/Conferences/AAAI-23/) conference. 
+This paper has been accepted to the [AAAI'23](https://aaai.org/Conferences/AAAI-23/) conference.

 The abstract from the paper is the following:

@ -54,6 +54,7 @@ The [`BridgeTowerProcessor`] wraps [`RobertaTokenizer`] and [`BridgeTowerImagePr
 encode the text and prepare the images respectively.

 The following example shows how to run contrastive learning using [`BridgeTowerProcessor`] and [`BridgeTowerForContrastiveLearning`].
+
 ```python
 >>> from transformers import BridgeTowerProcessor, BridgeTowerForContrastiveLearning
 >>> import requests
@ -76,6 +77,7 @@ The following example shows how to run contrastive learning using [`BridgeTowerP
 ```

 The following example shows how to run image-text retrieval using [`BridgeTowerProcessor`] and [`BridgeTowerForImageAndTextRetrieval`].
+
 ```python
 >>> from transformers import BridgeTowerProcessor, BridgeTowerForImageAndTextRetrieval
 >>> import requests
@ -130,7 +132,6 @@ Tips:
 - Please refer to [Table 5](https://huggingface.co/papers/2206.08657) for BridgeTower's performance on Image Retrieval and other down stream tasks.
 - The PyTorch version of this model is only available in torch 1.10 and higher.

-
 ## BridgeTowerConfig

 [[autodoc]] BridgeTowerConfig
@ -177,4 +178,3 @@ Tips:

 [[autodoc]] BridgeTowerForImageAndTextRetrieval
    - forward
-
--- a/docs/source/en/model_doc/bros.md
+++ b/docs/source/en/model_doc/bros.md
@ -57,7 +57,6 @@ def expand_and_normalize_bbox(bboxes, doc_width, doc_height):

 - [`~transformers.BrosForTokenClassification.forward`, `~transformers.BrosSpadeEEForTokenClassification.forward`, `~transformers.BrosSpadeEEForTokenClassification.forward`] require not only `input_ids` and `bbox` but also `box_first_token_mask` for loss calculation. It is a mask to filter out non-first tokens of each box. You can obtain this mask by saving start token indices of bounding boxes when creating `input_ids` from words. You can make `box_first_token_mask` with following code,

-
 ```python
 def make_box_first_token_mask(bboxes, words, tokenizer, max_seq_length=512):

@ -102,7 +101,6 @@ def make_box_first_token_mask(bboxes, words, tokenizer, max_seq_length=512):
 [[autodoc]] BrosModel
    - forward

-
 ## BrosForTokenClassification

 [[autodoc]] BrosForTokenClassification
--- a/docs/source/en/model_doc/camembert.md
+++ b/docs/source/en/model_doc/camembert.md
@ -50,6 +50,7 @@ from transformers import pipeline
 pipeline = pipeline("fill-mask", model="camembert-base", dtype=torch.float16, device=0)
 pipeline("Le camembert est un délicieux fromage <mask>.")
 ```
+
 </hfoption>

 <hfoption id="AutoModel">
@ -72,6 +73,7 @@ predicted_token = tokenizer.decode(predicted_token_id)

 print(f"The predicted token is: {predicted_token}")
 ```
+
 </hfoption>

 <hfoption id="transformers CLI">
@ -84,7 +86,6 @@ echo -e "Le camembert est un délicieux fromage <mask>." | transformers run --ta

 </hfoptions>

-
 Quantization reduces the memory burden of large models by representing weights in lower precision. Refer to the [Quantization](../quantization/overview) overview for available options.

 The example below uses [bitsandbytes](../quantization/bitsandbytes) quantization to quantize the weights to 8-bits.
--- a/docs/source/en/model_doc/canine.md
+++ b/docs/source/en/model_doc/canine.md
@ -86,6 +86,7 @@ echo -e "Plant create energy through a process known as photosynthesis." | trans
    inputs = ["Life is like a box of chocolates.", "You never know what you gonna get."]
    encoding = tokenizer(inputs, padding="longest", truncation=True, return_tensors="pt")
    ```
+
 - CANINE is primarily designed to be fine-tuned on a downstream task. The pretrained model can be used for either masked language modeling or next sentence prediction.

 ## CanineConfig
--- a/docs/source/en/model_doc/chameleon.md
+++ b/docs/source/en/model_doc/chameleon.md
@ -28,7 +28,6 @@ rendered properly in your Markdown viewer.
 The Chameleon model was proposed in [Chameleon: Mixed-Modal Early-Fusion Foundation Models
 ](https://huggingface.co/papers/2405.09818) by META AI Chameleon Team. Chameleon is a Vision-Language Model that use vector quantization to tokenize images which enables the model to generate multimodal output. The model takes images and texts as input, including an interleaved format, and generates textual response. Image generation module is not released yet.

-
 The abstract from the paper is the following:

 *We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. We outline a stable training
@ -43,7 +42,6 @@ including Gemini Pro and GPT-4V, according to human judgments on a new long-form
 generation evaluation, where either the prompt or outputs contain mixed sequences of both images and
 text. Chameleon marks a significant step forward in unified modeling of full multimodal documents*

-
 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/chameleon_arch.png"
 alt="drawing" width="600"/>

@ -52,7 +50,6 @@ alt="drawing" width="600"/>
 This model was contributed by [joaogante](https://huggingface.co/joaogante) and [RaushanTurganbay](https://huggingface.co/RaushanTurganbay).
 The original code can be found [here](https://github.com/facebookresearch/chameleon).

-
 ## Usage tips

 - We advise users to use `padding_side="left"` when computing batched generation as it leads to more accurate results. Simply make sure to set `processor.tokenizer.padding_side = "left"` before generating.
--- a/docs/source/en/model_doc/clipseg.md
+++ b/docs/source/en/model_doc/clipseg.md
@ -47,7 +47,7 @@ can be formulated. Finally, we find our system to adapt well
 to generalized queries involving affordances or properties*

 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/clipseg_architecture.png"
-alt="drawing" width="600"/> 
+alt="drawing" width="600"/>

 <small> CLIPSeg overview. Taken from the <a href="https://huggingface.co/papers/2112.10003">original paper.</a> </small>

--- a/docs/source/en/model_doc/clvp.md
+++ b/docs/source/en/model_doc/clvp.md
@ -29,29 +29,25 @@ The abstract from the paper is the following:

 *In recent years, the field of image generation has been revolutionized by the application of autoregressive transformers and DDPMs. These approaches model the process of image generation as a step-wise probabilistic processes and leverage large amounts of compute and data to learn the image distribution. This methodology of improving performance need not be confined to images. This paper describes a way to apply advances in the image generative domain to speech synthesis. The result is TorToise - an expressive, multi-voice text-to-speech system.*

-
 This model was contributed by [Susnato Dhar](https://huggingface.co/susnato).
 The original code can be found [here](https://github.com/neonbjb/tortoise-tts).

-
 ## Usage tips

 1. CLVP is an integral part of the Tortoise TTS model.
 2. CLVP can be used to compare different generated speech candidates with the provided text, and the best speech tokens are forwarded to the diffusion model.
 3. The use of the [`ClvpModelForConditionalGeneration.generate()`] method is strongly recommended for tortoise usage.
-4. Note that the CLVP model expects the audio to be sampled at 22.05 kHz contrary to other audio models which expects 16 kHz. 
-
+4. Note that the CLVP model expects the audio to be sampled at 22.05 kHz contrary to other audio models which expects 16 kHz.

 ## Brief Explanation:

 - The [`ClvpTokenizer`] tokenizes the text input, and the [`ClvpFeatureExtractor`] extracts the log mel-spectrogram from the desired audio.
 - [`ClvpConditioningEncoder`] takes those text tokens and audio representations and converts them into embeddings conditioned on the text and audio.
 - The [`ClvpForCausalLM`] uses those embeddings to generate multiple speech candidates.
- Each speech candidate is passed through the speech encoder ([`ClvpEncoder`]) which converts them into a vector representation, and the text encoder ([`ClvpEncoder`]) converts the text tokens into the same latent space. 
- At the end, we compare each speech vector with the text vector to see which speech vector is most similar to the text vector. 
+- Each speech candidate is passed through the speech encoder ([`ClvpEncoder`]) which converts them into a vector representation, and the text encoder ([`ClvpEncoder`]) converts the text tokens into the same latent space.
+- At the end, we compare each speech vector with the text vector to see which speech vector is most similar to the text vector.
 - [`ClvpModelForConditionalGeneration.generate()`] compresses all of the logic described above into a single method.  

-
 Example :

 ```python
@ -74,7 +70,6 @@ Example :
 >>> generated_output = model.generate(**processor_output)
 ```

-
 ## ClvpConfig

 [[autodoc]] ClvpConfig
@ -128,4 +123,3 @@ Example :
 ## ClvpDecoder

 [[autodoc]] ClvpDecoder
-
--- a/docs/source/en/model_doc/code_llama.md
+++ b/docs/source/en/model_doc/code_llama.md
@ -143,6 +143,7 @@ visualizer("""def func(a, b):

 - Infilling is only available in the 7B and 13B base models, and not in the Python, Instruct, 34B, or 70B models.
 - Use the `<FILL_ME>` token where you want your input to be filled. The tokenizer splits this token to create a formatted input string that follows the [original training pattern](https://github.com/facebookresearch/codellama/blob/cb51c14ec761370ba2e2bc351374a79265d0465e/llama/generation.py#L402). This is more robust than preparing the pattern yourself.
+
    ```py
    from transformers import LlamaForCausalLM, CodeLlamaTokenizer

@ -158,6 +159,7 @@ visualizer("""def func(a, b):
    filling = tokenizer.batch_decode(generated_ids[:, input_ids.shape[1]:], skip_special_tokens = True)[0]
    print(PROMPT.replace("<FILL_ME>", filling))
    ```
+
 - Use `bfloat16` for further training or fine-tuning and `float16` for inference.
 - The `BOS` character is not used for infilling when encoding the prefix or suffix, but only at the beginning of each prompt.
 - The tokenizer is a byte-pair encoding model based on [SentencePiece](https://github.com/google/sentencepiece). During decoding, if the first token is the start of the word (for example, “Banana”), the tokenizer doesn’t prepend the prefix space to the string.
--- a/docs/source/en/model_doc/codegen.md
+++ b/docs/source/en/model_doc/codegen.md
@ -29,7 +29,7 @@ CodeGen is an autoregressive language model for program synthesis trained sequen

 The abstract from the paper is the following:

-*Program synthesis strives to generate a computer program as a solution to a given problem specification. We propose a conversational program synthesis approach via large language models, which addresses the challenges of searching over a vast program space and user intent specification faced in prior approaches. Our new approach casts the process of writing a specification and program as a multi-turn conversation between a user and a system. It treats program synthesis as a sequence prediction problem, in which the specification is expressed in natural language and the desired program is conditionally sampled. We train a family of large language models, called CodeGen, on natural language and programming language data. With weak supervision in the data and the scaling up of data size and model size, conversational capacities emerge from the simple autoregressive language modeling. To study the model behavior on conversational program synthesis, we develop a multi-turn programming benchmark (MTPB), where solving each problem requires multi-step synthesis via multi-turn conversation between the user and the model. Our findings show the emergence of conversational capabilities and the effectiveness of the proposed conversational program synthesis paradigm. In addition, our model CodeGen (with up to 16B parameters trained on TPU-v4) outperforms OpenAI's Codex on the HumanEval benchmark. We make the training library JaxFormer including checkpoints available as open source contribution: [this https URL](https://github.com/salesforce/codegen).* 
+*Program synthesis strives to generate a computer program as a solution to a given problem specification. We propose a conversational program synthesis approach via large language models, which addresses the challenges of searching over a vast program space and user intent specification faced in prior approaches. Our new approach casts the process of writing a specification and program as a multi-turn conversation between a user and a system. It treats program synthesis as a sequence prediction problem, in which the specification is expressed in natural language and the desired program is conditionally sampled. We train a family of large language models, called CodeGen, on natural language and programming language data. With weak supervision in the data and the scaling up of data size and model size, conversational capacities emerge from the simple autoregressive language modeling. To study the model behavior on conversational program synthesis, we develop a multi-turn programming benchmark (MTPB), where solving each problem requires multi-step synthesis via multi-turn conversation between the user and the model. Our findings show the emergence of conversational capabilities and the effectiveness of the proposed conversational program synthesis paradigm. In addition, our model CodeGen (with up to 16B parameters trained on TPU-v4) outperforms OpenAI's Codex on the HumanEval benchmark. We make the training library JaxFormer including checkpoints available as open source contribution: [this https URL](https://github.com/salesforce/codegen).*

 This model was contributed by [Hiroaki Hayashi](https://huggingface.co/rooa).
 The original code can be found [here](https://github.com/salesforce/codegen).
@ -39,7 +39,7 @@ The original code can be found [here](https://github.com/salesforce/codegen).
 * CodeGen model [checkpoints](https://huggingface.co/models?other=codegen) are available on different pre-training data with variable sizes.
 * The format is: `Salesforce/codegen-{size}-{data}`, where
  * `size`: `350M`, `2B`, `6B`, `16B`
-  * `data`: 
+  * `data`:
    * `nl`: Pre-trained on the Pile
    * `multi`: Initialized with `nl`, then further pre-trained on multiple programming languages data
    * `mono`: Initialized with `multi`, then further pre-trained on Python data
--- a/docs/source/en/model_doc/cohere.md
+++ b/docs/source/en/model_doc/cohere.md
@ -22,14 +22,12 @@ rendered properly in your Markdown viewer.
    </div>
 </div>

-
 # Cohere

 Cohere [Command-R](https://cohere.com/blog/command-r) is a 35B parameter multilingual large language model designed for long context tasks like retrieval-augmented generation (RAG) and calling external APIs and tools. The model is specifically trained for grounded generation and supports both single-step and multi-step tool use. It supports a context length of 128K tokens.

 You can find all the original Command-R checkpoints under the [Command Models](https://huggingface.co/collections/CohereForAI/command-models-67652b401665205e17b192ad) collection.

-
 > [!TIP]
 > Click on the Cohere models in the right sidebar for more examples of how to apply Cohere to different language tasks.

@ -123,7 +121,6 @@ visualizer("Plants create energy through a process known as")
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/cohere-attn-mask.png"/>
 </div>

-
 ## Notes
 - Don’t use the dtype parameter in [`~AutoModel.from_pretrained`] if you’re using FlashAttention-2 because it only supports fp16 or bf16. You should use [Automatic Mixed Precision](https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html), set fp16 or bf16 to True if using [`Trainer`], or use [torch.autocast](https://pytorch.org/docs/stable/amp.html#torch.autocast).

@ -145,7 +142,6 @@ visualizer("Plants create energy through a process known as")
 [[autodoc]] CohereModel
    - forward

-
 ## CohereForCausalLM

 [[autodoc]] CohereForCausalLM
--- a/docs/source/en/model_doc/cohere2.md
+++ b/docs/source/en/model_doc/cohere2.md
@ -22,7 +22,6 @@ rendered properly in your Markdown viewer.
    </div>
 </div>

-
 # Cohere 2

 [Cohere Command R7B](https://cohere.com/blog/command-r7b) is an open weights research release of a 7B billion parameter model. It is a multilingual model trained on 23 languages and has a context window of 128k. The model features three layers with sliding window attention and ROPE for efficient local context modeling and relative positional encoding. A fourth layer uses global attention without positional embeddings, enabling unrestricted token interactions across the entire sequence.
@ -31,7 +30,6 @@ This model is optimized for speed, cost-performance, and compute resources.

 You can find all the original Command-R checkpoints under the [Command Models](https://huggingface.co/collections/CohereForAI/command-models-67652b401665205e17b192ad) collection.

-
 > [!TIP]
 > Click on the Cohere models in the right sidebar for more examples of how to apply Cohere to different language tasks.

@ -136,7 +134,6 @@ print(tokenizer.decode(output[0], skip_special_tokens=True))
 [[autodoc]] Cohere2Model
    - forward

-
 ## Cohere2ForCausalLM

 [[autodoc]] Cohere2ForCausalLM
--- a/docs/source/en/model_doc/cohere2_vision.md
+++ b/docs/source/en/model_doc/cohere2_vision.md
@ -113,6 +113,7 @@ outputs = pipe(text=messages, max_new_tokens=300, return_full_text=False)

 print(outputs)
 ```
+
 </hfoption>
 </hfoptions>

--- a/docs/source/en/model_doc/cpm.md
+++ b/docs/source/en/model_doc/cpm.md
@ -42,7 +42,6 @@ NLP tasks in the settings of few-shot (even zero-shot) learning.*
 This model was contributed by [canwenxu](https://huggingface.co/canwenxu). The original implementation can be found
 here: https://github.com/TsinghuaAI/CPM-Generate

-
 <Tip>

 CPM's architecture is the same as GPT-2, except for tokenization method. Refer to [GPT-2 documentation](gpt2) for
@ -50,7 +49,6 @@ API reference information.

 </Tip>

-
 ## CpmTokenizer

 [[autodoc]] CpmTokenizer
--- a/docs/source/en/model_doc/cpmant.md
+++ b/docs/source/en/model_doc/cpmant.md
@ -45,7 +45,7 @@ This model was contributed by [OpenBMB](https://huggingface.co/openbmb). The ori

 [[autodoc]] CpmAntModel
    - all
-    
+
 ## CpmAntForCausalLM

 [[autodoc]] CpmAntForCausalLM
--- a/docs/source/en/model_doc/csm.md
+++ b/docs/source/en/model_doc/csm.md
@ -346,7 +346,6 @@ out.loss.backward()
 This model was contributed by [Eustache Le Bihan](https://huggingface.co/eustlb).
 The original code can be found [here](https://github.com/SesameAILabs/csm).

-
 ## CsmConfig

 [[autodoc]] CsmConfig
--- a/docs/source/en/model_doc/ctrl.md
+++ b/docs/source/en/model_doc/ctrl.md
@ -55,7 +55,6 @@ This model was contributed by [keskarnitishr](https://huggingface.co/keskarnitis
  pre-computed values in the context of text generation. See the [`forward`](model_doc/ctrl#transformers.CTRLModel.forward)
  method for more information on the usage of this argument.

-
 ## Resources

 - [Text classification task guide](../tasks/sequence_classification)
--- a/docs/source/en/model_doc/d_fine.md
+++ b/docs/source/en/model_doc/d_fine.md
@ -24,13 +24,13 @@ Yansong Peng, Hebei Li, Peixi Wu, Yueyi Zhang, Xiaoyan Sun, Feng Wu

 The abstract from the paper is the following:

-*We introduce D-FINE, a powerful real-time object detector that achieves outstanding localization precision by redefining the bounding box regression task in DETR models. D-FINE comprises two key components: Fine-grained Distribution Refinement (FDR) and Global Optimal Localization Self-Distillation (GO-LSD). 
+*We introduce D-FINE, a powerful real-time object detector that achieves outstanding localization precision by redefining the bounding box regression task in DETR models. D-FINE comprises two key components: Fine-grained Distribution Refinement (FDR) and Global Optimal Localization Self-Distillation (GO-LSD).
 FDR transforms the regression process from predicting fixed coordinates to iteratively refining probability distributions, providing a fine-grained intermediate representation that significantly enhances localization accuracy. GO-LSD is a bidirectional optimization strategy that transfers localization knowledge from refined distributions to shallower layers through self-distillation, while also simplifying the residual prediction tasks for deeper layers. Additionally, D-FINE incorporates lightweight optimizations in computationally intensive modules and operations, achieving a better balance between speed and accuracy. Specifically, D-FINE-L / X achieves 54.0% / 55.8% AP on the COCO dataset at 124 / 78 FPS on an NVIDIA T4 GPU. When pretrained on Objects365, D-FINE-L / X attains 57.1% / 59.3% AP, surpassing all existing real-time detectors. Furthermore, our method significantly enhances the performance of a wide range of DETR models by up to 5.3% AP with negligible extra parameters and training costs. Our code and pretrained models: this https URL.*

-This model was contributed by [VladOS95-cyber](https://github.com/VladOS95-cyber). 
+This model was contributed by [VladOS95-cyber](https://github.com/VladOS95-cyber).
 The original code can be found [here](https://github.com/Peterande/D-FINE).

-## Usage tips 
+## Usage tips

 ```python
 >>> import torch
--- a/docs/source/en/model_doc/dab-detr.md
+++ b/docs/source/en/model_doc/dab-detr.md
@ -77,7 +77,9 @@ for result in results:
        box = [round(i, 2) for i in box.tolist()]
        print(f"{model.config.id2label[label]}: {score:.2f} {box}")
 ```
+
 This should output
+
 ```
 cat: 0.87 [14.7, 49.39, 320.52, 469.28]
 remote: 0.86 [41.08, 72.37, 173.39, 117.2]
@ -89,6 +91,7 @@ couch: 0.59 [-0.04, 1.34, 639.9, 477.09]
 There are three other ways to instantiate a DAB-DETR model (depending on what you prefer):

 Option 1: Instantiate DAB-DETR with pre-trained weights for entire model
+
 ```py
 >>> from transformers import DabDetrForObjectDetection

@ -96,19 +99,21 @@ Option 1: Instantiate DAB-DETR with pre-trained weights for entire model
 ```

 Option 2: Instantiate DAB-DETR with randomly initialized weights for Transformer, but pre-trained weights for backbone
+
 ```py
 >>> from transformers import DabDetrConfig, DabDetrForObjectDetection

 >>> config = DabDetrConfig()
 >>> model = DabDetrForObjectDetection(config)
 ```
+
 Option 3: Instantiate DAB-DETR with randomly initialized weights for backbone + Transformer
+
 ```py
 >>> config = DabDetrConfig(use_pretrained_backbone=False)
 >>> model = DabDetrForObjectDetection(config)
 ```

-
 ## DabDetrConfig

 [[autodoc]] DabDetrConfig
--- a/docs/source/en/model_doc/dac.md
+++ b/docs/source/en/model_doc/dac.md
@ -23,7 +23,6 @@ rendered properly in your Markdown viewer.

 ## Overview

-
 The DAC model was proposed in [Descript Audio Codec: High-Fidelity Audio Compression with Improved RVQGAN](https://huggingface.co/papers/2306.06546) by Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, Kundan Kumar.

 The Descript Audio Codec (DAC) model is a powerful tool for compressing audio data, making it highly efficient for storage and transmission. By compressing 44.1 KHz audio into tokens at just 8kbps bandwidth, the DAC model enables high-quality audio processing while significantly reducing the data footprint. This is particularly useful in scenarios where bandwidth is limited or storage space is at a premium, such as in streaming applications, remote conferencing, and archiving large audio datasets.
@ -35,7 +34,6 @@ The abstract from the paper is the following:
 This model was contributed by [Kamil Akesbi](https://huggingface.co/kamilakesbi).
 The original code can be found [here](https://github.com/descriptinc/descript-audio-codec/tree/main?tab=readme-ov-file).

-
 ## Model structure

 The Descript Audio Codec (DAC) model is structured into three distinct stages:
@ -44,11 +42,11 @@ The Descript Audio Codec (DAC) model is structured into three distinct stages:
 2. Residual Vector Quantizer (RVQ) Model: Working in tandem with the encoder, this model quantizes the latent codes of the audio, refining the compression and ensuring high-quality reconstruction.
 3. Decoder Model: This final stage reconstructs the audio from its compressed form, restoring it to a state that closely resembles the original input.

-## Usage example 
+## Usage example

-Here is a quick example of how to encode and decode an audio using this model: 
+Here is a quick example of how to encode and decode an audio using this model:

-```python 
+```python
 >>> from datasets import load_dataset, Audio
 >>> from transformers import DacModel, AutoProcessor
 >>> librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
--- a/docs/source/en/model_doc/dbrx.md
+++ b/docs/source/en/model_doc/dbrx.md
@ -35,7 +35,6 @@ We estimate that this data is at least 2x better token-for-token than the data w
 This new dataset was developed using the full suite of Databricks tools, including Apache Spark™ and Databricks notebooks for data processing, and Unity Catalog for data management and governance.
 We used curriculum learning for pretraining, changing the data mix during training in ways we found to substantially improve model quality.

-
 More detailed information about DBRX Instruct and DBRX Base can be found in our [technical blog post](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm).

 This model was contributed by [eitan-turok](https://huggingface.co/eitanturok) and [abhi-db](https://huggingface.co/abhi-db). The original code can be found [here](https://github.com/databricks/dbrx-instruct), though this may not be up to date.
@ -65,6 +64,7 @@ print(tokenizer.decode(outputs[0]))
 ```

 If you have flash-attention installed (`pip install flash-attn`), it is possible to generate faster. (The HuggingFace documentation for flash-attention can be found [here](https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2).)
+
 ```python
 from transformers import DbrxForCausalLM, AutoTokenizer
 import torch
@ -87,6 +87,7 @@ print(tokenizer.decode(outputs[0]))
 ```

 You can also generate faster using the PyTorch scaled dot product attention. (The HuggingFace documentation for scaled dot product attention can be found [here](https://huggingface.co/docs/transformers/perf_infer_gpu_one#pytorch-scaled-dot-product-attention).)
+
 ```python
 from transformers import DbrxForCausalLM, AutoTokenizer
 import torch
@ -112,15 +113,12 @@ print(tokenizer.decode(outputs[0]))

 [[autodoc]] DbrxConfig

-
 ## DbrxModel

 [[autodoc]] DbrxModel
    - forward

-
 ## DbrxForCausalLM

 [[autodoc]] DbrxForCausalLM
    - forward
-
--- a/docs/source/en/model_doc/deberta-v2.md
+++ b/docs/source/en/model_doc/deberta-v2.md
@ -21,14 +21,12 @@ rendered properly in your Markdown viewer.
    </div>
 </div>

-
 # DeBERTa-v2

 [DeBERTa-v2](https://huggingface.co/papers/2006.03654) improves on the original [DeBERTa](./deberta) architecture by using a SentencePiece-based tokenizer and a new vocabulary size of 128K. It also adds an additional convolutional layer within the first transformer layer to better learn local dependencies of input tokens. Finally, the position projection and content projection matrices are shared in the attention layer to reduce the number of parameters.

 You can find all the original [DeBERTa-v2] checkpoints under the [Microsoft](https://huggingface.co/microsoft?search_models=deberta-v2) organization.

-
 > [!TIP]
 > This model was contributed by [Pengcheng He](https://huggingface.co/DeBERTa).
 >
@ -86,6 +84,7 @@ print(f"Predicted label: {predicted_label}")
 ```bash
 echo -e "DeBERTa-v2 is great at understanding context!" | transformers run --task fill-mask --model microsoft/deberta-v2-xlarge-mnli --device 0
 ```
+
 </hfoption>
 </hfoptions>

@ -119,7 +118,6 @@ print(f"Predicted label: {predicted_label}")

 ```

-
 ## DebertaV2Config

 [[autodoc]] DebertaV2Config
--- a/docs/source/en/model_doc/deberta.md
+++ b/docs/source/en/model_doc/deberta.md
@ -31,7 +31,6 @@ Even with less training data than RoBERTa, DeBERTa manages to outperform it on s

 You can find all the original DeBERTa checkpoints under the [Microsoft](https://huggingface.co/microsoft?search_models=deberta) organization.

-
 > [!TIP]
 > Click on the DeBERTa models in the right sidebar for more examples of how to apply DeBERTa to different language tasks.

--- a/docs/source/en/model_doc/decision_transformer.md
+++ b/docs/source/en/model_doc/decision_transformer.md
@ -28,14 +28,14 @@ by Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael La

 The abstract from the paper is the following:

-*We introduce a framework that abstracts Reinforcement Learning (RL) as a sequence modeling problem. 
+*We introduce a framework that abstracts Reinforcement Learning (RL) as a sequence modeling problem.
 This allows us to draw upon the simplicity and scalability of the Transformer architecture, and associated advances
- in language modeling such as GPT-x and BERT. In particular, we present Decision Transformer, an architecture that 
- casts the problem of RL as conditional sequence modeling. Unlike prior approaches to RL that fit value functions or 
- compute policy gradients, Decision Transformer simply outputs the optimal actions by leveraging a causally masked 
- Transformer. By conditioning an autoregressive model on the desired return (reward), past states, and actions, our 
- Decision Transformer model can generate future actions that achieve the desired return. Despite its simplicity, 
- Decision Transformer matches or exceeds the performance of state-of-the-art model-free offline RL baselines on 
+ in language modeling such as GPT-x and BERT. In particular, we present Decision Transformer, an architecture that
+ casts the problem of RL as conditional sequence modeling. Unlike prior approaches to RL that fit value functions or
+ compute policy gradients, Decision Transformer simply outputs the optimal actions by leveraging a causally masked
+ Transformer. By conditioning an autoregressive model on the desired return (reward), past states, and actions, our
+ Decision Transformer model can generate future actions that achieve the desired return. Despite its simplicity,
+ Decision Transformer matches or exceeds the performance of state-of-the-art model-free offline RL baselines on
 Atari, OpenAI Gym, and Key-to-Door tasks.*

 This version of the model is for tasks where the state is a vector.
@ -46,7 +46,6 @@ This model was contributed by [edbeeching](https://huggingface.co/edbeeching). T

 [[autodoc]] DecisionTransformerConfig

-
 ## DecisionTransformerGPT2Model

 [[autodoc]] DecisionTransformerGPT2Model
--- a/docs/source/en/model_doc/deepseek_v3.md
+++ b/docs/source/en/model_doc/deepseek_v3.md
@ -26,17 +26,17 @@ We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 67

 ## Limitations and call for contribution!

-We are super happy to make this code community-powered, and would love to see how you can best optimize the following: 
+We are super happy to make this code community-powered, and would love to see how you can best optimize the following:

 - current implementation uses the "naive" attention compution (so not really MLA)
- current implementation loops through the experts. This should be replaced. Pointers to use `get_packed_weights` from `integrations/tensor_parallel`. 
+- current implementation loops through the experts. This should be replaced. Pointers to use `get_packed_weights` from `integrations/tensor_parallel`.
 - current implementation uses the eleuther formula for ROPE, using the original one would be more efficient! (should still follow our API)
 - static cache is not supported (this should be just a generation config issue / config shape issues)

 ### Usage tips
 The model uses Multi-head Latent Attention (MLA) and DeepSeekMoE architectures for efficient inference and cost-effective training. It employs an auxiliary-loss-free strategy for load balancing and multi-token prediction training objective. The model can be used for various language tasks after being pre-trained on 14.8 trillion tokens and going through Supervised Fine-Tuning and Reinforcement Learning stages.

-You can run the model in `FP8` automatically, using 2 nodes of 8 H100 should be more than enough! 
+You can run the model in `FP8` automatically, using 2 nodes of 8 H100 should be more than enough!

 ```python
 # `run_deepseek_v1.py`
@ -61,7 +61,8 @@ outputs = model.generate(inputs, max_new_tokens=50)
 print(tokenizer.batch_decode(outputs))
 print(time.time()-start)
 ```
-This generated: 
+
+This generated:

 ``````
 <｜Assistant｜><think>
@ -157,18 +158,20 @@ Want to dive deeper or see a specific framework’s implementation (e.g., OpenAI
 ``````

 Use the following to run it
+
 ```bash
 torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0|1 --rdzv-id an_id --rdzv-backend c10d --rdzv-endpoint master_addr:master_port run_deepseek_r1.py
 ```

-If you have: 
+If you have:
+
 ```bash
 [rank0]: ncclInternalError: Internal check failed.
 [rank0]: Last error:
 [rank0]: Bootstrap : no socket interface found
 ```
-error, it means NCCL was probably not loaded. 

+error, it means NCCL was probably not loaded.

 ## DeepseekV3Config

--- a/docs/source/en/model_doc/deepseek_vl.md
+++ b/docs/source/en/model_doc/deepseek_vl.md
@ -63,6 +63,7 @@ messages = [

 pipe(text=messages, max_new_tokens=20, return_full_text=False)
 ```
+
 </hfoption>

 <hfoption id="AutoModel">
@ -115,6 +116,7 @@ output_text = processor.batch_decode(

 print(output_text)
 ```
+
 </hfoption>
 </hfoptions>

@ -138,9 +140,11 @@ model = DeepseekVLForConditionalGeneration.from_pretrained(
    quantization_config=quantization_config
 )
 ```
+
 ### Notes

 - Do inference with multiple images in a single conversation.
+
    ```py
    import torch
    from transformers import DeepseekVLForConditionalGeneration, AutoProcessor
--- a/docs/source/en/model_doc/deepseek_vl_hybrid.md
+++ b/docs/source/en/model_doc/deepseek_vl_hybrid.md
@ -62,6 +62,7 @@ messages = [

 pipe(text=messages, max_new_tokens=20, return_full_text=False)
 ```
+
 </hfoption>

 <hfoption id="AutoModel">
@ -114,6 +115,7 @@ output_text = processor.batch_decode(

 print(output_text)
 ```
+
 </hfoption>
 </hfoptions>

@ -137,9 +139,11 @@ model = DeepseekVLHybridForConditionalGeneration.from_pretrained(
    quantization_config=quantization_config
 )
 ```
+
 ### Notes

 - Do inference with multiple images in a single conversation.
+
    ```py
    import torch
    from transformers import DeepseekVLHybridForConditionalGeneration, AutoProcessor
--- a/docs/source/en/model_doc/deplot.md
+++ b/docs/source/en/model_doc/deplot.md
@ -21,7 +21,7 @@ rendered properly in your Markdown viewer.
 <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
 </div>

-## Overview 
+## Overview

 DePlot was proposed in the paper [DePlot: One-shot visual language reasoning by plot-to-table translation](https://huggingface.co/papers/2212.10505) from Fangyu Liu, Julian Martin Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Collier, Yasemin Altun.

@ -36,8 +36,7 @@ DePlot is a Visual Question Answering subset of `Pix2Struct` architecture. It re

 Currently one checkpoint is available for DePlot:

- `google/deplot`: DePlot fine-tuned on ChartQA dataset 
-
+- `google/deplot`: DePlot fine-tuned on ChartQA dataset

 ```python
 from transformers import AutoProcessor, Pix2StructForConditionalGeneration
@ -57,6 +56,7 @@ print(processor.decode(predictions[0], skip_special_tokens=True))
 ## Fine-tuning

 To fine-tune DePlot, refer to the pix2struct [fine-tuning notebook](https://github.com/huggingface/notebooks/blob/main/examples/image_captioning_pix2struct.ipynb). For `Pix2Struct` models, we have found out that fine-tuning the model with Adafactor and cosine learning rate scheduler leads to faster convergence:
+
 ```python
 from transformers.optimization import Adafactor, get_cosine_schedule_with_warmup

--- a/docs/source/en/model_doc/depth_pro.md
+++ b/docs/source/en/model_doc/depth_pro.md
@ -102,12 +102,14 @@ The network is supplemented with a focal length estimation head. A small convolu
 The `use_fov_model` parameter in `DepthProConfig` controls whether **FOV prediction** is enabled. By default, it is set to `False` to conserve memory and computation. When enabled, the **FOV encoder** is instantiated based on the `fov_model_config` parameter, which defaults to a `Dinov2Model`. The `use_fov_model` parameter can also be passed when initializing the `DepthProForDepthEstimation` model.

 The pretrained model at checkpoint `apple/DepthPro-hf` uses the FOV encoder. To use the pretrained-model without FOV encoder, set `use_fov_model=False` when loading the model, which saves computation.
+
 ```py
 >>> from transformers import DepthProForDepthEstimation
 >>> model = DepthProForDepthEstimation.from_pretrained("apple/DepthPro-hf", use_fov_model=False)
 ```

 To instantiate a new model with FOV encoder, set `use_fov_model=True` in the config.
+
 ```py
 >>> from transformers import DepthProConfig, DepthProForDepthEstimation
 >>> config = DepthProConfig(use_fov_model=True)
@ -115,6 +117,7 @@ To instantiate a new model with FOV encoder, set `use_fov_model=True` in the con
 ```

 Or set `use_fov_model=True` when initializing the model, which overrides the value in config.
+
 ```py
 >>> from transformers import DepthProConfig, DepthProForDepthEstimation
 >>> config = DepthProConfig()
@ -123,13 +126,13 @@ Or set `use_fov_model=True` when initializing the model, which overrides the val

 ### Using Scaled Dot Product Attention (SDPA)

-PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function 
-encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the 
-[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) 
+PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function
+encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the
+[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html)
 or the [GPU Inference](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention)
 page for more information.

-SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set 
+SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set
 `attn_implementation="sdpa"` in `from_pretrained()` to explicitly request SDPA to be used.

 ```py
--- a/docs/source/en/model_doc/detr.md
+++ b/docs/source/en/model_doc/detr.md
@ -113,6 +113,7 @@ DETR can be naturally extended to perform panoptic segmentation (which unifies s
 There are three other ways to instantiate a DETR model (depending on what you prefer):

 - Option 1: Instantiate DETR with pre-trained weights for entire model
+
 ```python
 from transformers import DetrForObjectDetection

@ -120,6 +121,7 @@ model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")
 ```

 - Option 2: Instantiate DETR with randomly initialized weights for Transformer, but pre-trained weights for backbone
+
 ```python
 from transformers import DetrConfig, DetrForObjectDetection

@ -128,6 +130,7 @@ model = DetrForObjectDetection(config)
 ```

 - Option 3: Instantiate DETR with randomly initialized weights for backbone + Transformer
+
 ```python
 config = DetrConfig(use_pretrained_backbone=False)
 model = DetrForObjectDetection(config)
@ -144,7 +147,7 @@ As a summary, consider the following table:
 | **Postprocessing** (i.e. converting the output of the model to Pascal VOC format) | [`~transformers.DetrImageProcessor.post_process`] | [`~transformers.DetrImageProcessor.post_process_segmentation`] | [`~transformers.DetrImageProcessor.post_process_segmentation`], [`~transformers.DetrImageProcessor.post_process_panoptic`] |
 | **evaluators** | `CocoEvaluator` with `iou_types="bbox"` | `CocoEvaluator` with `iou_types="bbox"` or `"segm"` | `CocoEvaluator` with `iou_tupes="bbox"` or `"segm"`, `PanopticEvaluator` |

- In short, one should prepare the data either in COCO detection or COCO panoptic format, then use [`~transformers.DetrImageProcessor`] to create `pixel_values`, `pixel_mask` and optional `labels`, which can then be used to train (or fine-tune) a model. 
+- In short, one should prepare the data either in COCO detection or COCO panoptic format, then use [`~transformers.DetrImageProcessor`] to create `pixel_values`, `pixel_mask` and optional `labels`, which can then be used to train (or fine-tune) a model.
 - For evaluation, one should first convert the outputs of the model using one of the postprocessing methods of [`~transformers.DetrImageProcessor`]. These can be provided to either `CocoEvaluator` or `PanopticEvaluator`, which allow you to calculate metrics like mean Average Precision (mAP) and Panoptic Quality (PQ). The latter objects are implemented in the [original repository](https://github.com/facebookresearch/detr). See the [example notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/DETR) for more info regarding evaluation.

 ## Resources
--- a/docs/source/en/model_doc/dia.md
+++ b/docs/source/en/model_doc/dia.md
@ -117,11 +117,9 @@ out = model(**inputs)
 out.loss.backward()
 ```

-
 This model was contributed by [Jaeyong Sung](https://huggingface.co/buttercrab), [Arthur Zucker](https://huggingface.co/ArthurZ),
 and [Anton Vlasjuk](https://huggingface.co/AntonV). The original code can be found [here](https://github.com/nari-labs/dia/).

-
 ## DiaConfig

 [[autodoc]] DiaConfig
--- a/docs/source/en/model_doc/diffllama.md
+++ b/docs/source/en/model_doc/diffllama.md
@ -35,7 +35,6 @@ The abstract from the paper is the following:
 ### Usage tips
 The hyperparameters of this model is the same as Llama model.

-
 ## DiffLlamaConfig

 [[autodoc]] DiffLlamaConfig
--- a/docs/source/en/model_doc/dinov2.md
+++ b/docs/source/en/model_doc/dinov2.md
@ -19,7 +19,6 @@ specific language governing permissions and limitations under the License.
    </div>
 </div>

-
 # DINOv2

 [DINOv2](https://huggingface.co/papers/2304.07193) is a vision foundation model that uses [ViT](./vit) as a feature extractor for multiple downstream tasks like image classification and depth estimation. It focuses on stabilizing and accelerating training through techniques like a faster memory-efficient attention, sequence packing, improved stochastic depth, Fully Sharded Data Parallel (FSDP), and model distillation.
--- a/docs/source/en/model_doc/dinov2_with_registers.md
+++ b/docs/source/en/model_doc/dinov2_with_registers.md
@ -45,7 +45,6 @@ Tips:
 This model was contributed by [nielsr](https://huggingface.co/nielsr).
 The original code can be found [here](https://github.com/facebookresearch/dinov2).

-
 ## Dinov2WithRegistersConfig

 [[autodoc]] Dinov2WithRegistersConfig
--- a/docs/source/en/model_doc/dinov3.md
+++ b/docs/source/en/model_doc/dinov3.md
@ -19,7 +19,6 @@ specific language governing permissions and limitations under the License.
    </div>
 </div>

-
 # DINOv3

 [DINOv3](https://huggingface.co/papers/2508.10104) is a family of versatile vision foundation models that outperforms the specialized state of the art across a broad range of settings, without fine-tuning. DINOv3 produces high-quality dense features that achieve outstanding performance on various vision tasks, significantly surpassing previous self- and weakly-supervised foundation models.
--- a/docs/source/en/model_doc/dit.md
+++ b/docs/source/en/model_doc/dit.md
@ -85,6 +85,7 @@ print(f"The predicted class label is: {predicted_class_label}")
 ## Notes

 - The pretrained DiT weights can be loaded in a [BEiT] model with a modeling head to predict visual tokens.
+
   ```py
   from transformers import BeitForMaskedImageModeling

--- a/docs/source/en/model_doc/doge.md
+++ b/docs/source/en/model_doc/doge.md
@ -17,7 +17,6 @@ rendered properly in your Markdown viewer.

 # Doge

-
 ## Overview

 Doge is a series of small language models based on the [Doge](https://github.com/SmallDoges/small-doge) architecture, aiming to combine the advantages of state-space and self-attention algorithms, calculate dynamic masks from cached value states using the zero-order hold method, and solve the problem of existing mainstream language models getting lost in context. It uses the `wsd_scheduler` scheduler to pre-train on the `smollm-corpus`, and can continue training on new datasets or add sparse activation feedforward networks from stable stage checkpoints.
@ -28,7 +27,6 @@ As shown in the figure below, the sequence transformation part of the Doge archi

 Checkout all Doge model checkpoints [here](https://huggingface.co/collections/SmallDoge/doge-slm-679cc991f027c4a3abbded4a).

-
 ## Usage

 <details>
@ -44,6 +42,7 @@ inputs = tokenizer("Hey how are you doing?", return_tensors="pt")
 outputs = model.generate(**inputs, max_new_tokens=100)
 print(tokenizer.batch_decode(outputs))
 ```
+
 </details>

 <details>
@ -82,6 +81,7 @@ outputs = model.generate(
    streamer=steamer
 )
 ```
+
 </details>

 ## DogeConfig
--- a/docs/source/en/model_doc/donut.md
+++ b/docs/source/en/model_doc/donut.md
@ -22,7 +22,7 @@ specific language governing permissions and limitations under the License. -->

 # Donut

-[Donut (Document Understanding Transformer)](https://huggingface.co/papers/2111.15664) is a visual document understanding model that doesn't require an Optical Character Recognition (OCR) engine. Unlike traditional approaches that extract text using OCR before processing, Donut employs an end-to-end Transformer-based architecture to directly analyze document images. This eliminates OCR-related inefficiencies making it more accurate and adaptable to diverse languages and formats. 
+[Donut (Document Understanding Transformer)](https://huggingface.co/papers/2111.15664) is a visual document understanding model that doesn't require an Optical Character Recognition (OCR) engine. Unlike traditional approaches that extract text using OCR before processing, Donut employs an end-to-end Transformer-based architecture to directly analyze document images. This eliminates OCR-related inefficiencies making it more accurate and adaptable to diverse languages and formats.

 Donut features vision encoder ([Swin](./swin)) and a text decoder ([BART](./bart)). Swin converts document images into embeddings and BART processes them into meaningful text sequences.

--- a/docs/source/en/model_doc/dots1.md
+++ b/docs/source/en/model_doc/dots1.md
@ -25,7 +25,6 @@ The abstract from the report is the following:

 *Mixture of Experts (MoE) models have emerged as a promising paradigm for scaling language models efficiently by activating only a subset of parameters for each input token. In this report, we present dots.llm1, a large-scale MoE model that activates 14B parameters out of a total of 142B parameters, delivering performance on par with state-of-the-art models while reducing training and inference costs. Leveraging our meticulously crafted and efficient data processing pipeline, dots.llm1 achieves performance comparable to Qwen2.5-72B after pretraining on high-quality corpus and post-training to fully unlock its capabilities. Notably, no synthetic data is used during pretraining. To foster further research, we open-source intermediate training checkpoints spanning the entire training process, providing valuable insights into the learning dynamics of large language models.*

-
 ## Dots1Config

 [[autodoc]] Dots1Config
--- a/docs/source/en/model_doc/efficientloftr.md
+++ b/docs/source/en/model_doc/efficientloftr.md
@ -45,6 +45,7 @@ results = keypoint_matcher([url_0, url_1], threshold=0.9)
 print(results[0])
 # {'keypoint_image_0': {'x': ..., 'y': ...}, 'keypoint_image_1': {'x': ..., 'y': ...}, 'score': ...}
 ```
+
 </hfoption>
 <hfoption id="AutoModel">

@ -167,4 +168,3 @@ processed_outputs = processor.post_process_keypoint_matching(outputs, image_size
 [[autodoc]] EfficientLoFTRForKeypointMatching

 - forward
-
--- a/docs/source/en/model_doc/efficientnet.md
+++ b/docs/source/en/model_doc/efficientnet.md
@ -23,7 +23,7 @@ rendered properly in your Markdown viewer.

 ## Overview

-The EfficientNet model was proposed in [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://huggingface.co/papers/1905.11946) 
+The EfficientNet model was proposed in [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://huggingface.co/papers/1905.11946)
 by Mingxing Tan and Quoc V. Le. EfficientNets are a family of image classification models, which achieve state-of-the-art accuracy, yet being an order-of-magnitude smaller and faster than previous models.

 The abstract from the paper is the following:
@ -34,7 +34,6 @@ To go even further, we use neural architecture search to design a new baseline n
 This model was contributed by [adirik](https://huggingface.co/adirik).
 The original code can be found [here](https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet).

-
 ## EfficientNetConfig

 [[autodoc]] EfficientNetConfig
@ -58,4 +57,3 @@ The original code can be found [here](https://github.com/tensorflow/tpu/tree/mas

 [[autodoc]] EfficientNetForImageClassification
    - forward
-
--- a/docs/source/en/model_doc/emu3.md
+++ b/docs/source/en/model_doc/emu3.md
@ -27,8 +27,7 @@ rendered properly in your Markdown viewer.

 The Emu3 model was proposed in [Emu3: Next-Token Prediction is All You Need](https://huggingface.co/papers/2409.18869) by Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, Yingli Zhao, Yulong Ao, Xuebin Min, Tao Li, Boya Wu, Bo Zhao, Bowen Zhang, Liangdong Wang, Guang Liu, Zheqi He, Xi Yang, Jingjing Liu, Yonghua Lin, Tiejun Huang, Zhongyuan Wang.

-Emu3 is a multimodal LLM that uses vector quantization to tokenize images into discrete tokens. Discretized image tokens are later fused with text token ids for image and text generation. The model can additionally generate images by predicting image token ids. 
-
+Emu3 is a multimodal LLM that uses vector quantization to tokenize images into discrete tokens. Discretized image tokens are later fused with text token ids for image and text generation. The model can additionally generate images by predicting image token ids.

 The abstract from the paper is the following:

@ -45,11 +44,9 @@ Tips:
 > [!TIP]
 > Emu3 implementation in Transformers uses a special image token to indicate where to merge image embeddings. The special image token isn't new and uses one of the reserved tokens: `<|extra_0|>`. You have to add `<image>` to your prompt in the place where the image should be embedded for correct generation.

-
 This model was contributed by [RaushanTurganbay](https://huggingface.co/RaushanTurganbay).
 The original code can be found [here](https://github.com/baaivision/Emu3).

-
 ## Usage example

 ### Text generation inference
@ -143,7 +140,6 @@ for i, image in enumerate(images['pixel_values']):

 ```

-
 ## Emu3Config

 [[autodoc]] Emu3Config
--- a/Show More
+++ b/Show More