
diff --git a/docs/source/en/model_doc/pop2piano.md b/docs/source/en/model_doc/pop2piano.md
index 5f68b180500..90e0cd3f063 100644
--- a/docs/source/en/model_doc/pop2piano.md
+++ b/docs/source/en/model_doc/pop2piano.md
@@ -21,14 +21,14 @@ specific language governing permissions and limitations under the License.
The Pop2Piano model was proposed in [Pop2Piano : Pop Audio-based Piano Cover Generation](https://huggingface.co/papers/2211.00895) by Jongho Choi and Kyogu Lee.
-Piano covers of pop music are widely enjoyed, but generating them from music is not a trivial task. It requires great
-expertise with playing piano as well as knowing different characteristics and melodies of a song. With Pop2Piano you
-can directly generate a cover from a song's audio waveform. It is the first model to directly generate a piano cover
-from pop audio without melody and chord extraction modules.
+Piano covers of pop music are widely enjoyed, but generating them from music is not a trivial task. It requires great
+expertise with playing piano as well as knowing different characteristics and melodies of a song. With Pop2Piano you
+can directly generate a cover from a song's audio waveform. It is the first model to directly generate a piano cover
+from pop audio without melody and chord extraction modules.
-Pop2Piano is an encoder-decoder Transformer model based on [T5](https://huggingface.co/papers/1910.10683). The input audio
-is transformed to its waveform and passed to the encoder, which transforms it to a latent representation. The decoder
-uses these latent representations to generate token ids in an autoregressive way. Each token id corresponds to one of four
+Pop2Piano is an encoder-decoder Transformer model based on [T5](https://huggingface.co/papers/1910.10683). The input audio
+is transformed to its waveform and passed to the encoder, which transforms it to a latent representation. The decoder
+uses these latent representations to generate token ids in an autoregressive way. Each token id corresponds to one of four
different token types: time, velocity, note and 'special'. The token ids are then decoded to their equivalent MIDI file.
The abstract from the paper is the following:
@@ -53,9 +53,11 @@ The original code can be found [here](https://github.com/sweetcocoa/pop2piano).
## Usage tips
* To use Pop2Piano, you will need to install the 🤗 Transformers library, as well as the following third party modules:
+
```bash
pip install pretty-midi==0.2.9 essentia==2.1b6.dev1034 librosa scipy
```
+
Please note that you may need to restart your runtime after installation.
* Pop2Piano is an Encoder-Decoder based model like T5.
* Pop2Piano can be used to generate midi-audio files for a given audio sequence.
@@ -131,7 +133,6 @@ Please note that you may need to restart your runtime after installation.
>>> tokenizer_output[1].write("./Outputs/midi_output2.mid")
```
-
- Example of processing multiple audio files in batch (Using `Pop2PianoFeatureExtractor` and `Pop2PianoTokenizer`):
```python
@@ -166,7 +167,6 @@ Please note that you may need to restart your runtime after installation.
>>> tokenizer_output[1].write("./Outputs/midi_output2.mid")
```
-
## Pop2PianoConfig
[[autodoc]] Pop2PianoConfig
diff --git a/docs/source/en/model_doc/prompt_depth_anything.md b/docs/source/en/model_doc/prompt_depth_anything.md
index 5af13c5d630..0ac26609b4d 100644
--- a/docs/source/en/model_doc/prompt_depth_anything.md
+++ b/docs/source/en/model_doc/prompt_depth_anything.md
@@ -19,8 +19,7 @@ rendered properly in your Markdown viewer.
## Overview
-The Prompt Depth Anything model was introduced in [Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation](https://huggingface.co/papers/2412.14015) by Haotong Lin, Sida Peng, Jingxiao Chen, Songyou Peng, Jiaming Sun, Minghuan Liu, Hujun Bao, Jiashi Feng, Xiaowei Zhou, Bingyi Kang.
-
+The Prompt Depth Anything model was introduced in [Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation](https://huggingface.co/papers/2412.14015) by Haotong Lin, Sida Peng, Jingxiao Chen, Songyou Peng, Jiaming Sun, Minghuan Liu, Hujun Bao, Jiashi Feng, Xiaowei Zhou, Bingyi Kang.
The abstract from the paper is as follows:
diff --git a/docs/source/en/model_doc/pvt.md b/docs/source/en/model_doc/pvt.md
index e7902affe5f..38858db5552 100644
--- a/docs/source/en/model_doc/pvt.md
+++ b/docs/source/en/model_doc/pvt.md
@@ -29,23 +29,22 @@ is used to further reduce the resource consumption when learning high-resolution
The abstract from the paper is the following:
-*Although convolutional neural networks (CNNs) have achieved great success in computer vision, this work investigates a
-simpler, convolution-free backbone network useful for many dense prediction tasks. Unlike the recently proposed Vision
-Transformer (ViT) that was designed for image classification specifically, we introduce the Pyramid Vision Transformer
-(PVT), which overcomes the difficulties of porting Transformer to various dense prediction tasks. PVT has several
-merits compared to current state of the arts. Different from ViT that typically yields low resolution outputs and
-incurs high computational and memory costs, PVT not only can be trained on dense partitions of an image to achieve high
-output resolution, which is important for dense prediction, but also uses a progressive shrinking pyramid to reduce the
-computations of large feature maps. PVT inherits the advantages of both CNN and Transformer, making it a unified
-backbone for various vision tasks without convolutions, where it can be used as a direct replacement for CNN backbones.
+*Although convolutional neural networks (CNNs) have achieved great success in computer vision, this work investigates a
+simpler, convolution-free backbone network useful for many dense prediction tasks. Unlike the recently proposed Vision
+Transformer (ViT) that was designed for image classification specifically, we introduce the Pyramid Vision Transformer
+(PVT), which overcomes the difficulties of porting Transformer to various dense prediction tasks. PVT has several
+merits compared to current state of the arts. Different from ViT that typically yields low resolution outputs and
+incurs high computational and memory costs, PVT not only can be trained on dense partitions of an image to achieve high
+output resolution, which is important for dense prediction, but also uses a progressive shrinking pyramid to reduce the
+computations of large feature maps. PVT inherits the advantages of both CNN and Transformer, making it a unified
+backbone for various vision tasks without convolutions, where it can be used as a direct replacement for CNN backbones.
We validate PVT through extensive experiments, showing that it boosts the performance of many downstream tasks, including
-object detection, instance and semantic segmentation. For example, with a comparable number of parameters, PVT+RetinaNet
-achieves 40.4 AP on the COCO dataset, surpassing ResNet50+RetinNet (36.3 AP) by 4.1 absolute AP (see Figure 2). We hope
+object detection, instance and semantic segmentation. For example, with a comparable number of parameters, PVT+RetinaNet
+achieves 40.4 AP on the COCO dataset, surpassing ResNet50+RetinNet (36.3 AP) by 4.1 absolute AP (see Figure 2). We hope
that PVT could serve as an alternative and useful backbone for pixel-level predictions and facilitate future research.*
This model was contributed by [Xrenya](https://huggingface.co/Xrenya). The original code can be found [here](https://github.com/whai362/PVT).
-
- PVTv1 on ImageNet-1K
| **Model variant** |**Size** |**Acc@1**|**Params (M)**|
@@ -55,7 +54,6 @@ This model was contributed by [Xrenya](https://huggingface.co/Xrenya). The origi
| PVT-Medium | 224 | 81.2 | 44.2 |
| PVT-Large | 224 | 81.7 | 61.4 |
-
## PvtConfig
[[autodoc]] PvtConfig
diff --git a/docs/source/en/model_doc/pvt_v2.md b/docs/source/en/model_doc/pvt_v2.md
index 0d0ee3cca75..5be8998f4cc 100644
--- a/docs/source/en/model_doc/pvt_v2.md
+++ b/docs/source/en/model_doc/pvt_v2.md
@@ -26,7 +26,7 @@ The PVTv2 encoder structure has been successfully deployed to achieve state-of-t
PVTv2 belongs to a family of models called [hierarchical transformers](https://natecibik.medium.com/the-rise-of-vision-transformers-f623c980419f) , which make adaptations to transformer layers in order to generate multi-scale feature maps. Unlike the columnal structure of Vision Transformer ([ViT](https://huggingface.co/papers/2010.11929)) which loses fine-grained detail, multi-scale feature maps are known preserve this detail and aid performance in dense prediction tasks. In the case of PVTv2, this is achieved by generating image patch tokens using 2D convolution with overlapping kernels in each encoder layer.
-The multi-scale features of hierarchical transformers allow them to be easily swapped in for traditional workhorse computer vision backbone models like ResNet in larger architectures. Both Segformer and Panoptic Segformer demonstrated that configurations using PVTv2 for a backbone consistently outperformed those with similarly sized ResNet backbones.
+The multi-scale features of hierarchical transformers allow them to be easily swapped in for traditional workhorse computer vision backbone models like ResNet in larger architectures. Both Segformer and Panoptic Segformer demonstrated that configurations using PVTv2 for a backbone consistently outperformed those with similarly sized ResNet backbones.
Another powerful feature of the PVTv2 is the complexity reduction in the self-attention layers called Spatial Reduction Attention (SRA), which uses 2D convolution layers to project hidden states to a smaller resolution before attending to them with the queries, improving the $O(n^2)$ complexity of self-attention to $O(n^2/R)$, with $R$ being the spatial reduction ratio (`sr_ratio`, aka kernel size and stride in the 2D convolution).
@@ -48,6 +48,7 @@ This model was contributed by [FoamoftheSea](https://huggingface.co/FoamoftheSea
- ImageNet pretrained weights for all model sizes can be found on the [hub](https://huggingface.co/models?other=pvt_v2).
The best way to get started with the PVTv2 is to load the pretrained checkpoint with the size of your choosing using `AutoModelForImageClassification`:
+
```python
import requests
import torch
@@ -99,7 +100,6 @@ outputs = model(torch.tensor(processed["pixel_values"]))
| PVT-V2-B4 | 224 | 83.6 | 62.6 |
| PVT-V2-B5 | 224 | 83.8 | 82.0 |
-
## PvtV2Config
[[autodoc]] PvtV2Config
diff --git a/docs/source/en/model_doc/qwen2.md b/docs/source/en/model_doc/qwen2.md
index 3f872302cc2..feeb69959b2 100644
--- a/docs/source/en/model_doc/qwen2.md
+++ b/docs/source/en/model_doc/qwen2.md
@@ -142,7 +142,6 @@ outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
-
## Notes
- Ensure your Transformers library version is up-to-date. Qwen2 requires Transformers>=4.37.0 for full support.
diff --git a/docs/source/en/model_doc/qwen2_5_omni.md b/docs/source/en/model_doc/qwen2_5_omni.md
index e124f7cdb42..7a0836592d4 100644
--- a/docs/source/en/model_doc/qwen2_5_omni.md
+++ b/docs/source/en/model_doc/qwen2_5_omni.md
@@ -31,8 +31,6 @@ The abstract from the technical report is the following:
*We present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. To enable the streaming of multimodal information inputs, both audio and visual encoders utilize a block-wise processing approach. This strategy effectively decouples the handling of long sequences of multimodal data, assigning the perceptual responsibilities to the multimodal encoder and entrusting the modeling of extended sequences to a large language model. Such a division of labor enhances the fusion of different modalities via the shared attention mechanism. To synchronize the timestamps of video inputs with audio, we organized the audio and video sequentially in an interleaved manner and propose a novel position embedding approach, named TMRoPE (Time-aligned Multimodal RoPE). To concurrently generate text and speech while avoiding interference between the two modalities, we propose Thinker-Talker architecture. In this framework, Thinker functions as a large language model tasked with text generation, while Talker is a dual-track autoregressive model that directly utilizes the hidden representations from the Thinker to produce audio tokens as output. Both the Thinker and Talker models are designed to be trained and inferred in an end-to-end manner. For decoding audio tokens in a streaming manner, we introduce a sliding-window DiT that restricts the receptive field, aiming to reduce the initial package delay. Qwen2.5-Omni outperforms the similarly sized Qwen2-VL and Qwen2-Audio in both image and audio capabilities. Furthermore, Qwen2.5-Omni achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench. Notably, Qwen2.5-Omni is the first open-source model to achieve a level of performance in end-to-end speech instruction following that is comparable to its capabilities with text inputs, as evidenced by benchmarks such as MMLU and GSM8K. As for speech generation, Qwen2.5-Omni’s streaming Talker outperform most existing streaming and non-streaming alternatives in robustness and naturalness.*
-
-
## Notes
- Use [`Qwen2_5OmniForConditionalGeneration`] to generate audio and text output. To generate only one output type, use [`Qwen2_5OmniThinkerForConditionalGeneration`] for text-only and [`Qwen2_5OmniTalkersForConditionalGeneration`] for audio-only outputs.
@@ -40,7 +38,6 @@ The abstract from the technical report is the following:
- In case out out-of-memory errors hwen working with video input, decrease `processor.max_pixels`. By default the maximum is set to a very arge value and high resolution visuals will not be resized, unless resolution exceeds `processor.max_pixels`.
- The processor has its own [`~ProcessorMixin.apply_chat_template`] method to convert chat messages to model inputs.
-
## Usage example
`Qwen2.5-Omni` can be found on the [Huggingface Hub](https://huggingface.co/Qwen).
@@ -275,6 +272,7 @@ processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B", min_pixels=min
#### Prompt for audio output
If users need audio output, the system prompt must be set as "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.", otherwise the audio output may not work as expected.
+
```
{
"role": "system",
@@ -285,6 +283,7 @@ If users need audio output, the system prompt must be set as "You are Qwen, a vi
#### Use audio output or not
The model supports both text and audio outputs, if users do not need audio outputs, they can set `enable_audio_output` in the `from_pretrained` function. This option will save about `~2GB` of GPU memory but the `return_audio` option for `generate` function will only allow to be set at `False`.
+
```python
model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-Omni-7B",
@@ -341,8 +340,6 @@ model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
)
```
-
-
## Qwen2_5OmniConfig
[[autodoc]] Qwen2_5OmniConfig
diff --git a/docs/source/en/model_doc/qwen2_5_vl.md b/docs/source/en/model_doc/qwen2_5_vl.md
index 62527ea4963..7f682bf8020 100644
--- a/docs/source/en/model_doc/qwen2_5_vl.md
+++ b/docs/source/en/model_doc/qwen2_5_vl.md
@@ -26,7 +26,6 @@ rendered properly in your Markdown viewer.
[Qwen2.5-VL](https://huggingface.co/papers/2502.13923) is a multimodal vision-language model, available in 3B, 7B, and 72B parameters, pretrained on 4.1T tokens. The model introduces window attention in the ViT encoder to accelerate training and inference, dynamic FPS sampling on the spatial and temporal dimensions for better video understanding across different sampling rates, and an upgraded MRoPE (multi-resolutional rotary positional encoding) mechanism to better capture and learn temporal dynamics.
-
You can find all the original Qwen2.5-VL checkpoints under the [Qwen2.5-VL](https://huggingface.co/collections/Qwen/qwen25-vl-6795ffac22b334a837c0f9a5) collection.
> [!TIP]
@@ -61,6 +60,7 @@ messages = [
pipe(text=messages,max_new_tokens=20, return_full_text=False)
```
+
@@ -110,6 +110,7 @@ output_text = processor.batch_decode(
)
print(output_text)
```
+
@@ -130,9 +131,11 @@ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
)
```
+
### Notes
- Use Qwen2.5-VL for video inputs by setting `"type": "video"` as shown below.
+
```python
conversation = [
{
@@ -159,8 +162,10 @@ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
print(output_text)
```
+
- Use Qwen2.5-VL for a mixed batch of inputs (images, videos, text). Add labels when handling multiple images or videos for better reference
as show below.
+
```python
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
@@ -221,14 +226,15 @@ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
max_pixels = 2048*2048
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
```
-
+
Higher resolution can require more compute whereas reducing the resolution can save memory as follows:
-
+
```python
min_pixels = 256*28*28
max_pixels = 1024*28*28
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
```
+
## Qwen2_5_VLConfig
[[autodoc]] Qwen2_5_VLConfig
diff --git a/docs/source/en/model_doc/qwen2_audio.md b/docs/source/en/model_doc/qwen2_audio.md
index 7cdcd52119c..9b9dd43a919 100644
--- a/docs/source/en/model_doc/qwen2_audio.md
+++ b/docs/source/en/model_doc/qwen2_audio.md
@@ -36,7 +36,6 @@ The abstract from the paper is the following:
*We introduce the latest progress of Qwen-Audio, a large-scale audio-language model called Qwen2-Audio, which is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions. In contrast to complex hierarchical tags, we have simplified the pre-training process by utilizing natural language prompts for different data and tasks, and have further expanded the data volume. We have boosted the instruction-following capability of Qwen2-Audio and implemented two distinct audio interaction modes for voice chat and audio analysis. In the voice chat mode, users can freely engage in voice interactions with Qwen2-Audio without text input. In the audio analysis mode, users could provide audio and text instructions for analysis during the interaction. Note that we do not use any system prompts to switch between voice chat and audio analysis modes. Qwen2-Audio is capable of intelligently comprehending the content within audio and following voice commands to respond appropriately. For instance, in an audio segment that simultaneously contains sounds, multi-speaker conversations, and a voice command, Qwen2-Audio can directly understand the command and provide an interpretation and response to the audio. Additionally, DPO has optimized the model's performance in terms of factuality and adherence to desired behavior. According to the evaluation results from AIR-Bench, Qwen2-Audio outperformed previous SOTAs, such as Gemini-1.5-pro, in tests focused on audio-centric instruction-following capabilities. Qwen2-Audio is open-sourced with the aim of fostering the advancement of the multi-modal language community. *
-
## Usage tips
`Qwen2-Audio-7B` and `Qwen2-Audio-7B-Instruct` can be found on the [Huggingface Hub](https://huggingface.co/Qwen)
@@ -79,6 +78,7 @@ In the following, we demonstrate how to use `Qwen2-Audio-7B-Instruct` for the in
### Voice Chat Inference
In the voice chat mode, users can freely engage in voice interactions with Qwen2-Audio without text input:
+
```python
from io import BytesIO
from urllib.request import urlopen
@@ -119,6 +119,7 @@ response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_
### Audio Analysis Inference
In the audio analysis, users could provide both audio and text instructions for analysis:
+
```python
from io import BytesIO
from urllib.request import urlopen
@@ -167,6 +168,7 @@ response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_
### Batch Inference
We also support batch inference:
+
```python
from io import BytesIO
from urllib.request import urlopen
diff --git a/docs/source/en/model_doc/qwen2_moe.md b/docs/source/en/model_doc/qwen2_moe.md
index b8a3fe65d31..9d55de63e16 100644
--- a/docs/source/en/model_doc/qwen2_moe.md
+++ b/docs/source/en/model_doc/qwen2_moe.md
@@ -24,7 +24,6 @@ rendered properly in your Markdown viewer.
# Qwen2MoE
-
[Qwen2MoE](https://huggingface.co/papers/2407.10671) is a Mixture-of-Experts (MoE) variant of [Qwen2](./qwen2), available as a base model and an aligned chat model. It uses SwiGLU activation, group query attention and a mixture of sliding window attention and full attention. The tokenizer can also be adapted to multiple languages and codes.
The MoE architecture uses upcyled models from the dense language models. For example, Qwen1.5-MoE-A2.7B is upcycled from Qwen-1.8B. It has 14.3B parameters but only 2.7B parameters are activated during runtime.
@@ -57,6 +56,7 @@ messages = [
outputs = pipe(messages, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"][-1]['content'])
```
+
@@ -100,14 +100,14 @@ generated_ids = [
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
```
-
+
+
```bash
transformers chat Qwen/Qwen1.5-MoE-A2.7B-Chat --dtype auto --attn_implementation flash_attention_2
```
-
-
+
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
diff --git a/docs/source/en/model_doc/qwen2_vl.md b/docs/source/en/model_doc/qwen2_vl.md
index 8ff09ca5723..59dc25b5e08 100644
--- a/docs/source/en/model_doc/qwen2_vl.md
+++ b/docs/source/en/model_doc/qwen2_vl.md
@@ -25,7 +25,7 @@ rendered properly in your Markdown viewer.
## Overview
-The [Qwen2-VL](https://huggingface.co/papers/2409.12191) ([blog post](https://qwenlm.github.io/blog/qwen2-vl/)) model is a major update to [Qwen-VL](https://huggingface.co/papers/2308.12966) from the Qwen team at Alibaba Research.
+The [Qwen2-VL](https://huggingface.co/papers/2409.12191) ([blog post](https://qwenlm.github.io/blog/qwen2-vl/)) model is a major update to [Qwen-VL](https://huggingface.co/papers/2308.12966) from the Qwen team at Alibaba Research.
The abstract from the blog is the following:
@@ -203,8 +203,8 @@ min_pixels = 256*28*28
max_pixels = 1024*28*28
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
```
-This ensures each image gets encoded using a number between 256-1024 tokens. The 28 comes from the fact that the model uses a patch size of 14 and a temporal patch size of 2 (14 x 2 = 28).
+This ensures each image gets encoded using a number between 256-1024 tokens. The 28 comes from the fact that the model uses a patch size of 14 and a temporal patch size of 2 (14 x 2 = 28).
#### Multiple Image Inputs
@@ -307,7 +307,7 @@ model = Qwen2VLForConditionalGeneration.from_pretrained(
[[autodoc]] Qwen2VLTextModel
- forward
-
+
## Qwen2VLModel
[[autodoc]] Qwen2VLModel
diff --git a/docs/source/en/model_doc/qwen3.md b/docs/source/en/model_doc/qwen3.md
index 87e6ba500f9..0141388fb97 100644
--- a/docs/source/en/model_doc/qwen3.md
+++ b/docs/source/en/model_doc/qwen3.md
@@ -25,7 +25,6 @@ rendered properly in your Markdown viewer.
To be released with the official model launch.
-
## Usage tips
To be released with the official model launch.
diff --git a/docs/source/en/model_doc/qwen3_omni_moe.md b/docs/source/en/model_doc/qwen3_omni_moe.md
index 04d77534f64..cd5506802d5 100644
--- a/docs/source/en/model_doc/qwen3_omni_moe.md
+++ b/docs/source/en/model_doc/qwen3_omni_moe.md
@@ -31,8 +31,6 @@ The abstract from the technical report is the following:
*We present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. To enable the streaming of multimodal information inputs, both audio and visual encoders utilize a block-wise processing approach. This strategy effectively decouples the handling of long sequences of multimodal data, assigning the perceptual responsibilities to the multimodal encoder and entrusting the modeling of extended sequences to a large language model. Such a division of labor enhances the fusion of different modalities via the shared attention mechanism. To synchronize the timestamps of video inputs with audio, we organized the audio and video sequentially in an interleaved manner and propose a novel position embedding approach, named TMRoPE (Time-aligned Multimodal RoPE). To concurrently generate text and speech while avoiding interference between the two modalities, we propose Thinker-Talker architecture. In this framework, Thinker functions as a large language model tasked with text generation, while Talker is a dual-track autoregressive model that directly utilizes the hidden representations from the Thinker to produce audio tokens as output. Both the Thinker and Talker models are designed to be trained and inferred in an end-to-end manner. For decoding audio tokens in a streaming manner, we introduce a sliding-window DiT that restricts the receptive field, aiming to reduce the initial package delay. Qwen2.5-Omni outperforms the similarly sized Qwen2-VL and Qwen2-Audio in both image and audio capabilities. Furthermore, Qwen2.5-Omni achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench. Notably, Qwen2.5-Omni is the first open-source model to achieve a level of performance in end-to-end speech instruction following that is comparable to its capabilities with text inputs, as evidenced by benchmarks such as MMLU and GSM8K. As for speech generation, Qwen2.5-Omni’s streaming Talker outperform most existing streaming and non-streaming alternatives in robustness and naturalness.*
-
-
## Notes
- Use [`Qwen2_5OmniForConditionalGeneration`] to generate audio and text output. To generate only one output type, use [`Qwen2_5OmniThinkerForConditionalGeneration`] for text-only and [`Qwen2_5OmniTalkersForConditionalGeneration`] for audio-only outputs.
@@ -40,7 +38,6 @@ The abstract from the technical report is the following:
- In case out out-of-memory errors hwen working with video input, decrease `processor.max_pixels`. By default the maximum is set to a very arge value and high resolution visuals will not be resized, unless resolution exceeds `processor.max_pixels`.
- The processor has its own [`~ProcessorMixin.apply_chat_template`] method to convert chat messages to model inputs.
-
## Usage example
`Qwen2.5-Omni` can be found on the [Huggingface Hub](https://huggingface.co/Qwen).
@@ -275,6 +272,7 @@ processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B", min_pixels=min
#### Prompt for audio output
If users need audio output, the system prompt must be set as "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.", otherwise the audio output may not work as expected.
+
```
{
"role": "system",
@@ -285,6 +283,7 @@ If users need audio output, the system prompt must be set as "You are Qwen, a vi
#### Use audio output or not
The model supports both text and audio outputs, if users do not need audio outputs, they can set `enable_audio_output` in the `from_pretrained` function. This option will save about `~2GB` of GPU memory but the `return_audio` option for `generate` function will only allow to be set at `False`.
+
```python
model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-Omni-7B",
@@ -341,8 +340,6 @@ model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
)
```
-
-
## Qwen3OmniMoeConfig
[[autodoc]] Qwen3OmniMoeConfig
@@ -410,5 +407,3 @@ model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
## Qwen3OmniMoeTalkerCodePredictorModelForConditionalGeneration
[[autodoc]] Qwen3OmniMoeTalkerCodePredictorModelForConditionalGeneration
-
-
diff --git a/docs/source/en/model_doc/qwen3_vl.md b/docs/source/en/model_doc/qwen3_vl.md
index c939d5da3cd..dc9ecafeb44 100644
--- a/docs/source/en/model_doc/qwen3_vl.md
+++ b/docs/source/en/model_doc/qwen3_vl.md
@@ -77,6 +77,7 @@ output_text = processor.batch_decode(
)
print(output_text)
```
+
diff --git a/docs/source/en/model_doc/qwen3_vl_moe.md b/docs/source/en/model_doc/qwen3_vl_moe.md
index 6e27adf915d..e36336d90a4 100644
--- a/docs/source/en/model_doc/qwen3_vl_moe.md
+++ b/docs/source/en/model_doc/qwen3_vl_moe.md
@@ -77,6 +77,7 @@ output_text = processor.batch_decode(
)
print(output_text)
```
+
diff --git a/docs/source/en/model_doc/recurrent_gemma.md b/docs/source/en/model_doc/recurrent_gemma.md
index 1cd4e784a5b..2d7c940e00a 100644
--- a/docs/source/en/model_doc/recurrent_gemma.md
+++ b/docs/source/en/model_doc/recurrent_gemma.md
@@ -31,16 +31,14 @@ The abstract from the paper is the following:
Tips:
-- The original checkpoints can be converted using the conversion script [`src/transformers/models/recurrent_gemma/convert_recurrent_gemma_weights_to_hf.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/recurrent_gemma/convert_recurrent_gemma_to_hf.py).
+- The original checkpoints can be converted using the conversion script [`src/transformers/models/recurrent_gemma/convert_recurrent_gemma_weights_to_hf.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/recurrent_gemma/convert_recurrent_gemma_to_hf.py).
This model was contributed by [Arthur Zucker](https://huggingface.co/ArthurZ). The original code can be found [here](https://github.com/google-deepmind/recurrentgemma).
-
## RecurrentGemmaConfig
[[autodoc]] RecurrentGemmaConfig
-
## RecurrentGemmaModel
[[autodoc]] RecurrentGemmaModel
@@ -50,4 +48,3 @@ This model was contributed by [Arthur Zucker](https://huggingface.co/ArthurZ). T
[[autodoc]] RecurrentGemmaForCausalLM
- forward
-
diff --git a/docs/source/en/model_doc/reformer.md b/docs/source/en/model_doc/reformer.md
index f94134609d2..c48de93d47d 100644
--- a/docs/source/en/model_doc/reformer.md
+++ b/docs/source/en/model_doc/reformer.md
@@ -89,7 +89,6 @@ equal to `config.hidden_size` and `config.axial_pos_shape` is set to a tuple \\(
product has to be equal to `config.max_embedding_size`, which during training has to be equal to the *sequence
length* of the `input_ids`.
-
### LSH Self Attention
In Locality sensitive hashing (LSH) self attention the key and query projection weights are tied. Therefore, the key
@@ -122,7 +121,6 @@ Using LSH self attention, the memory and time complexity of the query-key matmul
\\(\mathcal{O}(n_s \times n_s)\\) to \\(\mathcal{O}(n_s \times \log(n_s))\\), which usually represents the memory
and time bottleneck in a transformer model, with \\(n_s\\) being the sequence length.
-
### Local Self Attention
Local self attention is essentially a "normal" self attention layer with key, query and value projections, but is
@@ -134,7 +132,6 @@ Using Local self attention, the memory and time complexity of the query-key matm
\\(\mathcal{O}(n_s \times n_s)\\) to \\(\mathcal{O}(n_s \times \log(n_s))\\), which usually represents the memory
and time bottleneck in a transformer model, with \\(n_s\\) being the sequence length.
-
### Training
During training, we must ensure that the sequence length is set to a value that can be divided by the least common
diff --git a/docs/source/en/model_doc/retribert.md b/docs/source/en/model_doc/retribert.md
index 871bdc6e8c8..829fed24215 100644
--- a/docs/source/en/model_doc/retribert.md
+++ b/docs/source/en/model_doc/retribert.md
@@ -39,7 +39,6 @@ pair of BERT encoders with lower-dimension projection for dense semantic indexin
This model was contributed by [yjernite](https://huggingface.co/yjernite). Code to train and use the model can be
found [here](https://github.com/huggingface/transformers/tree/main/examples/research-projects/distillation).
-
## RetriBertConfig
[[autodoc]] RetriBertConfig
diff --git a/docs/source/en/model_doc/roberta.md b/docs/source/en/model_doc/roberta.md
index 580ff09e72c..896156520c5 100644
--- a/docs/source/en/model_doc/roberta.md
+++ b/docs/source/en/model_doc/roberta.md
@@ -28,7 +28,6 @@ rendered properly in your Markdown viewer.
You can find all the original RoBERTa checkpoints under the [Facebook AI](https://huggingface.co/FacebookAI) organization.
-
> [!TIP]
> Click on the RoBERTa models in the right sidebar for more examples of how to apply RoBERTa to different language tasks.
diff --git a/docs/source/en/model_doc/rt_detr.md b/docs/source/en/model_doc/rt_detr.md
index 02accfd6d9f..d4c85f63fc3 100644
--- a/docs/source/en/model_doc/rt_detr.md
+++ b/docs/source/en/model_doc/rt_detr.md
@@ -23,7 +23,6 @@ rendered properly in your Markdown viewer.
## Overview
-
The RT-DETR model was proposed in [DETRs Beat YOLOs on Real-time Object Detection](https://huggingface.co/papers/2304.08069) by Wenyu Lv, Yian Zhao, Shangliang Xu, Jinman Wei, Guanzhong Wang, Cheng Cui, Yuning Du, Qingqing Dang, Yi Liu.
RT-DETR is an object detection model that stands for "Real-Time DEtection Transformer." This model is designed to perform object detection tasks with a focus on achieving real-time performance while maintaining high accuracy. Leveraging the transformer architecture, which has gained significant popularity in various fields of deep learning, RT-DETR processes images to identify and locate multiple objects within them.
@@ -39,7 +38,6 @@ alt="drawing" width="600"/>
The model version was contributed by [rafaelpadilla](https://huggingface.co/rafaelpadilla) and [sangbumchoi](https://github.com/SangbumChoi). The original code can be found [here](https://github.com/lyuwenyu/RT-DETR/).
-
## Usage tips
Initially, an image is processed using a pre-trained convolutional neural network, specifically a Resnet-D variant as referenced in the original code. This network extracts features from the final three layers of the architecture. Following this, a hybrid encoder is employed to convert the multi-scale features into a sequential array of image features. Then, a decoder, equipped with auxiliary prediction heads is used to refine the object queries. This process facilitates the direct generation of bounding boxes, eliminating the need for any additional post-processing to acquire the logits and coordinates for the bounding boxes.
diff --git a/docs/source/en/model_doc/rt_detr_v2.md b/docs/source/en/model_doc/rt_detr_v2.md
index f5eb54625c8..3f814ce0d64 100644
--- a/docs/source/en/model_doc/rt_detr_v2.md
+++ b/docs/source/en/model_doc/rt_detr_v2.md
@@ -34,9 +34,9 @@ The abstract from the paper is the following:
This model was contributed by [jadechoghari](https://huggingface.co/jadechoghari).
The original code can be found [here](https://github.com/lyuwenyu/RT-DETR).
-## Usage tips
+## Usage tips
-This second version of RT-DETR improves how the decoder finds objects in an image.
+This second version of RT-DETR improves how the decoder finds objects in an image.
- **better sampling** – adjusts offsets so the model looks at the right areas
- **flexible attention** – can use smooth (bilinear) or fixed (discrete) sampling
@@ -85,17 +85,15 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
- See also: [Object detection task guide](../tasks/object_detection).
- Notebooks for [inference](https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/RT_DETR_v2_inference.ipynb) and [fine-tuning](https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/RT_DETR_v2_finetune_on_a_custom_dataset.ipynb) RT-DETRv2 on a custom dataset (🌎).
-
## RTDetrV2Config
[[autodoc]] RTDetrV2Config
-
## RTDetrV2Model
[[autodoc]] RTDetrV2Model
- forward
-
+
## RTDetrV2ForObjectDetection
[[autodoc]] RTDetrV2ForObjectDetection
diff --git a/docs/source/en/model_doc/rwkv.md b/docs/source/en/model_doc/rwkv.md
index 4d9d6bbb886..c0bd1273f61 100644
--- a/docs/source/en/model_doc/rwkv.md
+++ b/docs/source/en/model_doc/rwkv.md
@@ -58,7 +58,7 @@ torch.allclose(torch.cat([output_one, output_two], dim=1), output_whole, atol=1e
If you want to make sure the model stops generating when `'\n\n'` is detected, we recommend using the following stopping criteria:
-```python
+```python
from transformers import StoppingCriteria
class RwkvStoppingCriteria(StoppingCriteria):
diff --git a/docs/source/en/model_doc/sam.md b/docs/source/en/model_doc/sam.md
index 49a58254630..65286eb8428 100644
--- a/docs/source/en/model_doc/sam.md
+++ b/docs/source/en/model_doc/sam.md
@@ -41,7 +41,6 @@ Tips:
- Fine-tuning the model is not supported yet
- According to the paper, textual input should be also supported. However, at this time of writing this seems not to be supported according to [the official repository](https://github.com/facebookresearch/segment-anything/issues/4#issuecomment-1497626844).
-
This model was contributed by [ybelkada](https://huggingface.co/ybelkada) and [ArthurZ](https://huggingface.co/ArthurZ).
The original code can be found [here](https://github.com/facebookresearch/segment-anything).
@@ -98,6 +97,7 @@ masks = processor.image_processor.post_process_masks(
)
scores = outputs.iou_scores
```
+
## Resources
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with SAM.
diff --git a/docs/source/en/model_doc/sam_hq.md b/docs/source/en/model_doc/sam_hq.md
index 2bd14229c37..9dea1de7a77 100644
--- a/docs/source/en/model_doc/sam_hq.md
+++ b/docs/source/en/model_doc/sam_hq.md
@@ -25,7 +25,6 @@ The model is an enhancement to the original SAM model that produces significantl

-
SAM-HQ introduces several key improvements over the original SAM model:
1. High-Quality Output Token: A learnable token injected into SAM's mask decoder for higher quality mask prediction
@@ -105,7 +104,6 @@ masks = processor.image_processor.post_process_masks(
scores = outputs.iou_scores
```
-
## Resources
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with SAM-HQ:
@@ -137,7 +135,6 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
[[autodoc]] SamHQVisionModel
-
## SamHQModel
[[autodoc]] SamHQModel
diff --git a/docs/source/en/model_doc/seamless_m4t.md b/docs/source/en/model_doc/seamless_m4t.md
index c6f3a56f9ba..e7fc00d047c 100644
--- a/docs/source/en/model_doc/seamless_m4t.md
+++ b/docs/source/en/model_doc/seamless_m4t.md
@@ -67,7 +67,6 @@ Here is how to use the processor to process text and audio:
>>> text_inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt")
```
-
### Speech
[`SeamlessM4TModel`] can *seamlessly* generate text or speech with few or no changes. Let's target Russian voice translation:
@@ -84,7 +83,7 @@ With basically the same code, I've translated English text and Arabic speech to
Similarly, you can generate translated text from audio files or from text with the same model. You only have to pass `generate_speech=False` to [`SeamlessM4TModel.generate`].
This time, let's translate to French.
-```python
+```python
>>> # from audio
>>> output_tokens = model.generate(**audio_inputs, tgt_lang="fra", generate_speech=False)
>>> translated_text_from_audio = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
@@ -96,11 +95,10 @@ This time, let's translate to French.
### Tips
-
#### 1. Use dedicated models
[`SeamlessM4TModel`] is transformers top level model to generate speech and text, but you can also use dedicated models that perform the task without additional components, thus reducing the memory footprint.
-For example, you can replace the audio-to-audio generation snippet with the model dedicated to the S2ST task, the rest is exactly the same code:
+For example, you can replace the audio-to-audio generation snippet with the model dedicated to the S2ST task, the rest is exactly the same code:
```python
>>> from transformers import SeamlessM4TForSpeechToSpeech
@@ -130,7 +128,6 @@ Use `return_intermediate_token_ids=True` with [`SeamlessM4TModel`] to return bot
## Model architecture
-
SeamlessM4T features a versatile architecture that smoothly handles the sequential generation of text and speech. This setup comprises two sequence-to-sequence (seq2seq) models. The first model translates the input modality into translated text, while the second model generates speech tokens, known as "unit tokens," from the translated text.
Each modality has its own dedicated encoder with a unique architecture. Additionally, for speech output, a vocoder inspired by the [HiFi-GAN](https://huggingface.co/papers/2010.05646) architecture is placed on top of the second seq2seq model.
@@ -142,7 +139,6 @@ Here's how the generation process works:
- If speech generation is required, the second seq2seq model, following a standard encoder-decoder structure, generates unit tokens.
- These unit tokens are then passed through the final vocoder to produce the actual speech.
-
This model was contributed by [ylacombe](https://huggingface.co/ylacombe). The original code can be found [here](https://github.com/facebookresearch/seamless_communication).
## SeamlessM4TModel
@@ -150,19 +146,16 @@ This model was contributed by [ylacombe](https://huggingface.co/ylacombe). The o
[[autodoc]] SeamlessM4TModel
- generate
-
## SeamlessM4TForTextToSpeech
[[autodoc]] SeamlessM4TForTextToSpeech
- generate
-
## SeamlessM4TForSpeechToSpeech
[[autodoc]] SeamlessM4TForSpeechToSpeech
- generate
-
## SeamlessM4TForTextToText
[[autodoc]] transformers.SeamlessM4TForTextToText
@@ -179,7 +172,6 @@ This model was contributed by [ylacombe](https://huggingface.co/ylacombe). The o
[[autodoc]] SeamlessM4TConfig
-
## SeamlessM4TTokenizer
[[autodoc]] SeamlessM4TTokenizer
@@ -189,7 +181,6 @@ This model was contributed by [ylacombe](https://huggingface.co/ylacombe). The o
- create_token_type_ids_from_sequences
- save_vocabulary
-
## SeamlessM4TTokenizerFast
[[autodoc]] SeamlessM4TTokenizerFast
@@ -209,7 +200,6 @@ This model was contributed by [ylacombe](https://huggingface.co/ylacombe). The o
[[autodoc]] SeamlessM4TCodeHifiGan
-
## SeamlessM4THifiGan
[[autodoc]] SeamlessM4THifiGan
@@ -221,5 +211,3 @@ This model was contributed by [ylacombe](https://huggingface.co/ylacombe). The o
## SeamlessM4TTextToUnitForConditionalGeneration
[[autodoc]] SeamlessM4TTextToUnitForConditionalGeneration
-
-
diff --git a/docs/source/en/model_doc/seamless_m4t_v2.md b/docs/source/en/model_doc/seamless_m4t_v2.md
index 8a4ab82d2e9..716718072a4 100644
--- a/docs/source/en/model_doc/seamless_m4t_v2.md
+++ b/docs/source/en/model_doc/seamless_m4t_v2.md
@@ -67,7 +67,6 @@ Here is how to use the processor to process text and audio:
>>> text_inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt")
```
-
### Speech
[`SeamlessM4Tv2Model`] can *seamlessly* generate text or speech with few or no changes. Let's target Russian voice translation:
@@ -84,7 +83,7 @@ With basically the same code, I've translated English text and Arabic speech to
Similarly, you can generate translated text from audio files or from text with the same model. You only have to pass `generate_speech=False` to [`SeamlessM4Tv2Model.generate`].
This time, let's translate to French.
-```python
+```python
>>> # from audio
>>> output_tokens = model.generate(**audio_inputs, tgt_lang="fra", generate_speech=False)
>>> translated_text_from_audio = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
@@ -96,11 +95,10 @@ This time, let's translate to French.
### Tips
-
#### 1. Use dedicated models
[`SeamlessM4Tv2Model`] is transformers top level model to generate speech and text, but you can also use dedicated models that perform the task without additional components, thus reducing the memory footprint.
-For example, you can replace the audio-to-audio generation snippet with the model dedicated to the S2ST task, the rest is exactly the same code:
+For example, you can replace the audio-to-audio generation snippet with the model dedicated to the S2ST task, the rest is exactly the same code:
```python
>>> from transformers import SeamlessM4Tv2ForSpeechToSpeech
@@ -161,7 +159,6 @@ Here's how the generation process works:
- If speech generation is required, the second seq2seq model, generates unit tokens in an non auto-regressive way.
- These unit tokens are then passed through the final vocoder to produce the actual speech.
-
This model was contributed by [ylacombe](https://huggingface.co/ylacombe). The original code can be found [here](https://github.com/facebookresearch/seamless_communication).
## SeamlessM4Tv2Model
@@ -169,19 +166,16 @@ This model was contributed by [ylacombe](https://huggingface.co/ylacombe). The o
[[autodoc]] SeamlessM4Tv2Model
- generate
-
## SeamlessM4Tv2ForTextToSpeech
[[autodoc]] SeamlessM4Tv2ForTextToSpeech
- generate
-
## SeamlessM4Tv2ForSpeechToSpeech
[[autodoc]] SeamlessM4Tv2ForSpeechToSpeech
- generate
-
## SeamlessM4Tv2ForTextToText
[[autodoc]] transformers.SeamlessM4Tv2ForTextToText
diff --git a/docs/source/en/model_doc/segformer.md b/docs/source/en/model_doc/segformer.md
index 756c98d45f0..a6b407e5879 100644
--- a/docs/source/en/model_doc/segformer.md
+++ b/docs/source/en/model_doc/segformer.md
@@ -71,8 +71,6 @@ logits = outputs.logits # shape [batch, num_labels, height, width]
-
-
## Notes
- SegFormer works with **any input size**, padding inputs to be divisible by `config.patch_sizes`.
diff --git a/docs/source/en/model_doc/seggpt.md b/docs/source/en/model_doc/seggpt.md
index 9e8c08cf2d2..a5568d5c80e 100644
--- a/docs/source/en/model_doc/seggpt.md
+++ b/docs/source/en/model_doc/seggpt.md
@@ -74,7 +74,6 @@ mask = image_processor.post_process_semantic_segmentation(outputs, target_sizes,
This model was contributed by [EduardoPacheco](https://huggingface.co/EduardoPacheco).
The original code can be found [here]([(https://github.com/baaivision/Painter/tree/main)).
-
## SegGptConfig
[[autodoc]] SegGptConfig
diff --git a/docs/source/en/model_doc/shieldgemma2.md b/docs/source/en/model_doc/shieldgemma2.md
index 99ffde6288f..871cdd31db7 100644
--- a/docs/source/en/model_doc/shieldgemma2.md
+++ b/docs/source/en/model_doc/shieldgemma2.md
@@ -86,7 +86,6 @@ output = model(**inputs)
print(output.probabilities)
```
-
## ShieldGemma2Processor
[[autodoc]] ShieldGemma2Processor
diff --git a/docs/source/en/model_doc/siglip.md b/docs/source/en/model_doc/siglip.md
index c0eb9a8ac6b..bf9c0a46034 100644
--- a/docs/source/en/model_doc/siglip.md
+++ b/docs/source/en/model_doc/siglip.md
@@ -31,7 +31,6 @@ Unlike CLIP, SigLIP employs a pairwise sigmoid loss on image-text pairs during t
You can find all the original SigLIP checkpoints under the [SigLIP](https://huggingface.co/collections/google/siglip-659d5e62f0ae1a57ae0e83ba) collection.
-
> [!TIP]
> Click on the SigLIP models in the right sidebar for more examples of how to apply SigLIP to different image and text tasks.
@@ -107,12 +106,14 @@ logits_per_image = outputs.logits_per_image
probs = torch.sigmoid(logits_per_image)
print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'")
```
+
## Notes
- Training is supported for DDP and FSDP on single-node multi-GPU setups. However, it does not use [torch.distributed](https://pytorch.org/tutorials/beginner/dist_overview.html) utilities which may limit the scalability of batch size.
- When using the standalone [`SiglipTokenizer`] or [`SiglipProcessor`], make sure to pass `padding="max_length"` because that is how the model was trained.
- To get the same results as the [`Pipeline`], a prompt template of `"This is a photo of {label}."` should be passed to the processor.
- Toggle the `attn_implementation` parameter to either `"sdpa"` or `"flash_attention_2"` to use a more memory-efficient attention.
+
```py
# pip install -U flash-attn --no-build-isolation
@@ -126,7 +127,6 @@ print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'")
)
```
-
## SiglipConfig
[[autodoc]] SiglipConfig
@@ -179,7 +179,6 @@ print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'")
[[autodoc]] SiglipVisionModel
- forward
-
## SiglipForImageClassification
[[autodoc]] SiglipForImageClassification
diff --git a/docs/source/en/model_doc/siglip2.md b/docs/source/en/model_doc/siglip2.md
index f2684c6defc..6a058f8907a 100644
--- a/docs/source/en/model_doc/siglip2.md
+++ b/docs/source/en/model_doc/siglip2.md
@@ -32,7 +32,6 @@ rendered properly in your Markdown viewer.
- NaFlex supports different resolutions and maintains the native image aspect ratio
- FixRes supports fixed resolutions and is backwards compatible with [SigLIP](./siglip)
-
You can find all the original SigLIP2 checkpoints under the [SigLIP2](https://huggingface.co/collections/google/siglip2-67b5dcef38c175486e240107) collection.
> [!TIP]
@@ -157,6 +156,7 @@ print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'")
NaFlex resizes the input image so the height and width are multiples of the patch size after resizing. It keeps the aspect ratio distortion as low as possible and produces a sequence length of at most the desired target sequence length (`max_num_patches`). After resizing, the image is split into a sequence of patches and a mask with padding information is added.
- Toggle the `attn_implementation` parameter to either `"sdpa"` or `"flash_attention_2"` to use a more memory-efficient attention.
+
```py
# pip install -U flash-attn --no-build-isolation
@@ -169,6 +169,7 @@ print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'")
device_map=device,
)
```
+
## Siglip2Config
[[autodoc]] Siglip2Config
diff --git a/docs/source/en/model_doc/smollm3.md b/docs/source/en/model_doc/smollm3.md
index da98a15e33b..db2ddd33601 100644
--- a/docs/source/en/model_doc/smollm3.md
+++ b/docs/source/en/model_doc/smollm3.md
@@ -139,7 +139,6 @@ outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
-
## Notes
- Ensure your Transformers library version is up-to-date. SmolLM3 requires Transformers>=4.53.0 for full support.
diff --git a/docs/source/en/model_doc/smolvlm.md b/docs/source/en/model_doc/smolvlm.md
index c9a886ac876..5f74fa60ba0 100644
--- a/docs/source/en/model_doc/smolvlm.md
+++ b/docs/source/en/model_doc/smolvlm.md
@@ -39,6 +39,7 @@ If `do_resize` is set to `True`, the model resizes images so that the longest ed
The default resizing behavior can be customized by passing a dictionary to the `size` parameter. For example, `{"longest_edge": 4 * 512}` is the default, but you can change it to a different value if needed.
Here’s how to control resizing and set a custom size:
+
```python
image_processor = SmolVLMImageProcessor(do_resize=True, size={"longest_edge": 2 * 512}, max_image_size=512)
```
@@ -47,8 +48,6 @@ Additionally, the `max_image_size` parameter, which controls the size of each sq
This model was contributed by [orrzohar](https://huggingface.co/orrzohar).
-
-
## Usage example
### Single Media inference
diff --git a/docs/source/en/model_doc/stablelm.md b/docs/source/en/model_doc/stablelm.md
index 29f32a0004e..e47598a8f85 100644
--- a/docs/source/en/model_doc/stablelm.md
+++ b/docs/source/en/model_doc/stablelm.md
@@ -92,7 +92,6 @@ Now, to run the model with Flash Attention 2, refer to the snippet below:
['The weather is always wonderful in Costa Rica, which makes it a prime destination for retirees. That’s where the Pensionado program comes in, offering']
```
-
## StableLmConfig
[[autodoc]] StableLmConfig
diff --git a/docs/source/en/model_doc/starcoder2.md b/docs/source/en/model_doc/starcoder2.md
index 2d27aed399c..b67e5dedd2c 100644
--- a/docs/source/en/model_doc/starcoder2.md
+++ b/docs/source/en/model_doc/starcoder2.md
@@ -34,7 +34,7 @@ The abstract of the paper is the following:
## License
The models are licensed under the [BigCode OpenRAIL-M v1 license agreement](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement).
-
+
## Usage tips
The StarCoder2 models can be found in the [HuggingFace hub](https://huggingface.co/collections/bigcode/starcoder2-65de6da6e87db3383572be1a). You can find some examples for inference and fine-tuning in StarCoder2's [GitHub repo](https://github.com/bigcode-project/starcoder2).
diff --git a/docs/source/en/model_doc/superglue.md b/docs/source/en/model_doc/superglue.md
index 81bb91861de..d25ca822e4c 100644
--- a/docs/source/en/model_doc/superglue.md
+++ b/docs/source/en/model_doc/superglue.md
@@ -153,4 +153,3 @@ processed_outputs = processor.post_process_keypoint_matching(outputs, image_size
[[autodoc]] SuperGlueForKeypointMatching
- forward
-
diff --git a/docs/source/en/model_doc/superpoint.md b/docs/source/en/model_doc/superpoint.md
index b86f7fd4aa7..26ffb2c8b4b 100644
--- a/docs/source/en/model_doc/superpoint.md
+++ b/docs/source/en/model_doc/superpoint.md
@@ -33,8 +33,6 @@ You can find all the original SuperPoint checkpoints under the [Magic Leap Commu
>
> Click on the SuperPoint models in the right sidebar for more examples of how to apply SuperPoint to different computer vision tasks.
-
-
The example below demonstrates how to detect interest points in an image with the [`AutoModel`] class.
@@ -101,6 +99,7 @@ processed_outputs = processor.post_process_keypoint_detection(outputs, [image_si
```
- You can then print the keypoints on the image of your choice to visualize the result:
+
```py
import matplotlib.pyplot as plt
plt.axis("off")
diff --git a/docs/source/en/model_doc/swin.md b/docs/source/en/model_doc/swin.md
index f6a994ef69b..81142f6c411 100644
--- a/docs/source/en/model_doc/swin.md
+++ b/docs/source/en/model_doc/swin.md
@@ -47,6 +47,7 @@ pipeline = pipeline(
)
pipeline("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg")
```
+
@@ -79,6 +80,7 @@ class_labels = model.config.id2label
predicted_class_label = class_labels[predicted_class_id]
print(f"The predicted class label is: {predicted_class_label}")
```
+
diff --git a/docs/source/en/model_doc/swinv2.md b/docs/source/en/model_doc/swinv2.md
index 507b79fc7cf..0dc008767ac 100644
--- a/docs/source/en/model_doc/swinv2.md
+++ b/docs/source/en/model_doc/swinv2.md
@@ -81,7 +81,7 @@ print(f"The predicted class label is: {predicted_class_label}")
## Notes
-- Swin Transformer V2 can pad the inputs for any input height and width divisible by `32`.
+- Swin Transformer V2 can pad the inputs for any input height and width divisible by `32`.
- Swin Transformer V2 can be used as a [backbone](../backbones). When `output_hidden_states = True`, it outputs both `hidden_states` and `reshaped_hidden_states`. The `reshaped_hidden_states` have a shape of `(batch, num_channels, height, width)` rather than `(batch_size, sequence_length, num_channels)`.
## Swinv2Config
diff --git a/docs/source/en/model_doc/switch_transformers.md b/docs/source/en/model_doc/switch_transformers.md
index efa6bd499db..5eb27a9e7d8 100644
--- a/docs/source/en/model_doc/switch_transformers.md
+++ b/docs/source/en/model_doc/switch_transformers.md
@@ -27,7 +27,6 @@ rendered properly in your Markdown viewer.
You can find all the original Switch Transformers checkpoints under the [Switch Transformer](https://huggingface.co/collections/google/switch-transformers-release-6548c35c6507968374b56d1f) collection.
-
> [!TIP]
> This model was contributed by [ybelkada](https://huggingface.co/ybelkada) and [ArthurZ](https://huggingface.co/ArthurZ).
>
@@ -99,7 +98,6 @@ outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
```
-
## SwitchTransformersConfig
[[autodoc]] SwitchTransformersConfig
diff --git a/docs/source/en/model_doc/t5gemma.md b/docs/source/en/model_doc/t5gemma.md
index aa8d3b7880e..00dde7ab93a 100644
--- a/docs/source/en/model_doc/t5gemma.md
+++ b/docs/source/en/model_doc/t5gemma.md
@@ -39,7 +39,6 @@ The example below demonstrates how to chat with the model with [`Pipeline`] or t
-
```python
import torch
from transformers import pipeline
@@ -89,6 +88,7 @@ print(tokenizer.decode(outputs[0]))
```
echo -e "Write me a poem about Machine Learning. Answer:" | transformers run --task text2text-generation --model google/t5gemma-2b-2b-prefixlm --device 0
```
+
diff --git a/docs/source/en/model_doc/t5v1.1.md b/docs/source/en/model_doc/t5v1.1.md
index 4ad072addcc..62787d5f9d6 100644
--- a/docs/source/en/model_doc/t5v1.1.md
+++ b/docs/source/en/model_doc/t5v1.1.md
@@ -68,7 +68,6 @@ Google has released the following variants:
- [google/t5-v1_1-xxl](https://huggingface.co/google/t5-v1_1-xxl).
-
Refer to [T5's documentation page](t5) for all API reference, tips, code examples and notebooks.
diff --git a/docs/source/en/model_doc/table-transformer.md b/docs/source/en/model_doc/table-transformer.md
index b35df2aec31..c982d305907 100644
--- a/docs/source/en/model_doc/table-transformer.md
+++ b/docs/source/en/model_doc/table-transformer.md
@@ -43,8 +43,8 @@ alt="drawing" width="600"/>
Table detection and table structure recognition clarified. Taken from the original paper.
-The authors released 2 models, one for [table detection](https://huggingface.co/microsoft/table-transformer-detection) in
-documents, one for [table structure recognition](https://huggingface.co/microsoft/table-transformer-structure-recognition)
+The authors released 2 models, one for [table detection](https://huggingface.co/microsoft/table-transformer-detection) in
+documents, one for [table structure recognition](https://huggingface.co/microsoft/table-transformer-structure-recognition)
(the task of recognizing the individual rows, columns etc. in a table).
This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be
diff --git a/docs/source/en/model_doc/tapas.md b/docs/source/en/model_doc/tapas.md
index 4dfac5edce3..c5144121df6 100644
--- a/docs/source/en/model_doc/tapas.md
+++ b/docs/source/en/model_doc/tapas.md
@@ -76,7 +76,6 @@ To summarize:
| Weak supervision for aggregation | WTQ | Questions might involve aggregation, and the model must learn this given only the answer as supervision |
| Strong supervision for aggregation | WikiSQL-supervised | Questions might involve aggregation, and the model must learn this given the gold aggregation operator |
-
Initializing a model with a pre-trained base and randomly initialized classification heads from the hub can be done as shown below.
```py
@@ -105,7 +104,6 @@ Of course, you don't necessarily have to follow one of these three ways in which
>>> model = TapasForQuestionAnswering.from_pretrained("google/tapas-base", config=config)
```
-
What you can also do is start from an already fine-tuned checkpoint. A note here is that the already fine-tuned checkpoint on WTQ has some issues due to the L2-loss which is somewhat brittle. See [here](https://github.com/google-research/tapas/issues/91#issuecomment-735719340) for more info.
For a list of all pre-trained and fine-tuned TAPAS checkpoints available on HuggingFace's hub, see [here](https://huggingface.co/models?search=tapas).
@@ -128,7 +126,6 @@ The tables themselves should be present in a folder, each table being a separate
**STEP 3: Convert your data into tensors using TapasTokenizer**
-
Third, given that you've prepared your data in this TSV/CSV format (and corresponding CSV files containing the tabular data), you can then use [`TapasTokenizer`] to convert table-question pairs into `input_ids`, `attention_mask`, `token_type_ids` and so on. Again, based on which of the three cases you picked above, [`TapasForQuestionAnswering`] requires different
inputs to be fine-tuned:
@@ -214,13 +211,11 @@ Of course, this only shows how to encode a single training example. It is advise
>>> train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=32)
```
-
Note that here, we encode each table-question pair independently. This is fine as long as your dataset is **not conversational**. In case your dataset involves conversational questions (such as in SQA), then you should first group together the `queries`, `answer_coordinates` and `answer_text` per table (in the order of their `position`
index) and batch encode each table with its questions. This will make sure that the `prev_labels` token types (see docs of [`TapasTokenizer`]) are set correctly. See [this notebook](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb) for more info.
**STEP 4: Train (fine-tune) the model
-
You can then fine-tune [`TapasForQuestionAnswering`] as follows (shown here for the weak supervision for aggregation case):
```py
@@ -272,10 +267,8 @@ You can then fine-tune [`TapasForQuestionAnswering`] as follows (shown here for
... optimizer.step()
```
-
## Usage: inference
-
Here we explain how you can use [`TapasForQuestionAnswering`] for inference (i.e. making predictions on new data). For inference, only `input_ids`, `attention_mask` and `token_type_ids` (which you can obtain using [`TapasTokenizer`]) have to be provided to the model to obtain the logits. Next, you can use the handy [`~models.tapas.tokenization_tapas.convert_logits_to_predictions`] method to convert these into predicted coordinates and optional aggregation indices.
However, note that inference is **different** depending on whether or not the setup is conversational. In a non-conversational set-up, inference can be done in parallel on all table-question pairs of a batch. Here's an example of that:
@@ -333,7 +326,6 @@ What is the total number of movies?
Predicted answer: SUM > 87, 53, 69
```
-
In case of a conversational set-up, then each table-question pair must be provided **sequentially** to the model, such that the `prev_labels` token types can be overwritten by the predicted `labels` of the previous table-question pair. Again, more info can be found in [this notebook](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb).
## Resources
diff --git a/docs/source/en/model_doc/textnet.md b/docs/source/en/model_doc/textnet.md
index 9c29a8b16be..c986b17dbff 100644
--- a/docs/source/en/model_doc/textnet.md
+++ b/docs/source/en/model_doc/textnet.md
@@ -34,7 +34,7 @@ This model was contributed by [Raghavan](https://huggingface.co/Raghavan), [jade
## Usage tips
-TextNet is mainly used as a backbone network for the architecture search of text detection. Each stage of the backbone network is comprised of a stride-2 convolution and searchable blocks.
+TextNet is mainly used as a backbone network for the architecture search of text detection. Each stage of the backbone network is comprised of a stride-2 convolution and searchable blocks.
Specifically, we present a layer-level candidate set, defined as {conv3×3, conv1×3, conv3×1, identity}. As the 1×3 and 3×1 convolutions have asymmetric kernels and oriented structure priors, they may help to capture the features of extreme aspect-ratio and rotated text lines.
TextNet is the backbone for Fast, but can also be used as an efficient text/image classification, we add a `TextNetForImageClassification` as is it would allow people to train an image classifier on top of the pre-trained textnet weights
@@ -62,4 +62,3 @@ TextNet is the backbone for Fast, but can also be used as an efficient text/imag
[[autodoc]] TextNetForImageClassification
- forward
-
diff --git a/docs/source/en/model_doc/time_series_transformer.md b/docs/source/en/model_doc/time_series_transformer.md
index c38671f00fb..921b7e01d4b 100644
--- a/docs/source/en/model_doc/time_series_transformer.md
+++ b/docs/source/en/model_doc/time_series_transformer.md
@@ -61,7 +61,6 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
- Check out the Time Series Transformer blog-post in HuggingFace blog: [Probabilistic Time Series Forecasting with 🤗 Transformers](https://huggingface.co/blog/time-series-transformers)
-
## TimeSeriesTransformerConfig
[[autodoc]] TimeSeriesTransformerConfig
diff --git a/docs/source/en/model_doc/timesfm.md b/docs/source/en/model_doc/timesfm.md
index 83dee48e71b..e8938202ee9 100644
--- a/docs/source/en/model_doc/timesfm.md
+++ b/docs/source/en/model_doc/timesfm.md
@@ -25,16 +25,13 @@ rendered properly in your Markdown viewer.
TimesFM (Time Series Foundation Model) is a pretrained time-series foundation model proposed in [A decoder-only foundation model for time-series forecasting](https://huggingface.co/papers/2310.10688) by Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. It is a decoder only model that uses non-overlapping patches of time-series data as input and outputs some output patch length prediction in an autoregressive fashion.
-
The abstract from the paper is the following:
*Motivated by recent advances in large language models for Natural Language Processing (NLP), we design a time-series foundation model for forecasting whose out-of-the-box zero-shot performance on a variety of public datasets comes close to the accuracy of state-of-the-art supervised forecasting models for each individual dataset. Our model is based on pretraining a patched-decoder style attention model on a large time-series corpus, and can work well across different forecasting history lengths, prediction lengths and temporal granularities.*
-
This model was contributed by [kashif](https://huggingface.co/kashif).
The original code can be found [here](https://github.com/google-research/timesfm).
-
To use the model:
```python
diff --git a/docs/source/en/model_doc/transfo-xl.md b/docs/source/en/model_doc/transfo-xl.md
index 5d9b92f7946..0bd1b0f57e1 100644
--- a/docs/source/en/model_doc/transfo-xl.md
+++ b/docs/source/en/model_doc/transfo-xl.md
@@ -90,7 +90,6 @@ This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The o
- Basically, the hidden states of the previous segment are concatenated to the current input to compute the attention scores. This allows the model to pay attention to information that was in the previous segment as well as the current one. By stacking multiple attention layers, the receptive field can be increased to multiple previous segments.
- This changes the positional embeddings to positional relative embeddings (as the regular positional embeddings would give the same results in the current input and the current hidden state at a given position) and needs to make some adjustments in the way attention scores are computed.
-
TransformerXL does **not** work with *torch.nn.DataParallel* due to a bug in PyTorch, see [issue #36035](https://github.com/pytorch/pytorch/issues/36035)
diff --git a/docs/source/en/model_doc/trocr.md b/docs/source/en/model_doc/trocr.md
index 6346977dafa..da5c71edde3 100644
--- a/docs/source/en/model_doc/trocr.md
+++ b/docs/source/en/model_doc/trocr.md
@@ -14,8 +14,6 @@ rendered properly in your Markdown viewer.
specific language governing permissions and limitations under the License. -->
*This model was released on 2021-09-21 and added to Hugging Face Transformers on 2021-10-13.*
-
-

@@ -32,13 +30,11 @@ You can find all the original TrOCR checkpoints under the [Microsoft](https://hu
alt="drawing" width="600"/>
TrOCR architecture. Taken from the original paper.
-
> [!TIP]
> This model was contributed by [nielsr](https://huggingface.co/nielsr).
>
> Click on the TrOCR models in the right sidebar for more examples of how to apply TrOCR to different image and text tasks.
-
The example below demonstrates how to perform optical character recognition (OCR) with the [`AutoModel`] class.
@@ -113,7 +109,6 @@ print(generated_text)
- A notebook on [inference with TrOCR](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/TrOCR/Inference_with_TrOCR_%2B_Gradio_demo.ipynb) and Gradio demo.
- A notebook on [evaluating TrOCR on the IAM test set](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/TrOCR/Evaluating_TrOCR_base_handwritten_on_the_IAM_test_set.ipynb).
-
## TrOCRConfig
[[autodoc]] TrOCRConfig
diff --git a/docs/source/en/model_doc/tvp.md b/docs/source/en/model_doc/tvp.md
index 49a538ffa8c..2df4da02555 100644
--- a/docs/source/en/model_doc/tvp.md
+++ b/docs/source/en/model_doc/tvp.md
@@ -47,6 +47,7 @@ The [`TvpProcessor`] wraps [`BertTokenizer`] and [`TvpImageProcessor`] into a si
encode the text and prepare the images respectively.
The following example shows how to run temporal video grounding using [`TvpProcessor`] and [`TvpForVideoGrounding`].
+
```python
import av
import cv2
@@ -165,7 +166,6 @@ Tips:
- Checkpoints for pre-trained [tvp-base](https://huggingface.co/Intel/tvp-base) is released.
- Please refer to [Table 2](https://huggingface.co/papers/2303.04995) for TVP's performance on Temporal Video Grounding task.
-
## TvpConfig
[[autodoc]] TvpConfig
diff --git a/docs/source/en/model_doc/umt5.md b/docs/source/en/model_doc/umt5.md
index 349dcecf03c..784cc9974df 100644
--- a/docs/source/en/model_doc/umt5.md
+++ b/docs/source/en/model_doc/umt5.md
@@ -39,7 +39,7 @@ Google has released the following variants:
This model was contributed by [agemagician](https://huggingface.co/agemagician) and [stefan-it](https://huggingface.co/stefan-it). The original code can be
found [here](https://github.com/google-research/t5x).
-## Usage tips
+## Usage tips
- UMT5 was only pre-trained on [mC4](https://huggingface.co/datasets/mc4) excluding any supervised training.
Therefore, this model has to be fine-tuned before it is usable on a downstream task, unlike the original T5 model.
@@ -67,7 +67,7 @@ The conversion script is also different because the model was saved in t5x's lat
['nyone who drink a alcohol A A. This I']
```
-
+
Refer to [T5's documentation page](t5) for more tips, code examples and notebooks.
@@ -105,4 +105,3 @@ Refer to [T5's documentation page](t5) for more tips, code examples and notebook
[[autodoc]] UMT5ForQuestionAnswering
- forward
-
diff --git a/docs/source/en/model_doc/univnet.md b/docs/source/en/model_doc/univnet.md
index e20bc5c405e..7a580692833 100644
--- a/docs/source/en/model_doc/univnet.md
+++ b/docs/source/en/model_doc/univnet.md
@@ -69,7 +69,6 @@ write("sample_audio.wav", feature_extractor.sampling_rate, audio)
This model was contributed by [dg845](https://huggingface.co/dg845).
To the best of my knowledge, there is no official code release, but an unofficial implementation can be found at [maum-ai/univnet](https://github.com/maum-ai/univnet) with pretrained checkpoints [here](https://github.com/maum-ai/univnet#pre-trained-model).
-
## UnivNetConfig
[[autodoc]] UnivNetConfig
diff --git a/docs/source/en/model_doc/van.md b/docs/source/en/model_doc/van.md
index 0e07e314bee..0a4ded43021 100644
--- a/docs/source/en/model_doc/van.md
+++ b/docs/source/en/model_doc/van.md
@@ -74,4 +74,3 @@ If you're interested in submitting a resource to be included here, please feel f
[[autodoc]] VanForImageClassification
- forward
-
diff --git a/docs/source/en/model_doc/vaultgemma.md b/docs/source/en/model_doc/vaultgemma.md
index 94d28cc8afe..9d39a5eb7ee 100644
--- a/docs/source/en/model_doc/vaultgemma.md
+++ b/docs/source/en/model_doc/vaultgemma.md
@@ -12,7 +12,6 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be rendered properly in your Markdown viewer.
-->
@@ -45,7 +44,6 @@ command line.
-
```python
from transformers import pipeline
diff --git a/docs/source/en/model_doc/video_llava.md b/docs/source/en/model_doc/video_llava.md
index 6b09367f37c..5b792b33733 100644
--- a/docs/source/en/model_doc/video_llava.md
+++ b/docs/source/en/model_doc/video_llava.md
@@ -27,7 +27,6 @@ rendered properly in your Markdown viewer.
Video-LLaVa is an open-source multimodal LLM trained by fine-tuning LlamA/Vicuna on multimodal instruction-following data generated by Llava1.5 and VideChat. It is an auto-regressive language model, based on the transformer architecture. Video-LLaVa unifies visual representations to the language feature space, and enables an LLM to perform visual reasoning capabilities on both images and videos simultaneously.
-
The Video-LLaVA model was proposed in [Video-LLaVA: Learning United Visual Representation by Alignment Before Projection](https://huggingface.co/papers/2311.10122) by Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munang Ning, Peng Jin, Li Yuan.
The abstract from the paper is the following:
@@ -55,18 +54,16 @@ for the LLM*
- Note the model has not been explicitly trained to process multiple images/videos in the same prompt, although this is technically possible, you may experience inaccurate results.
-- Note that the video inputs should have exactly 8 frames at the input, since the models were trained in that setting.
+- Note that the video inputs should have exactly 8 frames at the input, since the models were trained in that setting.
This model was contributed by [RaushanTurganbay](https://huggingface.co/RaushanTurganbay).
The original code can be found [here](https://github.com/PKU-YuanGroup/Video-LLaVA).
-
> [!NOTE]
> LLaVA models after release v4.46 will raise warnings about adding `processor.patch_size = {{patch_size}}`, `processor.num_additional_image_tokens = {{num_additional_image_tokens}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. It is strongly recommended to add the attributes to the processor if you own the model checkpoint, or open a PR if it is not owned by you.
Adding these attributes means that LLaVA will try to infer the number of image tokens required per image and expand the text with as many `` placeholders as there will be tokens. Usually it is around 500 tokens per image, so make sure that the text is not truncated as otherwise there will be failure when merging the embeddings.
The attributes can be obtained from model config, as `model.config.vision_config.patch_size` or `model.config.vision_feature_select_strategy`. The `num_additional_image_tokens` should be `1` if the vision backbone adds a CLS token or `0` if nothing extra is added to the vision patches.
-
## Usage example
### Single Media Mode
@@ -126,7 +123,7 @@ For multiple turns conversation change the prompt format to:
### Mixed Media Mode
-The model can also generate from an interleaved image-video inputs. However note, that it was not trained in interleaved image-video setting which might affect the performance. Below is an example usage for mixed media input, add the following lines to the above code snippet:
+The model can also generate from an interleaved image-video inputs. However note, that it was not trained in interleaved image-video setting which might affect the performance. Below is an example usage for mixed media input, add the following lines to the above code snippet:
```python
from PIL import Image
@@ -150,7 +147,7 @@ processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokeniza
### Quantization using Bitsandbytes for memory efficiency
-The model can be loaded in lower bits, significantly reducing memory burden while maintaining the performance of the original model. his allows for efficient deployment on resource-constrained cases.
+The model can be loaded in lower bits, significantly reducing memory burden while maintaining the performance of the original model. his allows for efficient deployment on resource-constrained cases.
First make sure to install bitsandbytes by running `pip install bitsandbytes` and to have access to a GPU/accelerator that is supported by the library.
@@ -164,7 +161,6 @@ We value your feedback to help identify bugs before the full release! Check out
Load the quantized model by simply adding [`BitsAndBytesConfig`](../main_classes/quantization#transformers.BitsAndBytesConfig) as shown below:
-
```python
from transformers import VideoLlavaForConditionalGeneration, BitsAndBytesConfig
@@ -178,7 +174,6 @@ quantization_config = BitsAndBytesConfig(
model = VideoLlavaForConditionalGeneration.from_pretrained("LanguageBind/Video-LLaVA-7B-hf", quantization_config=quantization_config, device_map="auto")
```
-
### Flash-Attention 2 to speed-up generation
Additionally, we can greatly speed-up model inference by using [Flash Attention](../perf_train_gpu_one#flash-attention-2), which is a faster implementation of the attention mechanism used inside the model.
@@ -203,7 +198,6 @@ model = VideoLlavaForConditionalGeneration.from_pretrained(
).to(0)
```
-
## VideoLlavaConfig
[[autodoc]] VideoLlavaConfig
@@ -212,7 +206,6 @@ model = VideoLlavaForConditionalGeneration.from_pretrained(
[[autodoc]] VideoLlavaImageProcessor
-
## VideoLlavaVideoProcessor
[[autodoc]] VideoLlavaVideoProcessor
diff --git a/docs/source/en/model_doc/videomae.md b/docs/source/en/model_doc/videomae.md
index e0ebbaa4288..44fc8b8b5be 100644
--- a/docs/source/en/model_doc/videomae.md
+++ b/docs/source/en/model_doc/videomae.md
@@ -42,13 +42,13 @@ The original code can be found [here](https://github.com/MCG-NJU/VideoMAE).
## Using Scaled Dot Product Attention (SDPA)
-PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function
-encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the
-[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html)
+PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function
+encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the
+[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html)
or the [GPU Inference](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention)
page for more information.
-SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set
+SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set
`attn_implementation="sdpa"` in `from_pretrained()` to explicitly request SDPA to be used.
```
diff --git a/docs/source/en/model_doc/vipllava.md b/docs/source/en/model_doc/vipllava.md
index 0d0a209c27a..fc4aec6ae9b 100644
--- a/docs/source/en/model_doc/vipllava.md
+++ b/docs/source/en/model_doc/vipllava.md
@@ -37,7 +37,6 @@ The original code can be found [here](https://github.com/mu-cai/ViP-LLaVA).
This model was contributed by [Younes Belkada](https://huggingface.co/ybelkada)
-
## Usage tips:
- The architecture is similar than llava architecture except that the multi-modal projector takes a set of concatenated vision hidden states and has an additional layernorm layer on that module.
@@ -51,7 +50,6 @@ This model was contributed by [Younes Belkada](https://huggingface.co/ybelkada)
Adding these attributes means that LLaVA will try to infer the number of image tokens required per image and expand the text with as many `` placeholders as there will be tokens. Usually it is around 500 tokens per image, so make sure that the text is not truncated as otherwise there will be failure when merging the embeddings.
The attributes can be obtained from model config, as `model.config.vision_config.patch_size` or `model.config.vision_feature_select_strategy`. The `num_additional_image_tokens` should be `1` if the vision backbone adds a CLS token or `0` if nothing extra is added to the vision patches.
-
- For better results, we recommend users to use the processor's `apply_chat_template()` method to format your prompt correctly. For that you need to construct a conversation history, passing in a plain string will not format your prompt. Each message in the conversation history for chat templates is a dictionary with keys "role" and "content". The "content" should be a list of dictionaries, for "text" and "image" modalities, as follows:
```python
@@ -88,16 +86,17 @@ print(text_prompt)
```
- If you want to construct a chat prompt yourself, below is a list of prompt formats accepted by VipLLaVa checkpoints:
+
```bash
A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.###Human: \n###Assistant:
```
For multiple turns conversation:
+
```bash
A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.###Human: \n###Assistant: ###Human: ###Assistant:
```
-
## VipLlavaConfig
[[autodoc]] VipLlavaConfig
diff --git a/docs/source/en/model_doc/visual_bert.md b/docs/source/en/model_doc/visual_bert.md
index 7a7ac24e4db..a9912144c4f 100644
--- a/docs/source/en/model_doc/visual_bert.md
+++ b/docs/source/en/model_doc/visual_bert.md
@@ -27,7 +27,6 @@ rendered properly in your Markdown viewer.
You can find all the original VisualBERT checkpoints under the [UCLA NLP](https://huggingface.co/uclanlp/models?search=visualbert) organization.
-
> [!TIP]
> This model was contributed by [gchhablani](https://huggingface.co/gchhablani).
> Click on the VisualBERT models in the right sidebar for more examples of how to apply VisualBERT to different image and language tasks.
diff --git a/docs/source/en/model_doc/vit_hybrid.md b/docs/source/en/model_doc/vit_hybrid.md
index 86c2c7229f5..15fa6fad474 100644
--- a/docs/source/en/model_doc/vit_hybrid.md
+++ b/docs/source/en/model_doc/vit_hybrid.md
@@ -55,13 +55,13 @@ found [here](https://github.com/google-research/vision_transformer).
## Using Scaled Dot Product Attention (SDPA)
-PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function
-encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the
-[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html)
+PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function
+encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the
+[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html)
or the [GPU Inference](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention)
page for more information.
-SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set
+SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set
`attn_implementation="sdpa"` in `from_pretrained()` to explicitly request SDPA to be used.
```
diff --git a/docs/source/en/model_doc/vit_mae.md b/docs/source/en/model_doc/vit_mae.md
index b8b9867e881..1099019a842 100644
--- a/docs/source/en/model_doc/vit_mae.md
+++ b/docs/source/en/model_doc/vit_mae.md
@@ -15,7 +15,6 @@ rendered properly in your Markdown viewer.
-->
*This model was released on 2021-11-11 and added to Hugging Face Transformers on 2022-01-18.*
-

diff --git a/docs/source/en/model_doc/vit_msn.md b/docs/source/en/model_doc/vit_msn.md
index 5b727f34256..6d10dd59a99 100644
--- a/docs/source/en/model_doc/vit_msn.md
+++ b/docs/source/en/model_doc/vit_msn.md
@@ -40,11 +40,11 @@ while producing representations of a high semantic level that perform competitiv
on ImageNet-1K, with only 5,000 annotated images, our base MSN model achieves 72.4% top-1 accuracy,
and with 1% of ImageNet-1K labels, we achieve 75.7% top-1 accuracy, setting a new state-of-the-art for self-supervised learning on this benchmark.*
-

+
MSN architecture. Taken from the original paper.
-This model was contributed by [sayakpaul](https://huggingface.co/sayakpaul). The original code can be found [here](https://github.com/facebookresearch/msn).
+This model was contributed by [sayakpaul](https://huggingface.co/sayakpaul). The original code can be found [here](https://github.com/facebookresearch/msn).
## Usage tips
@@ -58,13 +58,13 @@ labels when fine-tuned.
### Using Scaled Dot Product Attention (SDPA)
-PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function
-encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the
-[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html)
+PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function
+encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the
+[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html)
or the [GPU Inference](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention)
page for more information.
-SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set
+SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set
`attn_implementation="sdpa"` in `from_pretrained()` to explicitly request SDPA to be used.
```
diff --git a/docs/source/en/model_doc/vits.md b/docs/source/en/model_doc/vits.md
index 2c1777b77f1..664edcb92ae 100644
--- a/docs/source/en/model_doc/vits.md
+++ b/docs/source/en/model_doc/vits.md
@@ -156,4 +156,3 @@ Audio(waveform, rate=model.config.sampling_rate)
[[autodoc]] VitsModel
- forward
-
diff --git a/docs/source/en/model_doc/vivit.md b/docs/source/en/model_doc/vivit.md
index 041f80f61ae..9ee5a10a19f 100644
--- a/docs/source/en/model_doc/vivit.md
+++ b/docs/source/en/model_doc/vivit.md
@@ -32,13 +32,13 @@ This model was contributed by [jegormeister](https://huggingface.co/jegormeister
### Using Scaled Dot Product Attention (SDPA)
-PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function
-encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the
-[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html)
+PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function
+encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the
+[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html)
or the [GPU Inference](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention)
page for more information.
-SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set
+SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set
`attn_implementation="sdpa"` in `from_pretrained()` to explicitly request SDPA to be used.
```
@@ -56,8 +56,6 @@ On a local benchmark (A100-40GB, PyTorch 2.3.0, OS Ubuntu 22.04) with `float32`
|---------------------:|-------------:|----------:|--------------:|----------------------:|---------------------:|-----------------:|
| 100 | 1 | True | 7.122 | 2575.28 | 5932.54 | 130.364 |
-
-
### Inference
| num_batches | batch_size | is cuda | is half | Speedup (%) | Mem eager (MB) | Mem BT (MB) | Mem saved (%) |
|---------------|--------------|-----------|-----------|---------------|------------------|---------------|-----------------|
@@ -65,7 +63,6 @@ On a local benchmark (A100-40GB, PyTorch 2.3.0, OS Ubuntu 22.04) with `float32`
| 20 | 2 | True | False | 17.146 | 1234.75 | 447.175 | 176.122 |
| 20 | 4 | True | False | 18.093 | 2275.82 | 709.864 | 220.6 |
| 20 | 8 | True | False | 19.284 | 4358.19 | 1233.24 | 253.393 |
-
## VivitConfig
diff --git a/docs/source/en/model_doc/vjepa2.md b/docs/source/en/model_doc/vjepa2.md
index 93960f05189..049c7ff98f2 100644
--- a/docs/source/en/model_doc/vjepa2.md
+++ b/docs/source/en/model_doc/vjepa2.md
@@ -15,7 +15,6 @@ rendered properly in your Markdown viewer.
-->
*This model was released on 2025-06-11 and added to Hugging Face Transformers on 2025-06-11.*
-

@@ -34,7 +33,6 @@ rendered properly in your Markdown viewer.
You can find all original V-JEPA2 checkpoints under the [V-JEPA 2](https://huggingface.co/collections/facebook/v-jepa-2-6841bad8413014e185b497a6) collection.
-
This model was contributed by [koustuvs](https://huggingface.co/koustuvs), [yonigozlan](https://huggingface.co/yonigozlan) and [qubvel](https://huggingface.co/qubvel-hf). The original code can be found [here](https://github.com/facebookresearch/vjepa2).
## Usage example
diff --git a/docs/source/en/model_doc/voxtral.md b/docs/source/en/model_doc/voxtral.md
index 71f0661c827..56fc84d30d0 100644
--- a/docs/source/en/model_doc/voxtral.md
+++ b/docs/source/en/model_doc/voxtral.md
@@ -43,6 +43,7 @@ Voxtral builds on Ministral-3B by adding audio processing capabilities:
The model supports audio-text instructions, including multi-turn and multi-audio interactions, all processed in batches.
➡️ audio + text instruction
+
```python
import torch
from transformers import VoxtralForConditionalGeneration, AutoProcessor, infer_device
@@ -78,7 +79,8 @@ print(decoded_outputs[0])
print("=" * 80)
```
-➡️ multi-audio + text instruction
+➡️ multi-audio + text instruction
+
```python
import torch
from transformers import VoxtralForConditionalGeneration, AutoProcessor, infer_device
@@ -119,6 +121,7 @@ print("=" * 80)
```
➡️ multi-turn:
+
```python
import torch
from transformers import VoxtralForConditionalGeneration, AutoProcessor, infer_device
@@ -173,6 +176,7 @@ print("=" * 80)
```
➡️ text only:
+
```python
import torch
from transformers import VoxtralForConditionalGeneration, AutoProcessor, infer_device
@@ -208,6 +212,7 @@ print("=" * 80)
```
➡️ audio only:
+
```python
import torch
from transformers import VoxtralForConditionalGeneration, AutoProcessor, infer_device
@@ -243,6 +248,7 @@ print("=" * 80)
```
➡️ batched inference!
+
```python
import torch
from transformers import VoxtralForConditionalGeneration, AutoProcessor, infer_device()
diff --git a/docs/source/en/model_doc/wav2vec2-bert.md b/docs/source/en/model_doc/wav2vec2-bert.md
index 4edb67498aa..4a2c8de89c3 100644
--- a/docs/source/en/model_doc/wav2vec2-bert.md
+++ b/docs/source/en/model_doc/wav2vec2-bert.md
@@ -54,7 +54,6 @@ This model was contributed by [ylacombe](https://huggingface.co/ylacombe). The o
- [`Wav2Vec2BertForSequenceClassification`] can be used by adapting this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/audio-classification).
- See also: [Audio classification task guide](../tasks/audio_classification)
-
## Wav2Vec2BertConfig
[[autodoc]] Wav2Vec2BertConfig
diff --git a/docs/source/en/model_doc/wav2vec2-conformer.md b/docs/source/en/model_doc/wav2vec2-conformer.md
index e2a56b450df..663b6163011 100644
--- a/docs/source/en/model_doc/wav2vec2-conformer.md
+++ b/docs/source/en/model_doc/wav2vec2-conformer.md
@@ -38,7 +38,7 @@ Note: Meta (FAIR) released a new version of [Wav2Vec2-BERT 2.0](https://huggingf
- Wav2Vec2-Conformer follows the same architecture as Wav2Vec2, but replaces the *Attention*-block with a *Conformer*-block
as introduced in [Conformer: Convolution-augmented Transformer for Speech Recognition](https://huggingface.co/papers/2005.08100).
-- For the same number of layers, Wav2Vec2-Conformer requires more parameters than Wav2Vec2, but also yields
+- For the same number of layers, Wav2Vec2-Conformer requires more parameters than Wav2Vec2, but also yields
an improved word error rate.
- Wav2Vec2-Conformer uses the same tokenizer and feature extractor as Wav2Vec2.
- Wav2Vec2-Conformer can use either no relative position embeddings, Transformer-XL-like position embeddings, or
diff --git a/docs/source/en/model_doc/wav2vec2.md b/docs/source/en/model_doc/wav2vec2.md
index 6c4772f90bc..1f5f4a90576 100644
--- a/docs/source/en/model_doc/wav2vec2.md
+++ b/docs/source/en/model_doc/wav2vec2.md
@@ -80,13 +80,10 @@ model = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-large-960h-lv60-self",
Below is an expected speedup diagram comparing the pure inference time between the native implementation in transformers of the `facebook/wav2vec2-large-960h-lv60-self` model and the flash-attention-2 and sdpa (scale-dot-product-attention) versions. . We show the average speedup obtained on the `librispeech_asr` `clean` validation split:
-
-
-
## Resources
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with Wav2Vec2. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
diff --git a/docs/source/en/model_doc/wav2vec2_phoneme.md b/docs/source/en/model_doc/wav2vec2_phoneme.md
index fe989def3bd..c2621f8924c 100644
--- a/docs/source/en/model_doc/wav2vec2_phoneme.md
+++ b/docs/source/en/model_doc/wav2vec2_phoneme.md
@@ -53,7 +53,6 @@ The original code can be found [here](https://github.com/pytorch/fairseq/tree/ma
- By default, the model outputs a sequence of phonemes. In order to transform the phonemes to a sequence of words one
should make use of a dictionary and language model.
-
Wav2Vec2Phoneme's architecture is based on the Wav2Vec2 model, for API reference, check out [`Wav2Vec2`](wav2vec2)'s documentation page
diff --git a/docs/source/en/model_doc/whisper.md b/docs/source/en/model_doc/whisper.md
index 673085ac3e7..5e19e870bdd 100644
--- a/docs/source/en/model_doc/whisper.md
+++ b/docs/source/en/model_doc/whisper.md
@@ -15,7 +15,6 @@ rendered properly in your Markdown viewer.
-->
*This model was released on 2022-12-06 and added to Hugging Face Transformers on 2022-10-05.*
-

diff --git a/docs/source/en/model_doc/xcodec.md b/docs/source/en/model_doc/xcodec.md
index c4a0b92a26f..ca6d6e473fc 100644
--- a/docs/source/en/model_doc/xcodec.md
+++ b/docs/source/en/model_doc/xcodec.md
@@ -33,7 +33,7 @@ The X-Codec model is a neural audio codec that integrates semantic information f
The abstract of the paper states the following:
-*Recent advancements in audio generation have been significantly propelled by the capabilities of Large Language Models (LLMs). The existing research on audio LLM has primarily focused on enhancing the architecture and scale of audio language models, as well as leveraging larger datasets, and generally, acoustic codecs, such as EnCodec, are used for audio tokenization. However, these codecs were originally designed for audio compression, which may lead to suboptimal performance in the context of audio LLM. Our research aims to address the shortcomings of current audio LLM codecs, particularly their challenges in maintaining semantic integrity in generated audio. For instance, existing methods like VALL-E, which condition acoustic token generation on text transcriptions, often suffer from content inaccuracies and elevated word error rates (WER) due to semantic misinterpretations of acoustic tokens, resulting in word skipping and errors. To overcome these issues, we propose a straightforward yet effective approach called X-Codec. X-Codec incorporates semantic features from a pre-trained semantic encoder before the Residual Vector Quantization (RVQ) stage and introduces a semantic reconstruction loss after RVQ. By enhancing the semantic ability of the codec, X-Codec significantly reduces WER in speech synthesis tasks and extends these benefits to non-speech applications, including music and sound generation. Our experiments in text-to-speech, music continuation, and text-to-sound tasks demonstrate that integrating semantic information substantially improves the overall performance of language models in audio generation.*
+*Recent advancements in audio generation have been significantly propelled by the capabilities of Large Language Models (LLMs). The existing research on audio LLM has primarily focused on enhancing the architecture and scale of audio language models, as well as leveraging larger datasets, and generally, acoustic codecs, such as EnCodec, are used for audio tokenization. However, these codecs were originally designed for audio compression, which may lead to suboptimal performance in the context of audio LLM. Our research aims to address the shortcomings of current audio LLM codecs, particularly their challenges in maintaining semantic integrity in generated audio. For instance, existing methods like VALL-E, which condition acoustic token generation on text transcriptions, often suffer from content inaccuracies and elevated word error rates (WER) due to semantic misinterpretations of acoustic tokens, resulting in word skipping and errors. To overcome these issues, we propose a straightforward yet effective approach called X-Codec. X-Codec incorporates semantic features from a pre-trained semantic encoder before the Residual Vector Quantization (RVQ) stage and introduces a semantic reconstruction loss after RVQ. By enhancing the semantic ability of the codec, X-Codec significantly reduces WER in speech synthesis tasks and extends these benefits to non-speech applications, including music and sound generation. Our experiments in text-to-speech, music continuation, and text-to-sound tasks demonstrate that integrating semantic information substantially improves the overall performance of language models in audio generation.*
Model cards:
- [xcodec-hubert-librispeech](https://huggingface.co/hf-audio/xcodec-hubert-librispeech) (for speech)
@@ -46,12 +46,11 @@ This model was contributed by [Manal El Aidouni](https://huggingface.co/Manel).
Demos can be found on this [page](https://x-codec-audio.github.io/).
-
-## Usage example
+## Usage example
Here is a quick example of how to encode and decode an audio using this model:
-```python
+```python
from datasets import load_dataset, Audio
from transformers import XcodecModel, AutoFeatureExtractor
dummy_dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
@@ -75,6 +74,7 @@ audio_values = decoder_outputs.audio_values
audio_values = model(inputs["input_values"]).audio_values
```
+
To listen to the original and reconstructed audio, run the snippet below and then open the generated `original.wav` and `reconstruction.wav` files in your music player to compare.
```python
@@ -88,12 +88,10 @@ sf.write("original.wav", original, sampling_rate)
sf.write("reconstruction.wav", reconstruction.T, sampling_rate)
```
-
## XcodecConfig
[[autodoc]] XcodecConfig
-
## XcodecModel
[[autodoc]] XcodecModel
diff --git a/docs/source/en/model_doc/xglm.md b/docs/source/en/model_doc/xglm.md
index 9a9170d29b7..370055c90ea 100644
--- a/docs/source/en/model_doc/xglm.md
+++ b/docs/source/en/model_doc/xglm.md
@@ -44,7 +44,6 @@ showing in particular that it enables cross-lingual in-context learning on some
on surface form robustness and adaptation to tasks that do not have a natural cloze form. Finally, we evaluate our models
in social value tasks such as hate speech detection in five languages and find it has limitations similar to comparable sized GPT-3 models.*
-
This model was contributed by [Suraj](https://huggingface.co/valhalla). The original code can be found [here](https://github.com/pytorch/fairseq/tree/main/examples/xglm).
## Resources
@@ -67,7 +66,6 @@ This model was contributed by [Suraj](https://huggingface.co/valhalla). The orig
[[autodoc]] XGLMTokenizerFast
-
## XGLMModel
[[autodoc]] XGLMModel
diff --git a/docs/source/en/model_doc/xlm-prophetnet.md b/docs/source/en/model_doc/xlm-prophetnet.md
index 4dad4c0afa7..fbf47d8c422 100644
--- a/docs/source/en/model_doc/xlm-prophetnet.md
+++ b/docs/source/en/model_doc/xlm-prophetnet.md
@@ -41,7 +41,6 @@ You can do so by running the following command: `pip install -U transformers==4.
**DISCLAIMER:** If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title) and assign
@patrickvonplaten
-
## Overview
The XLM-ProphetNet model was proposed in [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training,](https://huggingface.co/papers/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei
diff --git a/docs/source/en/model_doc/xlm-roberta-xl.md b/docs/source/en/model_doc/xlm-roberta-xl.md
index 8ae33e8b286..5e1f0bbda28 100644
--- a/docs/source/en/model_doc/xlm-roberta-xl.md
+++ b/docs/source/en/model_doc/xlm-roberta-xl.md
@@ -77,6 +77,7 @@ predicted_token = tokenizer.decode(predicted_token_id)
print(f"The predicted token is: {predicted_token}")
```
+
@@ -84,6 +85,7 @@ print(f"The predicted token is: {predicted_token}")
```bash
echo -e "Plants create through a process known as photosynthesis." | transformers run --task fill-mask --model facebook/xlm-roberta-xl --device 0
```
+
diff --git a/docs/source/en/model_doc/xlm-roberta.md b/docs/source/en/model_doc/xlm-roberta.md
index 65468a786a0..0e986763689 100644
--- a/docs/source/en/model_doc/xlm-roberta.md
+++ b/docs/source/en/model_doc/xlm-roberta.md
@@ -87,6 +87,7 @@ print(f"The predicted token is: {predicted_token}")
```bash
echo -e "Plants create
through a process known as photosynthesis." | transformers run --task fill-mask --model FacebookAI/xlm-roberta-base --device 0
```
+
diff --git a/docs/source/en/model_doc/xlm.md b/docs/source/en/model_doc/xlm.md
index b4d84c791f5..ff8f8c46024 100644
--- a/docs/source/en/model_doc/xlm.md
+++ b/docs/source/en/model_doc/xlm.md
@@ -79,6 +79,7 @@ print(f"Predicted token: {predicted_token}")
```bash
echo -e "Plants create through a process known as photosynthesis." | transformers run --task fill-mask --model FacebookAI/xlm-mlm-en-2048 --device 0
```
+
diff --git a/docs/source/en/model_doc/xlstm.md b/docs/source/en/model_doc/xlstm.md
index b239d631fbb..e1ba3195ecc 100644
--- a/docs/source/en/model_doc/xlstm.md
+++ b/docs/source/en/model_doc/xlstm.md
@@ -15,7 +15,6 @@ rendered properly in your Markdown viewer.
-->
*This model was released on 2024-05-07 and added to Hugging Face Transformers on 2025-07-25.*
-
# xLSTM
## Overview
@@ -32,7 +31,6 @@ The abstract from the paper is the following:
This model was contributed by [NX-AI](https://huggingface.co/NX-AI).
The original code can be found [here](https://github.com/NX-AI/xlstm).
-
## xLSTMConfig
[[autodoc]] xLSTMConfig
diff --git a/docs/source/en/model_doc/yolos.md b/docs/source/en/model_doc/yolos.md
index 5c31b539e59..666f9674332 100644
--- a/docs/source/en/model_doc/yolos.md
+++ b/docs/source/en/model_doc/yolos.md
@@ -26,14 +26,12 @@ rendered properly in your Markdown viewer.
[YOLOS](https://huggingface.co/papers/2106.00666) uses a [Vision Transformer (ViT)](./vit) for object detection with minimal modifications and region priors. It can achieve performance comparable to specialized object detection models and frameworks with knowledge about 2D spatial structures.
-
You can find all the original YOLOS checkpoints under the [HUST Vision Lab](https://huggingface.co/hustvl/models?search=yolos) organization.
YOLOS architecture. Taken from the original paper.
-
> [!TIP]
> This model wasa contributed by [nielsr](https://huggingface.co/nielsr).
> Click on the YOLOS models in the right sidebar for more examples of how to apply YOLOS to different object detection tasks.
@@ -98,7 +96,6 @@ for score, label, box in zip(filtered_scores, filtered_labels, pixel_boxes):
-
## Notes
- Use [`YolosImageProcessor`] for preparing images (and optional targets) for the model. Contrary to [DETR](./detr), YOLOS doesn't require a `pixel_mask`.
diff --git a/docs/source/en/model_doc/yoso.md b/docs/source/en/model_doc/yoso.md
index f07e5aba082..8e121dd88cd 100644
--- a/docs/source/en/model_doc/yoso.md
+++ b/docs/source/en/model_doc/yoso.md
@@ -26,20 +26,20 @@ rendered properly in your Markdown viewer.
The YOSO model was proposed in [You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling](https://huggingface.co/papers/2111.09714)
by Zhanpeng Zeng, Yunyang Xiong, Sathya N. Ravi, Shailesh Acharya, Glenn Fung, Vikas Singh. YOSO approximates standard softmax self-attention
via a Bernoulli sampling scheme based on Locality Sensitive Hashing (LSH). In principle, all the Bernoulli random variables can be sampled with
-a single hash.
+a single hash.
The abstract from the paper is the following:
-*Transformer-based models are widely used in natural language processing (NLP). Central to the transformer model is
-the self-attention mechanism, which captures the interactions of token pairs in the input sequences and depends quadratically
-on the sequence length. Training such models on longer sequences is expensive. In this paper, we show that a Bernoulli sampling
-attention mechanism based on Locality Sensitive Hashing (LSH), decreases the quadratic complexity of such models to linear.
-We bypass the quadratic cost by considering self-attention as a sum of individual tokens associated with Bernoulli random
-variables that can, in principle, be sampled at once by a single hash (although in practice, this number may be a small constant).
-This leads to an efficient sampling scheme to estimate self-attention which relies on specific modifications of
-LSH (to enable deployment on GPU architectures). We evaluate our algorithm on the GLUE benchmark with standard 512 sequence
-length where we see favorable performance relative to a standard pretrained Transformer. On the Long Range Arena (LRA) benchmark,
-for evaluating performance on long sequences, our method achieves results consistent with softmax self-attention but with sizable
+*Transformer-based models are widely used in natural language processing (NLP). Central to the transformer model is
+the self-attention mechanism, which captures the interactions of token pairs in the input sequences and depends quadratically
+on the sequence length. Training such models on longer sequences is expensive. In this paper, we show that a Bernoulli sampling
+attention mechanism based on Locality Sensitive Hashing (LSH), decreases the quadratic complexity of such models to linear.
+We bypass the quadratic cost by considering self-attention as a sum of individual tokens associated with Bernoulli random
+variables that can, in principle, be sampled at once by a single hash (although in practice, this number may be a small constant).
+This leads to an efficient sampling scheme to estimate self-attention which relies on specific modifications of
+LSH (to enable deployment on GPU architectures). We evaluate our algorithm on the GLUE benchmark with standard 512 sequence
+length where we see favorable performance relative to a standard pretrained Transformer. On the Long Range Arena (LRA) benchmark,
+for evaluating performance on long sequences, our method achieves results consistent with softmax self-attention but with sizable
speed-ups and memory savings and often outperforms other efficient self-attention methods. Our code is available at this https URL*
This model was contributed by [novice03](https://huggingface.co/novice03). The original code can be found [here](https://github.com/mlpen/YOSO).
@@ -50,12 +50,12 @@ This model was contributed by [novice03](https://huggingface.co/novice03). The o
in parallel on a GPU.
- The kernels provide a `fast_hash` function, which approximates the random projections of the queries and keys using the Fast Hadamard Transform. Using these
hash codes, the `lsh_cumulation` function approximates self-attention via LSH-based Bernoulli sampling.
-- To use the custom kernels, the user should set `config.use_expectation = False`. To ensure that the kernels are compiled successfully,
-the user must install the correct version of PyTorch and cudatoolkit. By default, `config.use_expectation = True`, which uses YOSO-E and
+- To use the custom kernels, the user should set `config.use_expectation = False`. To ensure that the kernels are compiled successfully,
+the user must install the correct version of PyTorch and cudatoolkit. By default, `config.use_expectation = True`, which uses YOSO-E and
does not require compiling CUDA kernels.
+alt="drawing" width="600"/>
YOSO Attention Algorithm. Taken from the original paper.
diff --git a/docs/source/en/model_doc/zamba.md b/docs/source/en/model_doc/zamba.md
index bb974080770..635bc76fb0c 100644
--- a/docs/source/en/model_doc/zamba.md
+++ b/docs/source/en/model_doc/zamba.md
@@ -24,7 +24,6 @@ rendered properly in your Markdown viewer.
This model was contributed by [pglo](https://huggingface.co/pglo).
-
## Model details
Zamba-7B-v1 is a hybrid between state-space models (Specifically [Mamba](https://github.com/state-spaces/mamba)) and transformer, and was trained using next-token prediction. Zamba uses a shared transformer layer after every 6 mamba blocks. It uses the [Mistral v0.1 tokenizer](https://huggingface.co/mistralai/Mistral-7B-v0.1). We came to this architecture after a series of ablations at small scales. Zamba-7B-v1 was pre-trained on 1T tokens of text and code data.
@@ -33,23 +32,24 @@ Zamba-7B-v1 is a hybrid between state-space models (Specifically [Mamba](https:/
## Quick start
-
### Presequities
Zamba requires you use `transformers` version 4.46.0 or higher:
+
```bash
pip install transformers>=4.45.0
```
In order to run optimized Mamba implementations, you first need to install `mamba-ssm` and `causal-conv1d`:
+
```bash
pip install mamba-ssm causal-conv1d>=1.2.0
```
+
You also have to have the model on a CUDA device.
You can run the model not using the optimized Mamba kernels, but it is **not** recommended as it will result in significantly lower latencies. In order to do that, you'll need to specify `use_mamba_kernels=False` when loading the model.
-
## Inference
```python
@@ -66,39 +66,32 @@ outputs = model.generate(**input_ids, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))
```
-
## Model card
The model cards can be found at:
* [Zamba-7B](https://huggingface.co/Zyphra/Zamba-7B-v1)
-
## Issues
For issues with model output, or community discussion, please use the Hugging Face community [forum](https://huggingface.co/Zyphra/Zamba-7B-v1/discussions)
-
## License
The model weights are open-sourced via an Apache 2.0 license.
-
## ZambaConfig
[[autodoc]] ZambaConfig
-
## ZambaModel
[[autodoc]] ZambaModel
- forward
-
## ZambaForCausalLM
[[autodoc]] ZambaForCausalLM
- forward
-
## ZambaForSequenceClassification
[[autodoc]] transformers.ZambaForSequenceClassification
diff --git a/docs/source/en/model_doc/zamba2.md b/docs/source/en/model_doc/zamba2.md
index ba4324366a9..7296ef1b250 100644
--- a/docs/source/en/model_doc/zamba2.md
+++ b/docs/source/en/model_doc/zamba2.md
@@ -26,7 +26,6 @@ rendered properly in your Markdown viewer.
This model was contributed by [pglo](https://huggingface.co/pglo).
-
## Model details
[Zamba2-1.2B](https://www.zyphra.com/post/zamba2-mini), [Zamba2-2.7B](https://www.zyphra.com/post/zamba2-small) and [Zamba2-7B](https://www.zyphra.com/post/zamba2-7b) are hybrid models combining state-space models (Specifically [Mamba2](https://github.com/state-spaces/mamba)) and transformer, and were trained using next-token prediction. Zamba2 uses shared transformer layers after every 6 mamba blocks. It uses the [Mistral v0.1 tokenizer](https://huggingface.co/mistralai/Mistral-7B-v0.1). We came to this architecture after a series of ablations at small scales. Zamba2-1.2B, Zamba2-2.7B and Zamba2-7B were pre-trained on 2T and 3T tokens, respectively.
@@ -35,10 +34,10 @@ This model was contributed by [pglo](https://huggingface.co/pglo).
## Quick start
-
### Presequities
Zamba2 requires you use `transformers` version 4.48.0 or higher:
+
```bash
pip install transformers>=4.48.0
```
@@ -59,7 +58,6 @@ outputs = model.generate(**input_ids, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))
```
-
## Model card
The model cards can be found at:
@@ -67,33 +65,27 @@ The model cards can be found at:
* [Zamba2-2.7B](https://huggingface.co/Zyphra/Zamba2-2.7B)
* [Zamba2-7B](https://huggingface.co/Zyphra/Zamba2-7B)
-
## Issues
For issues with model output, or community discussion, please use the Hugging Face community [forum](https://huggingface.co/Zyphra/Zamba2-7B/discussions)
-
## License
The model weights are open-sourced via an Apache 2.0 license.
-
## Zamba2Config
[[autodoc]] Zamba2Config
-
## Zamba2Model
[[autodoc]] Zamba2Model
- forward
-
## Zamba2ForCausalLM
[[autodoc]] Zamba2ForCausalLM
- forward
-
## Zamba2ForSequenceClassification
[[autodoc]] transformers.Zamba2ForSequenceClassification
diff --git a/docs/source/en/model_doc/zoedepth.md b/docs/source/en/model_doc/zoedepth.md
index 367c630a322..5252d2b4d36 100644
--- a/docs/source/en/model_doc/zoedepth.md
+++ b/docs/source/en/model_doc/zoedepth.md
@@ -15,7 +15,6 @@ rendered properly in your Markdown viewer.
-->
*This model was released on 2023-02-23 and added to Hugging Face Transformers on 2024-07-08.*
-

@@ -97,6 +96,7 @@ Image.fromarray(depth.astype("uint8"))
## Notes
- In the [original implementation](https://github.com/isl-org/ZoeDepth/blob/edb6daf45458569e24f50250ef1ed08c015f17a7/zoedepth/models/depth_model.py#L131) ZoeDepth performs inference on both the original and flipped images and averages the results. The `post_process_depth_estimation` function handles this by passing the flipped outputs to the optional `outputs_flipped` argument as shown below.
+
```py
with torch.no_grad():
outputs = model(pixel_values)
@@ -107,7 +107,7 @@ Image.fromarray(depth.astype("uint8"))
outputs_flipped=outputs_flipped,
)
```
-
+
## Resources
- Refer to this [notebook](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/ZoeDepth) for an inference example.
diff --git a/docs/source/en/model_memory_anatomy.md b/docs/source/en/model_memory_anatomy.md
index 2c6162ed1ca..9b2e4b4b622 100644
--- a/docs/source/en/model_memory_anatomy.md
+++ b/docs/source/en/model_memory_anatomy.md
@@ -16,24 +16,23 @@ limitations under the License.
# Model training anatomy
-To understand performance optimization techniques that one can apply to improve efficiency of model training
-speed and memory utilization, it's helpful to get familiar with how GPU is utilized during training, and how compute
+To understand performance optimization techniques that one can apply to improve efficiency of model training
+speed and memory utilization, it's helpful to get familiar with how GPU is utilized during training, and how compute
intensity varies depending on an operation performed.
-Let's start by exploring a motivating example of GPU utilization and the training run of a model. For the demonstration,
-we'll need to install a few libraries:
+Let's start by exploring a motivating example of GPU utilization and the training run of a model. For the demonstration,
+we'll need to install a few libraries:
```bash
pip install transformers datasets accelerate nvidia-ml-py
```
-The `nvidia-ml-py` library allows us to monitor the memory usage of the models from within Python. You might be familiar
+The `nvidia-ml-py` library allows us to monitor the memory usage of the models from within Python. You might be familiar
with the `nvidia-smi` command in the terminal - this library allows to access the same information in Python directly.
-Then, we create some dummy data: random token IDs between 100 and 30000 and binary labels for a classifier.
+Then, we create some dummy data: random token IDs between 100 and 30000 and binary labels for a classifier.
In total, we get 512 sequences each with length 512 and store them in a [`~datasets.Dataset`] with PyTorch format.
-
```py
>>> import numpy as np
>>> from datasets import Dataset
@@ -74,9 +73,9 @@ Let's verify that we start with a free GPU memory:
GPU memory occupied: 0 MB.
```
-That looks good: the GPU memory is not occupied as we would expect before we load any models. If that's not the case on
-your machine make sure to stop all processes that are using GPU memory. However, not all free GPU memory can be used by
-the user. When a model is loaded to the GPU the kernels are also loaded, which can take up 1-2GB of memory. To see how
+That looks good: the GPU memory is not occupied as we would expect before we load any models. If that's not the case on
+your machine make sure to stop all processes that are using GPU memory. However, not all free GPU memory can be used by
+the user. When a model is loaded to the GPU the kernels are also loaded, which can take up 1-2GB of memory. To see how
much it is we load a tiny tensor into the GPU which triggers the kernels to be loaded as well.
```py
@@ -92,10 +91,9 @@ We see that the kernels alone take up 1.3GB of GPU memory. Now let's see how muc
## Load Model
-First, we load the `google-bert/bert-large-uncased` model. We load the model weights directly to the GPU so that we can check
+First, we load the `google-bert/bert-large-uncased` model. We load the model weights directly to the GPU so that we can check
how much space just the weights use.
-
```py
>>> from transformers import AutoModelForSequenceClassification
@@ -105,12 +103,11 @@ how much space just the weights use.
GPU memory occupied: 2631 MB.
```
-We can see that the model weights alone take up 1.3 GB of GPU memory. The exact number depends on the specific
-GPU you are using. Note that on newer GPUs a model can sometimes take up more space since the weights are loaded in an
-optimized fashion that speeds up the usage of the model. Now we can also quickly check if we get the same result
+We can see that the model weights alone take up 1.3 GB of GPU memory. The exact number depends on the specific
+GPU you are using. Note that on newer GPUs a model can sometimes take up more space since the weights are loaded in an
+optimized fashion that speeds up the usage of the model. Now we can also quickly check if we get the same result
as with `nvidia-smi` CLI:
-
```bash
nvidia-smi
```
@@ -138,8 +135,8 @@ Tue Jan 11 08:58:05 2022
+-----------------------------------------------------------------------------+
```
-We get the same number as before and you can also see that we are using a V100 GPU with 16GB of memory. So now we can
-start training the model and see how the GPU memory consumption changes. First, we set up a few standard training
+We get the same number as before and you can also see that we are using a V100 GPU with 16GB of memory. So now we can
+start training the model and see how the GPU memory consumption changes. First, we set up a few standard training
arguments:
```py
@@ -154,7 +151,7 @@ default_args = {
- If you plan to run multiple experiments, in order to properly clear the memory between experiments, restart the Python
+ If you plan to run multiple experiments, in order to properly clear the memory between experiments, restart the Python
kernel between experiments.
@@ -181,9 +178,9 @@ Samples/second: 8.86
GPU memory occupied: 14949 MB.
```
-We see that already a relatively small batch size almost fills up our GPU's entire memory. However, a larger batch size
+We see that already a relatively small batch size almost fills up our GPU's entire memory. However, a larger batch size
can often result in faster model convergence or better end performance. So ideally we want to tune the batch size to our
-model's needs and not to the GPU limitations. What's interesting is that we use much more memory than the size of the model.
+model's needs and not to the GPU limitations. What's interesting is that we use much more memory than the size of the model.
To understand a bit better why this is the case let's have a look at a model's operations and memory needs.
## Anatomy of Model's Operations
@@ -206,10 +203,9 @@ This knowledge can be helpful to know when analyzing performance bottlenecks.
This summary is derived from [Data Movement Is All You Need: A Case Study on Optimizing Transformers 2020](https://huggingface.co/papers/2007.00072)
-
## Anatomy of Model's Memory
-We've seen that training the model uses much more memory than just putting the model on the GPU. This is because there
+We've seen that training the model uses much more memory than just putting the model on the GPU. This is because there
are many components during training that use GPU memory. The components on GPU memory are the following:
1. model weights
@@ -219,8 +215,8 @@ are many components during training that use GPU memory. The components on GPU m
5. temporary buffers
6. functionality-specific memory
-A typical model trained in mixed precision with AdamW requires 18 bytes per model parameter plus activation memory. For
-inference there are no optimizer states and gradients, so we can subtract those. And thus we end up with 6 bytes per
+A typical model trained in mixed precision with AdamW requires 18 bytes per model parameter plus activation memory. For
+inference there are no optimizer states and gradients, so we can subtract those. And thus we end up with 6 bytes per
model parameter for mixed precision inference, plus activation memory.
Let's look at the details.
@@ -244,29 +240,29 @@ Let's look at the details.
- size depends on many factors, the key ones being sequence length, hidden size and batch size.
-There are the input and output that are being passed and returned by the forward and the backward functions and the
+There are the input and output that are being passed and returned by the forward and the backward functions and the
forward activations saved for gradient computation.
**Temporary Memory**
-Additionally, there are all kinds of temporary variables which get released once the calculation is done, but in the
-moment these could require additional memory and could push to OOM. Therefore, when coding it's crucial to think
+Additionally, there are all kinds of temporary variables which get released once the calculation is done, but in the
+moment these could require additional memory and could push to OOM. Therefore, when coding it's crucial to think
strategically about such temporary variables and sometimes to explicitly free those as soon as they are no longer needed.
**Functionality-specific memory**
-Then, your software could have special memory needs. For example, when generating text using beam search, the software
+Then, your software could have special memory needs. For example, when generating text using beam search, the software
needs to maintain multiple copies of inputs and outputs.
**`forward` vs `backward` Execution Speed**
-For convolutions and linear layers there are 2x flops in the backward compared to the forward, which generally translates
-into ~2x slower (sometimes more, because sizes in the backward tend to be more awkward). Activations are usually
-bandwidth-limited, and it’s typical for an activation to have to read more data in the backward than in the forward
-(e.g. activation forward reads once, writes once, activation backward reads twice, gradOutput and output of the forward,
+For convolutions and linear layers there are 2x flops in the backward compared to the forward, which generally translates
+into ~2x slower (sometimes more, because sizes in the backward tend to be more awkward). Activations are usually
+bandwidth-limited, and it’s typical for an activation to have to read more data in the backward than in the forward
+(e.g. activation forward reads once, writes once, activation backward reads twice, gradOutput and output of the forward,
and writes once, gradInput).
-As you can see, there are potentially a few places where we could save GPU memory or speed up operations.
-Now that you understand what affects GPU utilization and computation speed, refer to
-the [Methods and tools for efficient training on a single GPU](perf_train_gpu_one) documentation page to learn about
-performance optimization techniques.
+As you can see, there are potentially a few places where we could save GPU memory or speed up operations.
+Now that you understand what affects GPU utilization and computation speed, refer to
+the [Methods and tools for efficient training on a single GPU](perf_train_gpu_one) documentation page to learn about
+performance optimization techniques.
diff --git a/docs/source/en/models.md b/docs/source/en/models.md
index fdfcfba6585..ae5572c0c77 100644
--- a/docs/source/en/models.md
+++ b/docs/source/en/models.md
@@ -45,7 +45,6 @@ There are two general types of models you can load:
1. A barebones model, like [`AutoModel`] or [`LlamaModel`], that outputs hidden states.
2. A model with a specific *head* attached, like [`AutoModelForCausalLM`] or [`LlamaForCausalLM`], for performing specific tasks.
-
## Model classes
To get a pretrained model, you need to load the weights into the model. This is done by calling [`~PreTrainedModel.from_pretrained`] which accepts weights from the Hugging Face Hub or a local directory.
@@ -111,7 +110,6 @@ You need enough memory to hold two copies of the model weights (random and pretr
Transformers reduces some of these memory-related challenges with fast initialization, sharded checkpoints, Accelerate's [Big Model Inference](https://hf.co/docs/accelerate/usage_guides/big_modeling) feature, and supporting lower bit data types.
-
### Sharded checkpoints
The [`~PreTrainedModel.save_pretrained`] method automatically shards checkpoints larger than 10GB.
diff --git a/docs/source/en/perf_train_gaudi.md b/docs/source/en/perf_train_gaudi.md
index 2ba792d484a..1ab8957f9d7 100644
--- a/docs/source/en/perf_train_gaudi.md
+++ b/docs/source/en/perf_train_gaudi.md
@@ -20,14 +20,17 @@ The Intel Gaudi AI accelerator family includes [Intel Gaudi 1](https://habana.ai
[`TrainingArguments`], [`Trainer`] and [`Pipeline`] detect and set the backend device to `hpu` if an Intel Gaudi device is available. No additional changes are required to enable training and inference on your device.
Some modeling code in Transformers is not optimized for HPU lazy mode. If you encounter any errors, set the environment variable below to use eager mode:
+
```
PT_HPU_LAZY_MODE=0
```
In some cases, you'll also need to enable int64 support to avoid casting issues with long integers:
+
```
PT_ENABLE_INT64_SUPPORT=1
```
+
Refer to the [Gaudi docs](https://docs.habana.ai/en/latest/index.html) for more details.
> [!TIP]
diff --git a/docs/source/en/pipeline_webserver.md b/docs/source/en/pipeline_webserver.md
index 0112d116c47..37d245483b9 100644
--- a/docs/source/en/pipeline_webserver.md
+++ b/docs/source/en/pipeline_webserver.md
@@ -82,6 +82,7 @@ Query the server with a POST request.
```bash
curl -X POST -d "Paris is the [MASK] of France." http://localhost:8000/
```
+
This should return the output below.
```bash
diff --git a/docs/source/en/pr_checks.md b/docs/source/en/pr_checks.md
index a5634c29ee4..7056adf2149 100644
--- a/docs/source/en/pr_checks.md
+++ b/docs/source/en/pr_checks.md
@@ -52,7 +52,6 @@ or for an editable install:
pip install -e .[quality]
```
-
## Tests
All the jobs that begin with `ci/circleci: run_tests_` run parts of the Transformers testing suite. Each of those jobs focuses on a part of the library in a certain environment: for instance `ci/circleci: run_tests_pipelines` runs the pipeline tests in an environment where all pipeline-related requirements are installed.
diff --git a/docs/source/en/quantization/auto_round.md b/docs/source/en/quantization/auto_round.md
index 15abf9faa84..7526597ee86 100644
--- a/docs/source/en/quantization/auto_round.md
+++ b/docs/source/en/quantization/auto_round.md
@@ -11,18 +11,17 @@ rendered properly in your Markdown viewer.
# AutoRound
-[AutoRound](https://github.com/intel/auto-round) is an advanced quantization algorithm that delivers strong accuracy, even at 2-bit precision.
-It leverages sign gradient descent to fine-tune both rounding values and min-max clipping thresholds in just 200 steps. Designed for broad compatibility, it seamlessly supports a wide range of LLMs and is actively expanding to cover more VLMs as well.
+[AutoRound](https://github.com/intel/auto-round) is an advanced quantization algorithm that delivers strong accuracy, even at 2-bit precision.
+It leverages sign gradient descent to fine-tune both rounding values and min-max clipping thresholds in just 200 steps. Designed for broad compatibility, it seamlessly supports a wide range of LLMs and is actively expanding to cover more VLMs as well.
It also supports quantization and inference across multiple hardware platforms, including CPU, XPU, and CUDA.
-AutoRound also offers a variety of useful features, including mixed-bit tuning and inference, lm-head quantization, support for exporting to formats like GPTQ/AWQ/GGUF, and flexible tuning recipes.
+AutoRound also offers a variety of useful features, including mixed-bit tuning and inference, lm-head quantization, support for exporting to formats like GPTQ/AWQ/GGUF, and flexible tuning recipes.
For a comprehensive overview and the latest updates, check out the AutoRound [README](https://github.com/intel/auto-round).
-AutoRound was originally developed as part of the [Intel Neural Compressor](https://github.com/intel/neural-compressor), serving as a general-purpose model compression library for deep learning.
-It has since evolved into a standalone library focused specifically on low-precision optimization for large language models (LLMs).
+AutoRound was originally developed as part of the [Intel Neural Compressor](https://github.com/intel/neural-compressor), serving as a general-purpose model compression library for deep learning.
+It has since evolved into a standalone library focused specifically on low-precision optimization for large language models (LLMs).
AutoRound remains fully integrated with the Intel Neural Compressor, and you can explore the repository for more details.
-
## Installation
```bash
@@ -51,6 +50,7 @@ Currently, only offline mode is supported to generate quantized models.
### Command Line Usage
+
```bash
auto-round \
--model facebook/opt-125m \
@@ -59,7 +59,7 @@ auto-round \
--output_dir ./tmp_autoround
```
-AutoRound also offer another two recipes, `auto-round-best` and `auto-round-light`, designed for optimal accuracy and improved speed, respectively.
+AutoRound also offer another two recipes, `auto-round-best` and `auto-round-light`, designed for optimal accuracy and improved speed, respectively.
For 2 bits, we recommend using `auto-round-best` or `auto-round`.
@@ -99,6 +99,7 @@ autoround.quantize_and_save(output_dir, format='auto_round')
### AutoRoundBest recipe
This setting provides the best accuracy in most scenarios but is 4–5× slower than the standard AutoRound recipe. It is especially recommended for 2-bit quantization and is a good choice if sufficient resources are available.
+
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_round import AutoRound
@@ -121,6 +122,7 @@ autoround = AutoRound(
output_dir = "./tmp_autoround"
autoround.quantize_and_save(output_dir, format='auto_round')
```
+
@@ -230,7 +232,7 @@ print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50, do_sample=Fal
AutoRound automatically selects the backend for each layer based on compatibility. In general, the priority order is Marlin > ExLLaMAV2 > Triton, but the final choice depends on factors such as group size, bit width, packing format, hardware device, and other implementation details. For more details, please refer to [backends](https://github.com/intel/auto-round?tab=readme-ov-file#specify-backend),
-The backend may not always be the most suitable for certain devices.
+The backend may not always be the most suitable for certain devices.
You can specify your preferred backend such as "ipex" for CPU, "ipex/triton" for XPU, "marlin/exllamav2/triton" for CUDA, according to your needs or hardware compatibility. Please note that additional corresponding libraries may be required.
```python
@@ -247,7 +249,6 @@ print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50, do_sample=Fal
-
### Convert GPTQ/AWQ to AutoRound
@@ -277,7 +278,6 @@ the [transformers](https://github.com/huggingface/transformers/issues) repositor
If you encounter any issues with auto-round, please open an issue on
the [AutoRound](https://github.com/intel/auto-round/issues) repository.
-
## Acknowledgement
Special thanks to open-source low precision libraries such as AutoGPTQ, AutoAWQ, GPTQModel, Triton, Marlin, and ExLLaMAV2 for providing low-precision CUDA kernels, which are leveraged in AutoRound.
diff --git a/docs/source/en/quantization/awq.md b/docs/source/en/quantization/awq.md
index b6437e2588a..b2cf4b9ecdf 100644
--- a/docs/source/en/quantization/awq.md
+++ b/docs/source/en/quantization/awq.md
@@ -25,6 +25,7 @@ Run the command below to install autoawq
```bash
pip install autoawq
```
+
> [!WARNING]
> AutoAWQ downgrades Transformers to version 4.47.1. If you want to do inference with AutoAWQ, you may need to reinstall your Transformers' version after installing AutoAWQ.
diff --git a/docs/source/en/quantization/bitsandbytes.md b/docs/source/en/quantization/bitsandbytes.md
index 60c3c2dfebf..9cdbbe5af39 100644
--- a/docs/source/en/quantization/bitsandbytes.md
+++ b/docs/source/en/quantization/bitsandbytes.md
@@ -32,12 +32,12 @@ bitsandbytes offers two main quantization features:
> **Note:** For a user-friendly quantization experience, you can use the `bitsandbytes` [community space](https://huggingface.co/spaces/bnb-community/bnb-my-repo).
-
Run the command below to install bitsandbytes.
```bash
pip install --upgrade transformers accelerate bitsandbytes
```
+
To compile from source, follow the instructions in the [bitsandbytes installation guide](https://huggingface.co/docs/bitsandbytes/main/en/installation).
## Hardware Compatibility
@@ -116,6 +116,7 @@ model = AutoModelForCausalLM.from_pretrained(
model.push_to_hub("bloom-560m-8bit")
```
+
@@ -166,6 +167,7 @@ model = AutoModelForCausalLM.from_pretrained(
model.push_to_hub("bloom-560m-4bit")
```
+
diff --git a/docs/source/en/quantization/compressed_tensors.md b/docs/source/en/quantization/compressed_tensors.md
index a3b01a1b448..3c047d0af98 100644
--- a/docs/source/en/quantization/compressed_tensors.md
+++ b/docs/source/en/quantization/compressed_tensors.md
@@ -99,29 +99,29 @@ For a more detailed look at the model weights, use the [safetensors viewer](http
| Tensors | Shape | Precision |
| ------- | ----- | --------- |
-model.layers.0.input_layernorm.weight | [4 096] | BF16
-model.layers.0.mlp.down_proj.input_scale | [1] | BF16
-model.layers.0.mlp.down_proj.weight | [4 096, 14 336] | F8_E4M3
-model.layers.0.mlp.down_proj.weight_scale | [1] | BF16
-model.layers.0.mlp.gate_proj.input_scale | [1] | BF16
-model.layers.0.mlp.gate_proj.weight | [14 336, 4 096] | F8_E4M3
-model.layers.0.mlp.gate_proj.weight_scale | [1] | BF16
-model.layers.0.mlp.up_proj.input_scale| [1] |BF16
-model.layers.0.mlp.up_proj.weight | [14 336, 4 096] | F8_E4M3
-model.layers.0.mlp.up_proj.weight_scale | [1] | BF16
-model.layers.0.post_attention_layernorm.weight | [4 096] |BF16
+model.layers.0.input_layernorm.weight | [4 096] | BF16
+model.layers.0.mlp.down_proj.input_scale | [1] | BF16
+model.layers.0.mlp.down_proj.weight | [4 096, 14 336] | F8_E4M3
+model.layers.0.mlp.down_proj.weight_scale | [1] | BF16
+model.layers.0.mlp.gate_proj.input_scale | [1] | BF16
+model.layers.0.mlp.gate_proj.weight | [14 336, 4 096] | F8_E4M3
+model.layers.0.mlp.gate_proj.weight_scale | [1] | BF16
+model.layers.0.mlp.up_proj.input_scale| [1] |BF16
+model.layers.0.mlp.up_proj.weight | [14 336, 4 096] | F8_E4M3
+model.layers.0.mlp.up_proj.weight_scale | [1] | BF16
+model.layers.0.post_attention_layernorm.weight | [4 096] |BF16
model.layers.0.self_attn.k_proj.input_scale | [1] | BF16
model.layers.0.self_attn.k_proj.weight | [1 024, 4 096]| F8_E4M3
-model.layers.0.self_attn.k_proj.weight_scale |[1] | BF16
+model.layers.0.self_attn.k_proj.weight_scale |[1] | BF16
model.layers.0.self_attn.o_proj.input_scale | [1] | BF16
-model.layers.0.self_attn.o_proj.weight | [4 096, 4 096] | F8_E4M3
-model.layers.0.self_attn.o_proj.weight_scale | [1] | BF16
-model.layers.0.self_attn.q_proj.input_scale | [1] | BF16
-model.layers.0.self_attn.q_proj.weight | [4 096, 4 096] | F8_E4M3
-model.layers.0.self_attn.q_proj.weight_scale | [1] | BF16
-model.layers.0.self_attn.v_proj.input_scale | [1] | BF16
-model.layers.0.self_attn.v_proj.weight | [1 024, 4 096] | F8_E4M3
-model.layers.0.self_attn.v_proj.weight_scale | [1] | BF16
+model.layers.0.self_attn.o_proj.weight | [4 096, 4 096] | F8_E4M3
+model.layers.0.self_attn.o_proj.weight_scale | [1] | BF16
+model.layers.0.self_attn.q_proj.input_scale | [1] | BF16
+model.layers.0.self_attn.q_proj.weight | [4 096, 4 096] | F8_E4M3
+model.layers.0.self_attn.q_proj.weight_scale | [1] | BF16
+model.layers.0.self_attn.v_proj.input_scale | [1] | BF16
+model.layers.0.self_attn.v_proj.weight | [1 024, 4 096] | F8_E4M3
+model.layers.0.self_attn.v_proj.weight_scale | [1] | BF16
When loading a compressed-tensors model with the [`~quantizers.HFQuantizer`] integration, all the [nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) modules specified in the quantization config are replaced by [CompressedLinear](https://github.com/neuralmagic/compressed-tensors/blob/975cb223b19fcac2b98a4271d17668462d4d6e1d/src/compressed_tensors/linear/compressed_linear.py#L30) modules that manage the compressed weights and forward pass for inference. The `lm_head` module is still kept as an unquantized nn.Linear module.
diff --git a/docs/source/en/quantization/concept_guide.md b/docs/source/en/quantization/concept_guide.md
index ff300b9d48a..e9d3b451484 100644
--- a/docs/source/en/quantization/concept_guide.md
+++ b/docs/source/en/quantization/concept_guide.md
@@ -18,7 +18,6 @@ rendered properly in your Markdown viewer.
Quantization reduces the memory footprint and computational cost of large machine learning models like those found in the Transformers library. It achieves this by representing the model's weights and or activations with lower-precision data types (like 8-bit integers or int8) instead of the standard 32-bit floating-point (float32).
-
Reducing a model's precision offers several significant benefits:
- Smaller model size: Lower-precision data types require less storage space. An int8 model, for example, is roughly 4 times smaller than its float32 counterpart.
@@ -46,8 +45,7 @@ The most common method is *affine quantization*. For a given float32 tensor (lik
There are two main ways to perform this mapping, *symmetric* and *asymmetric*. The choice between symmetric and asymmetric quantization determines how the float32 range is mapped to the int8 range.
- Symmetric: This method assumes the original float32 range is symmetric around zero ( \\([ -a, a ]\\) ). This range is mapped symmetrically to the int8 range, for example, \\([-127, 127]\\). A key characteristic is that the float32 value \\(0.0\\) maps directly to the int8 value \\(0\\). This only requires one parameter, the **scale ( \\(S\\) )**, to define the mapping. It can simplify computations, but it might be less accurate if the original data distribution isn't naturally centered around zero.
-- Asymmetric (Affine): This method does not assume the data is centered around zero. It maps the exact range \\([val_{min}, val_{max}]\\) from float32 to the full int8 range, like \\([-128, 127]\\). This requires two parameters, a **scale ( \\(S\\) )** and a **zero-point ( \\(Z\\) )**.
-
+- Asymmetric (Affine): This method does not assume the data is centered around zero. It maps the exact range \\([val_{min}, val_{max}]\\) from float32 to the full int8 range, like \\([-128, 127]\\). This requires two parameters, a **scale ( \\(S\\) )** and a **zero-point ( \\(Z\\) )**.
scale ( \\(S\\) ): A positive float32 number representing the ratio between the float32 and the int8 range.
@@ -134,8 +132,7 @@ There are two main types of quantization techniques.
## Quantization in Transformers
-Transformers integrates several quantization backends such as bitsandbytes, torchao, compressed-tensors, and more (refer to the quantization [overview](./overview) for more backends).
-
+Transformers integrates several quantization backends such as bitsandbytes, torchao, compressed-tensors, and more (refer to the quantization [overview](./overview) for more backends).
All backends are unified under the [`HfQuantizer`] API and associated [`QuantizationConfig`] classes. You can integrate your own custom quantization backends by implementing a custom [`HfQuantizer`] and [`QuantizationConfig`], as shown in the [Contribution](./contribute) guide.
@@ -165,7 +162,6 @@ model = AutoModelForCausalLM.from_pretrained(
)
```
-
## Resources
To explore quantization and related performance optimization concepts more deeply, check out the following resources.
diff --git a/docs/source/en/quantization/mxfp4.md b/docs/source/en/quantization/mxfp4.md
index a2b9f7634c8..dd313c5555e 100644
--- a/docs/source/en/quantization/mxfp4.md
+++ b/docs/source/en/quantization/mxfp4.md
@@ -16,7 +16,7 @@ rendered properly in your Markdown viewer.
# MXFP4
-Note: MXFP4 quantisation currently only works for OpenAI GPT-OSS 120b and 20b.
+Note: MXFP4 quantisation currently only works for OpenAI GPT-OSS 120b and 20b.
MXFP4 is a 4-bit floating point format that dramatically reduces the memory requirements of large models. Large models (GPT-OSS-120B) can fit on a single 80GB GPU and smaller models (GPT-OSS-20B) only require 16GB of memory. It uses blockwise scaling to preserve it's range and accuracy, which typically becomes degraded at lower precisions.
@@ -25,7 +25,6 @@ To use MXPF4, make sure your hardware meets the following requirements.
- Install Accelerate, kernels, and Triton ≥ 3.4. Only manually install Triton ≥ 3.4 if you're using PyTorch 2.7 because it is already supported in PyTorch 2.8.
- NVIDIA GPU Compute Capability ≥ 7.5 which includes Tesla GPUs and newer. Use [get_device_capability](https://docs.pytorch.org/docs/stable/generated/torch.cuda.get_device_capability.html) to check Compute Capability.
-
```python
from torch import cuda
cuda.get_device_capability()
@@ -54,7 +53,6 @@ print(cfg.quantization_config)
# }
```
-
## MXFP4 kernels
Transformers automatically pulls the MXFP4-aware Triton kernels from the community repository when you load a model that needs them. The kernels are stored in your local cache and used during the forward pass.
@@ -67,7 +65,6 @@ You can use [hf cache scan](https://huggingface.co/docs/huggingface_hub/en/guide
hf cache scan
```
-
```shell
REPO ID REPO TYPE SIZE ON DISK
-------------------------------- --------- ------------
diff --git a/docs/source/en/quantization/overview.md b/docs/source/en/quantization/overview.md
index ceab195b2b5..d607ae44660 100644
--- a/docs/source/en/quantization/overview.md
+++ b/docs/source/en/quantization/overview.md
@@ -34,7 +34,7 @@ Use the Space below to help you pick a quantization method depending on your har
| [GGUF / GGML (llama.cpp)](../gguf) | 🟢 | 🟢 | 🟢 | 🔴 | 🟢 | 🟢 | 🔴 | 1/8 | 🔴 | [See Notes](../gguf) | [See Notes](../gguf) | https://github.com/ggerganov/llama.cpp |
| [GPTQModel](./gptq) | 🔴 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 | 🔴 | 2/3/4/8 | 🟢 | 🟢 | 🟢 | https://github.com/ModelCloud/GPTQModel |
| [AutoGPTQ](./gptq) | 🔴 | 🔴 | 🟢 | 🟢 | 🔴 | 🔴 | 🔴 | 2/3/4/8 | 🟢 | 🟢 | 🟢 | https://github.com/AutoGPTQ/AutoGPTQ |
-| [HIGGS](./higgs) | 🟢 | 🔴 | 🟢 | 🔴 | 🔴 | 🔴 | 🟢 | 2/4 | 🔴 | 🟢 | 🟢 | https://github.com/HanGuo97/flute |
+| [HIGGS](./higgs) | 🟢 | 🔴 | 🟢 | 🔴 | 🔴 | 🔴 | 🟢 | 2/4 | 🔴 | 🟢 | 🟢 | https://github.com/HanGuo97/flute |
| [HQQ](./hqq) | 🟢 | 🟢 | 🟢 | 🔴 | 🔴 | 🟢 | 🟢 | 1/8 | 🟢 | 🔴 | 🟢 | https://github.com/mobiusml/hqq/ |
| [optimum-quanto](./quanto) | 🟢 | 🟢 | 🟢 | 🔴 | 🟢 | 🟢 | 🟢 | 2/4/8 | 🔴 | 🔴 | 🟢 | https://github.com/huggingface/optimum-quanto |
| [FBGEMM_FP8](./fbgemm_fp8) | 🟢 | 🔴 | 🟢 | 🔴 | 🔴 | 🔴 | 🔴 | 8 | 🔴 | 🟢 | 🟢 | https://github.com/pytorch/FBGEMM |
@@ -53,7 +53,7 @@ If you are new to quantization, we recommend checking out these beginner-friendl
## User-Friendly Quantization Tools
-If you are looking for a user-friendly quantization experience, you can use the following community spaces and notebooks:
+If you are looking for a user-friendly quantization experience, you can use the following community spaces and notebooks:
* [Bitsandbytes Space](https://huggingface.co/spaces/bnb-community/bnb-my-repo)
* [GGUF Space](https://huggingface.co/spaces/ggml-org/gguf-my-repo)
diff --git a/docs/source/en/quantization/selecting.md b/docs/source/en/quantization/selecting.md
index 7653e946dd8..69b989bca88 100644
--- a/docs/source/en/quantization/selecting.md
+++ b/docs/source/en/quantization/selecting.md
@@ -118,7 +118,7 @@ Consider the quantization method below during fine-tuning to save memory.
Other methods offer PEFT compatibility, though bitsandbytes is the most established and straightforward path for QLoRA.
-See the [bitsandbytes documentation](./bitsandbytes#qlora) and [PEFT Docs](https://huggingface.co/docs/peft/developer_guides/quantization#aqlm-quantization) for more details.
+See the [bitsandbytes documentation](./bitsandbytes#qlora) and [PEFT Docs](https://huggingface.co/docs/peft/developer_guides/quantization#aqlm-quantization) for more details.
## Research
diff --git a/docs/source/en/quantization/torchao.md b/docs/source/en/quantization/torchao.md
index 6427866d022..8778f9f3e5e 100644
--- a/docs/source/en/quantization/torchao.md
+++ b/docs/source/en/quantization/torchao.md
@@ -30,7 +30,6 @@ See the table below for additional torchao features.
> [!TIP]
> Refer to the torchao [README.md](https://github.com/pytorch/ao#torchao-pytorch-architecture-optimization) for more details about the library.
-
torchao supports the [quantization techniques](https://github.com/pytorch/ao/blob/main/torchao/quantization/README.md) below.
- A16W8 Float8 Dynamic Quantization
@@ -43,7 +42,6 @@ torchao supports the [quantization techniques](https://github.com/pytorch/ao/blo
torchao also supports module level configuration by specifying a dictionary from fully qualified name of module and its corresponding quantization config. This allows skip quantizing certain layers and using different quantization config for different modules.
-
Check the table below to see if your hardware is compatible.
| Component | Compatibility |
@@ -52,8 +50,6 @@ Check the table below to see if your hardware is compatible.
| XPU Versions | ✅ pytorch2.8 |
| CPU | ✅ change `device_map="cpu"` (see examples below) |
-
-
Install torchao from PyPi or the PyTorch index with the following commands.
@@ -64,13 +60,15 @@ Install torchao from PyPi or the PyTorch index with the following commands.
# Stable release from Pypi which will default to CUDA 12.6
pip install --upgrade torchao transformers
```
+
Stable Release from the PyTorch index
-
+
```bash
pip install torchao --index-url https://download.pytorch.org/whl/cu126 # options are cpu/cu118/cu126/cu128
```
+
@@ -118,6 +116,7 @@ input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)
output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static")
print(tokenizer.decode(output[0], skip_special_tokens=True))
```
+
@@ -146,6 +145,7 @@ input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)
output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static")
print(tokenizer.decode(output[0], skip_special_tokens=True))
```
+
@@ -177,13 +177,14 @@ input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)
output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static")
print(tokenizer.decode(output[0], skip_special_tokens=True))
```
+
### A100 GPU
-
+
```py
import torch
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
@@ -210,6 +211,7 @@ input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)
output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static")
print(tokenizer.decode(output[0], skip_special_tokens=True))
```
+
@@ -245,6 +247,7 @@ input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)
output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static")
print(tokenizer.decode(output[0], skip_special_tokens=True))
```
+
@@ -276,13 +279,14 @@ input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)
output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static")
print(tokenizer.decode(output[0], skip_special_tokens=True))
```
+
### Intel XPU
-
+
```py
import torch
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
@@ -309,6 +313,7 @@ input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)
output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static")
print(tokenizer.decode(output[0], skip_special_tokens=True))
```
+
@@ -340,14 +345,14 @@ input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)
output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static")
print(tokenizer.decode(output[0], skip_special_tokens=True))
```
+
-
### CPU
-
+
```py
import torch
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
@@ -373,6 +378,7 @@ input_ids = tokenizer(input_text, return_tensors="pt")
output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static")
print(tokenizer.decode(output[0], skip_special_tokens=True))
```
+
@@ -404,12 +410,14 @@ input_ids = tokenizer(input_text, return_tensors="pt")
output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static")
print(tokenizer.decode(output[0], skip_special_tokens=True))
```
+
### Per Module Quantization
#### 1. Skip quantization for certain layers
With `ModuleFqnToConfig` we can specify a default configuration for all layers while skipping quantization for certain layers.
+
```py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
@@ -438,6 +446,7 @@ print(output_text)
```
#### 2. Quantizing different layers with different quantization configs
+
```py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
@@ -485,7 +494,6 @@ Note: autoquant is for GPU only right now.
Create a [`TorchAoConfig`] and set to `"autoquant"`. Set the `cache_implementation` to `"static"` to automatically [torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) the forward method. Finally, call `finalize_autoquant` on the quantized model to finalize the quantization and log the input shapes.
-
```py
import torch
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
@@ -509,7 +517,6 @@ quantized_model.finalize_autoquant()
print(tokenizer.decode(output[0], skip_special_tokens=True))
```
-
## Serialization
torchao implements [torch.Tensor subclasses](https://pytorch.org/docs/stable/notes/extending.html#subclassing-torch-tensor) for maximum flexibility in supporting new quantized torch.Tensor formats. [Safetensors](https://huggingface.co/docs/safetensors/en/index) serialization and deserialization does not work with torchao.
@@ -518,15 +525,16 @@ To avoid arbitrary user code execution, torchao sets `weights_only=True` in [tor
-
+
```py
# don't serialize model with Safetensors
output_dir = "llama3-8b-int4wo-128"
quantized_model.save_pretrained("llama3-8b-int4wo-128", safe_serialization=False)
```
+
-
+
```py
# don't serialize model with Safetensors
USER_ID = "your_huggingface_user_id"
@@ -534,13 +542,14 @@ REPO_ID = "llama3-8b-int4wo-128"
quantized_model.push_to_hub(f"{USER_ID}/llama3-8b-int4wo-128", safe_serialization=False)
tokenizer.push_to_hub(f"{USER_ID}/llama3-8b-int4wo-128")
```
+
-
## Loading quantized models
Loading a quantized model depends on the quantization scheme. For quantization schemes, like int8 and float8, you can quantize the model on any device and also load it on any device. The example below demonstrates quantizing a model on the CPU and then loading it on CUDA or XPU.
+
```py
import torch
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
@@ -574,6 +583,7 @@ output = reloaded_model.generate(**input_ids, max_new_tokens=10)
print(tokenizer.decode(output[0], skip_special_tokens=True))
```
+
For int4, the model can only be loaded on the same device it was quantized on because the layout is specific to the device. The example below demonstrates quantizing and loading a model on the CPU.
```py
@@ -641,8 +651,6 @@ print(tokenizer.decode(output[0], skip_special_tokens=True))
>
> All configuration objects accept parameters for customization (e.g., `group_size`, `scheme`, `layout`).
-
-
## Resources
For a better sense of expected performance, view the [benchmarks](https://github.com/pytorch/ao/tree/main/torchao/quantization#benchmarks) for various models with CUDA and XPU backends. You can also run the code below to benchmark a model yourself.
diff --git a/docs/source/en/run_scripts.md b/docs/source/en/run_scripts.md
index ef32bf26ee0..594eb84b02a 100644
--- a/docs/source/en/run_scripts.md
+++ b/docs/source/en/run_scripts.md
@@ -52,6 +52,7 @@ Start with a smaller dataset by including the `max_train_samples`, `max_eval_sam
> [!WARNING]
> Not all example scripts support the `max_predict_samples` parameter. Run the command below to check whether a script supports it or not.
+>
> ```bash
> examples/pytorch/summarization/run_summarization.py -h
> ```
diff --git a/docs/source/en/serialization.md b/docs/source/en/serialization.md
index 831f163bed1..cf9160f5b33 100644
--- a/docs/source/en/serialization.md
+++ b/docs/source/en/serialization.md
@@ -38,6 +38,7 @@ pip install optimum[exporters]
> [!TIP]
> Refer to the [Export a model to ONNX with optimum.exporters.onnx](https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/export_a_model#exporting-a-model-to-onnx-using-the-cli) guide for all available arguments or with the command below.
+>
> ```bash
> optimum-cli export onnx --help
> ```
diff --git a/docs/source/en/serving.md b/docs/source/en/serving.md
index f421a284950..6237b09bb49 100644
--- a/docs/source/en/serving.md
+++ b/docs/source/en/serving.md
@@ -356,7 +356,6 @@ ResponseCompletedEvent(response=Response(id='resp_req_0', created_at=1754060400.
-
## MCP integration
The `transformers serve` server is also an MCP client, so it can interact with MCP tools in agentic use cases. This, of course, requires the use of an LLM that is designed to use tools.
@@ -382,7 +381,6 @@ transformers serve \
--attn_implementation sdpa_paged
```
-
### Performance tips
- Use an efficient attention backend when available:
@@ -401,5 +399,3 @@ transformers serve \
- `--load_in_4bit`/`--load_in_8bit` can reduce memory footprint for LoRA setups
- `--force-model ` avoids per-request model hints and helps produce stable, repeatable runs
-
-
diff --git a/docs/source/en/tasks/audio_classification.md b/docs/source/en/tasks/audio_classification.md
index 52e2f965ee2..250b980be19 100644
--- a/docs/source/en/tasks/audio_classification.md
+++ b/docs/source/en/tasks/audio_classification.md
@@ -210,7 +210,6 @@ At this point, only three steps remain:
2. Pass the training arguments to [`Trainer`] along with the model, dataset, tokenizer, data collator, and `compute_metrics` function.
3. Call [`~Trainer.train`] to fine-tune your model.
-
```py
>>> training_args = TrainingArguments(
... output_dir="my_awesome_mind_model",
diff --git a/docs/source/en/tasks/document_question_answering.md b/docs/source/en/tasks/document_question_answering.md
index d83e025c409..902a948307f 100644
--- a/docs/source/en/tasks/document_question_answering.md
+++ b/docs/source/en/tasks/document_question_answering.md
@@ -439,6 +439,7 @@ Now that you have finetuned a LayoutLMv2 model, and uploaded it to the 🤗 Hub,
way to try out your finetuned model for inference is to use it in a [`Pipeline`].
Let's take an example:
+
```py
>>> example = dataset["test"][2]
>>> question = example["query"]["en"]
diff --git a/docs/source/en/tasks/idefics.md b/docs/source/en/tasks/idefics.md
index 3f8915f3cc9..5fef5953d5b 100644
--- a/docs/source/en/tasks/idefics.md
+++ b/docs/source/en/tasks/idefics.md
@@ -18,26 +18,26 @@ rendered properly in your Markdown viewer.
[[open-in-colab]]
-While individual tasks can be tackled by fine-tuning specialized models, an alternative approach
-that has recently emerged and gained popularity is to use large models for a diverse set of tasks without fine-tuning.
-For instance, large language models can handle such NLP tasks as summarization, translation, classification, and more.
-This approach is no longer limited to a single modality, such as text, and in this guide, we will illustrate how you can
-solve image-text tasks with a large multimodal model called IDEFICS.
+While individual tasks can be tackled by fine-tuning specialized models, an alternative approach
+that has recently emerged and gained popularity is to use large models for a diverse set of tasks without fine-tuning.
+For instance, large language models can handle such NLP tasks as summarization, translation, classification, and more.
+This approach is no longer limited to a single modality, such as text, and in this guide, we will illustrate how you can
+solve image-text tasks with a large multimodal model called IDEFICS.
-[IDEFICS](../model_doc/idefics) is an open-access vision and language model based on [Flamingo](https://huggingface.co/papers/2204.14198),
-a state-of-the-art visual language model initially developed by DeepMind. The model accepts arbitrary sequences of image
-and text inputs and generates coherent text as output. It can answer questions about images, describe visual content,
-create stories grounded in multiple images, and so on. IDEFICS comes in two variants - [80 billion parameters](https://huggingface.co/HuggingFaceM4/idefics-80b)
-and [9 billion parameters](https://huggingface.co/HuggingFaceM4/idefics-9b), both of which are available on the 🤗 Hub. For each variant, you can also find fine-tuned instructed
+[IDEFICS](../model_doc/idefics) is an open-access vision and language model based on [Flamingo](https://huggingface.co/papers/2204.14198),
+a state-of-the-art visual language model initially developed by DeepMind. The model accepts arbitrary sequences of image
+and text inputs and generates coherent text as output. It can answer questions about images, describe visual content,
+create stories grounded in multiple images, and so on. IDEFICS comes in two variants - [80 billion parameters](https://huggingface.co/HuggingFaceM4/idefics-80b)
+and [9 billion parameters](https://huggingface.co/HuggingFaceM4/idefics-9b), both of which are available on the 🤗 Hub. For each variant, you can also find fine-tuned instructed
versions of the model adapted for conversational use cases.
-This model is exceptionally versatile and can be used for a wide range of image and multimodal tasks. However,
-being a large model means it requires significant computational resources and infrastructure. It is up to you to decide whether
-this approach suits your use case better than fine-tuning specialized models for each individual task.
+This model is exceptionally versatile and can be used for a wide range of image and multimodal tasks. However,
+being a large model means it requires significant computational resources and infrastructure. It is up to you to decide whether
+this approach suits your use case better than fine-tuning specialized models for each individual task.
-In this guide, you'll learn how to:
+In this guide, you'll learn how to:
- [Load IDEFICS](#loading-the-model) and [load the quantized version of the model](#quantized-model)
-- Use IDEFICS for:
+- Use IDEFICS for:
- [Image captioning](#image-captioning)
- [Prompted image captioning](#prompted-image-captioning)
- [Few-shot prompting](#few-shot-prompting)
@@ -47,7 +47,7 @@ In this guide, you'll learn how to:
- [Run inference in batch mode](#running-inference-in-batch-mode)
- [Run IDEFICS instruct for conversational use](#idefics-instruct-for-conversational-use)
-Before you begin, make sure you have all the necessary libraries installed.
+Before you begin, make sure you have all the necessary libraries installed.
```bash
pip install -q bitsandbytes sentencepiece accelerate transformers
@@ -59,14 +59,14 @@ To run the following examples with a non-quantized version of the model checkpoi
## Loading the model
-Let's start by loading the model's 9 billion parameters checkpoint:
+Let's start by loading the model's 9 billion parameters checkpoint:
```py
>>> checkpoint = "HuggingFaceM4/idefics-9b"
```
-Just like for other Transformers models, you need to load a processor and the model itself from the checkpoint.
-The IDEFICS processor wraps a [`LlamaTokenizer`] and IDEFICS image processor into a single processor to take care of
+Just like for other Transformers models, you need to load a processor and the model itself from the checkpoint.
+The IDEFICS processor wraps a [`LlamaTokenizer`] and IDEFICS image processor into a single processor to take care of
preparing text and image inputs for the model.
```py
@@ -79,13 +79,13 @@ preparing text and image inputs for the model.
>>> model = IdeficsForVisionText2Text.from_pretrained(checkpoint, dtype=torch.bfloat16, device_map="auto")
```
-Setting `device_map` to `"auto"` will automatically determine how to load and store the model weights in the most optimized
+Setting `device_map` to `"auto"` will automatically determine how to load and store the model weights in the most optimized
manner given existing devices.
### Quantized model
-If high-memory device availability is an issue, you can load the quantized version of the model. To load the model and the
-processor in 4bit precision, pass a `BitsAndBytesConfig` to the `from_pretrained` method and the model will be compressed
+If high-memory device availability is an issue, you can load the quantized version of the model. To load the model and the
+processor in 4bit precision, pass a `BitsAndBytesConfig` to the `from_pretrained` method and the model will be compressed
on the fly while loading.
```py
@@ -109,8 +109,8 @@ on the fly while loading.
Now that you have the model loaded in one of the suggested ways, let's move on to exploring tasks that you can use IDEFICS for.
## Image captioning
-Image captioning is the task of predicting a caption for a given image. A common application is to aid visually impaired
-people navigate through different situations, for instance, explore image content online.
+Image captioning is the task of predicting a caption for a given image. A common application is to aid visually impaired
+people navigate through different situations, for instance, explore image content online.
To illustrate the task, get an image to be captioned, e.g.:
@@ -118,10 +118,10 @@ To illustrate the task, get an image to be captioned, e.g.:
-Photo by [Hendo Wang](https://unsplash.com/@hendoo).
+Photo by [Hendo Wang](https://unsplash.com/@hendoo).
-IDEFICS accepts text and image prompts. However, to caption an image, you do not have to provide a text prompt to the
-model, only the preprocessed input image. Without a text prompt, the model will start generating text from the
+IDEFICS accepts text and image prompts. However, to caption an image, you do not have to provide a text prompt to the
+model, only the preprocessed input image. Without a text prompt, the model will start generating text from the
BOS (beginning-of-sequence) token thus creating a caption.
As image input to the model, you can use either an image object (`PIL.Image`) or a url from which the image can be retrieved.
@@ -142,15 +142,15 @@ A puppy in a flower bed
-It is a good idea to include the `bad_words_ids` in the call to `generate` to avoid errors arising when increasing
-the `max_new_tokens`: the model will want to generate a new `` or `` token when there
+It is a good idea to include the `bad_words_ids` in the call to `generate` to avoid errors arising when increasing
+the `max_new_tokens`: the model will want to generate a new `` or `` token when there
is no image being generated by the model.
You can set it on-the-fly as in this guide, or store in the `GenerationConfig` as described in the [Text generation strategies](../generation_strategies) guide.
## Prompted image captioning
-You can extend image captioning by providing a text prompt, which the model will continue given the image. Let's take
+You can extend image captioning by providing a text prompt, which the model will continue given the image. Let's take
another image to illustrate:
@@ -158,7 +158,7 @@ another image to illustrate:
Photo by [Denys Nevozhai](https://unsplash.com/@dnevozhai).
-
+
Textual and image prompts can be passed to the model's processor as a single list to create appropriate inputs.
```py
@@ -178,12 +178,12 @@ This is an image of the Eiffel Tower in Paris, France.
## Few-shot prompting
-While IDEFICS demonstrates great zero-shot results, your task may require a certain format of the caption, or come with
+While IDEFICS demonstrates great zero-shot results, your task may require a certain format of the caption, or come with
other restrictions or requirements that increase task's complexity. Few-shot prompting can be used to enable in-context learning.
-By providing examples in the prompt, you can steer the model to generate results that mimic the format of given examples.
+By providing examples in the prompt, you can steer the model to generate results that mimic the format of given examples.
-Let's use the previous image of the Eiffel Tower as an example for the model and build a prompt that demonstrates to the model
-that in addition to learning what the object in an image is, we would also like to get some interesting information about it.
+Let's use the previous image of the Eiffel Tower as an example for the model and build a prompt that demonstrates to the model
+that in addition to learning what the object in an image is, we would also like to get some interesting information about it.
Then, let's see, if we can get the same response format for an image of the Statue of Liberty:
@@ -213,24 +213,24 @@ User: Describe this image.
Assistant: An image of the Statue of Liberty. Fun fact: the Statue of Liberty is 151 feet tall.
```
-Notice that just from a single example (i.e., 1-shot) the model has learned how to perform the task. For more complex tasks,
+Notice that just from a single example (i.e., 1-shot) the model has learned how to perform the task. For more complex tasks,
feel free to experiment with a larger number of examples (e.g., 3-shot, 5-shot, etc.).
## Visual question answering
-Visual Question Answering (VQA) is the task of answering open-ended questions based on an image. Similar to image
-captioning it can be used in accessibility applications, but also in education (reasoning about visual materials), customer
+Visual Question Answering (VQA) is the task of answering open-ended questions based on an image. Similar to image
+captioning it can be used in accessibility applications, but also in education (reasoning about visual materials), customer
service (questions about products based on images), and image retrieval.
-Let's get a new image for this task:
+Let's get a new image for this task:
-Photo by [Jarritos Mexican Soda](https://unsplash.com/@jarritos).
+Photo by [Jarritos Mexican Soda](https://unsplash.com/@jarritos).
-You can steer the model from image captioning to visual question answering by prompting it with appropriate instructions:
+You can steer the model from image captioning to visual question answering by prompting it with appropriate instructions:
```py
>>> prompt = [
@@ -251,11 +251,11 @@ Instruction: Provide an answer to the question. Use the image to answer.
## Image classification
-IDEFICS is capable of classifying images into different categories without being explicitly trained on data containing
-labeled examples from those specific categories. Given a list of categories and using its image and text understanding
-capabilities, the model can infer which category the image likely belongs to.
+IDEFICS is capable of classifying images into different categories without being explicitly trained on data containing
+labeled examples from those specific categories. Given a list of categories and using its image and text understanding
+capabilities, the model can infer which category the image likely belongs to.
-Say, we have this image of a vegetable stand:
+Say, we have this image of a vegetable stand:

@@ -286,10 +286,10 @@ In the example above we instruct the model to classify the image into a single c
## Image-guided text generation
-For more creative applications, you can use image-guided text generation to generate text based on an image. This can be
-useful to create descriptions of products, ads, descriptions of a scene, etc.
+For more creative applications, you can use image-guided text generation to generate text based on an image. This can be
+useful to create descriptions of products, ads, descriptions of a scene, etc.
-Let's prompt IDEFICS to write a story based on a simple image of a red door:
+Let's prompt IDEFICS to write a story based on a simple image of a red door:

@@ -333,14 +333,14 @@ Looks like IDEFICS noticed the pumpkin on the doorstep and went with a spooky Ha
-For longer outputs like this, you will greatly benefit from tweaking the text generation strategy. This can help
-you significantly improve the quality of the generated output. Check out [Text generation strategies](../generation_strategies)
-to learn more.
+For longer outputs like this, you will greatly benefit from tweaking the text generation strategy. This can help
+you significantly improve the quality of the generated output. Check out [Text generation strategies](../generation_strategies)
+to learn more.
## Running inference in batch mode
-All of the earlier sections illustrated IDEFICS for a single example. In a very similar fashion, you can run inference
+All of the earlier sections illustrated IDEFICS for a single example. In a very similar fashion, you can run inference
for a batch of examples by passing a list of prompts:
```py
@@ -375,13 +375,13 @@ This is an image of a vegetable stand.
## IDEFICS instruct for conversational use
-For conversational use cases, you can find fine-tuned instructed versions of the model on the 🤗 Hub:
+For conversational use cases, you can find fine-tuned instructed versions of the model on the 🤗 Hub:
`HuggingFaceM4/idefics-80b-instruct` and `HuggingFaceM4/idefics-9b-instruct`.
-These checkpoints are the result of fine-tuning the respective base models on a mixture of supervised and instruction
+These checkpoints are the result of fine-tuning the respective base models on a mixture of supervised and instruction
fine-tuning datasets, which boosts the downstream performance while making the models more usable in conversational settings.
-The use and prompting for the conversational use is very similar to using the base models:
+The use and prompting for the conversational use is very similar to using the base models:
```py
>>> import torch
diff --git a/docs/source/en/tasks/image_captioning.md b/docs/source/en/tasks/image_captioning.md
index f9716f29a20..89c35a50b55 100644
--- a/docs/source/en/tasks/image_captioning.md
+++ b/docs/source/en/tasks/image_captioning.md
@@ -14,7 +14,6 @@ rendered properly in your Markdown viewer.
-->
-
# Image captioning
[[open-in-colab]]
@@ -26,7 +25,7 @@ helps to improve content accessibility for people by describing images to them.
This guide will show you how to:
* Fine-tune an image captioning model.
-* Use the fine-tuned model for inference.
+* Use the fine-tuned model for inference.
Before you begin, make sure you have all the necessary libraries installed:
@@ -37,7 +36,6 @@ pip install jiwer -q
We encourage you to log in to your Hugging Face account so you can upload and share your model with the community. When prompted, enter your token to log in:
-
```python
from huggingface_hub import notebook_login
@@ -47,8 +45,7 @@ notebook_login()
## Load the Pokémon BLIP captions dataset
Use the 🤗 Dataset library to load a dataset that consists of {image-caption} pairs. To create your own image captioning dataset
-in PyTorch, you can follow [this notebook](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/GIT/Fine_tune_GIT_on_an_image_captioning_dataset.ipynb).
-
+in PyTorch, you can follow [this notebook](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/GIT/Fine_tune_GIT_on_an_image_captioning_dataset.ipynb).
```python
from datasets import load_dataset
@@ -56,6 +53,7 @@ from datasets import load_dataset
ds = load_dataset("lambdalabs/pokemon-blip-captions")
ds
```
+
```bash
DatasetDict({
train: Dataset({
@@ -69,21 +67,19 @@ The dataset has two features, `image` and `text`.
-Many image captioning datasets contain multiple captions per image. In those cases, a common strategy is to randomly sample a caption amongst the available ones during training.
+Many image captioning datasets contain multiple captions per image. In those cases, a common strategy is to randomly sample a caption amongst the available ones during training.
Split the dataset’s train split into a train and test set with the [`~datasets.Dataset.train_test_split`] method:
-
```python
ds = ds["train"].train_test_split(test_size=0.1)
train_ds = ds["train"]
test_ds = ds["test"]
```
-Let's visualize a couple of samples from the training set.
-
+Let's visualize a couple of samples from the training set.
```python
from textwrap import wrap
@@ -106,7 +102,7 @@ sample_images_to_visualize = [np.array(train_ds[i]["image"]) for i in range(5)]
sample_captions = [train_ds[i]["text"] for i in range(5)]
plot_images(sample_images_to_visualize, sample_captions)
```
-
+
@@ -115,7 +111,7 @@ plot_images(sample_images_to_visualize, sample_captions)
Since the dataset has two modalities (image and text), the pre-processing pipeline will preprocess images and the captions.
-To do so, load the processor class associated with the model you are about to fine-tune.
+To do so, load the processor class associated with the model you are about to fine-tune.
```python
from transformers import AutoProcessor
@@ -124,7 +120,7 @@ checkpoint = "microsoft/git-base"
processor = AutoProcessor.from_pretrained(checkpoint)
```
-The processor will internally pre-process the image (which includes resizing, and pixel scaling) and tokenize the caption.
+The processor will internally pre-process the image (which includes resizing, and pixel scaling) and tokenize the caption.
```python
def transforms(example_batch):
@@ -139,13 +135,12 @@ train_ds.set_transform(transforms)
test_ds.set_transform(transforms)
```
-With the dataset ready, you can now set up the model for fine-tuning.
+With the dataset ready, you can now set up the model for fine-tuning.
## Load a base model
Load the ["microsoft/git-base"](https://huggingface.co/microsoft/git-base) into a [`AutoModelForCausalLM`](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForCausalLM) object.
-
```python
from transformers import AutoModelForCausalLM
@@ -154,10 +149,9 @@ model = AutoModelForCausalLM.from_pretrained(checkpoint)
## Evaluate
-Image captioning models are typically evaluated with the [Rouge Score](https://huggingface.co/spaces/evaluate-metric/rouge) or [Word Error Rate](https://huggingface.co/spaces/evaluate-metric/wer). For this guide, you will use the Word Error Rate (WER).
-
-We use the 🤗 Evaluate library to do so. For potential limitations and other gotchas of the WER, refer to [this guide](https://huggingface.co/spaces/evaluate-metric/wer).
+Image captioning models are typically evaluated with the [Rouge Score](https://huggingface.co/spaces/evaluate-metric/rouge) or [Word Error Rate](https://huggingface.co/spaces/evaluate-metric/wer). For this guide, you will use the Word Error Rate (WER).
+We use the 🤗 Evaluate library to do so. For potential limitations and other gotchas of the WER, refer to [this guide](https://huggingface.co/spaces/evaluate-metric/wer).
```python
from evaluate import load
@@ -177,11 +171,10 @@ def compute_metrics(eval_pred):
## Train!
-Now, you are ready to start fine-tuning the model. You will use the 🤗 [`Trainer`] for this.
+Now, you are ready to start fine-tuning the model. You will use the 🤗 [`Trainer`] for this.
First, define the training arguments using [`TrainingArguments`].
-
```python
from transformers import TrainingArguments, Trainer
@@ -208,7 +201,7 @@ training_args = TrainingArguments(
)
```
-Then pass them along with the datasets and the model to 🤗 Trainer.
+Then pass them along with the datasets and the model to 🤗 Trainer.
```python
trainer = Trainer(
@@ -222,7 +215,7 @@ trainer = Trainer(
To start training, simply call [`~Trainer.train`] on the [`Trainer`] object.
-```python
+```python
trainer.train()
```
@@ -230,7 +223,6 @@ You should see the training loss drop smoothly as training progresses.
Once training is completed, share your model to the Hub with the [`~Trainer.push_to_hub`] method so everyone can use your model:
-
```python
trainer.push_to_hub()
```
@@ -239,7 +231,6 @@ trainer.push_to_hub()
Take a sample image from `test_ds` to test the model.
-
```python
from PIL import Image
import requests
@@ -252,7 +243,7 @@ image
-
+
Prepare image for the model.
```python
@@ -263,13 +254,14 @@ inputs = processor(images=image, return_tensors="pt").to(device)
pixel_values = inputs.pixel_values
```
-Call [`generate`] and decode the predictions.
+Call [`generate`] and decode the predictions.
```python
generated_ids = model.generate(pixel_values=pixel_values, max_length=50)
generated_caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(generated_caption)
```
+
```bash
a drawing of a pink and blue pokemon
```
diff --git a/docs/source/en/tasks/image_classification.md b/docs/source/en/tasks/image_classification.md
index 39b013f129c..4754a91bd48 100644
--- a/docs/source/en/tasks/image_classification.md
+++ b/docs/source/en/tasks/image_classification.md
@@ -175,7 +175,6 @@ Your `compute_metrics` function is ready to go now, and you'll return to it when
## Train
-
If you aren't familiar with finetuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#train-with-pytorch-trainer)!
@@ -238,7 +237,6 @@ Once training is completed, share your model to the Hub with the [`~transformers
>>> trainer.push_to_hub()
```
-
For a more in-depth example of how to finetune a model for image classification, take a look at the corresponding [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb).
diff --git a/docs/source/en/tasks/image_feature_extraction.md b/docs/source/en/tasks/image_feature_extraction.md
index 455a2b425d4..e08ba89e4dd 100644
--- a/docs/source/en/tasks/image_feature_extraction.md
+++ b/docs/source/en/tasks/image_feature_extraction.md
@@ -27,7 +27,7 @@ In this guide, you will:
## Image Similarity using `image-feature-extraction` Pipeline
-We have two images of cats sitting on top of fish nets, one of them is generated.
+We have two images of cats sitting on top of fish nets, one of them is generated.
```python
from PIL import Image
@@ -66,7 +66,7 @@ print(outputs)
# [[[-0.03909236937761307, 0.43381670117378235, -0.06913255900144577,
```
-To get the similarity score, we need to pass them to a similarity function.
+To get the similarity score, we need to pass them to a similarity function.
```python
from torch.nn.functional import cosine_similarity
@@ -131,4 +131,3 @@ print(similarity_score)
# tensor([0.6061], device='cuda:0', grad_fn=)
```
-
diff --git a/docs/source/en/tasks/image_text_to_text.md b/docs/source/en/tasks/image_text_to_text.md
index b34f4edf90f..5412882b59f 100644
--- a/docs/source/en/tasks/image_text_to_text.md
+++ b/docs/source/en/tasks/image_text_to_text.md
@@ -63,7 +63,6 @@ The image inputs look like the following.
-
```python
from PIL import Image
import requests
@@ -76,7 +75,6 @@ images = [Image.open(requests.get(img_urls[0], stream=True).raw),
Below is an example of the chat template. We can feed conversation turns and the last message as an input by appending it at the end of the template.
-
```python
messages = [
{
@@ -207,7 +205,6 @@ We can use [text streaming](./generation_strategies#streaming) for a better gene
Assume we have an application that keeps chat history and takes in the new user input. We will preprocess the inputs as usual and initialize [`TextIteratorStreamer`] to handle the generation in a separate thread. This allows you to stream the generated text tokens in real-time. Any generation arguments can be passed to [`TextIteratorStreamer`].
-
```python
import time
from transformers import TextIteratorStreamer
diff --git a/docs/source/en/tasks/image_to_image.md b/docs/source/en/tasks/image_to_image.md
index da6a57ac9aa..6c4cdf585f0 100644
--- a/docs/source/en/tasks/image_to_image.md
+++ b/docs/source/en/tasks/image_to_image.md
@@ -18,7 +18,7 @@ rendered properly in your Markdown viewer.
[[open-in-colab]]
-Image-to-Image task is the task where an application receives an image and outputs another image. This has various subtasks, including image enhancement (super resolution, low light enhancement, deraining and so on), image inpainting, and more.
+Image-to-Image task is the task where an application receives an image and outputs another image. This has various subtasks, including image enhancement (super resolution, low light enhancement, deraining and so on), image inpainting, and more.
This guide will show you how to:
- Use an image-to-image pipeline for super resolution task,
@@ -32,7 +32,7 @@ Let's begin by installing the necessary libraries.
pip install transformers
```
-We can now initialize the pipeline with a [Swin2SR model](https://huggingface.co/caidas/swin2SR-lightweight-x2-64). We can then infer with the pipeline by calling it with an image. As of now, only [Swin2SR models](https://huggingface.co/models?sort=trending&search=swin2sr) are supported in this pipeline.
+We can now initialize the pipeline with a [Swin2SR model](https://huggingface.co/caidas/swin2SR-lightweight-x2-64). We can then infer with the pipeline by calling it with an image. As of now, only [Swin2SR models](https://huggingface.co/models?sort=trending&search=swin2sr) are supported in this pipeline.
```python
from transformers import pipeline, infer_device
@@ -53,19 +53,22 @@ image = Image.open(requests.get(url, stream=True).raw)
print(image.size)
```
+
```bash
# (532, 432)
```
+
-We can now do inference with the pipeline. We will get an upscaled version of the cat image.
+We can now do inference with the pipeline. We will get an upscaled version of the cat image.
```python
upscaled = pipe(image)
print(upscaled.size)
```
+
```bash
# (1072, 880)
```
@@ -79,7 +82,7 @@ model = Swin2SRForImageSuperResolution.from_pretrained("caidas/swin2SR-lightweig
processor = Swin2SRImageProcessor("caidas/swin2SR-lightweight-x2-64")
```
-`pipeline` abstracts away the preprocessing and postprocessing steps that we have to do ourselves, so let's preprocess the image. We will pass the image to the processor and then move the pixel values to GPU.
+`pipeline` abstracts away the preprocessing and postprocessing steps that we have to do ourselves, so let's preprocess the image. We will pass the image to the processor and then move the pixel values to GPU.
```python
pixel_values = processor(image, return_tensors="pt").pixel_values
@@ -96,7 +99,8 @@ import torch
with torch.no_grad():
outputs = model(pixel_values)
```
-Output is an object of type `ImageSuperResolutionOutput` that looks like below 👇
+
+Output is an object of type `ImageSuperResolutionOutput` that looks like below 👇
```
(loss=None, reconstruction=tensor([[[[0.8270, 0.8269, 0.8275, ..., 0.7463, 0.7446, 0.7453],
@@ -108,6 +112,7 @@ Output is an object of type `ImageSuperResolutionOutput` that looks like below
[0.5927, 0.5914, 0.5922, ..., 0.0664, 0.0694, 0.0718]]]],
device='cuda:0'), hidden_states=None, attentions=None)
```
+
We need to get the `reconstruction` and post-process it for visualization. Let's see how it looks like.
```python
@@ -128,6 +133,7 @@ output = np.moveaxis(output, source=0, destination=-1)
output = (output * 255.0).round().astype(np.uint8)
Image.fromarray(output)
```
+
diff --git a/docs/source/en/tasks/keypoint_detection.md b/docs/source/en/tasks/keypoint_detection.md
index 3a5871d01a2..c850c67ae15 100644
--- a/docs/source/en/tasks/keypoint_detection.md
+++ b/docs/source/en/tasks/keypoint_detection.md
@@ -18,7 +18,7 @@ rendered properly in your Markdown viewer.
[[open-in-colab]]
-Keypoint detection identifies and locates specific points of interest within an image. These keypoints, also known as landmarks, represent meaningful features of objects, such as facial features or object parts. These models take an image input and return the following outputs:
+Keypoint detection identifies and locates specific points of interest within an image. These keypoints, also known as landmarks, represent meaningful features of objects, such as facial features or object parts. These models take an image input and return the following outputs:
- **Keypoints and Scores**: Points of interest and their confidence scores.
- **Descriptors**: A representation of the image region surrounding each keypoint, capturing its texture, gradient, orientation and other properties.
@@ -36,15 +36,14 @@ model = SuperPointForKeypointDetection.from_pretrained("magic-leap-community/sup
Let's test the model on the images below.
-

-
-
```python
import torch
from PIL import Image
@@ -93,7 +92,7 @@ image_sizes = [(image.size[1], image.size[0]) for image in images]
outputs = processor.post_process_keypoint_detection(outputs, image_sizes)
```
-The outputs are now a list of dictionaries where each dictionary is a processed output of keypoints, scores and descriptors.
+The outputs are now a list of dictionaries where each dictionary is a processed output of keypoints, scores and descriptors.
```python
[{'keypoints': tensor([[ 226, 57],
@@ -144,11 +143,10 @@ for i in range(len(images)):
Below you can see the outputs.
-

-
-
diff --git a/docs/source/en/tasks/keypoint_matching.md b/docs/source/en/tasks/keypoint_matching.md
index f7065f31521..aff16a937d7 100644
--- a/docs/source/en/tasks/keypoint_matching.md
+++ b/docs/source/en/tasks/keypoint_matching.md
@@ -34,15 +34,15 @@ model = AutoModelForKeypointMatching.from_pretrained("zju-community/matchanythin
Load two images that have the same object of interest. The second photo is taken a second apart, it's colors are edited, and it is further cropped and rotated.
-

-
-```python
+```python
from transformers.image_utils import load_image
image1 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg")
image2 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee_edited.jpg")
@@ -82,16 +82,16 @@ Here's the outputs.
[1521, 2560]], dtype=torch.int32),
'matching_scores': tensor([0.2189, 0.2073, 0.2414, ...
])}]
-```
+```
We have trimmed the output but there's 401 matches!
```python
len(outputs[0]["keypoints0"])
# 401
-```
+```
-We can visualize them using the processor's [`~EfficientLoFTRImageProcessor.visualize_keypoint_matching`] method.
+We can visualize them using the processor's [`~EfficientLoFTRImageProcessor.visualize_keypoint_matching`] method.
```python
plot_images = processor.visualize_keypoint_matching(images, outputs)
@@ -100,7 +100,7 @@ plot_images

-Optionally, you can use the [`Pipeline`] API and set the task to `keypoint-matching`.
+Optionally, you can use the [`Pipeline`] API and set the task to `keypoint-matching`.
```python
from transformers import pipeline
diff --git a/docs/source/en/tasks/knowledge_distillation_for_image_classification.md b/docs/source/en/tasks/knowledge_distillation_for_image_classification.md
index 7c4a684d3c0..d4b3dd8511d 100644
--- a/docs/source/en/tasks/knowledge_distillation_for_image_classification.md
+++ b/docs/source/en/tasks/knowledge_distillation_for_image_classification.md
@@ -52,7 +52,6 @@ processed_datasets = dataset.map(process, batched=True)
Essentially, we want the student model (a randomly initialized MobileNet) to mimic the teacher model (fine-tuned vision transformer). To achieve this, we first get the logits output from the teacher and the student. Then, we divide each of them by the parameter `temperature` which controls the importance of each soft target. A parameter called `lambda` weighs the importance of the distillation loss. In this example, we will use `temperature=5` and `lambda=0.5`. We will use the Kullback-Leibler Divergence loss to compute the divergence between the student and teacher. Given two data P and Q, KL Divergence explains how much extra information we need to represent P using Q. If two are identical, their KL divergence is zero, as there's no other information needed to explain P from Q. Thus, in the context of knowledge distillation, KL divergence is useful.
-
```python
from transformers import TrainingArguments, Trainer, infer_device
import torch
diff --git a/docs/source/en/tasks/mask_generation.md b/docs/source/en/tasks/mask_generation.md
index 5f66e68c245..06ba26ea123 100644
--- a/docs/source/en/tasks/mask_generation.md
+++ b/docs/source/en/tasks/mask_generation.md
@@ -16,22 +16,22 @@ rendered properly in your Markdown viewer.
# Mask Generation
-Mask generation is the task of generating semantically meaningful masks for an image.
-This task is very similar to [image segmentation](semantic_segmentation), but many differences exist. Image segmentation models are trained on labeled datasets and are limited to the classes they have seen during training; they return a set of masks and corresponding classes, given an image.
+Mask generation is the task of generating semantically meaningful masks for an image.
+This task is very similar to [image segmentation](semantic_segmentation), but many differences exist. Image segmentation models are trained on labeled datasets and are limited to the classes they have seen during training; they return a set of masks and corresponding classes, given an image.
-Mask generation models are trained on large amounts of data and operate in two modes.
-- Prompting mode: In this mode, the model takes in an image and a prompt, where a prompt can be a 2D point location (XY coordinates) in the image within an object or a bounding box surrounding an object. In prompting mode, the model only returns the mask over the object
-that the prompt is pointing out.
-- Segment Everything mode: In segment everything, given an image, the model generates every mask in the image. To do so, a grid of points is generated and overlaid on the image for inference.
+Mask generation models are trained on large amounts of data and operate in two modes.
+- Prompting mode: In this mode, the model takes in an image and a prompt, where a prompt can be a 2D point location (XY coordinates) in the image within an object or a bounding box surrounding an object. In prompting mode, the model only returns the mask over the object
+that the prompt is pointing out.
+- Segment Everything mode: In segment everything, given an image, the model generates every mask in the image. To do so, a grid of points is generated and overlaid on the image for inference.
-Mask generation task is supported by [Segment Anything Model (SAM)](model_doc/sam). It's a powerful model that consists of a Vision Transformer-based image encoder, a prompt encoder, and a two-way transformer mask decoder. Images and prompts are encoded, and the decoder takes these embeddings and generates valid masks.
+Mask generation task is supported by [Segment Anything Model (SAM)](model_doc/sam). It's a powerful model that consists of a Vision Transformer-based image encoder, a prompt encoder, and a two-way transformer mask decoder. Images and prompts are encoded, and the decoder takes these embeddings and generates valid masks.
-SAM serves as a powerful foundation model for segmentation as it has large data coverage. It is trained on
-[SA-1B](https://ai.meta.com/datasets/segment-anything/), a dataset with 1 million images and 1.1 billion masks.
+SAM serves as a powerful foundation model for segmentation as it has large data coverage. It is trained on
+[SA-1B](https://ai.meta.com/datasets/segment-anything/), a dataset with 1 million images and 1.1 billion masks.
In this guide, you will learn how to:
- Infer in segment everything mode with batching,
@@ -114,7 +114,6 @@ Below is the original image in grayscale with colorful maps overlaid. Very impre
-
## Model Inference
### Point Prompting
@@ -132,7 +131,7 @@ processor = SamProcessor.from_pretrained("facebook/sam-vit-base")
To do point prompting, pass the input point to the processor, then take the processor output
and pass it to the model for inference. To post-process the model output, pass the outputs and
-`original_sizes` and `reshaped_input_sizes` we take from the processor's initial output. We need to pass these
+`original_sizes` and `reshaped_input_sizes` we take from the processor's initial output. We need to pass these
since the processor resizes the image, and the output needs to be extrapolated.
```python
@@ -143,6 +142,7 @@ with torch.no_grad():
outputs = model(**inputs)
masks = processor.image_processor.post_process_masks(outputs.pred_masks.cpu(), inputs["original_sizes"].cpu(), inputs["reshaped_input_sizes"].cpu())
```
+
We can visualize the three masks in the `masks` output.
```python
@@ -177,10 +177,9 @@ plt.show()
### Box Prompting
You can also do box prompting in a similar fashion to point prompting. You can simply pass the input box in the format of a list
-`[x_min, y_min, x_max, y_max]` format along with the image to the `processor`. Take the processor output and directly pass it
+`[x_min, y_min, x_max, y_max]` format along with the image to the `processor`. Take the processor output and directly pass it
to the model, then post-process the output again.
-
```python
# bounding box around the bee
box = [2350, 1600, 2850, 2100]
@@ -219,7 +218,7 @@ plt.show()
-You can see the inference output below.
+You can see the inference output below.
```python
fig, ax = plt.subplots()
@@ -233,4 +232,3 @@ plt.show()
-
diff --git a/docs/source/en/tasks/monocular_depth_estimation.md b/docs/source/en/tasks/monocular_depth_estimation.md
index c90abce1cd5..aef9bd22c4d 100644
--- a/docs/source/en/tasks/monocular_depth_estimation.md
+++ b/docs/source/en/tasks/monocular_depth_estimation.md
@@ -23,7 +23,7 @@ a single camera viewpoint.
Monocular depth estimation has various applications, including 3D reconstruction, augmented reality, autonomous driving,
and robotics. It is a challenging task as it requires the model to understand the complex relationships between objects
in the scene and the corresponding depth information, which can be affected by factors such as lighting conditions,
-occlusion, and texture.
+occlusion, and texture.
There are two main depth estimation categories:
@@ -143,7 +143,7 @@ Let's post-process the results to remove any padding and resize the depth map to
In the original implementation ZoeDepth model performs inference on both the original and flipped images and averages out the results. The post_process_depth_estimation function can handle this for us by passing the flipped outputs to the optional outputs_flipped argument:
->>> with torch.no_grad():
+>>> with torch.no_grad():
... outputs = model(pixel_values)
... outputs_flipped = model(pixel_values=torch.flip(inputs.pixel_values, dims=[3]))
>>> post_processed_output = image_processor.post_process_depth_estimation(
diff --git a/docs/source/en/tasks/multiple_choice.md b/docs/source/en/tasks/multiple_choice.md
index 3f4c9d4637f..d35f108ecce 100644
--- a/docs/source/en/tasks/multiple_choice.md
+++ b/docs/source/en/tasks/multiple_choice.md
@@ -113,6 +113,7 @@ To apply the preprocessing function over the entire dataset, use 🤗 Datasets [
```
To create a batch of examples, it's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length. [`DataCollatorForMultipleChoice`] flattens all the model inputs, applies padding, and then unflattens the results.
+
```py
>>> from transformers import DataCollatorForMultipleChoice
>>> collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)
@@ -197,7 +198,6 @@ Once training is completed, share your model to the Hub with the [`~transformers
>>> trainer.push_to_hub()
```
-
For a more in-depth example of how to finetune a model for multiple choice, take a look at the corresponding
diff --git a/docs/source/en/tasks/object_detection.md b/docs/source/en/tasks/object_detection.md
index 394e77104b7..093644b662f 100644
--- a/docs/source/en/tasks/object_detection.md
+++ b/docs/source/en/tasks/object_detection.md
@@ -171,11 +171,11 @@ To get an even better understanding of the data, visualize an example in the dat
>>> image
```
+
-
To visualize the bounding boxes with associated labels, you can get the labels from the dataset's metadata, specifically
the `category` field.
You'll also want to create dictionaries that map a label id to a label class (`id2label`) and the other way around (`label2id`).
@@ -576,6 +576,7 @@ Finally, bring everything together, and call [`~transformers.Trainer.train`]:
>>> trainer.train()
```
+
@@ -1487,6 +1488,7 @@ Now that you have finetuned a model, evaluated it, and uploaded it to the Huggin
```
Load model and image processor from the Hugging Face Hub (skip to use already trained in this session):
+
```py
>>> from transformers import infer_device
diff --git a/docs/source/en/tasks/prompting.md b/docs/source/en/tasks/prompting.md
index eb8e61d67aa..2d115d4e544 100644
--- a/docs/source/en/tasks/prompting.md
+++ b/docs/source/en/tasks/prompting.md
@@ -127,7 +127,6 @@ for output in outputs:
print(f"Result: {output['generated_text']}")
```
-
While the basic few-shot prompting approach embedded examples within a single text string, the chat template format offers the following benefits.
- The model may have a potentially improved understanding because it can better recognize the pattern and the expected roles of user input and assistant output.
diff --git a/docs/source/en/tasks/semantic_segmentation.md b/docs/source/en/tasks/semantic_segmentation.md
index 5d3c8e70aa1..08d68047dc6 100644
--- a/docs/source/en/tasks/semantic_segmentation.md
+++ b/docs/source/en/tasks/semantic_segmentation.md
@@ -69,6 +69,7 @@ results
```
The segmentation pipeline output includes a mask for every predicted class.
+
```bash
[{'score': None,
'label': 'road',
@@ -107,6 +108,7 @@ Taking a look at the mask for the car class, we can see every car is classified
```python
results[-1]["mask"]
```
+
@@ -135,11 +137,13 @@ As you can see below, there are multiple cars classified, and there's no classif
'label': 'person',
'mask':
}]
```
+
Checking out one of the car masks below.
```python
results[2]["mask"]
```
+
@@ -151,6 +155,7 @@ panoptic_segmentation = pipeline("image-segmentation", "facebook/mask2former-swi
results = panoptic_segmentation(image)
results
```
+
As you can see below, we have more classes. We will later illustrate to see that every pixel is classified into one of the classes.
```bash
@@ -206,7 +211,6 @@ To see all architectures and checkpoints compatible with this task, we recommend
-
### Load SceneParse150 dataset
Start by loading a smaller subset of the SceneParse150 dataset from the 🤗 Datasets library. This'll give you a chance to experiment and make sure everything works before spending more time training on the full dataset.
@@ -473,7 +477,6 @@ Reload the dataset and load an image for inference.
-
We will now see how to infer without a pipeline. Process the image with an image processor and place the `pixel_values` on a GPU:
```py
@@ -503,7 +506,6 @@ Next, rescale the logits to the original image size:
>>> pred_seg = upsampled_logits.argmax(dim=1)[0]
```
-
To visualize the results, load the [dataset color palette](https://github.com/tensorflow/models/blob/3f1ca33afe3c1631b733ea7e40c294273b9e406d/research/deeplab/utils/get_dataset_colormap.py#L51) as `ade_palette()` that maps each class to their RGB values.
```py
diff --git a/docs/source/en/tasks/summarization.md b/docs/source/en/tasks/summarization.md
index c57097421fb..b2f2beebc80 100644
--- a/docs/source/en/tasks/summarization.md
+++ b/docs/source/en/tasks/summarization.md
@@ -213,7 +213,6 @@ Once training is completed, share your model to the Hub with the [`~transformers
>>> trainer.push_to_hub()
```
-
For a more in-depth example of how to finetune a model for summarization, take a look at the corresponding
diff --git a/docs/source/en/tasks/token_classification.md b/docs/source/en/tasks/token_classification.md
index 49b0fcf216b..5096298affd 100644
--- a/docs/source/en/tasks/token_classification.md
+++ b/docs/source/en/tasks/token_classification.md
@@ -242,7 +242,6 @@ Before you start training your model, create a map of the expected ids to their
... }
```
-
If you aren't familiar with finetuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#train-with-pytorch-trainer)!
@@ -298,7 +297,6 @@ Once training is completed, share your model to the Hub with the [`~transformers
>>> trainer.push_to_hub()
```
-
For a more in-depth example of how to finetune a model for token classification, take a look at the corresponding
diff --git a/docs/source/en/tasks/video_classification.md b/docs/source/en/tasks/video_classification.md
index b387a8320df..bae638bd84e 100644
--- a/docs/source/en/tasks/video_classification.md
+++ b/docs/source/en/tasks/video_classification.md
@@ -363,7 +363,6 @@ Leverage [`Trainer`](https://huggingface.co/docs/transformers/main_classes/train
Most of the training arguments are self-explanatory, but one that is quite important here is `remove_unused_columns=False`. This one will drop any features not used by the model's call function. By default it's `True` because usually it's ideal to drop unused feature columns, making it easier to unpack inputs into the model's call function. But, in this case, you need the unused features ('video' in particular) in order to create `pixel_values` (which is a mandatory key our model expects in its inputs).
-
```py
>>> from transformers import TrainingArguments, Trainer
@@ -477,7 +476,6 @@ The simplest way to try out your fine-tuned model for inference is to use it in
You can also manually replicate the results of the `pipeline` if you'd like.
-
```py
>>> def run_inference(model, video):
... # (num_frames, num_channels, height, width)
diff --git a/docs/source/en/tasks/video_text_to_text.md b/docs/source/en/tasks/video_text_to_text.md
index 0e0191af588..b0f698f039e 100644
--- a/docs/source/en/tasks/video_text_to_text.md
+++ b/docs/source/en/tasks/video_text_to_text.md
@@ -18,9 +18,9 @@ rendered properly in your Markdown viewer.
[[open-in-colab]]
-Video-text-to-text models, also known as video language models or vision language models with video input, are language models that take a video input. These models can tackle various tasks, from video question answering to video captioning.
+Video-text-to-text models, also known as video language models or vision language models with video input, are language models that take a video input. These models can tackle various tasks, from video question answering to video captioning.
-These models have nearly the same architecture as [image-text-to-text](../image_text_to_text) models except for some changes to accept video data, since video data is essentially image frames with temporal dependencies. Some image-text-to-text models take in multiple images, but this alone is inadequate for a model to accept videos. Moreover, video-text-to-text models are often trained with all vision modalities. Each example might have videos, multiple videos, images and multiple images. Some of these models can also take interleaved inputs. For example, you can refer to a specific video inside a string of text by adding a video token in text like "What is happening in this video? `
Pass the image and the candidate object labels to look for to the pipeline.
-Here we pass the image directly; other suitable options include a local path to an image or an image url. We also pass text descriptions for all items we want to query the image for.
+Here we pass the image directly; other suitable options include a local path to an image or an image url. We also pass text descriptions for all items we want to query the image for.
```py
>>> predictions = detector(
diff --git a/docs/source/en/testing.md b/docs/source/en/testing.md
index 497c6b01931..78c32a58097 100644
--- a/docs/source/en/testing.md
+++ b/docs/source/en/testing.md
@@ -16,7 +16,6 @@ rendered properly in your Markdown viewer.
# Testing
-
Let's take a look at how 🤗 Transformers models are tested and how you can write new tests and improve the existing ones.
There are 2 test suites in the repository:
@@ -51,12 +50,8 @@ RUN_SLOW=1 pytest examples/
The results can be observed [here](https://github.com/huggingface/transformers/actions).
-
-
## Running tests
-
-
### Choosing which tests to run
This document goes into many details of how tests can be run. If after reading everything, you need even more details
@@ -89,8 +84,6 @@ which tells pytest to:
- do not capture output
- run in verbose mode
-
-
### Getting the list of all tests
All tests of the test suite:
@@ -187,7 +180,6 @@ Sometimes you need to run `accelerate` tests on your models. For that you can ju
RUN_SLOW=1 pytest -m accelerate_tests tests/models/opt/test_modeling_opt.py
```
-
### Run documentation tests
In order to test whether the documentation examples are correct, you should check that the `doctests` are passing.
@@ -217,9 +209,11 @@ Example:
```
Just run the following line to automatically test every docstring example in the desired file:
+
```bash
pytest --doctest-modules
```
+
If the file has a markdown extension, you should add the `--doctest-glob="*.md"` argument.
### Run only modified tests
@@ -271,7 +265,6 @@ directory.
[pytest-watch](https://github.com/joeyespo/pytest-watch) is an alternative implementation of this functionality.
-
### Skip a test module
If you want to run all test modules, except a few you can exclude them by giving an explicit list of tests to run. For
@@ -307,7 +300,6 @@ It's good to repeat the tests several times, in sequence, randomly, or in sets,
inter-dependency and state-related bugs (tear down). And the straightforward multiple repetition is just good to detect
some problems that get uncovered by randomness of DL.
-
#### Repeat tests
- [pytest-flakefinder](https://github.com/dropbox/pytest-flakefinder):
@@ -403,8 +395,6 @@ pytest -p no:sugar
or uninstall it.
-
-
#### Report each sub-test name and its progress
For a single or a group of tests via `pytest` (after `pip install pytest-pspec`):
@@ -457,7 +447,6 @@ decorators are used to set the requirements of tests CPU/GPU/XPU/TPU-wise:
Let's depict the GPU requirements in the following table:
-
| n gpus | decorator |
|--------|--------------------------------|
| `>= 0` | `@require_torch` |
@@ -466,7 +455,6 @@ Let's depict the GPU requirements in the following table:
| `< 2` | `@require_torch_non_multi_gpu` |
| `< 3` | `@require_torch_up_to_2_gpus` |
-
For example, here is a test that must be run only when there are 2 or more GPUs available and pytorch is installed:
```python no-style
@@ -520,6 +508,7 @@ Certain devices will require an additional import after importing `torch` for th
```bash
TRANSFORMERS_TEST_BACKEND="torch_npu" pytest tests/utils/test_logging.py
```
+
Alternative backends may also require the replacement of device-specific functions. For example `torch.cuda.manual_seed` may need to be replaced with a device-specific seed setter like `torch.npu.manual_seed` or `torch.xpu.manual_seed` to correctly set a random seed on the device. To specify a new backend with backend-specific device functions when running the test suite, create a Python device specification file `spec.py` in the format:
```python
@@ -536,6 +525,7 @@ MANUAL_SEED_FN = torch.npu.manual_seed
EMPTY_CACHE_FN = torch.npu.empty_cache
DEVICE_COUNT_FN = torch.npu.device_count
```
+
This format also allows for specification of any additional imports required. To use this file to replace equivalent methods in the test suite, set the environment variable `TRANSFORMERS_TEST_DEVICE_SPEC` to the path of the spec file, e.g. `TRANSFORMERS_TEST_DEVICE_SPEC=spec.py`.
Currently, only `MANUAL_SEED_FN`, `EMPTY_CACHE_FN` and `DEVICE_COUNT_FN` are supported for device-specific dispatch.
@@ -610,7 +600,6 @@ You can read [here](https://docs.pytest.org/en/stable/unittest.html) which featu
thing to remember is that most `pytest` fixtures don't work. Neither parametrization, but we use the module
`parameterized` that works in a similar way.
-
### Parametrization
Often, there is a need to run the same test multiple times, but with different arguments. It could be done from within
@@ -719,8 +708,6 @@ pytest test_this2.py::test_floor[negative--1.5--2.0] test_this2.py::test_floor[i
as in the previous example.
-
-
### Files and directories
In tests often we need to know where things are relative to the current test file, and it's not trivial since the test
@@ -843,7 +830,6 @@ otherwise.
If you need to temporary override `sys.path` to import from another test for example, you can use the
`ExtendSysPath` context manager. Example:
-
```python
import os
from transformers.testing_utils import ExtendSysPath
@@ -893,7 +879,6 @@ or the `xfail` way:
def test_feature_x():
```
-
Here's how to skip a test based on internal checks within the test:
```python
@@ -1018,7 +1003,6 @@ That report is also useful to find slow outliers that aren't marked as such, or
If you notice that the test suite starts getting slow on CI, the top listing of this report will show the slowest
tests.
-
### Testing the stdout/stderr output
In order to test functions that write to `stdout` and/or `stderr`, the test can access those streams using the
@@ -1141,7 +1125,6 @@ print(cs.err, cs.out)
Also, to aid debugging test issues, by default these context managers automatically replay the captured streams on exit
from the context.
-
### Capturing logger stream
If you need to validate the output of a logger, you can use `CaptureLogger`:
@@ -1193,7 +1176,6 @@ called if anything.
This helper method creates a copy of the `os.environ` object, so the original remains intact.
-
### Getting reproducible results
In some situations you may want to remove randomness for your tests. To get identical reproducible results set, you
@@ -1241,9 +1223,6 @@ To trigger a self-push workflow CI job, you must:
4. Then you can see the job appear [here](https://github.com/huggingface/transformers/actions/workflows/self-push.yml). It may not run right away if there
is a backlog.
-
-
-
## Testing Experimental CI Features
Testing CI features can be potentially problematic as it can interfere with the normal CI functioning. Therefore if a
diff --git a/docs/source/en/tiny_agents.md b/docs/source/en/tiny_agents.md
index dc53d05a4bf..7266f0236a6 100644
--- a/docs/source/en/tiny_agents.md
+++ b/docs/source/en/tiny_agents.md
@@ -42,4 +42,3 @@ Image URL: https://evalstate-flux1-schnell.hf.space/gradio_api/file=/tmp/gradio/
I have generated an image of a cat on the moon using the Flux 1 Schnell Image Generator. The image is 1024x1024 pixels and was created with 4 inference steps. Let me know if you would like to make any changes or need further assistance!
```
-
diff --git a/docs/source/en/trainer.md b/docs/source/en/trainer.md
index 48325da6893..32f14bc41da 100644
--- a/docs/source/en/trainer.md
+++ b/docs/source/en/trainer.md
@@ -346,7 +346,6 @@ use_cpu: false
-
Run [accelerate_launch](https://hf.co/docs/accelerate/package_reference/cli#accelerate-launch) to start training with the configurations set in `config_file.yaml`. This file is saved to the Accelerate cache folder and automatically loaded when you run `accelerate_launch`.
The example below launches the [run_glue.py](../../../examples/pytorch/text-classification/run_glue) script with the FSDP configuration shown earlier. Parameters from the `config_file.yaml` file can also be directly set in the command line.
diff --git a/docs/source/en/training.md b/docs/source/en/training.md
index ed992e8152d..ccee25704fa 100644
--- a/docs/source/en/training.md
+++ b/docs/source/en/training.md
@@ -52,6 +52,7 @@ dataset = dataset.map(tokenize, batched=True)
> [!TIP]
> Fine-tune on a smaller subset of the full dataset to reduce the time it takes. The results won't be as good compared to fine-tuning on the full dataset, but it is useful to make sure everything works as expected first before committing to training on the full dataset.
+>
> ```py
> small_train = dataset["train"].shuffle(seed=42).select(range(1000))
> small_eval = dataset["test"].shuffle(seed=42).select(range(1000))
diff --git a/docs/source/en/transformers_as_backend.md b/docs/source/en/transformers_as_backend.md
index 422cc4a121e..d1070acea6f 100644
--- a/docs/source/en/transformers_as_backend.md
+++ b/docs/source/en/transformers_as_backend.md
@@ -32,6 +32,7 @@ vLLM automatically selects the best backend, and if a model isn’t natively sup
from vllm import LLM
llm = LLM(model="meta-llama/Llama-3.2-1B", model_impl="transformers")
```
+
Add `--model-impl transformers` to `vllm serve` to launch a server with a Transformers' model.
```bash
@@ -42,7 +43,6 @@ vllm serve meta-llama/Llama-3.2-1B \
Refer to the [vLLM docs](https://docs.vllm.ai/en/latest/models/supported_models.html#transformers) for more usage examples and tips on using a Transformers as the backend.
-
## SGLang
[SGLang](https://github.com/InternLM/sglang) is a high-performance, OpenAI-compatible server and runtime designed for chat-based LLMs. It offers fast inference, role-based conversation handling, and support for custom pipelines, making it great for building real-world LLM apps.
@@ -57,12 +57,6 @@ print(llm.generate(["The capital of France is"], {"max_new_tokens": 20})[0])
```
Add `impl transformers` to `sglang.launch_server` to launch a server with a Transformers' model.
-
-
-
-
-
-
```bash
python3 -m sglang.launch_server \
@@ -133,7 +127,7 @@ class MyModel(PreTrainedModel):
3. This step is optional, but if you want to support tensor parallel and/or pipeline parallel features, add the following keys to the config.
* `base_model_tp_plan` enables [tensor parallelism](./perf_infer_gpu_multi) by mapping fully qualified layer name patterns to tensor parallel styles. Only the `"colwise"` and `"rowwise"` partitioning strategies are currently supported.
* `base_model_pp_plan` enables pipeline parallelism by mapping direct child layer names to tuples of lists of strings. The list in the first element of the tuple contains the names of the input arguments. The list in the last element of the tuple contains the names of the variables the layer outputs to in the modeling code.
-
+
Expand the code below for an example.
@@ -158,6 +152,7 @@ class MyConfig(PretrainedConfig):
"norm": (["hidden_states"], ["hidden_states"]),
}
```
+
### Multimodal models
@@ -200,8 +195,8 @@ class MyMultimodalModelForConditionalGeneration(MyMultimodalPreTrainedModel, Gen
self.model = MyMultimodalModel(config)
self.lm_head = nn.Linear(hidden_dim, vocab_size)
```
-
+
2. A multimodal model config must be nested with the following fields.
* text_config: decoder language model config
@@ -246,6 +241,7 @@ class MyMultimodalProcessor(ProcessorMixin):
vision_data.update({"num_image_tokens": num_image_tokens, "num_image_patches": num_image_patches})
return MultiModalData(**vision_data)
```
+
## Resources
diff --git a/docs/source/en/troubleshooting.md b/docs/source/en/troubleshooting.md
index 7998881d364..cfc51966893 100644
--- a/docs/source/en/troubleshooting.md
+++ b/docs/source/en/troubleshooting.md
@@ -34,7 +34,6 @@ Sometimes errors occur, but we are here to help! This guide covers some of the m
For more details about troubleshooting and getting help, take a look at [Chapter 8](https://huggingface.co/course/chapter8/1?fw=pt) of the Hugging Face course.
-
## Firewalled environments
Some GPU instances on cloud and intranet setups are firewalled to external connections, resulting in a connection error. When your script attempts to download model weights or datasets, the download will hang and then timeout with the following message:
diff --git a/notebooks/README.md b/notebooks/README.md
index 4d31797104f..aed43587880 100644
--- a/notebooks/README.md
+++ b/notebooks/README.md
@@ -22,7 +22,6 @@ Also, we would like to list here interesting content created by the community.
If you wrote some notebook(s) leveraging 🤗 Transformers and would like to be listed here, please open a
Pull Request so it can be included under the Community notebooks.
-
## Hugging Face's notebooks 🤗
### Documentation notebooks
@@ -38,7 +37,6 @@ You can open any page of the documentation as a notebook in Colab (there is a bu
| [Summary of the tokenizers](https://github.com/huggingface/notebooks/blob/main/transformers_doc/en/tokenizer_summary.ipynb) | The differences between the tokenizers algorithm |[](https://colab.research.google.com/github/huggingface/notebooks/blob/main/transformers_doc/en/tokenizer_summary.ipynb)| [](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/transformers_doc/en/tokenizer_summary.ipynb)|
| [Multilingual models](https://github.com/huggingface/notebooks/blob/main/transformers_doc/en/multilingual.ipynb) | How to use the multilingual models of the library |[](https://colab.research.google.com/github/huggingface/notebooks/blob/main/transformers_doc/en/multilingual.ipynb)| [](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/transformers_doc/en/multilingual.ipynb)|
-
### PyTorch Examples
#### Natural Language Processing[[pytorch-nlp]]
@@ -88,7 +86,6 @@ You can open any page of the documentation as a notebook in Colab (there is a bu
| [How to fine-tune a Nucleotide Transformer model](https://github.com/huggingface/notebooks/blob/main/examples/nucleotide_transformer_dna_sequence_modelling.ipynb) | See how to tokenize DNA and fine-tune a large pre-trained DNA "language" model | [](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/nucleotide_transformer_dna_sequence_modelling.ipynb) | [](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/nucleotide_transformer_dna_sequence_modelling.ipynb) |
| [Fine-tune a Nucleotide Transformer model with LoRA](https://github.com/huggingface/notebooks/blob/main/examples/nucleotide_transformer_dna_sequence_modelling_with_peft.ipynb) | Train even larger DNA models in a memory-efficient way | [](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/nucleotide_transformer_dna_sequence_modelling_with_peft.ipynb) | [](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/nucleotide_transformer_dna_sequence_modelling_with_peft.ipynb) |
-
#### Other modalities[[pytorch-other]]
| Notebook | Description | | |
@@ -101,7 +98,6 @@ You can open any page of the documentation as a notebook in Colab (there is a bu
|:----------|:-------------|:-------------|------:|
| [How to export model to ONNX](https://github.com/huggingface/notebooks/blob/main/examples/onnx-export.ipynb)| Highlight how to export and run inference workloads through ONNX | [](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/onnx-export.ipynb)| [](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/onnx-export.ipynb)|
-
### Optimum notebooks
🤗 [Optimum](https://github.com/huggingface/optimum) is an extension of 🤗 Transformers, providing a set of performance optimization tools enabling maximum efficiency to train and run models on targeted hardwares.