Fix white space in documentation (#41157)

* Fix white space

Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>

* Revert changes

Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>

* Fix autodoc

Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>

---------

Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>
This commit is contained in:
Yuanyuan Chen
2025-10-01 00:41:03 +08:00
committed by GitHub
parent 16a141765c
commit 374ded5ea4
148 changed files with 338 additions and 246 deletions

View File

@ -278,13 +278,14 @@ are working on it).<br>
useful to avoid duplicated work, and to differentiate it from PRs ready to be merged.<br>
☐ Make sure existing tests pass.<br>
☐ If adding a new feature, also add tests for it.<br>
- If you are adding a new model, make sure you use
- If you are adding a new model, make sure you use
`ModelTester.all_model_classes = (MyModel, MyModelWithLMHead,...)` to trigger the common tests.
- If you are adding new `@slow` tests, make sure they pass using
- If you are adding new `@slow` tests, make sure they pass using
`RUN_SLOW=1 python -m pytest tests/models/my_new_model/test_my_new_model.py`.
- If you are adding a new tokenizer, write tests and make sure
- If you are adding a new tokenizer, write tests and make sure
`RUN_SLOW=1 python -m pytest tests/models/{your_model_name}/test_tokenization_{your_model_name}.py` passes.
- CircleCI does not run the slow tests, but GitHub Actions does every night!<br>
- CircleCI does not run the slow tests, but GitHub Actions does every night!<br>
☐ All public methods must have informative docstrings (see
[`modeling_bert.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/bert/modeling_bert.py)
@ -340,6 +341,7 @@ RUN_SLOW=yes python -m pytest -n auto --dist=loadfile -s -v ./examples/pytorch/t
```
Like the slow tests, there are other environment variables available which are not enabled by default during testing:
- `RUN_CUSTOM_TOKENIZERS`: Enables tests for custom tokenizers.
More environment variables and additional information can be found in the [testing_utils.py](https://github.com/huggingface/transformers/blob/main/src/transformers/testing_utils.py).

View File

@ -193,4 +193,4 @@ def custom_attention_mask(
It mostly works thanks to the `mask_function`, which is a `Callable` in the form of [torch's mask_mod functions](https://pytorch.org/blog/flexattention/), taking 4 indices as input and returning a boolean to indicate if this position should take part in the attention computation.
If you cannot use the `mask_function` to create your mask for some reason, you can try to work around it by doing something similar to our [torch export workaround](https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/executorch.py).
If you cannot use the `mask_function` to create your mask for some reason, you can try to work around it by doing something similar to our [torch export workaround](https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/executorch.py).

View File

@ -210,9 +210,9 @@ There are some rules for documenting different types of arguments and they're li
This can span multiple lines.
```
* Include `type` in backticks.
* Add *optional* if the argument is not required or has a default value.
* Add "defaults to X" if it has a default value. You don't need to add "defaults to `None`" if the default value is `None`.
* Include `type` in backticks.
* Add *optional* if the argument is not required or has a default value.
* Add "defaults to X" if it has a default value. You don't need to add "defaults to `None`" if the default value is `None`.
These arguments can also be passed to `@auto_docstring` as a `custom_args` argument. It is used to define the docstring block for new arguments once if they are repeated in multiple places in the modeling file.

View File

@ -162,6 +162,7 @@ generated_ids = model.generate(**inputs, use_cache=True, max_new_tokens=10)
Before the [`Cache`] class, the cache used to be stored as a tuple of tuples of tensors. This format is dynamic because it grows as text is generated, similar to [`DynamicCache`].
The legacy format is essentially the same data structure but organized differently.
- It's a tuple of tuples, where each inner tuple contains the key and value tensors for a layer.
- The tensors have the same shape `[batch_size, num_heads, seq_len, head_dim]`.
- The format is less flexible and doesn't support features like quantization or offloading.

View File

@ -221,4 +221,4 @@ model_input = tokenizer.apply_chat_template(
messages,
tools = [current_time, multiply]
)
```
```

View File

@ -77,9 +77,9 @@ Mistral-7B-Instruct uses `[INST]` and `[/INST]` tokens to indicate the start and
The input to `apply_chat_template` should be structured as a list of dictionaries with `role` and `content` keys. The `role` key specifies the speaker, and the `content` key contains the message. The common roles are:
- `user` for messages from the user
- `assistant` for messages from the model
- `system` for directives on how the model should act (usually placed at the beginning of the chat)
- `user` for messages from the user
- `assistant` for messages from the model
- `system` for directives on how the model should act (usually placed at the beginning of the chat)
[`apply_chat_template`] takes this list and returns a formatted sequence. Set `tokenize=True` if you want to tokenize the sequence.

View File

@ -21,6 +21,7 @@ where `port` is the port used by `transformers serve` (`8000` by default). On th
</h3>
You're now ready to set things up on the app side! In Cursor, while you can't set a new provider, you can change the endpoint for OpenAI requests in the model selection settings. First, navigate to "Settings" > "Cursor Settings", "Models" tab, and expand the "API Keys" collapsible. To set your `transformers serve` endpoint, follow this order:
1. Unselect ALL models in the list above (e.g. `gpt4`, ...);
2. Add and select the model you want to use (e.g. `Qwen/Qwen3-4B`)
3. Add some random text to OpenAI API Key. This field won't be used, but it can't be empty;

View File

@ -229,6 +229,7 @@ tokenizer.batch_decode(outputs, skip_special_tokens=True)
## Custom generation methods
Custom generation methods enable specialized behavior such as:
- have the model continue thinking if it is uncertain;
- roll back generation if the model gets stuck;
- handle special tokens with custom logic;
@ -301,6 +302,7 @@ Updating your Python requirements accordingly will remove this error message.
### Creating a custom generation method
To create a new generation method, you need to create a new [**Model**](https://huggingface.co/new) repository and push a few files into it.
1. The model you've designed your generation method with.
2. `custom_generate/generate.py`, which contains all the logic for your custom generation method.
3. `custom_generate/requirements.txt`, used to optionally add new Python requirements and/or lock specific versions to correctly use your method.
@ -377,6 +379,7 @@ def generate(model, input_ids, generation_config=None, left_padding=None, **kwar
```
Follow the recommended practices below to ensure your custom generation method works as expected.
- Feel free to reuse the logic for validation and input preparation in the original [`~GenerationMixin.generate`].
- Pin the `transformers` version in the requirements if you use any private method/attribute in `model`.
- Consider adding model validation, input validation, or even a separate test file to help users sanity-check your code in their environment.
@ -410,6 +413,7 @@ tags:
```
Recommended practices:
- Document input and output differences in [`~GenerationMixin.generate`].
- Add self-contained examples to enable quick experimentation.
- Describe soft-requirements such as if the method only works well with a certain family of models.
@ -442,6 +446,7 @@ output = model.generate(
### Finding custom generation methods
You can find all custom generation methods by [searching for their custom tag.](https://huggingface.co/models?other=custom_generate), `custom_generate`. In addition to the tag, we curate two collections of `custom_generate` methods:
- [Custom generation methods - Community](https://huggingface.co/collections/transformers-community/custom-generation-methods-community-6888fb1da0efbc592d3a8ab6) -- a collection of powerful methods contributed by the community;
- [Custom generation methods - Tutorials](https://huggingface.co/collections/transformers-community/custom-generation-methods-tutorials-6823589657a94940ea02cfec) -- a collection of reference implementations for methods that previously were part of `transformers`, as well as tutorials for `custom_generate`.

View File

@ -185,9 +185,9 @@ See the [Fine-tune a pretrained model](https://huggingface.co/docs/transformers/
The model head refers to the last layer of a neural network that accepts the raw hidden states and projects them onto a different dimension. There is a different model head for each task. For example:
* [`GPT2ForSequenceClassification`] is a sequence classification head - a linear layer - on top of the base [`GPT2Model`].
* [`ViTForImageClassification`] is an image classification head - a linear layer on top of the final hidden state of the `CLS` token - on top of the base [`ViTModel`].
* [`Wav2Vec2ForCTC`] is a language modeling head with [CTC](#connectionist-temporal-classification-ctc) on top of the base [`Wav2Vec2Model`].
* [`GPT2ForSequenceClassification`] is a sequence classification head - a linear layer - on top of the base [`GPT2Model`].
* [`ViTForImageClassification`] is an image classification head - a linear layer on top of the final hidden state of the `CLS` token - on top of the base [`ViTModel`].
* [`Wav2Vec2ForCTC`] is a language modeling head with [CTC](#connectionist-temporal-classification-ctc) on top of the base [`Wav2Vec2Model`].
## I

View File

@ -149,4 +149,4 @@ Call [print_trainable_parameters](https://huggingface.co/docs/peft/package_refer
```py
model.print_trainable_parameters()
"trainable params: 589,824 || all params: 94,274,096 || trainable%: 0.6256"
```
```

View File

@ -218,9 +218,9 @@ path reference to the associated `.safetensors` file. Each tensor is written to
the state dictionary. File names are constructed using the `module_path` as a prefix with a few possible postfixes that
are built recursively.
* Module inputs are denoted with the `_inputs` and outputs by `_outputs`.
* `list` and `tuple` instances, such as `args` or function return values, will be postfixed with `_{index}`.
* `dict` instances will be postfixed with `_{key}`.
* Module inputs are denoted with the `_inputs` and outputs by `_outputs`.
* `list` and `tuple` instances, such as `args` or function return values, will be postfixed with `_{index}`.
* `dict` instances will be postfixed with `_{key}`.
### Comparing between implementations
@ -255,6 +255,7 @@ how many tests are being skipped and for which models.
When porting models to transformers, tests fail as they should, and sometimes `test_modeling_common` feels irreconcilable with the peculiarities of our brand new model. But how can we be sure we're not breaking everything by adding a seemingly innocent skip?
This utility:
- scans all test_modeling_common methods
- looks for times where a method is skipped
- returns a summary json you can load as a DataFrame/inspect

View File

@ -94,6 +94,7 @@ model.generate(**inputs, num_beams=4, do_sample=True)
```
[`~GenerationMixin.generate`] can also be extended with external libraries or custom code:
1. the `logits_processor` parameter accepts custom [`LogitsProcessor`] instances for manipulating the next token probability distribution;
2. the `stopping_criteria` parameters supports custom [`StoppingCriteria`] to stop text generation;
3. other custom generation methods can be loaded through the `custom_generate` flag ([docs](generation_strategies.md/#custom-decoding-methods)).

View File

@ -80,6 +80,7 @@ We use both in the `transformers` library. We leverage and adapt `logging`'s `ca
management of these warning messages by the verbosity setters above.
What does that mean for developers of the library? We should respect the following heuristics:
- `warnings` should be favored for developers of the library and libraries dependent on `transformers`
- `logging` should be used for end-users of the library using it in every-day projects

View File

@ -17,6 +17,7 @@ rendered properly in your Markdown viewer.
# Processors
Processors can mean two different things in the Transformers library:
- the objects that pre-process inputs for multi-modal models such as [Wav2Vec2](../model_doc/wav2vec2) (speech and text)
or [CLIP](../model_doc/clip) (text and vision)
- deprecated objects that were used in older versions of the library to preprocess data for GLUE or SQUAD.

View File

@ -30,15 +30,15 @@ like token streaming.
## GenerationConfig
[[autodoc]] generation.GenerationConfig
- from_pretrained
- from_model_config
- save_pretrained
- update
- validate
- get_generation_mode
- from_pretrained
- from_model_config
- save_pretrained
- update
- validate
- get_generation_mode
## GenerationMixin
[[autodoc]] GenerationMixin
- generate
- compute_transition_scores
- generate
- compute_transition_scores

View File

@ -148,6 +148,7 @@ for label, score in zip(candidate_labels, probs):
```
## Resources
- Refer to the [Kakao Brains Open Source ViT, ALIGN, and the New COYO Text-Image Dataset](https://huggingface.co/blog/vit-align) blog post for more details.
## AlignConfig

View File

@ -102,4 +102,4 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
## ArceeForTokenClassification
[[autodoc]] ArceeForTokenClassification
- forward
- forward

View File

@ -123,6 +123,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
- See also: [Image classification task guide](../tasks/image_classification)
**Semantic segmentation**
- [Semantic segmentation task guide](../tasks/semantic_segmentation)
If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.

View File

@ -156,4 +156,4 @@ print(tokenizer.decode(outputs[0]))
## BertGenerationDecoder
[[autodoc]] BertGenerationDecoder
- forward
- forward

View File

@ -88,6 +88,7 @@ echo -e "Plants create <mask> through a process known as photosynthesis." | tran
</hfoptions>
## Notes
- Use the [`AutoTokenizer`] or [`BertweetTokenizer`] because it's preloaded with a custom vocabulary adapted to tweet-specific tokens like hashtags (#), mentions (@), emojis, and common abbreviations. Make sure to also install the [emoji](https://pypi.org/project/emoji/) library.
- Inputs should be padded on the right (`padding="max_length"`) because BERT uses absolute position embeddings.

View File

@ -87,6 +87,7 @@ print(f"The predicted token is: {predicted_token}")
</hfoptions>
## Notes
- Inputs should be padded on the right because BigBird uses absolute position embeddings.
- BigBird supports `original_full` and `block_sparse` attention. If the input sequence length is less than 1024, it is recommended to use `original_full` since sparse patterns don't offer much benefit for smaller inputs.
- The current implementation uses window size of 3 blocks and 2 global blocks, only supports the ITC-implementation, and doesn't support `num_random_blocks=0`.

View File

@ -36,6 +36,7 @@ The original code can be found [here](https://github.com/google-research/big_tra
## Usage tips
- BiT models are equivalent to ResNetv2 in terms of architecture, except that: 1) all batch normalization layers are replaced by [group normalization](https://huggingface.co/papers/1803.08494),
2) [weight standardization](https://huggingface.co/papers/1903.10520) is used for convolutional layers. The authors show that the combination of both is useful for training with large batch sizes, and has a significant
impact on transfer learning.
@ -72,4 +73,4 @@ If you're interested in submitting a resource to be included here, please feel f
## BitForImageClassification
[[autodoc]] BitForImageClassification
- forward
- forward

View File

@ -38,22 +38,22 @@ Several versions of the model weights are available on Hugging Face:
### Model Details
* **Architecture:** Transformer-based, modified with `BitLinear` layers (BitNet framework).
* Uses Rotary Position Embeddings (RoPE).
* Uses squared ReLU (ReLU²) activation in FFN layers.
* Employs [`subln`](https://proceedings.mlr.press/v202/wang23u.html) normalization.
* No bias terms in linear or normalization layers.
* Uses Rotary Position Embeddings (RoPE).
* Uses squared ReLU (ReLU²) activation in FFN layers.
* Employs [`subln`](https://proceedings.mlr.press/v202/wang23u.html) normalization.
* No bias terms in linear or normalization layers.
* **Quantization:** Native 1.58-bit weights and 8-bit activations (W1.58A8).
* Weights are quantized to ternary values {-1, 0, +1} using absmean quantization during the forward pass.
* Activations are quantized to 8-bit integers using absmax quantization (per-token).
* **Crucially, the model was *trained from scratch* with this quantization scheme, not post-training quantized.**
* Weights are quantized to ternary values {-1, 0, +1} using absmean quantization during the forward pass.
* Activations are quantized to 8-bit integers using absmax quantization (per-token).
* **Crucially, the model was *trained from scratch* with this quantization scheme, not post-training quantized.**
* **Parameters:** ~2 Billion
* **Training Tokens:** 4 Trillion
* **Context Length:** Maximum sequence length of **4096 tokens**.
* *Recommendation:* For optimal performance on tasks requiring very long contexts (beyond the pre-training length or for specialized long-reasoning tasks), we recommend performing intermediate long-sequence adaptation/training before the final fine-tuning stage.
* **Context Length:** Maximum sequence length of **4096 tokens**.
* *Recommendation:* For optimal performance on tasks requiring very long contexts (beyond the pre-training length or for specialized long-reasoning tasks), we recommend performing intermediate long-sequence adaptation/training before the final fine-tuning stage.
* **Training Stages:**
1. **Pre-training:** Large-scale training on public text/code and synthetic math data using a two-stage learning rate and weight decay schedule.
2. **Supervised Fine-tuning (SFT):** Fine-tuned on instruction-following and conversational datasets using sum loss aggregation and specific hyperparameter tuning.
3. **Direct Preference Optimization (DPO):** Aligned with human preferences using preference pairs.
1. **Pre-training:** Large-scale training on public text/code and synthetic math data using a two-stage learning rate and weight decay schedule.
2. **Supervised Fine-tuning (SFT):** Fine-tuned on instruction-following and conversational datasets using sum loss aggregation and specific hyperparameter tuning.
3. **Direct Preference Optimization (DPO):** Aligned with human preferences using preference pairs.
* **Tokenizer:** LLaMA 3 Tokenizer (vocab size: 128,256).
## Usage tips

View File

@ -128,7 +128,7 @@ Refer to this [notebook](https://github.com/huggingface/notebooks/blob/main/exam
## BlipTextLMHeadModel
[[autodoc]] BlipTextLMHeadModel
- forward
- forward
## BlipVisionModel

View File

@ -43,16 +43,19 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
- [`BloomForCausalLM`] is supported by this [causal language modeling example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling#gpt-2gpt-and-causal-language-modeling) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb).
See also:
- [Causal language modeling task guide](../tasks/language_modeling)
- [Text classification task guide](../tasks/sequence_classification)
- [Token classification task guide](../tasks/token_classification)
- [Question answering task guide](../tasks/question_answering)
⚡️ Inference
- A blog on [Optimization story: Bloom inference](https://huggingface.co/blog/bloom-inference-optimization).
- A blog on [Incredibly Fast BLOOM Inference with DeepSpeed and Accelerate](https://huggingface.co/blog/bloom-inference-pytorch-scripts).
⚙️ Training
- A blog on [The Technology Behind BLOOM Training](https://huggingface.co/blog/bloom-megatron-deepspeed).
## BloomConfig

View File

@ -16,10 +16,10 @@ rendered properly in your Markdown viewer.
*This model was released on 2019-11-10 and added to Hugging Face Transformers on 2020-11-16.*
<div style="float: right;">
<div class="flex flex-wrap space-x-1">
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
<div class="flex flex-wrap space-x-1">
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
</div>
</div>
</div>
# CamemBERT

View File

@ -119,4 +119,4 @@ Currently, following scales of pretrained Chinese-CLIP models are available on
## ChineseCLIPVisionModel
[[autodoc]] ChineseCLIPVisionModel
- forward
- forward

View File

@ -106,4 +106,4 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
## CLIPSegForImageSegmentation
[[autodoc]] CLIPSegForImageSegmentation
- forward
- forward

View File

@ -122,6 +122,7 @@ visualizer("Plants create energy through a process known as")
</div>
## Notes
- Don't use the dtype parameter in [`~AutoModel.from_pretrained`] if you're using FlashAttention-2 because it only supports fp16 or bf16. You should use [Automatic Mixed Precision](https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html), set fp16 or bf16 to True if using [`Trainer`], or use [torch.autocast](https://pytorch.org/docs/stable/amp.html#torch.autocast).
## CohereConfig

View File

@ -49,4 +49,4 @@ This model was contributed by [OpenBMB](https://huggingface.co/openbmb). The ori
## CpmAntForCausalLM
[[autodoc]] CpmAntForCausalLM
- all
- all

View File

@ -103,6 +103,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
- [`Data2VecVisionForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb).
**Data2VecText documentation resources**
- [Text classification task guide](../tasks/sequence_classification)
- [Token classification task guide](../tasks/token_classification)
- [Question answering task guide](../tasks/question_answering)
@ -111,10 +112,12 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
- [Multiple choice task guide](../tasks/multiple_choice)
**Data2VecAudio documentation resources**
- [Audio classification task guide](../tasks/audio_classification)
- [Automatic speech recognition task guide](../tasks/asr)
**Data2VecVision documentation resources**
- [Image classification](../tasks/image_classification)
- [Semantic segmentation](../tasks/semantic_segmentation)

View File

@ -92,6 +92,7 @@ echo -e '{"text": "A soccer game with multiple people playing.", "text_pair": "S
</hfoptions>
## Notes
- DeBERTa uses **relative position embeddings**, so it does not require **right-padding** like BERT.
- For best results, use DeBERTa on sentence-level or sentence-pair classification tasks like MNLI, RTE, or SST-2.
- If you're using DeBERTa for token-level tasks like masked language modeling, make sure to load a checkpoint specifically pretrained or fine-tuned for token-level tasks.

View File

@ -47,4 +47,4 @@ The model uses Multi-head Latent Attention (MLA) and DeepSeekMoE architectures f
## DeepseekV2ForSequenceClassification
[[autodoc]] DeepseekV2ForSequenceClassification
- forward
- forward

View File

@ -16,9 +16,9 @@ rendered properly in your Markdown viewer.
*This model was released on 2020-10-08 and added to Hugging Face Transformers on 2022-09-14.*
<div style="float: right;">
<div class="flex flex-wrap space-x-1">
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
</div>
<div class="flex flex-wrap space-x-1">
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
</div>
</div>
# Deformable DETR

View File

@ -68,4 +68,4 @@ scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=1000, nu
DePlot is a model trained using `Pix2Struct` architecture. For API reference, see [`Pix2Struct` documentation](pix2struct).
</Tip>
</Tip>

View File

@ -86,4 +86,4 @@ Image.fromarray(depth.astype("uint8"))
## DepthAnythingForDepthEstimation
[[autodoc]] DepthAnythingForDepthEstimation
- forward
- forward

View File

@ -110,4 +110,4 @@ If you're interested in submitting a resource to be included here, please feel f
## DepthAnythingForDepthEstimation
[[autodoc]] DepthAnythingForDepthEstimation
- forward
- forward

View File

@ -84,12 +84,13 @@ alt="drawing" width="600"/>
The `DepthProForDepthEstimation` model uses a `DepthProEncoder`, for encoding the input image and a `FeatureFusionStage` for fusing the output features from encoder.
The `DepthProEncoder` further uses two encoders:
- `patch_encoder`
- Input image is scaled with multiple ratios, as specified in the `scaled_images_ratios` configuration.
- Each scaled image is split into smaller **patches** of size `patch_size` with overlapping areas determined by `scaled_images_overlap_ratios`.
- These patches are processed by the **`patch_encoder`**
- Input image is scaled with multiple ratios, as specified in the `scaled_images_ratios` configuration.
- Each scaled image is split into smaller **patches** of size `patch_size` with overlapping areas determined by `scaled_images_overlap_ratios`.
- These patches are processed by the **`patch_encoder`**
- `image_encoder`
- Input image is also rescaled to `patch_size` and processed by the **`image_encoder`**
- Input image is also rescaled to `patch_size` and processed by the **`image_encoder`**
Both these encoders can be configured via `patch_model_config` and `image_model_config` respectively, both of which are separate `Dinov2Model` by default.
@ -159,8 +160,8 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
- Official Implementation: [apple/ml-depth-pro](https://github.com/apple/ml-depth-pro)
- DepthPro Inference Notebook: [DepthPro Inference](https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/DepthPro_inference.ipynb)
- DepthPro for Super Resolution and Image Segmentation
- Read blog on Medium: [Depth Pro: Beyond Depth](https://medium.com/@raoarmaghanshakir040/depth-pro-beyond-depth-9d822fc557ba)
- Code on Github: [geetu040/depthpro-beyond-depth](https://github.com/geetu040/depthpro-beyond-depth)
- Read blog on Medium: [Depth Pro: Beyond Depth](https://medium.com/@raoarmaghanshakir040/depth-pro-beyond-depth-9d822fc557ba)
- Code on Github: [geetu040/depthpro-beyond-depth](https://github.com/geetu040/depthpro-beyond-depth)
If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.

View File

@ -16,9 +16,9 @@ rendered properly in your Markdown viewer.
*This model was released on 2020-05-26 and added to Hugging Face Transformers on 2021-06-09.*
<div style="float: right;">
<div class="flex flex-wrap space-x-1">
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
</div>
<div class="flex flex-wrap space-x-1">
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
</div>
</div>
# DETR

View File

@ -65,6 +65,7 @@ DiNAT can be used as a *backbone*. When `output_hidden_states = True`,
it will output both `hidden_states` and `reshaped_hidden_states`. The `reshaped_hidden_states` have a shape of `(batch, num_channels, height, width)` rather than `(batch_size, height, width, num_channels)`.
Notes:
- DiNAT depends on [NATTEN](https://github.com/SHI-Labs/NATTEN/)'s implementation of Neighborhood Attention and Dilated Neighborhood Attention.
You can install it with pre-built wheels for Linux by referring to [shi-labs.com/natten](https://shi-labs.com/natten), or build on your system by running `pip install natten`.
Note that the latter will likely take time to compile. NATTEN does not support Windows devices yet.

View File

@ -25,6 +25,7 @@ The [Vision Transformer](vit) (ViT) is a transformer encoder model (BERT-like) o
Next, people figured out ways to make ViT work really well on self-supervised image feature extraction (i.e. learning meaningful features, also called embeddings) on images without requiring any labels. Some example papers here include [DINOv2](dinov2) and [MAE](vit_mae).
The authors of DINOv2 noticed that ViTs have artifacts in attention maps. It's due to the model using some image patches as “registers”. The authors propose a fix: just add some new tokens (called "register" tokens), which you only use during pre-training (and throw away afterwards). This results in:
- no artifacts
- interpretable attention maps
- and improved performances.
@ -57,4 +58,4 @@ The original code can be found [here](https://github.com/facebookresearch/dinov2
## Dinov2WithRegistersForImageClassification
[[autodoc]] Dinov2WithRegistersForImageClassification
- forward
- forward

View File

@ -101,4 +101,4 @@ outputs = model.generate(
## DogeForSequenceClassification
[[autodoc]] DogeForSequenceClassification
- forward
- forward

View File

@ -44,9 +44,9 @@ This model was contributed by [lhoestq](https://huggingface.co/lhoestq). The ori
- DPR consists in three models:
* Question encoder: encode questions as vectors
* Context encoder: encode contexts as vectors
* Reader: extract the answer of the questions inside retrieved contexts, along with a relevance score (high if the inferred span actually answers the question).
* Question encoder: encode questions as vectors
* Context encoder: encode contexts as vectors
* Reader: extract the answer of the questions inside retrieved contexts, along with a relevance score (high if the inferred span actually answers the question).
## DPRConfig

View File

@ -144,27 +144,23 @@ processed_outputs = processor.post_process_keypoint_matching(outputs, image_size
## EfficientLoFTRImageProcessor
[[autodoc]] EfficientLoFTRImageProcessor
- preprocess
- post_process_keypoint_matching
- visualize_keypoint_matching
- preprocess
- post_process_keypoint_matching
- visualize_keypoint_matching
## EfficientLoFTRImageProcessorFast
[[autodoc]] EfficientLoFTRImageProcessorFast
- preprocess
- post_process_keypoint_matching
- visualize_keypoint_matching
- preprocess
- post_process_keypoint_matching
- visualize_keypoint_matching
## EfficientLoFTRModel
[[autodoc]] EfficientLoFTRModel
- forward
- forward
## EfficientLoFTRForKeypointMatching
[[autodoc]] EfficientLoFTRForKeypointMatching
- forward
- forward

View File

@ -207,4 +207,4 @@ plt.show()
## EomtForUniversalSegmentation
[[autodoc]] EomtForUniversalSegmentation
- forward
- forward

View File

@ -204,4 +204,4 @@ print(tokenizer.decode(output[0]))
## Exaone4ForQuestionAnswering
[[autodoc]] Exaone4ForQuestionAnswering
- forward
- forward

View File

@ -30,5 +30,6 @@ Depth up-scaling for improved reasoning: Building on recent studies on the effec
Knowledge distillation for better tiny models: To provide compact and efficient alternatives, we developed Falcon3-1B-Base and Falcon3-3B-Base by leveraging pruning and knowledge distillation techniques, using less than 100GT of curated high-quality data, thereby redefining pre-training efficiency.
## Resources
- [Blog post](https://huggingface.co/blog/falcon3)
- [Models on Huggingface](https://huggingface.co/collections/tiiuae/falcon3-67605ae03578be86e4e87026)

View File

@ -60,4 +60,4 @@ print(tokenizer.batch_decode(response, skip_special_tokens=True)[0])
[[autodoc]] FalconH1ForCausalLM
- forward
This HF implementation is contributed by [younesbelkada](https://github.com/younesbelkada) and [DhiaEddineRhaiem](https://github.com/dhiaEddineRhaiem).
This HF implementation is contributed by [younesbelkada](https://github.com/younesbelkada) and [DhiaEddineRhaiem](https://github.com/dhiaEddineRhaiem).

View File

@ -44,6 +44,7 @@ community for further reproducible experiments in French NLP.*
This model was contributed by [formiel](https://huggingface.co/formiel). The original code can be found [here](https://github.com/getalp/Flaubert).
Tips:
- Like RoBERTa, without the sentence ordering prediction (so just trained on the MLM objective).
## Resources

View File

@ -138,21 +138,21 @@ print(parsed_answer)
## Notes
- Florence-2 is a prompt-based model. You need to provide a task prompt to tell the model what to do. Supported tasks are:
- `<OCR>`
- `<OCR_WITH_REGION>`
- `<CAPTION>`
- `<DETAILED_CAPTION>`
- `<MORE_DETAILED_CAPTION>`
- `<OD>`
- `<DENSE_REGION_CAPTION>`
- `<CAPTION_TO_PHRASE_GROUNDING>`
- `<REFERRING_EXPRESSION_SEGMENTATION>`
- `<REGION_TO_SEGMENTATION>`
- `<OPEN_VOCABULARY_DETECTION>`
- `<REGION_TO_CATEGORY>`
- `<REGION_TO_DESCRIPTION>`
- `<REGION_TO_OCR>`
- `<REGION_PROPOSAL>`
- `<OCR>`
- `<OCR_WITH_REGION>`
- `<CAPTION>`
- `<DETAILED_CAPTION>`
- `<MORE_DETAILED_CAPTION>`
- `<OD>`
- `<DENSE_REGION_CAPTION>`
- `<CAPTION_TO_PHRASE_GROUNDING>`
- `<REFERRING_EXPRESSION_SEGMENTATION>`
- `<REGION_TO_SEGMENTATION>`
- `<OPEN_VOCABULARY_DETECTION>`
- `<REGION_TO_CATEGORY>`
- `<REGION_TO_DESCRIPTION>`
- `<REGION_TO_OCR>`
- `<REGION_PROPOSAL>`
- The raw output of the model is a string that needs to be parsed. The [`Florence2Processor`] has a [`~Florence2Processor.post_process_generation`] method that can parse the string into a more usable format, like bounding boxes and labels for object detection.
## Resources

View File

@ -121,9 +121,9 @@ echo -e "Plants create energy through a process known as" | transformers run --t
## Notes
- Use [`Gemma3nForConditionalGeneration`] for image-audio-and-text, image-and-text, image-and-audio, audio-and-text,
- Use [`Gemma3nForConditionalGeneration`] for image-audio-and-text, image-and-text, image-and-audio, audio-and-text,
image-only and audio-only inputs.
- Gemma 3n supports multiple images per input, but make sure the images are correctly batched before passing them to
- Gemma 3n supports multiple images per input, but make sure the images are correctly batched before passing them to
the processor. Each batch should be a list of one or more images.
```py
@ -148,11 +148,11 @@ echo -e "Plants create energy through a process known as" | transformers run --t
]
```
- Text passed to the processor should have a `<image_soft_token>` token wherever an image should be inserted.
- Gemma 3n accept at most one target audio clip per input, though multiple audio clips can be provided in few-shot
- Text passed to the processor should have a `<image_soft_token>` token wherever an image should be inserted.
- Gemma 3n accept at most one target audio clip per input, though multiple audio clips can be provided in few-shot
prompts, for example.
- Text passed to the processor should have a `<audio_soft_token>` token wherever an audio clip should be inserted.
- The processor has its own [`~ProcessorMixin.apply_chat_template`] method to convert chat messages to model inputs.
- Text passed to the processor should have a `<audio_soft_token>` token wherever an audio clip should be inserted.
- The processor has its own [`~ProcessorMixin.apply_chat_template`] method to convert chat messages to model inputs.
## Gemma3nAudioFeatureExtractor

View File

@ -81,4 +81,4 @@ The resource should ideally demonstrate something new instead of duplicating an
## GitForCausalLM
[[autodoc]] GitForCausalLM
- forward
- forward

View File

@ -35,6 +35,7 @@ Through our open-source work, we aim to explore the technological frontier toget
![bench_45](https://raw.githubusercontent.com/zai-org/GLM-V/refs/heads/main/resources/bench_45v.jpeg)
Beyond benchmark performance, GLM-4.5V focuses on real-world usability. Through efficient hybrid training, it can handle diverse types of visual content, enabling full-spectrum vision reasoning, including:
- **Image reasoning** (scene understanding, complex multi-image analysis, spatial recognition)
- **Video understanding** (long video segmentation and event recognition)
- **GUI tasks** (screen reading, icon recognition, desktop operation assistance)

View File

@ -36,6 +36,7 @@ The model is an optimized [GPT2 model](https://huggingface.co/docs/transformers/
## Implementation details
The main differences compared to GPT2.
- Added support for Multi-Query Attention.
- Use `gelu_pytorch_tanh` instead of classic `gelu`.
- Avoid unnecessary synchronizations (this has since been added to GPT2 in #20061, but wasn't in the reference codebase).

View File

@ -133,6 +133,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
- [`GPTJForCausalLM`] is supported by this [causal language modeling example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling#gpt-2gpt-and-causal-language-modeling), [text generation example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-generation), and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb).
**Documentation resources**
- [Text classification task guide](../tasks/sequence_classification)
- [Question answering task guide](../tasks/question_answering)
- [Causal language modeling task guide](../tasks/language_modeling)

View File

@ -37,6 +37,7 @@ Note that most of the aforementioned components are implemented generically to e
This model was contributed by [Alexander Brooks](https://huggingface.co/abrooks9944), [Avihu Dekel](https://huggingface.co/Avihu), and [George Saon](https://huggingface.co/gsaon).
## Usage tips
- This model bundles its own LoRA adapter, which will be automatically loaded and enabled/disabled as needed during inference calls. Be sure to install [PEFT](https://github.com/huggingface/peft) to ensure the LoRA is correctly applied!
<!-- TODO (@alex-jw-brooks) Add an example here once the model compatible with the transformers implementation is released -->

View File

@ -62,4 +62,4 @@ This HF implementation is contributed by [Mayank Mishra](https://huggingface.co/
## GraniteMoeSharedForCausalLM
[[autodoc]] GraniteMoeSharedForCausalLM
- forward
- forward

View File

@ -22,6 +22,7 @@ rendered properly in your Markdown viewer.
The [Granite Vision](https://www.ibm.com/new/announcements/ibm-granite-3-1-powerful-performance-long-context-and-more) model is a variant of [LLaVA-NeXT](llava_next), leveraging a [Granite](granite) language model alongside a [SigLIP](SigLIP) visual encoder. It utilizes multiple concatenated vision hidden states as its image features, similar to [VipLlava](vipllava). It also uses a larger set of image grid pinpoints than the original LlaVa-NeXT models to support additional aspect ratios.
Tips:
- This model is loaded into Transformers as an instance of LlaVA-Next. The usage and tips from [LLaVA-NeXT](llava_next) apply to this model as well.
- You can apply the chat template on the tokenizer / processor in the same way as well. Example chat format:

View File

@ -89,4 +89,4 @@ print(f"The predicted class label is: {predicted_class_label}")
## HGNetV2ForImageClassification
[[autodoc]] HGNetV2ForImageClassification
- forward
- forward

View File

@ -52,4 +52,4 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
## InformerForPrediction
[[autodoc]] InformerForPrediction
- forward
- forward

View File

@ -77,4 +77,4 @@ The attributes can be obtained from model config, as `model.config.num_query_tok
[[autodoc]] InstructBlipForConditionalGeneration
- forward
- generate
- generate

View File

@ -19,6 +19,7 @@ rendered properly in your Markdown viewer.
## Overview
[Kyutai STT](https://kyutai.org/next/stt) is a speech-to-text model architecture based on the [Mimi codec](https://huggingface.co/docs/transformers/en/model_doc/mimi), which encodes audio into discrete tokens in a streaming fashion, and a [Moshi-like](https://huggingface.co/docs/transformers/en/model_doc/moshi) autoregressive decoder. Kyutai's lab has released two model checkpoints:
- [kyutai/stt-1b-en_fr](https://huggingface.co/kyutai/stt-1b-en_fr): a 1B-parameter model capable of transcribing both English and French
- [kyutai/stt-2.6b-en](https://huggingface.co/kyutai/stt-2.6b-en): a 2.6B-parameter model focused solely on English, optimized for maximum transcription accuracy

View File

@ -37,8 +37,8 @@ This model was contributed by [nielsr](https://huggingface.co/nielsr). The origi
## Usage tips
- In terms of data processing, LayoutLMv3 is identical to its predecessor [LayoutLMv2](layoutlmv2), except that:
- images need to be resized and normalized with channels in regular RGB format. LayoutLMv2 on the other hand normalizes the images internally and expects the channels in BGR format.
- text is tokenized using byte-pair encoding (BPE), as opposed to WordPiece.
- images need to be resized and normalized with channels in regular RGB format. LayoutLMv2 on the other hand normalizes the images internally and expects the channels in BGR format.
- text is tokenized using byte-pair encoding (BPE), as opposed to WordPiece.
Due to these differences in data preprocessing, one can use [`LayoutLMv3Processor`] which internally combines a [`LayoutLMv3ImageProcessor`] (for the image modality) and a [`LayoutLMv3Tokenizer`]/[`LayoutLMv3TokenizerFast`] (for the text modality) to prepare all data for the model.
- Regarding usage of [`LayoutLMv3Processor`], we refer to the [usage guide](layoutlmv2#usage-layoutlmv2processor) of its predecessor.
@ -73,6 +73,7 @@ LayoutLMv3 is nearly identical to LayoutLMv2, so we've also included LayoutLMv2
- [Question answering task guide](../tasks/question_answering)
**Document question answering**
- [Document question answering task guide](../tasks/document_question_answering)
## LayoutLMv3Config

View File

@ -82,4 +82,4 @@ print(tokenizer.decode(output[0], skip_special_tokens=False))
## Lfm2ForCausalLM
[[autodoc]] Lfm2ForCausalLM
- forward
- forward

View File

@ -28,6 +28,7 @@ rendered properly in your Markdown viewer.
## Architecture
LFM2-VL consists of three main components: a language model backbone, a vision encoder, and a multimodal projector. LFM2-VL builds upon the LFM2 backbone, inheriting from either LFM2-1.2B (for LFM2-VL-1.6B) or LFM2-350M (for LFM2-VL-450M). For the vision tower, LFM2-VL uses SigLIP2 NaFlex encoders to convert input images into token sequences. Two variants are implemented:
* Shape-optimized (400M) for more fine-grained vision capabilities for LFM2-VL-1.6B
* Base (86M) for fast image processing for LFM2-VL-450M

View File

@ -143,13 +143,11 @@ processed_outputs = processor.post_process_keypoint_matching(outputs, image_size
## LightGlueImageProcessor
[[autodoc]] LightGlueImageProcessor
- preprocess
- post_process_keypoint_matching
- visualize_keypoint_matching
- preprocess
- post_process_keypoint_matching
- visualize_keypoint_matching
## LightGlueForKeypointMatching
[[autodoc]] LightGlueForKeypointMatching
- forward
- forward

View File

@ -62,6 +62,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
- Demo notebooks for LiLT can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LiLT).
**Documentation resources**
- [Text classification task guide](../tasks/sequence_classification)
- [Token classification task guide](../tasks/token_classification)
- [Question answering task guide](../tasks/question_answering)

View File

@ -27,9 +27,11 @@ rendered properly in your Markdown viewer.
[Llama 4](https://ai.meta.com/blog/llama-4-multimodal-intelligence/), developed by Meta, introduces a new auto-regressive Mixture-of-Experts (MoE) architecture.
This generation includes two models:
- The highly capable Llama 4 Maverick with 17B active parameters out of ~400B total, with 128 experts.
- The efficient Llama 4 Scout also has 17B active parameters out of ~109B total, using just 16 experts.
-
Both models leverage early fusion for native multimodality, enabling them to process text and image inputs.
Maverick and Scout are both trained on up to 40 trillion tokens on data encompassing 200 languages
(with specific fine-tuning support for 12 languages including Arabic, Spanish, German, and Hindi).
@ -421,24 +423,24 @@ model = Llama4ForConditionalGeneration.from_pretrained(
## Llama4ForConditionalGeneration
[[autodoc]] Llama4ForConditionalGeneration
- forward
- forward
## Llama4ForCausalLM
[[autodoc]] Llama4ForCausalLM
- forward
- forward
## Llama4TextModel
[[autodoc]] Llama4TextModel
- forward
- forward
## Llama4ForCausalLM
[[autodoc]] Llama4ForCausalLM
- forward
- forward
## Llama4VisionModel
[[autodoc]] Llama4VisionModel
- forward
- forward

View File

@ -57,6 +57,7 @@ The attributes can be obtained from model config, as `model.config.vision_config
Each **checkpoint** is trained with a specific prompt format, depending on the underlying large language model backbone. To ensure correct formatting, use the processor's `apply_chat_template` method.
**Important:**
- You must construct a conversation history — passing a plain string won't work.
- Each message should be a dictionary with `"role"` and `"content"` keys.
- The `"content"` should be a list of dictionaries for different modalities like `"text"` and `"image"`.

View File

@ -64,6 +64,7 @@ The attributes can be obtained from model config, as `model.config.vision_config
Each **checkpoint** is trained with a specific prompt format, depending on the underlying large language model backbone. To ensure correct formatting, use the processor's `apply_chat_template` method.
**Important:**
- You must construct a conversation history — passing a plain string won't work.
- Each message should be a dictionary with `"role"` and `"content"` keys.
- The `"content"` should be a list of dictionaries for different modalities like `"text"` and `"image"`.

View File

@ -59,6 +59,7 @@ Tips:
Each **checkpoint** is trained with a specific prompt format, depending on the underlying large language model backbone. To ensure correct formatting, use the processors `apply_chat_template` method.
**Important:**
- You must construct a conversation history — passing a plain string won't work.
- Each message should be a dictionary with `"role"` and `"content"` keys.
- The `"content"` should be a list of dictionaries for different modalities like `"text"` and `"image"`.

View File

@ -30,6 +30,7 @@ performance, similar to [LayoutLM](layoutlm).
The model can be used for tasks like question answering on web pages or information extraction from web pages. It obtains
state-of-the-art results on 2 important benchmarks:
- [WebSRC](https://x-lance.github.io/WebSRC/), a dataset for Web-Based Structural Reading Comprehension (a bit like SQuAD but for web pages)
- [SWDE](https://www.researchgate.net/publication/221299838_From_one_tree_to_a_forest_a_unified_solution_for_structured_web_data_extraction), a dataset
for information extraction from web pages (basically named-entity recognition on web pages)

View File

@ -86,4 +86,4 @@ The resource should ideally demonstrate something new instead of duplicating an
- preprocess
- post_process_semantic_segmentation
- post_process_instance_segmentation
- post_process_panoptic_segmentation
- post_process_panoptic_segmentation

View File

@ -44,7 +44,7 @@ This model was contributed by [francesco](https://huggingface.co/francesco). The
## Usage tips
- MaskFormer's Transformer decoder is identical to the decoder of [DETR](detr). During training, the authors of DETR did find it helpful to use auxiliary losses in the decoder, especially to help the model output the correct number of objects of each class. If you set the parameter `use_auxiliary_loss` of [`MaskFormerConfig`] to `True`, then prediction feedforward neural networks and Hungarian losses are added after each decoder layer (with the FFNs sharing parameters).
- MaskFormer's Transformer decoder is identical to the decoder of [DETR](detr). During training, the authors of DETR did find it helpful to use auxiliary losses in the decoder, especially to help the model output the correct number of objects of each class. If you set the parameter `use_auxiliary_loss` of [`MaskFormerConfig`] to `True`, then prediction feedforward neural networks and Hungarian losses are added after each decoder layer (with the FFNs sharing parameters).
- If you want to train the model in a distributed environment across multiple nodes, then one should update the
`get_num_masks` function inside in the `MaskFormerLoss` class of `modeling_maskformer.py`. When training on multiple nodes, this should be
set to the average number of target masks across all nodes, as can be seen in the original implementation [here](https://github.com/facebookresearch/MaskFormer/blob/da3e60d85fdeedcb31476b5edd7d328826ce56cc/mask_former/modeling/criterion.py#L169).
@ -102,4 +102,4 @@ This model was contributed by [francesco](https://huggingface.co/francesco). The
## MaskFormerForInstanceSegmentation
[[autodoc]] MaskFormerForInstanceSegmentation
- forward
- forward

View File

@ -79,4 +79,4 @@ scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=1000, nu
MatCha is a model that is trained using `Pix2Struct` architecture. You can find more information about `Pix2Struct` in the [Pix2Struct documentation](https://huggingface.co/docs/transformers/main/en/model_doc/pix2struct).
</Tip>
</Tip>

View File

@ -35,12 +35,12 @@ The architecture of MiniMax is briefly described as follows:
- Activated Parameters per Token: 45.9B
- Number Layers: 80
- Hybrid Attention: a softmax attention is positioned after every 7 lightning attention.
- Number of attention heads: 64
- Attention head dimension: 128
- Number of attention heads: 64
- Attention head dimension: 128
- Mixture of Experts:
- Number of experts: 32
- Expert hidden dimension: 9216
- Top-2 routing strategy
- Number of experts: 32
- Expert hidden dimension: 9216
- Top-2 routing strategy
- Positional Encoding: Rotary Position Embedding (RoPE) applied to half of the attention head dimension with a base frequency of 10,000,000
- Hidden Size: 6144
- Vocab Size: 200,064

View File

@ -83,4 +83,4 @@ The example below demonstrates how to use Ministral for text generation:
## MinistralForQuestionAnswering
[[autodoc]] MinistralForQuestionAnswering
- forward
- forward

View File

@ -163,4 +163,4 @@ Use the [AttentionMaskVisualizer](https://github.com/huggingface/transformers/bl
## MistralForQuestionAnswering
[[autodoc]] MistralForQuestionAnswering
- forward
- forward

View File

@ -42,6 +42,7 @@ Mixtral-8x7B is a decoder-only Transformer with the following architectural choi
- Despite the model having 45 billion parameters, the compute required for a single forward pass is the same as that of a 14 billion parameter model. This is because even though each of the experts have to be loaded in RAM (70B like ram requirement) each token from the hidden states are dispatched twice (top 2 routing) and thus the compute (the operation required at each forward computation) is just 2 X sequence_length.
The following implementation details are shared with Mistral AI's first model [Mistral-7B](mistral):
- Sliding Window Attention - Trained with 8k context length and fixed cache size, with a theoretical attention span of 128K tokens
- GQA (Grouped Query Attention) - allowing faster inference and lower cache size.
- Byte-fallback BPE tokenizer - ensures that characters are never mapped to out of vocabulary tokens.
@ -55,6 +56,7 @@ For more details refer to the [release blog post](https://mistral.ai/news/mixtra
## Usage tips
The Mistral team has released 2 checkpoints:
- a base model, [Mixtral-8x7B-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1), which has been pre-trained to predict the next token on internet-scale data.
- an instruction tuned model, [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1), which is the base model optimized for chat purposes using supervised fine-tuning (SFT) and direct preference optimization (DPO).

View File

@ -85,10 +85,10 @@ print(f"The predicted class label is: {predicted_class_label}")
## Notes
- Checkpoint names follow the pattern `mobilenet_v1_{depth_multiplier}_{resolution}`, like `mobilenet_v1_1.0_224`. `1.0` is the depth multiplier and `224` is the image resolution.
- While trained on images of a specific sizes, the model architecture works with images of different sizes (minimum 32x32). The [`MobileNetV1ImageProcessor`] handles the necessary preprocessing.
- MobileNet is pretrained on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k), a dataset with 1000 classes. However, the model actually predicts 1001 classes. The additional class is an extra "background" class (index 0).
- The original TensorFlow checkpoints determines the padding amount at inference because it depends on the input image size. To use the native PyTorch padding behavior, set `tf_padding=False` in [`MobileNetV1Config`].
- Checkpoint names follow the pattern `mobilenet_v1_{depth_multiplier}_{resolution}`, like `mobilenet_v1_1.0_224`. `1.0` is the depth multiplier and `224` is the image resolution.
- While trained on images of a specific sizes, the model architecture works with images of different sizes (minimum 32x32). The [`MobileNetV1ImageProcessor`] handles the necessary preprocessing.
- MobileNet is pretrained on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k), a dataset with 1000 classes. However, the model actually predicts 1001 classes. The additional class is an extra "background" class (index 0).
- The original TensorFlow checkpoints determines the padding amount at inference because it depends on the input image size. To use the native PyTorch padding behavior, set `tf_padding=False` in [`MobileNetV1Config`].
```python
from transformers import MobileNetV1Config
@ -96,11 +96,11 @@ print(f"The predicted class label is: {predicted_class_label}")
config = MobileNetV1Config.from_pretrained("google/mobilenet_v1_1.0_224", tf_padding=True)
```
- The Transformers implementation does not support the following features.
- Uses global average pooling instead of the optional 7x7 average pooling with stride 2. For larger inputs, this gives a pooled output that is larger than a 1x1 pixel.
- Does not support other `output_stride` values (fixed at 32). For smaller `output_strides`, the original implementation uses dilated convolution to prevent spatial resolution from being reduced further. (which would require dilated convolutions).
- `output_hidden_states=True` returns *all* intermediate hidden states. It is not possible to extract the output from specific layers for other downstream purposes.
- Does not include the quantized models from the original checkpoints because they include "FakeQuantization" operations to unquantize the weights.
- The Transformers implementation does not support the following features.
- Uses global average pooling instead of the optional 7x7 average pooling with stride 2. For larger inputs, this gives a pooled output that is larger than a 1x1 pixel.
- Does not support other `output_stride` values (fixed at 32). For smaller `output_strides`, the original implementation uses dilated convolution to prevent spatial resolution from being reduced further. (which would require dilated convolutions).
- `output_hidden_states=True` returns *all* intermediate hidden states. It is not possible to extract the output from specific layers for other downstream purposes.
- Does not include the quantized models from the original checkpoints because they include "FakeQuantization" operations to unquantize the weights.
## MobileNetV1Config

View File

@ -82,11 +82,11 @@ print(f"The predicted class label is: {predicted_class_label}")
## Notes
- Classification checkpoint names follow the pattern `mobilenet_v2_{depth_multiplier}_{resolution}`, like `mobilenet_v2_1.4_224`. `1.4` is the depth multiplier and `224` is the image resolution. Segmentation checkpoint names follow the pattern `deeplabv3_mobilenet_v2_{depth_multiplier}_{resolution}`.
- While trained on images of a specific sizes, the model architecture works with images of different sizes (minimum 32x32). The [`MobileNetV2ImageProcessor`] handles the necessary preprocessing.
- MobileNet is pretrained on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k), a dataset with 1000 classes. However, the model actually predicts 1001 classes. The additional class is an extra "background" class (index 0).
- The segmentation models use a [DeepLabV3+](https://huggingface.co/papers/1802.02611) head which is often pretrained on datasets like [PASCAL VOC](https://huggingface.co/datasets/merve/pascal-voc).
- The original TensorFlow checkpoints determines the padding amount at inference because it depends on the input image size. To use the native PyTorch padding behavior, set `tf_padding=False` in [`MobileNetV2Config`].
- Classification checkpoint names follow the pattern `mobilenet_v2_{depth_multiplier}_{resolution}`, like `mobilenet_v2_1.4_224`. `1.4` is the depth multiplier and `224` is the image resolution. Segmentation checkpoint names follow the pattern `deeplabv3_mobilenet_v2_{depth_multiplier}_{resolution}`.
- While trained on images of a specific sizes, the model architecture works with images of different sizes (minimum 32x32). The [`MobileNetV2ImageProcessor`] handles the necessary preprocessing.
- MobileNet is pretrained on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k), a dataset with 1000 classes. However, the model actually predicts 1001 classes. The additional class is an extra "background" class (index 0).
- The segmentation models use a [DeepLabV3+](https://huggingface.co/papers/1802.02611) head which is often pretrained on datasets like [PASCAL VOC](https://huggingface.co/datasets/merve/pascal-voc).
- The original TensorFlow checkpoints determines the padding amount at inference because it depends on the input image size. To use the native PyTorch padding behavior, set `tf_padding=False` in [`MobileNetV2Config`].
```python
from transformers import MobileNetV2Config
@ -94,11 +94,11 @@ print(f"The predicted class label is: {predicted_class_label}")
config = MobileNetV2Config.from_pretrained("google/mobilenet_v2_1.4_224", tf_padding=True)
```
- The Transformers implementation does not support the following features.
- Uses global average pooling instead of the optional 7x7 average pooling with stride 2. For larger inputs, this gives a pooled output that is larger than a 1x1 pixel.
- `output_hidden_states=True` returns *all* intermediate hidden states. It is not possible to extract the output from specific layers for other downstream purposes.
- Does not include the quantized models from the original checkpoints because they include "FakeQuantization" operations to unquantize the weights.
- For segmentation models, the final convolution layer of the backbone is computed even though the DeepLabV3+ head doesn't use it.
- The Transformers implementation does not support the following features.
- Uses global average pooling instead of the optional 7x7 average pooling with stride 2. For larger inputs, this gives a pooled output that is larger than a 1x1 pixel.
- `output_hidden_states=True` returns *all* intermediate hidden states. It is not possible to extract the output from specific layers for other downstream purposes.
- Does not include the quantized models from the original checkpoints because they include "FakeQuantization" operations to unquantize the weights.
- For segmentation models, the final convolution layer of the backbone is computed even though the DeepLabV3+ head doesn't use it.
## MobileNetV2Config

View File

@ -28,6 +28,7 @@ Unless required by applicable law or agreed to in writing, software distributed
You can find all the original MobileViT checkpoints under the [Apple](https://huggingface.co/apple/models?search=mobilevit) organization.
> [!TIP]
>
> - This model was contributed by [matthijs](https://huggingface.co/Matthijs).
>
> Click on the MobileViT models in the right sidebar for more examples of how to apply MobileViT to different vision tasks.

View File

@ -38,6 +38,7 @@ The abstract from the paper is the following:
*We introduce Moshi, a speech-text foundation model and full-duplex spoken dialogue framework. Current systems for spoken dialogue rely on pipelines of independent components, namely voice activity detection, speech recognition, textual dialogue and text-to-speech. Such frameworks cannot emulate the experience of real conversations. First, their complexity induces a latency of several seconds between interactions. Second, text being the intermediate modality for dialogue, non-linguistic information that modifies meaning— such as emotion or non-speech sounds— is lost in the interaction. Finally, they rely on a segmentation into speaker turns, which does not take into account overlapping speech, interruptions and interjections. Moshi solves these independent issues altogether by casting spoken dialogue as speech-to-speech generation. Starting from a text language model backbone, Moshi generates speech as tokens from the residual quantizer of a neural audio codec, while modeling separately its own speech and that of the user into parallel streams. This allows for the removal of explicit speaker turns, and the modeling of arbitrary conversational dynamics. We moreover extend the hierarchical semantic-to-acoustic token generation of previous work to first predict time-aligned text tokens as a prefix to audio tokens. Not only this “Inner Monologue” method significantly improves the linguistic quality of generated speech, but we also illustrate how it can provide streaming speech recognition and text-to-speech. Our resulting model is the first real-time full-duplex spoken large language model, with a theoretical latency of 160ms, 200ms in practice, and is available at github.com/kyutai-labs/moshi.*
Moshi deals with 3 streams of information:
1. The user's audio
2. Moshi's audio
3. Moshi's textual output
@ -70,6 +71,7 @@ The original checkpoints can be converted using the conversion script `src/trans
### How to use the model:
This implementation has two main aims:
1. quickly test model generation by simplifying the original API
2. simplify training. A training guide will come soon, but user contributions are welcomed!
@ -84,6 +86,7 @@ It is designed for intermediate use. We strongly recommend using the original [i
Moshi is a streaming auto-regressive model with two streams of audio. To put it differently, one audio stream corresponds to what the model said/will say and the other audio stream corresponds to what the user said/will say.
[`MoshiForConditionalGeneration.generate`] thus needs 3 inputs:
1. `input_ids` - corresponding to the text token history
2. `moshi_input_values` or `moshi_audio_codes`- corresponding to the model audio history
3. `user_input_values` or `user_audio_codes` - corresponding to the user audio history
@ -91,6 +94,7 @@ Moshi is a streaming auto-regressive model with two streams of audio. To put it
These three inputs must be synchronized. Meaning that their lengths must correspond to the same number of tokens.
You can dynamically use the 3 inputs depending on what you want to test:
1. Simply check the model response to an user prompt - in that case, `input_ids` can be filled with pad tokens and `user_input_values` can be a zero tensor of the same shape than the user prompt.
2. Test more complex behaviour - in that case, you must be careful about how the input tokens are synchronized with the audios.

View File

@ -64,4 +64,4 @@ The original code can be found [here](https://github.com/mlpen/mra-attention).
## MraForQuestionAnswering
[[autodoc]] MraForQuestionAnswering
- forward
- forward

View File

@ -230,6 +230,7 @@ generation config.
## Model Structure
The MusicGen model can be de-composed into three distinct stages:
1. Text encoder: maps the text inputs to a sequence of hidden-state representations. The pre-trained MusicGen models use a frozen text encoder from either T5 or Flan-T5
2. MusicGen decoder: a language model (LM) that auto-regressively generates audio tokens (or codes) conditional on the encoder hidden-state representations
3. Audio encoder/decoder: used to encode an audio prompt to use as prompt tokens, and recover the audio waveform from the audio tokens predicted by the decoder
@ -256,6 +257,7 @@ be combined with the frozen text encoder and audio encoder/decoders to recover t
model.
Tips:
* MusicGen is trained on the 32kHz checkpoint of Encodec. You should ensure you use a compatible version of the Encodec model.
* Sampling mode tends to deliver better results than greedy - you can toggle sampling with the variable `do_sample` in the call to [`MusicgenForConditionalGeneration.generate`]

View File

@ -40,6 +40,7 @@ This model was contributed by [ylacombe](https://huggingface.co/ylacombe). The o
## Difference with [MusicGen](https://huggingface.co/docs/transformers/main/en/model_doc/musicgen)
There are two key differences with MusicGen:
1. The audio prompt is used here as a conditional signal for the generated audio sample, whereas it's used for audio continuation in [MusicGen](https://huggingface.co/docs/transformers/main/en/model_doc/musicgen).
2. Conditional text and audio signals are concatenated to the decoder's hidden states instead of being used as a cross-attention signal, as in MusicGen.
@ -224,6 +225,7 @@ Note that any arguments passed to the generate method will **supersede** those i
## Model Structure
The MusicGen model can be de-composed into three distinct stages:
1. Text encoder: maps the text inputs to a sequence of hidden-state representations. The pre-trained MusicGen models use a frozen text encoder from either T5 or Flan-T5.
2. MusicGen Melody decoder: a language model (LM) that auto-regressively generates audio tokens (or codes) conditional on the encoder hidden-state representations
3. Audio decoder: used to recover the audio waveform from the audio tokens predicted by the decoder.
@ -253,6 +255,7 @@ python src/transformers/models/musicgen_melody/convert_musicgen_melody_transform
```
Tips:
* MusicGen is trained on the 32kHz checkpoint of Encodec. You should ensure you use a compatible version of the Encodec model.
* Sampling mode tends to deliver better results than greedy - you can toggle sampling with the variable `do_sample` in the call to [`MusicgenMelodyForConditionalGeneration.generate`]

View File

@ -68,6 +68,7 @@ The `reshaped_hidden_states` have a shape of `(batch, num_channels, height, widt
`(batch_size, height, width, num_channels)`.
Notes:
- NAT depends on [NATTEN](https://github.com/SHI-Labs/NATTEN/)'s implementation of Neighborhood Attention.
You can install it with pre-built wheels for Linux by referring to [shi-labs.com/natten](https://shi-labs.com/natten),
or build on your system by running `pip install natten`.

View File

@ -128,9 +128,9 @@ visualizer("UN Chief says there is no military solution in Syria")
>>> tokenizer = NllbTokenizer.from_pretrained("facebook/nllb-200-distilled-600M", legacy_behaviour=True)
```
- For non-English languages, specify the language's [BCP-47](https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200) code with the `src_lang` keyword as shown below.
- For non-English languages, specify the language's [BCP-47](https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200) code with the `src_lang` keyword as shown below.
- See example below for a translation from Romanian to German.
- See example below for a translation from Romanian to German.
```python
>>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

View File

@ -39,7 +39,7 @@ This model was contributed by [Jitesh Jain](https://huggingface.co/praeclarumjj3
## Usage tips
- OneFormer requires two inputs during inference: *image* and *task token*.
- OneFormer requires two inputs during inference: *image* and *task token*.
- During training, OneFormer only uses panoptic annotations.
- If you want to train the model in a distributed environment across multiple nodes, then one should update the
`get_num_masks` function inside in the `OneFormerLoss` class of `modeling_oneformer.py`. When training on multiple nodes, this should be

View File

@ -84,22 +84,22 @@ echo -e "The future of AI is" | transformers run --task text-generation --model
## OpenAIGPTModel
[[autodoc]] OpenAIGPTModel
- forward
- forward
## OpenAIGPTLMHeadModel
[[autodoc]] OpenAIGPTLMHeadModel
- forward
- forward
## OpenAIGPTDoubleHeadsModel
[[autodoc]] OpenAIGPTDoubleHeadsModel
- forward
- forward
## OpenAIGPTForSequenceClassification
[[autodoc]] OpenAIGPTForSequenceClassification
- forward
- forward
## OpenAIGPTTokenizer

View File

@ -27,12 +27,13 @@ rendered properly in your Markdown viewer.
Parakeet models, [introduced by NVIDIA NeMo](https://developer.nvidia.com/blog/pushing-the-boundaries-of-speech-recognition-with-nemo-parakeet-asr-models/), are models that combine a [Fast Conformer](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/models.html#fast-conformer) encoder with connectionist temporal classification (CTC), recurrent neural network transducer (RNNT) or token and duration transducer (TDT) decoder for automatic speech recognition.
**Model Architecture**
- **Fast Conformer Encoder**: A linearly scalable Conformer architecture that processes mel-spectrogram features and reduces sequence length through subsampling. This is more efficient version of the Conformer Encoder found in [FastSpeech2Conformer](./fastspeech2_conformer.md) (see [`ParakeetEncoder`] for the encoder implementation and details).
- [**ParakeetForCTC**](#parakeetforctc): a Fast Conformer Encoder + a CTC decoder
- **CTC Decoder**: Simple but effective decoder consisting of:
- 1D convolution projection from encoder hidden size to vocabulary size (for optimal NeMo compatibility).
- CTC loss computation for training.
- Greedy CTC decoding for inference.
- **CTC Decoder**: Simple but effective decoder consisting of:
- 1D convolution projection from encoder hidden size to vocabulary size (for optimal NeMo compatibility).
- CTC loss computation for training.
- Greedy CTC decoding for inference.
The original implementation can be found in [NVIDIA NeMo](https://github.com/NVIDIA/NeMo).
Model checkpoints are to be found under [the NVIDIA organization](https://huggingface.co/nvidia/models?search=parakeet).
@ -189,7 +190,7 @@ outputs.loss.backward()
## ParakeetTokenizerFast
[[autodoc]] ParakeetTokenizerFast
[[autodoc]] ParakeetTokenizerFast
## ParakeetFeatureExtractor
@ -205,11 +206,11 @@ outputs.loss.backward()
## ParakeetEncoderConfig
[[autodoc]] ParakeetEncoderConfig
[[autodoc]] ParakeetEncoderConfig
## ParakeetCTCConfig
[[autodoc]] ParakeetCTCConfig
[[autodoc]] ParakeetCTCConfig
## ParakeetEncoder
@ -218,4 +219,3 @@ outputs.loss.backward()
## ParakeetForCTC
[[autodoc]] ParakeetForCTC

View File

@ -89,4 +89,4 @@ The model can also be used for time series classification and time series regres
## PatchTSMixerForRegression
[[autodoc]] PatchTSMixerForRegression
- forward
- forward

View File

@ -45,6 +45,7 @@ The original code for PhiMoE can be found [here](https://huggingface.co/microsof
<Tip warning={true}>
Phi-3.5-MoE-instruct has been integrated in the development version (4.44.2.dev) of `transformers`. Until the official version is released through `pip`, ensure that you are doing the following:
* When loading the model, ensure that `trust_remote_code=True` is passed as an argument of the `from_pretrained()` function.
The current `transformers` version can be verified with: `pip list | grep transformers`.

View File

@ -79,4 +79,4 @@ The original code can be found [here](https://github.com/google-research/pix2str
## Pix2StructForConditionalGeneration
[[autodoc]] Pix2StructForConditionalGeneration
- forward
- forward

View File

@ -120,4 +120,4 @@ it's passed with the `text_target` keyword argument.
## PLBartForCausalLM
[[autodoc]] PLBartForCausalLM
- forward
- forward

View File

@ -59,6 +59,7 @@ pip install pretty-midi==0.2.9 essentia==2.1b6.dev1034 librosa scipy
```
Please note that you may need to restart your runtime after installation.
* Pop2Piano is an Encoder-Decoder based model like T5.
* Pop2Piano can be used to generate midi-audio files for a given audio sequence.
* Choosing different composers in `Pop2PianoForConditionalGeneration.generate()` can lead to variety of different results.

View File

@ -99,4 +99,4 @@ If you are interested in submitting a resource to be included here, please feel
[[autodoc]] PromptDepthAnythingImageProcessorFast
- preprocess
- post_process_depth_estimation
- post_process_depth_estimation

View File

@ -19,6 +19,7 @@ rendered properly in your Markdown viewer.
The Qwen3-Next series represents our next-generation foundation models, optimized for extreme context length and large-scale parameter efficiency.
The series introduces a suite of architectural innovations designed to maximize performance while minimizing computational cost:
- **Hybrid Attention**: Replaces standard attention with the combination of **Gated DeltaNet** and **Gated Attention**, enabling efficient context modeling.
- **High-Sparsity MoE**: Achieves an extreme low activation ratio as 1:50 in MoE layers — drastically reducing FLOPs per token while preserving model capacity.
- **Multi-Token Prediction(MTP)**: Boosts pretraining model performance, and accelerates inference.

View File

@ -152,4 +152,4 @@ $$D_{i} = e^{u + K_{i} - q} + e^{M_{i}} \tilde{D}_{i} \hbox{ where } q = \max(
which finally gives us
$$O_{i} = \sigma(R_{i}) \frac{N_{i}}{D_{i}}$$
$$O_{i} = \sigma(R_{i}) \frac{N_{i}}{D_{i}}$$

View File

@ -139,6 +139,7 @@ The architecture of this new version differs from the first in a few aspects:
#### Improvements on the second-pass model
The second seq2seq model, named text-to-unit model, is now non-auto regressive, meaning that it computes units in a **single forward pass**. This achievement is made possible by:
- the use of **character-level embeddings**, meaning that each character of the predicted translated text has its own embeddings, which are then used to predict the unit tokens.
- the use of an intermediate duration predictor, that predicts speech duration at the **character-level** on the predicted translated text.
- the use of a new text-to-unit decoder mixing convolutions and self-attention to handle longer context.
@ -146,6 +147,7 @@ The second seq2seq model, named text-to-unit model, is now non-auto regressive,
#### Difference in the speech encoder
The speech encoder, which is used during the first-pass generation process to predict the translated text, differs mainly from the previous speech encoder through these mechanisms:
- the use of chunked attention mask to prevent attention across chunks, ensuring that each position attends only to positions within its own chunk and a fixed number of previous chunks.
- the use of relative position embeddings which only considers distance between sequence elements rather than absolute positions. Please refer to [Self-Attentionwith Relative Position Representations (Shaw et al.)](https://huggingface.co/papers/1803.02155) for more details.
- the use of a causal depth-wise convolution instead of a non-causal one.

Some files were not shown because too many files have changed in this diff Show More