mirror of
https://github.com/huggingface/transformers.git
synced 2025-10-20 09:03:53 +08:00
Fix white space in documentation (#41157)
* Fix white space Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Revert changes Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> * Fix autodoc Signed-off-by: Yuanyuan Chen <cyyever@outlook.com> --------- Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>
This commit is contained in:
@ -278,13 +278,14 @@ are working on it).<br>
|
||||
useful to avoid duplicated work, and to differentiate it from PRs ready to be merged.<br>
|
||||
☐ Make sure existing tests pass.<br>
|
||||
☐ If adding a new feature, also add tests for it.<br>
|
||||
- If you are adding a new model, make sure you use
|
||||
|
||||
- If you are adding a new model, make sure you use
|
||||
`ModelTester.all_model_classes = (MyModel, MyModelWithLMHead,...)` to trigger the common tests.
|
||||
- If you are adding new `@slow` tests, make sure they pass using
|
||||
- If you are adding new `@slow` tests, make sure they pass using
|
||||
`RUN_SLOW=1 python -m pytest tests/models/my_new_model/test_my_new_model.py`.
|
||||
- If you are adding a new tokenizer, write tests and make sure
|
||||
- If you are adding a new tokenizer, write tests and make sure
|
||||
`RUN_SLOW=1 python -m pytest tests/models/{your_model_name}/test_tokenization_{your_model_name}.py` passes.
|
||||
- CircleCI does not run the slow tests, but GitHub Actions does every night!<br>
|
||||
- CircleCI does not run the slow tests, but GitHub Actions does every night!<br>
|
||||
|
||||
☐ All public methods must have informative docstrings (see
|
||||
[`modeling_bert.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/bert/modeling_bert.py)
|
||||
@ -340,6 +341,7 @@ RUN_SLOW=yes python -m pytest -n auto --dist=loadfile -s -v ./examples/pytorch/t
|
||||
```
|
||||
|
||||
Like the slow tests, there are other environment variables available which are not enabled by default during testing:
|
||||
|
||||
- `RUN_CUSTOM_TOKENIZERS`: Enables tests for custom tokenizers.
|
||||
|
||||
More environment variables and additional information can be found in the [testing_utils.py](https://github.com/huggingface/transformers/blob/main/src/transformers/testing_utils.py).
|
||||
|
@ -193,4 +193,4 @@ def custom_attention_mask(
|
||||
|
||||
It mostly works thanks to the `mask_function`, which is a `Callable` in the form of [torch's mask_mod functions](https://pytorch.org/blog/flexattention/), taking 4 indices as input and returning a boolean to indicate if this position should take part in the attention computation.
|
||||
|
||||
If you cannot use the `mask_function` to create your mask for some reason, you can try to work around it by doing something similar to our [torch export workaround](https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/executorch.py).
|
||||
If you cannot use the `mask_function` to create your mask for some reason, you can try to work around it by doing something similar to our [torch export workaround](https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/executorch.py).
|
||||
|
@ -210,9 +210,9 @@ There are some rules for documenting different types of arguments and they're li
|
||||
This can span multiple lines.
|
||||
```
|
||||
|
||||
* Include `type` in backticks.
|
||||
* Add *optional* if the argument is not required or has a default value.
|
||||
* Add "defaults to X" if it has a default value. You don't need to add "defaults to `None`" if the default value is `None`.
|
||||
* Include `type` in backticks.
|
||||
* Add *optional* if the argument is not required or has a default value.
|
||||
* Add "defaults to X" if it has a default value. You don't need to add "defaults to `None`" if the default value is `None`.
|
||||
|
||||
These arguments can also be passed to `@auto_docstring` as a `custom_args` argument. It is used to define the docstring block for new arguments once if they are repeated in multiple places in the modeling file.
|
||||
|
||||
|
@ -162,6 +162,7 @@ generated_ids = model.generate(**inputs, use_cache=True, max_new_tokens=10)
|
||||
Before the [`Cache`] class, the cache used to be stored as a tuple of tuples of tensors. This format is dynamic because it grows as text is generated, similar to [`DynamicCache`].
|
||||
|
||||
The legacy format is essentially the same data structure but organized differently.
|
||||
|
||||
- It's a tuple of tuples, where each inner tuple contains the key and value tensors for a layer.
|
||||
- The tensors have the same shape `[batch_size, num_heads, seq_len, head_dim]`.
|
||||
- The format is less flexible and doesn't support features like quantization or offloading.
|
||||
|
@ -221,4 +221,4 @@ model_input = tokenizer.apply_chat_template(
|
||||
messages,
|
||||
tools = [current_time, multiply]
|
||||
)
|
||||
```
|
||||
```
|
||||
|
@ -77,9 +77,9 @@ Mistral-7B-Instruct uses `[INST]` and `[/INST]` tokens to indicate the start and
|
||||
|
||||
The input to `apply_chat_template` should be structured as a list of dictionaries with `role` and `content` keys. The `role` key specifies the speaker, and the `content` key contains the message. The common roles are:
|
||||
|
||||
- `user` for messages from the user
|
||||
- `assistant` for messages from the model
|
||||
- `system` for directives on how the model should act (usually placed at the beginning of the chat)
|
||||
- `user` for messages from the user
|
||||
- `assistant` for messages from the model
|
||||
- `system` for directives on how the model should act (usually placed at the beginning of the chat)
|
||||
|
||||
[`apply_chat_template`] takes this list and returns a formatted sequence. Set `tokenize=True` if you want to tokenize the sequence.
|
||||
|
||||
|
@ -21,6 +21,7 @@ where `port` is the port used by `transformers serve` (`8000` by default). On th
|
||||
</h3>
|
||||
|
||||
You're now ready to set things up on the app side! In Cursor, while you can't set a new provider, you can change the endpoint for OpenAI requests in the model selection settings. First, navigate to "Settings" > "Cursor Settings", "Models" tab, and expand the "API Keys" collapsible. To set your `transformers serve` endpoint, follow this order:
|
||||
|
||||
1. Unselect ALL models in the list above (e.g. `gpt4`, ...);
|
||||
2. Add and select the model you want to use (e.g. `Qwen/Qwen3-4B`)
|
||||
3. Add some random text to OpenAI API Key. This field won't be used, but it can't be empty;
|
||||
|
@ -229,6 +229,7 @@ tokenizer.batch_decode(outputs, skip_special_tokens=True)
|
||||
## Custom generation methods
|
||||
|
||||
Custom generation methods enable specialized behavior such as:
|
||||
|
||||
- have the model continue thinking if it is uncertain;
|
||||
- roll back generation if the model gets stuck;
|
||||
- handle special tokens with custom logic;
|
||||
@ -301,6 +302,7 @@ Updating your Python requirements accordingly will remove this error message.
|
||||
### Creating a custom generation method
|
||||
|
||||
To create a new generation method, you need to create a new [**Model**](https://huggingface.co/new) repository and push a few files into it.
|
||||
|
||||
1. The model you've designed your generation method with.
|
||||
2. `custom_generate/generate.py`, which contains all the logic for your custom generation method.
|
||||
3. `custom_generate/requirements.txt`, used to optionally add new Python requirements and/or lock specific versions to correctly use your method.
|
||||
@ -377,6 +379,7 @@ def generate(model, input_ids, generation_config=None, left_padding=None, **kwar
|
||||
```
|
||||
|
||||
Follow the recommended practices below to ensure your custom generation method works as expected.
|
||||
|
||||
- Feel free to reuse the logic for validation and input preparation in the original [`~GenerationMixin.generate`].
|
||||
- Pin the `transformers` version in the requirements if you use any private method/attribute in `model`.
|
||||
- Consider adding model validation, input validation, or even a separate test file to help users sanity-check your code in their environment.
|
||||
@ -410,6 +413,7 @@ tags:
|
||||
```
|
||||
|
||||
Recommended practices:
|
||||
|
||||
- Document input and output differences in [`~GenerationMixin.generate`].
|
||||
- Add self-contained examples to enable quick experimentation.
|
||||
- Describe soft-requirements such as if the method only works well with a certain family of models.
|
||||
@ -442,6 +446,7 @@ output = model.generate(
|
||||
### Finding custom generation methods
|
||||
|
||||
You can find all custom generation methods by [searching for their custom tag.](https://huggingface.co/models?other=custom_generate), `custom_generate`. In addition to the tag, we curate two collections of `custom_generate` methods:
|
||||
|
||||
- [Custom generation methods - Community](https://huggingface.co/collections/transformers-community/custom-generation-methods-community-6888fb1da0efbc592d3a8ab6) -- a collection of powerful methods contributed by the community;
|
||||
- [Custom generation methods - Tutorials](https://huggingface.co/collections/transformers-community/custom-generation-methods-tutorials-6823589657a94940ea02cfec) -- a collection of reference implementations for methods that previously were part of `transformers`, as well as tutorials for `custom_generate`.
|
||||
|
||||
|
@ -185,9 +185,9 @@ See the [Fine-tune a pretrained model](https://huggingface.co/docs/transformers/
|
||||
|
||||
The model head refers to the last layer of a neural network that accepts the raw hidden states and projects them onto a different dimension. There is a different model head for each task. For example:
|
||||
|
||||
* [`GPT2ForSequenceClassification`] is a sequence classification head - a linear layer - on top of the base [`GPT2Model`].
|
||||
* [`ViTForImageClassification`] is an image classification head - a linear layer on top of the final hidden state of the `CLS` token - on top of the base [`ViTModel`].
|
||||
* [`Wav2Vec2ForCTC`] is a language modeling head with [CTC](#connectionist-temporal-classification-ctc) on top of the base [`Wav2Vec2Model`].
|
||||
* [`GPT2ForSequenceClassification`] is a sequence classification head - a linear layer - on top of the base [`GPT2Model`].
|
||||
* [`ViTForImageClassification`] is an image classification head - a linear layer on top of the final hidden state of the `CLS` token - on top of the base [`ViTModel`].
|
||||
* [`Wav2Vec2ForCTC`] is a language modeling head with [CTC](#connectionist-temporal-classification-ctc) on top of the base [`Wav2Vec2Model`].
|
||||
|
||||
## I
|
||||
|
||||
|
@ -149,4 +149,4 @@ Call [print_trainable_parameters](https://huggingface.co/docs/peft/package_refer
|
||||
```py
|
||||
model.print_trainable_parameters()
|
||||
"trainable params: 589,824 || all params: 94,274,096 || trainable%: 0.6256"
|
||||
```
|
||||
```
|
||||
|
@ -218,9 +218,9 @@ path reference to the associated `.safetensors` file. Each tensor is written to
|
||||
the state dictionary. File names are constructed using the `module_path` as a prefix with a few possible postfixes that
|
||||
are built recursively.
|
||||
|
||||
* Module inputs are denoted with the `_inputs` and outputs by `_outputs`.
|
||||
* `list` and `tuple` instances, such as `args` or function return values, will be postfixed with `_{index}`.
|
||||
* `dict` instances will be postfixed with `_{key}`.
|
||||
* Module inputs are denoted with the `_inputs` and outputs by `_outputs`.
|
||||
* `list` and `tuple` instances, such as `args` or function return values, will be postfixed with `_{index}`.
|
||||
* `dict` instances will be postfixed with `_{key}`.
|
||||
|
||||
### Comparing between implementations
|
||||
|
||||
@ -255,6 +255,7 @@ how many tests are being skipped and for which models.
|
||||
When porting models to transformers, tests fail as they should, and sometimes `test_modeling_common` feels irreconcilable with the peculiarities of our brand new model. But how can we be sure we're not breaking everything by adding a seemingly innocent skip?
|
||||
|
||||
This utility:
|
||||
|
||||
- scans all test_modeling_common methods
|
||||
- looks for times where a method is skipped
|
||||
- returns a summary json you can load as a DataFrame/inspect
|
||||
|
@ -94,6 +94,7 @@ model.generate(**inputs, num_beams=4, do_sample=True)
|
||||
```
|
||||
|
||||
[`~GenerationMixin.generate`] can also be extended with external libraries or custom code:
|
||||
|
||||
1. the `logits_processor` parameter accepts custom [`LogitsProcessor`] instances for manipulating the next token probability distribution;
|
||||
2. the `stopping_criteria` parameters supports custom [`StoppingCriteria`] to stop text generation;
|
||||
3. other custom generation methods can be loaded through the `custom_generate` flag ([docs](generation_strategies.md/#custom-decoding-methods)).
|
||||
|
@ -80,6 +80,7 @@ We use both in the `transformers` library. We leverage and adapt `logging`'s `ca
|
||||
management of these warning messages by the verbosity setters above.
|
||||
|
||||
What does that mean for developers of the library? We should respect the following heuristics:
|
||||
|
||||
- `warnings` should be favored for developers of the library and libraries dependent on `transformers`
|
||||
- `logging` should be used for end-users of the library using it in every-day projects
|
||||
|
||||
|
@ -17,6 +17,7 @@ rendered properly in your Markdown viewer.
|
||||
# Processors
|
||||
|
||||
Processors can mean two different things in the Transformers library:
|
||||
|
||||
- the objects that pre-process inputs for multi-modal models such as [Wav2Vec2](../model_doc/wav2vec2) (speech and text)
|
||||
or [CLIP](../model_doc/clip) (text and vision)
|
||||
- deprecated objects that were used in older versions of the library to preprocess data for GLUE or SQUAD.
|
||||
|
@ -30,15 +30,15 @@ like token streaming.
|
||||
## GenerationConfig
|
||||
|
||||
[[autodoc]] generation.GenerationConfig
|
||||
- from_pretrained
|
||||
- from_model_config
|
||||
- save_pretrained
|
||||
- update
|
||||
- validate
|
||||
- get_generation_mode
|
||||
- from_pretrained
|
||||
- from_model_config
|
||||
- save_pretrained
|
||||
- update
|
||||
- validate
|
||||
- get_generation_mode
|
||||
|
||||
## GenerationMixin
|
||||
|
||||
[[autodoc]] GenerationMixin
|
||||
- generate
|
||||
- compute_transition_scores
|
||||
- generate
|
||||
- compute_transition_scores
|
||||
|
@ -148,6 +148,7 @@ for label, score in zip(candidate_labels, probs):
|
||||
```
|
||||
|
||||
## Resources
|
||||
|
||||
- Refer to the [Kakao Brain’s Open Source ViT, ALIGN, and the New COYO Text-Image Dataset](https://huggingface.co/blog/vit-align) blog post for more details.
|
||||
|
||||
## AlignConfig
|
||||
|
@ -102,4 +102,4 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
||||
## ArceeForTokenClassification
|
||||
|
||||
[[autodoc]] ArceeForTokenClassification
|
||||
- forward
|
||||
- forward
|
||||
|
@ -123,6 +123,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
|
||||
- See also: [Image classification task guide](../tasks/image_classification)
|
||||
|
||||
**Semantic segmentation**
|
||||
|
||||
- [Semantic segmentation task guide](../tasks/semantic_segmentation)
|
||||
|
||||
If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
|
||||
|
@ -156,4 +156,4 @@ print(tokenizer.decode(outputs[0]))
|
||||
## BertGenerationDecoder
|
||||
|
||||
[[autodoc]] BertGenerationDecoder
|
||||
- forward
|
||||
- forward
|
||||
|
@ -88,6 +88,7 @@ echo -e "Plants create <mask> through a process known as photosynthesis." | tran
|
||||
</hfoptions>
|
||||
|
||||
## Notes
|
||||
|
||||
- Use the [`AutoTokenizer`] or [`BertweetTokenizer`] because it's preloaded with a custom vocabulary adapted to tweet-specific tokens like hashtags (#), mentions (@), emojis, and common abbreviations. Make sure to also install the [emoji](https://pypi.org/project/emoji/) library.
|
||||
- Inputs should be padded on the right (`padding="max_length"`) because BERT uses absolute position embeddings.
|
||||
|
||||
|
@ -87,6 +87,7 @@ print(f"The predicted token is: {predicted_token}")
|
||||
</hfoptions>
|
||||
|
||||
## Notes
|
||||
|
||||
- Inputs should be padded on the right because BigBird uses absolute position embeddings.
|
||||
- BigBird supports `original_full` and `block_sparse` attention. If the input sequence length is less than 1024, it is recommended to use `original_full` since sparse patterns don't offer much benefit for smaller inputs.
|
||||
- The current implementation uses window size of 3 blocks and 2 global blocks, only supports the ITC-implementation, and doesn't support `num_random_blocks=0`.
|
||||
|
@ -36,6 +36,7 @@ The original code can be found [here](https://github.com/google-research/big_tra
|
||||
## Usage tips
|
||||
|
||||
- BiT models are equivalent to ResNetv2 in terms of architecture, except that: 1) all batch normalization layers are replaced by [group normalization](https://huggingface.co/papers/1803.08494),
|
||||
|
||||
2) [weight standardization](https://huggingface.co/papers/1903.10520) is used for convolutional layers. The authors show that the combination of both is useful for training with large batch sizes, and has a significant
|
||||
impact on transfer learning.
|
||||
|
||||
@ -72,4 +73,4 @@ If you're interested in submitting a resource to be included here, please feel f
|
||||
## BitForImageClassification
|
||||
|
||||
[[autodoc]] BitForImageClassification
|
||||
- forward
|
||||
- forward
|
||||
|
@ -38,22 +38,22 @@ Several versions of the model weights are available on Hugging Face:
|
||||
### Model Details
|
||||
|
||||
* **Architecture:** Transformer-based, modified with `BitLinear` layers (BitNet framework).
|
||||
* Uses Rotary Position Embeddings (RoPE).
|
||||
* Uses squared ReLU (ReLU²) activation in FFN layers.
|
||||
* Employs [`subln`](https://proceedings.mlr.press/v202/wang23u.html) normalization.
|
||||
* No bias terms in linear or normalization layers.
|
||||
* Uses Rotary Position Embeddings (RoPE).
|
||||
* Uses squared ReLU (ReLU²) activation in FFN layers.
|
||||
* Employs [`subln`](https://proceedings.mlr.press/v202/wang23u.html) normalization.
|
||||
* No bias terms in linear or normalization layers.
|
||||
* **Quantization:** Native 1.58-bit weights and 8-bit activations (W1.58A8).
|
||||
* Weights are quantized to ternary values {-1, 0, +1} using absmean quantization during the forward pass.
|
||||
* Activations are quantized to 8-bit integers using absmax quantization (per-token).
|
||||
* **Crucially, the model was *trained from scratch* with this quantization scheme, not post-training quantized.**
|
||||
* Weights are quantized to ternary values {-1, 0, +1} using absmean quantization during the forward pass.
|
||||
* Activations are quantized to 8-bit integers using absmax quantization (per-token).
|
||||
* **Crucially, the model was *trained from scratch* with this quantization scheme, not post-training quantized.**
|
||||
* **Parameters:** ~2 Billion
|
||||
* **Training Tokens:** 4 Trillion
|
||||
* **Context Length:** Maximum sequence length of **4096 tokens**.
|
||||
* *Recommendation:* For optimal performance on tasks requiring very long contexts (beyond the pre-training length or for specialized long-reasoning tasks), we recommend performing intermediate long-sequence adaptation/training before the final fine-tuning stage.
|
||||
* **Context Length:** Maximum sequence length of **4096 tokens**.
|
||||
* *Recommendation:* For optimal performance on tasks requiring very long contexts (beyond the pre-training length or for specialized long-reasoning tasks), we recommend performing intermediate long-sequence adaptation/training before the final fine-tuning stage.
|
||||
* **Training Stages:**
|
||||
1. **Pre-training:** Large-scale training on public text/code and synthetic math data using a two-stage learning rate and weight decay schedule.
|
||||
2. **Supervised Fine-tuning (SFT):** Fine-tuned on instruction-following and conversational datasets using sum loss aggregation and specific hyperparameter tuning.
|
||||
3. **Direct Preference Optimization (DPO):** Aligned with human preferences using preference pairs.
|
||||
1. **Pre-training:** Large-scale training on public text/code and synthetic math data using a two-stage learning rate and weight decay schedule.
|
||||
2. **Supervised Fine-tuning (SFT):** Fine-tuned on instruction-following and conversational datasets using sum loss aggregation and specific hyperparameter tuning.
|
||||
3. **Direct Preference Optimization (DPO):** Aligned with human preferences using preference pairs.
|
||||
* **Tokenizer:** LLaMA 3 Tokenizer (vocab size: 128,256).
|
||||
|
||||
## Usage tips
|
||||
|
@ -128,7 +128,7 @@ Refer to this [notebook](https://github.com/huggingface/notebooks/blob/main/exam
|
||||
## BlipTextLMHeadModel
|
||||
|
||||
[[autodoc]] BlipTextLMHeadModel
|
||||
- forward
|
||||
- forward
|
||||
|
||||
## BlipVisionModel
|
||||
|
||||
|
@ -43,16 +43,19 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
|
||||
- [`BloomForCausalLM`] is supported by this [causal language modeling example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling#gpt-2gpt-and-causal-language-modeling) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb).
|
||||
|
||||
See also:
|
||||
|
||||
- [Causal language modeling task guide](../tasks/language_modeling)
|
||||
- [Text classification task guide](../tasks/sequence_classification)
|
||||
- [Token classification task guide](../tasks/token_classification)
|
||||
- [Question answering task guide](../tasks/question_answering)
|
||||
|
||||
⚡️ Inference
|
||||
|
||||
- A blog on [Optimization story: Bloom inference](https://huggingface.co/blog/bloom-inference-optimization).
|
||||
- A blog on [Incredibly Fast BLOOM Inference with DeepSpeed and Accelerate](https://huggingface.co/blog/bloom-inference-pytorch-scripts).
|
||||
|
||||
⚙️ Training
|
||||
|
||||
- A blog on [The Technology Behind BLOOM Training](https://huggingface.co/blog/bloom-megatron-deepspeed).
|
||||
|
||||
## BloomConfig
|
||||
|
@ -16,10 +16,10 @@ rendered properly in your Markdown viewer.
|
||||
*This model was released on 2019-11-10 and added to Hugging Face Transformers on 2020-11-16.*
|
||||
|
||||
<div style="float: right;">
|
||||
<div class="flex flex-wrap space-x-1">
|
||||
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
|
||||
<div class="flex flex-wrap space-x-1">
|
||||
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
|
||||
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
# CamemBERT
|
||||
|
@ -119,4 +119,4 @@ Currently, following scales of pretrained Chinese-CLIP models are available on
|
||||
## ChineseCLIPVisionModel
|
||||
|
||||
[[autodoc]] ChineseCLIPVisionModel
|
||||
- forward
|
||||
- forward
|
||||
|
@ -106,4 +106,4 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
|
||||
## CLIPSegForImageSegmentation
|
||||
|
||||
[[autodoc]] CLIPSegForImageSegmentation
|
||||
- forward
|
||||
- forward
|
||||
|
@ -122,6 +122,7 @@ visualizer("Plants create energy through a process known as")
|
||||
</div>
|
||||
|
||||
## Notes
|
||||
|
||||
- Don't use the dtype parameter in [`~AutoModel.from_pretrained`] if you're using FlashAttention-2 because it only supports fp16 or bf16. You should use [Automatic Mixed Precision](https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html), set fp16 or bf16 to True if using [`Trainer`], or use [torch.autocast](https://pytorch.org/docs/stable/amp.html#torch.autocast).
|
||||
|
||||
## CohereConfig
|
||||
|
@ -49,4 +49,4 @@ This model was contributed by [OpenBMB](https://huggingface.co/openbmb). The ori
|
||||
## CpmAntForCausalLM
|
||||
|
||||
[[autodoc]] CpmAntForCausalLM
|
||||
- all
|
||||
- all
|
||||
|
@ -103,6 +103,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
|
||||
- [`Data2VecVisionForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb).
|
||||
|
||||
**Data2VecText documentation resources**
|
||||
|
||||
- [Text classification task guide](../tasks/sequence_classification)
|
||||
- [Token classification task guide](../tasks/token_classification)
|
||||
- [Question answering task guide](../tasks/question_answering)
|
||||
@ -111,10 +112,12 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
|
||||
- [Multiple choice task guide](../tasks/multiple_choice)
|
||||
|
||||
**Data2VecAudio documentation resources**
|
||||
|
||||
- [Audio classification task guide](../tasks/audio_classification)
|
||||
- [Automatic speech recognition task guide](../tasks/asr)
|
||||
|
||||
**Data2VecVision documentation resources**
|
||||
|
||||
- [Image classification](../tasks/image_classification)
|
||||
- [Semantic segmentation](../tasks/semantic_segmentation)
|
||||
|
||||
|
@ -92,6 +92,7 @@ echo -e '{"text": "A soccer game with multiple people playing.", "text_pair": "S
|
||||
</hfoptions>
|
||||
|
||||
## Notes
|
||||
|
||||
- DeBERTa uses **relative position embeddings**, so it does not require **right-padding** like BERT.
|
||||
- For best results, use DeBERTa on sentence-level or sentence-pair classification tasks like MNLI, RTE, or SST-2.
|
||||
- If you're using DeBERTa for token-level tasks like masked language modeling, make sure to load a checkpoint specifically pretrained or fine-tuned for token-level tasks.
|
||||
|
@ -47,4 +47,4 @@ The model uses Multi-head Latent Attention (MLA) and DeepSeekMoE architectures f
|
||||
## DeepseekV2ForSequenceClassification
|
||||
|
||||
[[autodoc]] DeepseekV2ForSequenceClassification
|
||||
- forward
|
||||
- forward
|
||||
|
@ -16,9 +16,9 @@ rendered properly in your Markdown viewer.
|
||||
*This model was released on 2020-10-08 and added to Hugging Face Transformers on 2022-09-14.*
|
||||
|
||||
<div style="float: right;">
|
||||
<div class="flex flex-wrap space-x-1">
|
||||
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
|
||||
</div>
|
||||
<div class="flex flex-wrap space-x-1">
|
||||
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
|
||||
</div>
|
||||
</div>
|
||||
|
||||
# Deformable DETR
|
||||
|
@ -68,4 +68,4 @@ scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=1000, nu
|
||||
|
||||
DePlot is a model trained using `Pix2Struct` architecture. For API reference, see [`Pix2Struct` documentation](pix2struct).
|
||||
|
||||
</Tip>
|
||||
</Tip>
|
||||
|
@ -86,4 +86,4 @@ Image.fromarray(depth.astype("uint8"))
|
||||
## DepthAnythingForDepthEstimation
|
||||
|
||||
[[autodoc]] DepthAnythingForDepthEstimation
|
||||
- forward
|
||||
- forward
|
||||
|
@ -110,4 +110,4 @@ If you're interested in submitting a resource to be included here, please feel f
|
||||
## DepthAnythingForDepthEstimation
|
||||
|
||||
[[autodoc]] DepthAnythingForDepthEstimation
|
||||
- forward
|
||||
- forward
|
||||
|
@ -84,12 +84,13 @@ alt="drawing" width="600"/>
|
||||
The `DepthProForDepthEstimation` model uses a `DepthProEncoder`, for encoding the input image and a `FeatureFusionStage` for fusing the output features from encoder.
|
||||
|
||||
The `DepthProEncoder` further uses two encoders:
|
||||
|
||||
- `patch_encoder`
|
||||
- Input image is scaled with multiple ratios, as specified in the `scaled_images_ratios` configuration.
|
||||
- Each scaled image is split into smaller **patches** of size `patch_size` with overlapping areas determined by `scaled_images_overlap_ratios`.
|
||||
- These patches are processed by the **`patch_encoder`**
|
||||
- Input image is scaled with multiple ratios, as specified in the `scaled_images_ratios` configuration.
|
||||
- Each scaled image is split into smaller **patches** of size `patch_size` with overlapping areas determined by `scaled_images_overlap_ratios`.
|
||||
- These patches are processed by the **`patch_encoder`**
|
||||
- `image_encoder`
|
||||
- Input image is also rescaled to `patch_size` and processed by the **`image_encoder`**
|
||||
- Input image is also rescaled to `patch_size` and processed by the **`image_encoder`**
|
||||
|
||||
Both these encoders can be configured via `patch_model_config` and `image_model_config` respectively, both of which are separate `Dinov2Model` by default.
|
||||
|
||||
@ -159,8 +160,8 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
|
||||
- Official Implementation: [apple/ml-depth-pro](https://github.com/apple/ml-depth-pro)
|
||||
- DepthPro Inference Notebook: [DepthPro Inference](https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/DepthPro_inference.ipynb)
|
||||
- DepthPro for Super Resolution and Image Segmentation
|
||||
- Read blog on Medium: [Depth Pro: Beyond Depth](https://medium.com/@raoarmaghanshakir040/depth-pro-beyond-depth-9d822fc557ba)
|
||||
- Code on Github: [geetu040/depthpro-beyond-depth](https://github.com/geetu040/depthpro-beyond-depth)
|
||||
- Read blog on Medium: [Depth Pro: Beyond Depth](https://medium.com/@raoarmaghanshakir040/depth-pro-beyond-depth-9d822fc557ba)
|
||||
- Code on Github: [geetu040/depthpro-beyond-depth](https://github.com/geetu040/depthpro-beyond-depth)
|
||||
|
||||
If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
|
||||
|
||||
|
@ -16,9 +16,9 @@ rendered properly in your Markdown viewer.
|
||||
*This model was released on 2020-05-26 and added to Hugging Face Transformers on 2021-06-09.*
|
||||
|
||||
<div style="float: right;">
|
||||
<div class="flex flex-wrap space-x-1">
|
||||
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
|
||||
</div>
|
||||
<div class="flex flex-wrap space-x-1">
|
||||
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
|
||||
</div>
|
||||
</div>
|
||||
|
||||
# DETR
|
||||
|
@ -65,6 +65,7 @@ DiNAT can be used as a *backbone*. When `output_hidden_states = True`,
|
||||
it will output both `hidden_states` and `reshaped_hidden_states`. The `reshaped_hidden_states` have a shape of `(batch, num_channels, height, width)` rather than `(batch_size, height, width, num_channels)`.
|
||||
|
||||
Notes:
|
||||
|
||||
- DiNAT depends on [NATTEN](https://github.com/SHI-Labs/NATTEN/)'s implementation of Neighborhood Attention and Dilated Neighborhood Attention.
|
||||
You can install it with pre-built wheels for Linux by referring to [shi-labs.com/natten](https://shi-labs.com/natten), or build on your system by running `pip install natten`.
|
||||
Note that the latter will likely take time to compile. NATTEN does not support Windows devices yet.
|
||||
|
@ -25,6 +25,7 @@ The [Vision Transformer](vit) (ViT) is a transformer encoder model (BERT-like) o
|
||||
Next, people figured out ways to make ViT work really well on self-supervised image feature extraction (i.e. learning meaningful features, also called embeddings) on images without requiring any labels. Some example papers here include [DINOv2](dinov2) and [MAE](vit_mae).
|
||||
|
||||
The authors of DINOv2 noticed that ViTs have artifacts in attention maps. It's due to the model using some image patches as “registers”. The authors propose a fix: just add some new tokens (called "register" tokens), which you only use during pre-training (and throw away afterwards). This results in:
|
||||
|
||||
- no artifacts
|
||||
- interpretable attention maps
|
||||
- and improved performances.
|
||||
@ -57,4 +58,4 @@ The original code can be found [here](https://github.com/facebookresearch/dinov2
|
||||
## Dinov2WithRegistersForImageClassification
|
||||
|
||||
[[autodoc]] Dinov2WithRegistersForImageClassification
|
||||
- forward
|
||||
- forward
|
||||
|
@ -101,4 +101,4 @@ outputs = model.generate(
|
||||
## DogeForSequenceClassification
|
||||
|
||||
[[autodoc]] DogeForSequenceClassification
|
||||
- forward
|
||||
- forward
|
||||
|
@ -44,9 +44,9 @@ This model was contributed by [lhoestq](https://huggingface.co/lhoestq). The ori
|
||||
|
||||
- DPR consists in three models:
|
||||
|
||||
* Question encoder: encode questions as vectors
|
||||
* Context encoder: encode contexts as vectors
|
||||
* Reader: extract the answer of the questions inside retrieved contexts, along with a relevance score (high if the inferred span actually answers the question).
|
||||
* Question encoder: encode questions as vectors
|
||||
* Context encoder: encode contexts as vectors
|
||||
* Reader: extract the answer of the questions inside retrieved contexts, along with a relevance score (high if the inferred span actually answers the question).
|
||||
|
||||
## DPRConfig
|
||||
|
||||
|
@ -144,27 +144,23 @@ processed_outputs = processor.post_process_keypoint_matching(outputs, image_size
|
||||
## EfficientLoFTRImageProcessor
|
||||
|
||||
[[autodoc]] EfficientLoFTRImageProcessor
|
||||
|
||||
- preprocess
|
||||
- post_process_keypoint_matching
|
||||
- visualize_keypoint_matching
|
||||
- preprocess
|
||||
- post_process_keypoint_matching
|
||||
- visualize_keypoint_matching
|
||||
|
||||
## EfficientLoFTRImageProcessorFast
|
||||
|
||||
[[autodoc]] EfficientLoFTRImageProcessorFast
|
||||
|
||||
- preprocess
|
||||
- post_process_keypoint_matching
|
||||
- visualize_keypoint_matching
|
||||
- preprocess
|
||||
- post_process_keypoint_matching
|
||||
- visualize_keypoint_matching
|
||||
|
||||
## EfficientLoFTRModel
|
||||
|
||||
[[autodoc]] EfficientLoFTRModel
|
||||
|
||||
- forward
|
||||
- forward
|
||||
|
||||
## EfficientLoFTRForKeypointMatching
|
||||
|
||||
[[autodoc]] EfficientLoFTRForKeypointMatching
|
||||
|
||||
- forward
|
||||
- forward
|
||||
|
@ -207,4 +207,4 @@ plt.show()
|
||||
## EomtForUniversalSegmentation
|
||||
|
||||
[[autodoc]] EomtForUniversalSegmentation
|
||||
- forward
|
||||
- forward
|
||||
|
@ -204,4 +204,4 @@ print(tokenizer.decode(output[0]))
|
||||
## Exaone4ForQuestionAnswering
|
||||
|
||||
[[autodoc]] Exaone4ForQuestionAnswering
|
||||
- forward
|
||||
- forward
|
||||
|
@ -30,5 +30,6 @@ Depth up-scaling for improved reasoning: Building on recent studies on the effec
|
||||
Knowledge distillation for better tiny models: To provide compact and efficient alternatives, we developed Falcon3-1B-Base and Falcon3-3B-Base by leveraging pruning and knowledge distillation techniques, using less than 100GT of curated high-quality data, thereby redefining pre-training efficiency.
|
||||
|
||||
## Resources
|
||||
|
||||
- [Blog post](https://huggingface.co/blog/falcon3)
|
||||
- [Models on Huggingface](https://huggingface.co/collections/tiiuae/falcon3-67605ae03578be86e4e87026)
|
||||
|
@ -60,4 +60,4 @@ print(tokenizer.batch_decode(response, skip_special_tokens=True)[0])
|
||||
[[autodoc]] FalconH1ForCausalLM
|
||||
- forward
|
||||
|
||||
This HF implementation is contributed by [younesbelkada](https://github.com/younesbelkada) and [DhiaEddineRhaiem](https://github.com/dhiaEddineRhaiem).
|
||||
This HF implementation is contributed by [younesbelkada](https://github.com/younesbelkada) and [DhiaEddineRhaiem](https://github.com/dhiaEddineRhaiem).
|
||||
|
@ -44,6 +44,7 @@ community for further reproducible experiments in French NLP.*
|
||||
This model was contributed by [formiel](https://huggingface.co/formiel). The original code can be found [here](https://github.com/getalp/Flaubert).
|
||||
|
||||
Tips:
|
||||
|
||||
- Like RoBERTa, without the sentence ordering prediction (so just trained on the MLM objective).
|
||||
|
||||
## Resources
|
||||
|
@ -138,21 +138,21 @@ print(parsed_answer)
|
||||
## Notes
|
||||
|
||||
- Florence-2 is a prompt-based model. You need to provide a task prompt to tell the model what to do. Supported tasks are:
|
||||
- `<OCR>`
|
||||
- `<OCR_WITH_REGION>`
|
||||
- `<CAPTION>`
|
||||
- `<DETAILED_CAPTION>`
|
||||
- `<MORE_DETAILED_CAPTION>`
|
||||
- `<OD>`
|
||||
- `<DENSE_REGION_CAPTION>`
|
||||
- `<CAPTION_TO_PHRASE_GROUNDING>`
|
||||
- `<REFERRING_EXPRESSION_SEGMENTATION>`
|
||||
- `<REGION_TO_SEGMENTATION>`
|
||||
- `<OPEN_VOCABULARY_DETECTION>`
|
||||
- `<REGION_TO_CATEGORY>`
|
||||
- `<REGION_TO_DESCRIPTION>`
|
||||
- `<REGION_TO_OCR>`
|
||||
- `<REGION_PROPOSAL>`
|
||||
- `<OCR>`
|
||||
- `<OCR_WITH_REGION>`
|
||||
- `<CAPTION>`
|
||||
- `<DETAILED_CAPTION>`
|
||||
- `<MORE_DETAILED_CAPTION>`
|
||||
- `<OD>`
|
||||
- `<DENSE_REGION_CAPTION>`
|
||||
- `<CAPTION_TO_PHRASE_GROUNDING>`
|
||||
- `<REFERRING_EXPRESSION_SEGMENTATION>`
|
||||
- `<REGION_TO_SEGMENTATION>`
|
||||
- `<OPEN_VOCABULARY_DETECTION>`
|
||||
- `<REGION_TO_CATEGORY>`
|
||||
- `<REGION_TO_DESCRIPTION>`
|
||||
- `<REGION_TO_OCR>`
|
||||
- `<REGION_PROPOSAL>`
|
||||
- The raw output of the model is a string that needs to be parsed. The [`Florence2Processor`] has a [`~Florence2Processor.post_process_generation`] method that can parse the string into a more usable format, like bounding boxes and labels for object detection.
|
||||
|
||||
## Resources
|
||||
|
@ -121,9 +121,9 @@ echo -e "Plants create energy through a process known as" | transformers run --t
|
||||
|
||||
## Notes
|
||||
|
||||
- Use [`Gemma3nForConditionalGeneration`] for image-audio-and-text, image-and-text, image-and-audio, audio-and-text,
|
||||
- Use [`Gemma3nForConditionalGeneration`] for image-audio-and-text, image-and-text, image-and-audio, audio-and-text,
|
||||
image-only and audio-only inputs.
|
||||
- Gemma 3n supports multiple images per input, but make sure the images are correctly batched before passing them to
|
||||
- Gemma 3n supports multiple images per input, but make sure the images are correctly batched before passing them to
|
||||
the processor. Each batch should be a list of one or more images.
|
||||
|
||||
```py
|
||||
@ -148,11 +148,11 @@ echo -e "Plants create energy through a process known as" | transformers run --t
|
||||
]
|
||||
```
|
||||
|
||||
- Text passed to the processor should have a `<image_soft_token>` token wherever an image should be inserted.
|
||||
- Gemma 3n accept at most one target audio clip per input, though multiple audio clips can be provided in few-shot
|
||||
- Text passed to the processor should have a `<image_soft_token>` token wherever an image should be inserted.
|
||||
- Gemma 3n accept at most one target audio clip per input, though multiple audio clips can be provided in few-shot
|
||||
prompts, for example.
|
||||
- Text passed to the processor should have a `<audio_soft_token>` token wherever an audio clip should be inserted.
|
||||
- The processor has its own [`~ProcessorMixin.apply_chat_template`] method to convert chat messages to model inputs.
|
||||
- Text passed to the processor should have a `<audio_soft_token>` token wherever an audio clip should be inserted.
|
||||
- The processor has its own [`~ProcessorMixin.apply_chat_template`] method to convert chat messages to model inputs.
|
||||
|
||||
## Gemma3nAudioFeatureExtractor
|
||||
|
||||
|
@ -81,4 +81,4 @@ The resource should ideally demonstrate something new instead of duplicating an
|
||||
## GitForCausalLM
|
||||
|
||||
[[autodoc]] GitForCausalLM
|
||||
- forward
|
||||
- forward
|
||||
|
@ -35,6 +35,7 @@ Through our open-source work, we aim to explore the technological frontier toget
|
||||

|
||||
|
||||
Beyond benchmark performance, GLM-4.5V focuses on real-world usability. Through efficient hybrid training, it can handle diverse types of visual content, enabling full-spectrum vision reasoning, including:
|
||||
|
||||
- **Image reasoning** (scene understanding, complex multi-image analysis, spatial recognition)
|
||||
- **Video understanding** (long video segmentation and event recognition)
|
||||
- **GUI tasks** (screen reading, icon recognition, desktop operation assistance)
|
||||
|
@ -36,6 +36,7 @@ The model is an optimized [GPT2 model](https://huggingface.co/docs/transformers/
|
||||
## Implementation details
|
||||
|
||||
The main differences compared to GPT2.
|
||||
|
||||
- Added support for Multi-Query Attention.
|
||||
- Use `gelu_pytorch_tanh` instead of classic `gelu`.
|
||||
- Avoid unnecessary synchronizations (this has since been added to GPT2 in #20061, but wasn't in the reference codebase).
|
||||
|
@ -133,6 +133,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
|
||||
- [`GPTJForCausalLM`] is supported by this [causal language modeling example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling#gpt-2gpt-and-causal-language-modeling), [text generation example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-generation), and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb).
|
||||
|
||||
**Documentation resources**
|
||||
|
||||
- [Text classification task guide](../tasks/sequence_classification)
|
||||
- [Question answering task guide](../tasks/question_answering)
|
||||
- [Causal language modeling task guide](../tasks/language_modeling)
|
||||
|
@ -37,6 +37,7 @@ Note that most of the aforementioned components are implemented generically to e
|
||||
This model was contributed by [Alexander Brooks](https://huggingface.co/abrooks9944), [Avihu Dekel](https://huggingface.co/Avihu), and [George Saon](https://huggingface.co/gsaon).
|
||||
|
||||
## Usage tips
|
||||
|
||||
- This model bundles its own LoRA adapter, which will be automatically loaded and enabled/disabled as needed during inference calls. Be sure to install [PEFT](https://github.com/huggingface/peft) to ensure the LoRA is correctly applied!
|
||||
|
||||
<!-- TODO (@alex-jw-brooks) Add an example here once the model compatible with the transformers implementation is released -->
|
||||
|
@ -62,4 +62,4 @@ This HF implementation is contributed by [Mayank Mishra](https://huggingface.co/
|
||||
## GraniteMoeSharedForCausalLM
|
||||
|
||||
[[autodoc]] GraniteMoeSharedForCausalLM
|
||||
- forward
|
||||
- forward
|
||||
|
@ -22,6 +22,7 @@ rendered properly in your Markdown viewer.
|
||||
The [Granite Vision](https://www.ibm.com/new/announcements/ibm-granite-3-1-powerful-performance-long-context-and-more) model is a variant of [LLaVA-NeXT](llava_next), leveraging a [Granite](granite) language model alongside a [SigLIP](SigLIP) visual encoder. It utilizes multiple concatenated vision hidden states as its image features, similar to [VipLlava](vipllava). It also uses a larger set of image grid pinpoints than the original LlaVa-NeXT models to support additional aspect ratios.
|
||||
|
||||
Tips:
|
||||
|
||||
- This model is loaded into Transformers as an instance of LlaVA-Next. The usage and tips from [LLaVA-NeXT](llava_next) apply to this model as well.
|
||||
|
||||
- You can apply the chat template on the tokenizer / processor in the same way as well. Example chat format:
|
||||
|
@ -89,4 +89,4 @@ print(f"The predicted class label is: {predicted_class_label}")
|
||||
## HGNetV2ForImageClassification
|
||||
|
||||
[[autodoc]] HGNetV2ForImageClassification
|
||||
- forward
|
||||
- forward
|
||||
|
@ -52,4 +52,4 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
|
||||
## InformerForPrediction
|
||||
|
||||
[[autodoc]] InformerForPrediction
|
||||
- forward
|
||||
- forward
|
||||
|
@ -77,4 +77,4 @@ The attributes can be obtained from model config, as `model.config.num_query_tok
|
||||
|
||||
[[autodoc]] InstructBlipForConditionalGeneration
|
||||
- forward
|
||||
- generate
|
||||
- generate
|
||||
|
@ -19,6 +19,7 @@ rendered properly in your Markdown viewer.
|
||||
## Overview
|
||||
|
||||
[Kyutai STT](https://kyutai.org/next/stt) is a speech-to-text model architecture based on the [Mimi codec](https://huggingface.co/docs/transformers/en/model_doc/mimi), which encodes audio into discrete tokens in a streaming fashion, and a [Moshi-like](https://huggingface.co/docs/transformers/en/model_doc/moshi) autoregressive decoder. Kyutai's lab has released two model checkpoints:
|
||||
|
||||
- [kyutai/stt-1b-en_fr](https://huggingface.co/kyutai/stt-1b-en_fr): a 1B-parameter model capable of transcribing both English and French
|
||||
- [kyutai/stt-2.6b-en](https://huggingface.co/kyutai/stt-2.6b-en): a 2.6B-parameter model focused solely on English, optimized for maximum transcription accuracy
|
||||
|
||||
|
@ -37,8 +37,8 @@ This model was contributed by [nielsr](https://huggingface.co/nielsr). The origi
|
||||
## Usage tips
|
||||
|
||||
- In terms of data processing, LayoutLMv3 is identical to its predecessor [LayoutLMv2](layoutlmv2), except that:
|
||||
- images need to be resized and normalized with channels in regular RGB format. LayoutLMv2 on the other hand normalizes the images internally and expects the channels in BGR format.
|
||||
- text is tokenized using byte-pair encoding (BPE), as opposed to WordPiece.
|
||||
- images need to be resized and normalized with channels in regular RGB format. LayoutLMv2 on the other hand normalizes the images internally and expects the channels in BGR format.
|
||||
- text is tokenized using byte-pair encoding (BPE), as opposed to WordPiece.
|
||||
Due to these differences in data preprocessing, one can use [`LayoutLMv3Processor`] which internally combines a [`LayoutLMv3ImageProcessor`] (for the image modality) and a [`LayoutLMv3Tokenizer`]/[`LayoutLMv3TokenizerFast`] (for the text modality) to prepare all data for the model.
|
||||
- Regarding usage of [`LayoutLMv3Processor`], we refer to the [usage guide](layoutlmv2#usage-layoutlmv2processor) of its predecessor.
|
||||
|
||||
@ -73,6 +73,7 @@ LayoutLMv3 is nearly identical to LayoutLMv2, so we've also included LayoutLMv2
|
||||
- [Question answering task guide](../tasks/question_answering)
|
||||
|
||||
**Document question answering**
|
||||
|
||||
- [Document question answering task guide](../tasks/document_question_answering)
|
||||
|
||||
## LayoutLMv3Config
|
||||
|
@ -82,4 +82,4 @@ print(tokenizer.decode(output[0], skip_special_tokens=False))
|
||||
## Lfm2ForCausalLM
|
||||
|
||||
[[autodoc]] Lfm2ForCausalLM
|
||||
- forward
|
||||
- forward
|
||||
|
@ -28,6 +28,7 @@ rendered properly in your Markdown viewer.
|
||||
## Architecture
|
||||
|
||||
LFM2-VL consists of three main components: a language model backbone, a vision encoder, and a multimodal projector. LFM2-VL builds upon the LFM2 backbone, inheriting from either LFM2-1.2B (for LFM2-VL-1.6B) or LFM2-350M (for LFM2-VL-450M). For the vision tower, LFM2-VL uses SigLIP2 NaFlex encoders to convert input images into token sequences. Two variants are implemented:
|
||||
|
||||
* Shape-optimized (400M) for more fine-grained vision capabilities for LFM2-VL-1.6B
|
||||
* Base (86M) for fast image processing for LFM2-VL-450M
|
||||
|
||||
|
@ -143,13 +143,11 @@ processed_outputs = processor.post_process_keypoint_matching(outputs, image_size
|
||||
## LightGlueImageProcessor
|
||||
|
||||
[[autodoc]] LightGlueImageProcessor
|
||||
|
||||
- preprocess
|
||||
- post_process_keypoint_matching
|
||||
- visualize_keypoint_matching
|
||||
- preprocess
|
||||
- post_process_keypoint_matching
|
||||
- visualize_keypoint_matching
|
||||
|
||||
## LightGlueForKeypointMatching
|
||||
|
||||
[[autodoc]] LightGlueForKeypointMatching
|
||||
|
||||
- forward
|
||||
- forward
|
||||
|
@ -62,6 +62,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
|
||||
- Demo notebooks for LiLT can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LiLT).
|
||||
|
||||
**Documentation resources**
|
||||
|
||||
- [Text classification task guide](../tasks/sequence_classification)
|
||||
- [Token classification task guide](../tasks/token_classification)
|
||||
- [Question answering task guide](../tasks/question_answering)
|
||||
|
@ -27,9 +27,11 @@ rendered properly in your Markdown viewer.
|
||||
|
||||
[Llama 4](https://ai.meta.com/blog/llama-4-multimodal-intelligence/), developed by Meta, introduces a new auto-regressive Mixture-of-Experts (MoE) architecture.
|
||||
This generation includes two models:
|
||||
|
||||
- The highly capable Llama 4 Maverick with 17B active parameters out of ~400B total, with 128 experts.
|
||||
- The efficient Llama 4 Scout also has 17B active parameters out of ~109B total, using just 16 experts.
|
||||
-
|
||||
|
||||
Both models leverage early fusion for native multimodality, enabling them to process text and image inputs.
|
||||
Maverick and Scout are both trained on up to 40 trillion tokens on data encompassing 200 languages
|
||||
(with specific fine-tuning support for 12 languages including Arabic, Spanish, German, and Hindi).
|
||||
@ -421,24 +423,24 @@ model = Llama4ForConditionalGeneration.from_pretrained(
|
||||
## Llama4ForConditionalGeneration
|
||||
|
||||
[[autodoc]] Llama4ForConditionalGeneration
|
||||
- forward
|
||||
- forward
|
||||
|
||||
## Llama4ForCausalLM
|
||||
|
||||
[[autodoc]] Llama4ForCausalLM
|
||||
- forward
|
||||
- forward
|
||||
|
||||
## Llama4TextModel
|
||||
|
||||
[[autodoc]] Llama4TextModel
|
||||
- forward
|
||||
- forward
|
||||
|
||||
## Llama4ForCausalLM
|
||||
|
||||
[[autodoc]] Llama4ForCausalLM
|
||||
- forward
|
||||
- forward
|
||||
|
||||
## Llama4VisionModel
|
||||
|
||||
[[autodoc]] Llama4VisionModel
|
||||
- forward
|
||||
- forward
|
||||
|
@ -57,6 +57,7 @@ The attributes can be obtained from model config, as `model.config.vision_config
|
||||
Each **checkpoint** is trained with a specific prompt format, depending on the underlying large language model backbone. To ensure correct formatting, use the processor's `apply_chat_template` method.
|
||||
|
||||
**Important:**
|
||||
|
||||
- You must construct a conversation history — passing a plain string won't work.
|
||||
- Each message should be a dictionary with `"role"` and `"content"` keys.
|
||||
- The `"content"` should be a list of dictionaries for different modalities like `"text"` and `"image"`.
|
||||
|
@ -64,6 +64,7 @@ The attributes can be obtained from model config, as `model.config.vision_config
|
||||
Each **checkpoint** is trained with a specific prompt format, depending on the underlying large language model backbone. To ensure correct formatting, use the processor's `apply_chat_template` method.
|
||||
|
||||
**Important:**
|
||||
|
||||
- You must construct a conversation history — passing a plain string won't work.
|
||||
- Each message should be a dictionary with `"role"` and `"content"` keys.
|
||||
- The `"content"` should be a list of dictionaries for different modalities like `"text"` and `"image"`.
|
||||
|
@ -59,6 +59,7 @@ Tips:
|
||||
Each **checkpoint** is trained with a specific prompt format, depending on the underlying large language model backbone. To ensure correct formatting, use the processor’s `apply_chat_template` method.
|
||||
|
||||
**Important:**
|
||||
|
||||
- You must construct a conversation history — passing a plain string won't work.
|
||||
- Each message should be a dictionary with `"role"` and `"content"` keys.
|
||||
- The `"content"` should be a list of dictionaries for different modalities like `"text"` and `"image"`.
|
||||
|
@ -30,6 +30,7 @@ performance, similar to [LayoutLM](layoutlm).
|
||||
|
||||
The model can be used for tasks like question answering on web pages or information extraction from web pages. It obtains
|
||||
state-of-the-art results on 2 important benchmarks:
|
||||
|
||||
- [WebSRC](https://x-lance.github.io/WebSRC/), a dataset for Web-Based Structural Reading Comprehension (a bit like SQuAD but for web pages)
|
||||
- [SWDE](https://www.researchgate.net/publication/221299838_From_one_tree_to_a_forest_a_unified_solution_for_structured_web_data_extraction), a dataset
|
||||
for information extraction from web pages (basically named-entity recognition on web pages)
|
||||
|
@ -86,4 +86,4 @@ The resource should ideally demonstrate something new instead of duplicating an
|
||||
- preprocess
|
||||
- post_process_semantic_segmentation
|
||||
- post_process_instance_segmentation
|
||||
- post_process_panoptic_segmentation
|
||||
- post_process_panoptic_segmentation
|
||||
|
@ -44,7 +44,7 @@ This model was contributed by [francesco](https://huggingface.co/francesco). The
|
||||
|
||||
## Usage tips
|
||||
|
||||
- MaskFormer's Transformer decoder is identical to the decoder of [DETR](detr). During training, the authors of DETR did find it helpful to use auxiliary losses in the decoder, especially to help the model output the correct number of objects of each class. If you set the parameter `use_auxiliary_loss` of [`MaskFormerConfig`] to `True`, then prediction feedforward neural networks and Hungarian losses are added after each decoder layer (with the FFNs sharing parameters).
|
||||
- MaskFormer's Transformer decoder is identical to the decoder of [DETR](detr). During training, the authors of DETR did find it helpful to use auxiliary losses in the decoder, especially to help the model output the correct number of objects of each class. If you set the parameter `use_auxiliary_loss` of [`MaskFormerConfig`] to `True`, then prediction feedforward neural networks and Hungarian losses are added after each decoder layer (with the FFNs sharing parameters).
|
||||
- If you want to train the model in a distributed environment across multiple nodes, then one should update the
|
||||
`get_num_masks` function inside in the `MaskFormerLoss` class of `modeling_maskformer.py`. When training on multiple nodes, this should be
|
||||
set to the average number of target masks across all nodes, as can be seen in the original implementation [here](https://github.com/facebookresearch/MaskFormer/blob/da3e60d85fdeedcb31476b5edd7d328826ce56cc/mask_former/modeling/criterion.py#L169).
|
||||
@ -102,4 +102,4 @@ This model was contributed by [francesco](https://huggingface.co/francesco). The
|
||||
## MaskFormerForInstanceSegmentation
|
||||
|
||||
[[autodoc]] MaskFormerForInstanceSegmentation
|
||||
- forward
|
||||
- forward
|
||||
|
@ -79,4 +79,4 @@ scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=1000, nu
|
||||
|
||||
MatCha is a model that is trained using `Pix2Struct` architecture. You can find more information about `Pix2Struct` in the [Pix2Struct documentation](https://huggingface.co/docs/transformers/main/en/model_doc/pix2struct).
|
||||
|
||||
</Tip>
|
||||
</Tip>
|
||||
|
@ -35,12 +35,12 @@ The architecture of MiniMax is briefly described as follows:
|
||||
- Activated Parameters per Token: 45.9B
|
||||
- Number Layers: 80
|
||||
- Hybrid Attention: a softmax attention is positioned after every 7 lightning attention.
|
||||
- Number of attention heads: 64
|
||||
- Attention head dimension: 128
|
||||
- Number of attention heads: 64
|
||||
- Attention head dimension: 128
|
||||
- Mixture of Experts:
|
||||
- Number of experts: 32
|
||||
- Expert hidden dimension: 9216
|
||||
- Top-2 routing strategy
|
||||
- Number of experts: 32
|
||||
- Expert hidden dimension: 9216
|
||||
- Top-2 routing strategy
|
||||
- Positional Encoding: Rotary Position Embedding (RoPE) applied to half of the attention head dimension with a base frequency of 10,000,000
|
||||
- Hidden Size: 6144
|
||||
- Vocab Size: 200,064
|
||||
|
@ -83,4 +83,4 @@ The example below demonstrates how to use Ministral for text generation:
|
||||
## MinistralForQuestionAnswering
|
||||
|
||||
[[autodoc]] MinistralForQuestionAnswering
|
||||
- forward
|
||||
- forward
|
||||
|
@ -163,4 +163,4 @@ Use the [AttentionMaskVisualizer](https://github.com/huggingface/transformers/bl
|
||||
## MistralForQuestionAnswering
|
||||
|
||||
[[autodoc]] MistralForQuestionAnswering
|
||||
- forward
|
||||
- forward
|
||||
|
@ -42,6 +42,7 @@ Mixtral-8x7B is a decoder-only Transformer with the following architectural choi
|
||||
- Despite the model having 45 billion parameters, the compute required for a single forward pass is the same as that of a 14 billion parameter model. This is because even though each of the experts have to be loaded in RAM (70B like ram requirement) each token from the hidden states are dispatched twice (top 2 routing) and thus the compute (the operation required at each forward computation) is just 2 X sequence_length.
|
||||
|
||||
The following implementation details are shared with Mistral AI's first model [Mistral-7B](mistral):
|
||||
|
||||
- Sliding Window Attention - Trained with 8k context length and fixed cache size, with a theoretical attention span of 128K tokens
|
||||
- GQA (Grouped Query Attention) - allowing faster inference and lower cache size.
|
||||
- Byte-fallback BPE tokenizer - ensures that characters are never mapped to out of vocabulary tokens.
|
||||
@ -55,6 +56,7 @@ For more details refer to the [release blog post](https://mistral.ai/news/mixtra
|
||||
## Usage tips
|
||||
|
||||
The Mistral team has released 2 checkpoints:
|
||||
|
||||
- a base model, [Mixtral-8x7B-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1), which has been pre-trained to predict the next token on internet-scale data.
|
||||
- an instruction tuned model, [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1), which is the base model optimized for chat purposes using supervised fine-tuning (SFT) and direct preference optimization (DPO).
|
||||
|
||||
|
@ -85,10 +85,10 @@ print(f"The predicted class label is: {predicted_class_label}")
|
||||
|
||||
## Notes
|
||||
|
||||
- Checkpoint names follow the pattern `mobilenet_v1_{depth_multiplier}_{resolution}`, like `mobilenet_v1_1.0_224`. `1.0` is the depth multiplier and `224` is the image resolution.
|
||||
- While trained on images of a specific sizes, the model architecture works with images of different sizes (minimum 32x32). The [`MobileNetV1ImageProcessor`] handles the necessary preprocessing.
|
||||
- MobileNet is pretrained on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k), a dataset with 1000 classes. However, the model actually predicts 1001 classes. The additional class is an extra "background" class (index 0).
|
||||
- The original TensorFlow checkpoints determines the padding amount at inference because it depends on the input image size. To use the native PyTorch padding behavior, set `tf_padding=False` in [`MobileNetV1Config`].
|
||||
- Checkpoint names follow the pattern `mobilenet_v1_{depth_multiplier}_{resolution}`, like `mobilenet_v1_1.0_224`. `1.0` is the depth multiplier and `224` is the image resolution.
|
||||
- While trained on images of a specific sizes, the model architecture works with images of different sizes (minimum 32x32). The [`MobileNetV1ImageProcessor`] handles the necessary preprocessing.
|
||||
- MobileNet is pretrained on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k), a dataset with 1000 classes. However, the model actually predicts 1001 classes. The additional class is an extra "background" class (index 0).
|
||||
- The original TensorFlow checkpoints determines the padding amount at inference because it depends on the input image size. To use the native PyTorch padding behavior, set `tf_padding=False` in [`MobileNetV1Config`].
|
||||
|
||||
```python
|
||||
from transformers import MobileNetV1Config
|
||||
@ -96,11 +96,11 @@ print(f"The predicted class label is: {predicted_class_label}")
|
||||
config = MobileNetV1Config.from_pretrained("google/mobilenet_v1_1.0_224", tf_padding=True)
|
||||
```
|
||||
|
||||
- The Transformers implementation does not support the following features.
|
||||
- Uses global average pooling instead of the optional 7x7 average pooling with stride 2. For larger inputs, this gives a pooled output that is larger than a 1x1 pixel.
|
||||
- Does not support other `output_stride` values (fixed at 32). For smaller `output_strides`, the original implementation uses dilated convolution to prevent spatial resolution from being reduced further. (which would require dilated convolutions).
|
||||
- `output_hidden_states=True` returns *all* intermediate hidden states. It is not possible to extract the output from specific layers for other downstream purposes.
|
||||
- Does not include the quantized models from the original checkpoints because they include "FakeQuantization" operations to unquantize the weights.
|
||||
- The Transformers implementation does not support the following features.
|
||||
- Uses global average pooling instead of the optional 7x7 average pooling with stride 2. For larger inputs, this gives a pooled output that is larger than a 1x1 pixel.
|
||||
- Does not support other `output_stride` values (fixed at 32). For smaller `output_strides`, the original implementation uses dilated convolution to prevent spatial resolution from being reduced further. (which would require dilated convolutions).
|
||||
- `output_hidden_states=True` returns *all* intermediate hidden states. It is not possible to extract the output from specific layers for other downstream purposes.
|
||||
- Does not include the quantized models from the original checkpoints because they include "FakeQuantization" operations to unquantize the weights.
|
||||
|
||||
## MobileNetV1Config
|
||||
|
||||
|
@ -82,11 +82,11 @@ print(f"The predicted class label is: {predicted_class_label}")
|
||||
|
||||
## Notes
|
||||
|
||||
- Classification checkpoint names follow the pattern `mobilenet_v2_{depth_multiplier}_{resolution}`, like `mobilenet_v2_1.4_224`. `1.4` is the depth multiplier and `224` is the image resolution. Segmentation checkpoint names follow the pattern `deeplabv3_mobilenet_v2_{depth_multiplier}_{resolution}`.
|
||||
- While trained on images of a specific sizes, the model architecture works with images of different sizes (minimum 32x32). The [`MobileNetV2ImageProcessor`] handles the necessary preprocessing.
|
||||
- MobileNet is pretrained on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k), a dataset with 1000 classes. However, the model actually predicts 1001 classes. The additional class is an extra "background" class (index 0).
|
||||
- The segmentation models use a [DeepLabV3+](https://huggingface.co/papers/1802.02611) head which is often pretrained on datasets like [PASCAL VOC](https://huggingface.co/datasets/merve/pascal-voc).
|
||||
- The original TensorFlow checkpoints determines the padding amount at inference because it depends on the input image size. To use the native PyTorch padding behavior, set `tf_padding=False` in [`MobileNetV2Config`].
|
||||
- Classification checkpoint names follow the pattern `mobilenet_v2_{depth_multiplier}_{resolution}`, like `mobilenet_v2_1.4_224`. `1.4` is the depth multiplier and `224` is the image resolution. Segmentation checkpoint names follow the pattern `deeplabv3_mobilenet_v2_{depth_multiplier}_{resolution}`.
|
||||
- While trained on images of a specific sizes, the model architecture works with images of different sizes (minimum 32x32). The [`MobileNetV2ImageProcessor`] handles the necessary preprocessing.
|
||||
- MobileNet is pretrained on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k), a dataset with 1000 classes. However, the model actually predicts 1001 classes. The additional class is an extra "background" class (index 0).
|
||||
- The segmentation models use a [DeepLabV3+](https://huggingface.co/papers/1802.02611) head which is often pretrained on datasets like [PASCAL VOC](https://huggingface.co/datasets/merve/pascal-voc).
|
||||
- The original TensorFlow checkpoints determines the padding amount at inference because it depends on the input image size. To use the native PyTorch padding behavior, set `tf_padding=False` in [`MobileNetV2Config`].
|
||||
|
||||
```python
|
||||
from transformers import MobileNetV2Config
|
||||
@ -94,11 +94,11 @@ print(f"The predicted class label is: {predicted_class_label}")
|
||||
config = MobileNetV2Config.from_pretrained("google/mobilenet_v2_1.4_224", tf_padding=True)
|
||||
```
|
||||
|
||||
- The Transformers implementation does not support the following features.
|
||||
- Uses global average pooling instead of the optional 7x7 average pooling with stride 2. For larger inputs, this gives a pooled output that is larger than a 1x1 pixel.
|
||||
- `output_hidden_states=True` returns *all* intermediate hidden states. It is not possible to extract the output from specific layers for other downstream purposes.
|
||||
- Does not include the quantized models from the original checkpoints because they include "FakeQuantization" operations to unquantize the weights.
|
||||
- For segmentation models, the final convolution layer of the backbone is computed even though the DeepLabV3+ head doesn't use it.
|
||||
- The Transformers implementation does not support the following features.
|
||||
- Uses global average pooling instead of the optional 7x7 average pooling with stride 2. For larger inputs, this gives a pooled output that is larger than a 1x1 pixel.
|
||||
- `output_hidden_states=True` returns *all* intermediate hidden states. It is not possible to extract the output from specific layers for other downstream purposes.
|
||||
- Does not include the quantized models from the original checkpoints because they include "FakeQuantization" operations to unquantize the weights.
|
||||
- For segmentation models, the final convolution layer of the backbone is computed even though the DeepLabV3+ head doesn't use it.
|
||||
|
||||
## MobileNetV2Config
|
||||
|
||||
|
@ -28,6 +28,7 @@ Unless required by applicable law or agreed to in writing, software distributed
|
||||
You can find all the original MobileViT checkpoints under the [Apple](https://huggingface.co/apple/models?search=mobilevit) organization.
|
||||
|
||||
> [!TIP]
|
||||
>
|
||||
> - This model was contributed by [matthijs](https://huggingface.co/Matthijs).
|
||||
>
|
||||
> Click on the MobileViT models in the right sidebar for more examples of how to apply MobileViT to different vision tasks.
|
||||
|
@ -38,6 +38,7 @@ The abstract from the paper is the following:
|
||||
*We introduce Moshi, a speech-text foundation model and full-duplex spoken dialogue framework. Current systems for spoken dialogue rely on pipelines of independent components, namely voice activity detection, speech recognition, textual dialogue and text-to-speech. Such frameworks cannot emulate the experience of real conversations. First, their complexity induces a latency of several seconds between interactions. Second, text being the intermediate modality for dialogue, non-linguistic information that modifies meaning— such as emotion or non-speech sounds— is lost in the interaction. Finally, they rely on a segmentation into speaker turns, which does not take into account overlapping speech, interruptions and interjections. Moshi solves these independent issues altogether by casting spoken dialogue as speech-to-speech generation. Starting from a text language model backbone, Moshi generates speech as tokens from the residual quantizer of a neural audio codec, while modeling separately its own speech and that of the user into parallel streams. This allows for the removal of explicit speaker turns, and the modeling of arbitrary conversational dynamics. We moreover extend the hierarchical semantic-to-acoustic token generation of previous work to first predict time-aligned text tokens as a prefix to audio tokens. Not only this “Inner Monologue” method significantly improves the linguistic quality of generated speech, but we also illustrate how it can provide streaming speech recognition and text-to-speech. Our resulting model is the first real-time full-duplex spoken large language model, with a theoretical latency of 160ms, 200ms in practice, and is available at github.com/kyutai-labs/moshi.*
|
||||
|
||||
Moshi deals with 3 streams of information:
|
||||
|
||||
1. The user's audio
|
||||
2. Moshi's audio
|
||||
3. Moshi's textual output
|
||||
@ -70,6 +71,7 @@ The original checkpoints can be converted using the conversion script `src/trans
|
||||
### How to use the model:
|
||||
|
||||
This implementation has two main aims:
|
||||
|
||||
1. quickly test model generation by simplifying the original API
|
||||
2. simplify training. A training guide will come soon, but user contributions are welcomed!
|
||||
|
||||
@ -84,6 +86,7 @@ It is designed for intermediate use. We strongly recommend using the original [i
|
||||
Moshi is a streaming auto-regressive model with two streams of audio. To put it differently, one audio stream corresponds to what the model said/will say and the other audio stream corresponds to what the user said/will say.
|
||||
|
||||
[`MoshiForConditionalGeneration.generate`] thus needs 3 inputs:
|
||||
|
||||
1. `input_ids` - corresponding to the text token history
|
||||
2. `moshi_input_values` or `moshi_audio_codes`- corresponding to the model audio history
|
||||
3. `user_input_values` or `user_audio_codes` - corresponding to the user audio history
|
||||
@ -91,6 +94,7 @@ Moshi is a streaming auto-regressive model with two streams of audio. To put it
|
||||
These three inputs must be synchronized. Meaning that their lengths must correspond to the same number of tokens.
|
||||
|
||||
You can dynamically use the 3 inputs depending on what you want to test:
|
||||
|
||||
1. Simply check the model response to an user prompt - in that case, `input_ids` can be filled with pad tokens and `user_input_values` can be a zero tensor of the same shape than the user prompt.
|
||||
2. Test more complex behaviour - in that case, you must be careful about how the input tokens are synchronized with the audios.
|
||||
|
||||
|
@ -64,4 +64,4 @@ The original code can be found [here](https://github.com/mlpen/mra-attention).
|
||||
## MraForQuestionAnswering
|
||||
|
||||
[[autodoc]] MraForQuestionAnswering
|
||||
- forward
|
||||
- forward
|
||||
|
@ -230,6 +230,7 @@ generation config.
|
||||
## Model Structure
|
||||
|
||||
The MusicGen model can be de-composed into three distinct stages:
|
||||
|
||||
1. Text encoder: maps the text inputs to a sequence of hidden-state representations. The pre-trained MusicGen models use a frozen text encoder from either T5 or Flan-T5
|
||||
2. MusicGen decoder: a language model (LM) that auto-regressively generates audio tokens (or codes) conditional on the encoder hidden-state representations
|
||||
3. Audio encoder/decoder: used to encode an audio prompt to use as prompt tokens, and recover the audio waveform from the audio tokens predicted by the decoder
|
||||
@ -256,6 +257,7 @@ be combined with the frozen text encoder and audio encoder/decoders to recover t
|
||||
model.
|
||||
|
||||
Tips:
|
||||
|
||||
* MusicGen is trained on the 32kHz checkpoint of Encodec. You should ensure you use a compatible version of the Encodec model.
|
||||
* Sampling mode tends to deliver better results than greedy - you can toggle sampling with the variable `do_sample` in the call to [`MusicgenForConditionalGeneration.generate`]
|
||||
|
||||
|
@ -40,6 +40,7 @@ This model was contributed by [ylacombe](https://huggingface.co/ylacombe). The o
|
||||
## Difference with [MusicGen](https://huggingface.co/docs/transformers/main/en/model_doc/musicgen)
|
||||
|
||||
There are two key differences with MusicGen:
|
||||
|
||||
1. The audio prompt is used here as a conditional signal for the generated audio sample, whereas it's used for audio continuation in [MusicGen](https://huggingface.co/docs/transformers/main/en/model_doc/musicgen).
|
||||
2. Conditional text and audio signals are concatenated to the decoder's hidden states instead of being used as a cross-attention signal, as in MusicGen.
|
||||
|
||||
@ -224,6 +225,7 @@ Note that any arguments passed to the generate method will **supersede** those i
|
||||
## Model Structure
|
||||
|
||||
The MusicGen model can be de-composed into three distinct stages:
|
||||
|
||||
1. Text encoder: maps the text inputs to a sequence of hidden-state representations. The pre-trained MusicGen models use a frozen text encoder from either T5 or Flan-T5.
|
||||
2. MusicGen Melody decoder: a language model (LM) that auto-regressively generates audio tokens (or codes) conditional on the encoder hidden-state representations
|
||||
3. Audio decoder: used to recover the audio waveform from the audio tokens predicted by the decoder.
|
||||
@ -253,6 +255,7 @@ python src/transformers/models/musicgen_melody/convert_musicgen_melody_transform
|
||||
```
|
||||
|
||||
Tips:
|
||||
|
||||
* MusicGen is trained on the 32kHz checkpoint of Encodec. You should ensure you use a compatible version of the Encodec model.
|
||||
* Sampling mode tends to deliver better results than greedy - you can toggle sampling with the variable `do_sample` in the call to [`MusicgenMelodyForConditionalGeneration.generate`]
|
||||
|
||||
|
@ -68,6 +68,7 @@ The `reshaped_hidden_states` have a shape of `(batch, num_channels, height, widt
|
||||
`(batch_size, height, width, num_channels)`.
|
||||
|
||||
Notes:
|
||||
|
||||
- NAT depends on [NATTEN](https://github.com/SHI-Labs/NATTEN/)'s implementation of Neighborhood Attention.
|
||||
You can install it with pre-built wheels for Linux by referring to [shi-labs.com/natten](https://shi-labs.com/natten),
|
||||
or build on your system by running `pip install natten`.
|
||||
|
@ -128,9 +128,9 @@ visualizer("UN Chief says there is no military solution in Syria")
|
||||
>>> tokenizer = NllbTokenizer.from_pretrained("facebook/nllb-200-distilled-600M", legacy_behaviour=True)
|
||||
```
|
||||
|
||||
- For non-English languages, specify the language's [BCP-47](https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200) code with the `src_lang` keyword as shown below.
|
||||
- For non-English languages, specify the language's [BCP-47](https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200) code with the `src_lang` keyword as shown below.
|
||||
|
||||
- See example below for a translation from Romanian to German.
|
||||
- See example below for a translation from Romanian to German.
|
||||
|
||||
```python
|
||||
>>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
|
||||
|
@ -39,7 +39,7 @@ This model was contributed by [Jitesh Jain](https://huggingface.co/praeclarumjj3
|
||||
|
||||
## Usage tips
|
||||
|
||||
- OneFormer requires two inputs during inference: *image* and *task token*.
|
||||
- OneFormer requires two inputs during inference: *image* and *task token*.
|
||||
- During training, OneFormer only uses panoptic annotations.
|
||||
- If you want to train the model in a distributed environment across multiple nodes, then one should update the
|
||||
`get_num_masks` function inside in the `OneFormerLoss` class of `modeling_oneformer.py`. When training on multiple nodes, this should be
|
||||
|
@ -84,22 +84,22 @@ echo -e "The future of AI is" | transformers run --task text-generation --model
|
||||
## OpenAIGPTModel
|
||||
|
||||
[[autodoc]] OpenAIGPTModel
|
||||
- forward
|
||||
- forward
|
||||
|
||||
## OpenAIGPTLMHeadModel
|
||||
|
||||
[[autodoc]] OpenAIGPTLMHeadModel
|
||||
- forward
|
||||
- forward
|
||||
|
||||
## OpenAIGPTDoubleHeadsModel
|
||||
|
||||
[[autodoc]] OpenAIGPTDoubleHeadsModel
|
||||
- forward
|
||||
- forward
|
||||
|
||||
## OpenAIGPTForSequenceClassification
|
||||
|
||||
[[autodoc]] OpenAIGPTForSequenceClassification
|
||||
- forward
|
||||
- forward
|
||||
|
||||
## OpenAIGPTTokenizer
|
||||
|
||||
|
@ -27,12 +27,13 @@ rendered properly in your Markdown viewer.
|
||||
Parakeet models, [introduced by NVIDIA NeMo](https://developer.nvidia.com/blog/pushing-the-boundaries-of-speech-recognition-with-nemo-parakeet-asr-models/), are models that combine a [Fast Conformer](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/models.html#fast-conformer) encoder with connectionist temporal classification (CTC), recurrent neural network transducer (RNNT) or token and duration transducer (TDT) decoder for automatic speech recognition.
|
||||
|
||||
**Model Architecture**
|
||||
|
||||
- **Fast Conformer Encoder**: A linearly scalable Conformer architecture that processes mel-spectrogram features and reduces sequence length through subsampling. This is more efficient version of the Conformer Encoder found in [FastSpeech2Conformer](./fastspeech2_conformer.md) (see [`ParakeetEncoder`] for the encoder implementation and details).
|
||||
- [**ParakeetForCTC**](#parakeetforctc): a Fast Conformer Encoder + a CTC decoder
|
||||
- **CTC Decoder**: Simple but effective decoder consisting of:
|
||||
- 1D convolution projection from encoder hidden size to vocabulary size (for optimal NeMo compatibility).
|
||||
- CTC loss computation for training.
|
||||
- Greedy CTC decoding for inference.
|
||||
- **CTC Decoder**: Simple but effective decoder consisting of:
|
||||
- 1D convolution projection from encoder hidden size to vocabulary size (for optimal NeMo compatibility).
|
||||
- CTC loss computation for training.
|
||||
- Greedy CTC decoding for inference.
|
||||
|
||||
The original implementation can be found in [NVIDIA NeMo](https://github.com/NVIDIA/NeMo).
|
||||
Model checkpoints are to be found under [the NVIDIA organization](https://huggingface.co/nvidia/models?search=parakeet).
|
||||
@ -189,7 +190,7 @@ outputs.loss.backward()
|
||||
|
||||
## ParakeetTokenizerFast
|
||||
|
||||
[[autodoc]] ParakeetTokenizerFast
|
||||
[[autodoc]] ParakeetTokenizerFast
|
||||
|
||||
## ParakeetFeatureExtractor
|
||||
|
||||
@ -205,11 +206,11 @@ outputs.loss.backward()
|
||||
|
||||
## ParakeetEncoderConfig
|
||||
|
||||
[[autodoc]] ParakeetEncoderConfig
|
||||
[[autodoc]] ParakeetEncoderConfig
|
||||
|
||||
## ParakeetCTCConfig
|
||||
|
||||
[[autodoc]] ParakeetCTCConfig
|
||||
[[autodoc]] ParakeetCTCConfig
|
||||
|
||||
## ParakeetEncoder
|
||||
|
||||
@ -218,4 +219,3 @@ outputs.loss.backward()
|
||||
## ParakeetForCTC
|
||||
|
||||
[[autodoc]] ParakeetForCTC
|
||||
|
||||
|
@ -89,4 +89,4 @@ The model can also be used for time series classification and time series regres
|
||||
## PatchTSMixerForRegression
|
||||
|
||||
[[autodoc]] PatchTSMixerForRegression
|
||||
- forward
|
||||
- forward
|
||||
|
@ -45,6 +45,7 @@ The original code for PhiMoE can be found [here](https://huggingface.co/microsof
|
||||
<Tip warning={true}>
|
||||
|
||||
Phi-3.5-MoE-instruct has been integrated in the development version (4.44.2.dev) of `transformers`. Until the official version is released through `pip`, ensure that you are doing the following:
|
||||
|
||||
* When loading the model, ensure that `trust_remote_code=True` is passed as an argument of the `from_pretrained()` function.
|
||||
|
||||
The current `transformers` version can be verified with: `pip list | grep transformers`.
|
||||
|
@ -79,4 +79,4 @@ The original code can be found [here](https://github.com/google-research/pix2str
|
||||
## Pix2StructForConditionalGeneration
|
||||
|
||||
[[autodoc]] Pix2StructForConditionalGeneration
|
||||
- forward
|
||||
- forward
|
||||
|
@ -120,4 +120,4 @@ it's passed with the `text_target` keyword argument.
|
||||
## PLBartForCausalLM
|
||||
|
||||
[[autodoc]] PLBartForCausalLM
|
||||
- forward
|
||||
- forward
|
||||
|
@ -59,6 +59,7 @@ pip install pretty-midi==0.2.9 essentia==2.1b6.dev1034 librosa scipy
|
||||
```
|
||||
|
||||
Please note that you may need to restart your runtime after installation.
|
||||
|
||||
* Pop2Piano is an Encoder-Decoder based model like T5.
|
||||
* Pop2Piano can be used to generate midi-audio files for a given audio sequence.
|
||||
* Choosing different composers in `Pop2PianoForConditionalGeneration.generate()` can lead to variety of different results.
|
||||
|
@ -99,4 +99,4 @@ If you are interested in submitting a resource to be included here, please feel
|
||||
|
||||
[[autodoc]] PromptDepthAnythingImageProcessorFast
|
||||
- preprocess
|
||||
- post_process_depth_estimation
|
||||
- post_process_depth_estimation
|
||||
|
@ -19,6 +19,7 @@ rendered properly in your Markdown viewer.
|
||||
|
||||
The Qwen3-Next series represents our next-generation foundation models, optimized for extreme context length and large-scale parameter efficiency.
|
||||
The series introduces a suite of architectural innovations designed to maximize performance while minimizing computational cost:
|
||||
|
||||
- **Hybrid Attention**: Replaces standard attention with the combination of **Gated DeltaNet** and **Gated Attention**, enabling efficient context modeling.
|
||||
- **High-Sparsity MoE**: Achieves an extreme low activation ratio as 1:50 in MoE layers — drastically reducing FLOPs per token while preserving model capacity.
|
||||
- **Multi-Token Prediction(MTP)**: Boosts pretraining model performance, and accelerates inference.
|
||||
|
@ -152,4 +152,4 @@ $$D_{i} = e^{u + K_{i} - q} + e^{M_{i}} \tilde{D}_{i} \hbox{ where } q = \max(
|
||||
|
||||
which finally gives us
|
||||
|
||||
$$O_{i} = \sigma(R_{i}) \frac{N_{i}}{D_{i}}$$
|
||||
$$O_{i} = \sigma(R_{i}) \frac{N_{i}}{D_{i}}$$
|
||||
|
@ -139,6 +139,7 @@ The architecture of this new version differs from the first in a few aspects:
|
||||
#### Improvements on the second-pass model
|
||||
|
||||
The second seq2seq model, named text-to-unit model, is now non-auto regressive, meaning that it computes units in a **single forward pass**. This achievement is made possible by:
|
||||
|
||||
- the use of **character-level embeddings**, meaning that each character of the predicted translated text has its own embeddings, which are then used to predict the unit tokens.
|
||||
- the use of an intermediate duration predictor, that predicts speech duration at the **character-level** on the predicted translated text.
|
||||
- the use of a new text-to-unit decoder mixing convolutions and self-attention to handle longer context.
|
||||
@ -146,6 +147,7 @@ The second seq2seq model, named text-to-unit model, is now non-auto regressive,
|
||||
#### Difference in the speech encoder
|
||||
|
||||
The speech encoder, which is used during the first-pass generation process to predict the translated text, differs mainly from the previous speech encoder through these mechanisms:
|
||||
|
||||
- the use of chunked attention mask to prevent attention across chunks, ensuring that each position attends only to positions within its own chunk and a fixed number of previous chunks.
|
||||
- the use of relative position embeddings which only considers distance between sequence elements rather than absolute positions. Please refer to [Self-Attentionwith Relative Position Representations (Shaw et al.)](https://huggingface.co/papers/1803.02155) for more details.
|
||||
- the use of a causal depth-wise convolution instead of a non-causal one.
|
||||
|
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user