From 374ded5ea40c9026cf36003c737e07c584dbd63d Mon Sep 17 00:00:00 2001 From: Yuanyuan Chen Date: Wed, 1 Oct 2025 00:41:03 +0800 Subject: [PATCH] Fix white space in documentation (#41157) * Fix white space Signed-off-by: Yuanyuan Chen * Revert changes Signed-off-by: Yuanyuan Chen * Fix autodoc Signed-off-by: Yuanyuan Chen --------- Signed-off-by: Yuanyuan Chen --- CONTRIBUTING.md | 10 ++++--- docs/source/en/attention_interface.md | 2 +- docs/source/en/auto_docstring.md | 6 ++-- docs/source/en/cache_explanation.md | 1 + docs/source/en/chat_extras.md | 2 +- docs/source/en/chat_templating.md | 6 ++-- docs/source/en/cursor.md | 1 + docs/source/en/generation_strategies.md | 5 ++++ docs/source/en/glossary.md | 6 ++-- docs/source/en/how_to_hack_models.md | 2 +- .../en/internal/model_debugging_utils.md | 7 +++-- docs/source/en/llm_tutorial.md | 1 + docs/source/en/main_classes/logging.md | 1 + docs/source/en/main_classes/processors.md | 1 + .../source/en/main_classes/text_generation.md | 16 +++++----- docs/source/en/model_doc/align.md | 1 + docs/source/en/model_doc/arcee.md | 2 +- docs/source/en/model_doc/beit.md | 1 + docs/source/en/model_doc/bert-generation.md | 2 +- docs/source/en/model_doc/bertweet.md | 1 + docs/source/en/model_doc/big_bird.md | 1 + docs/source/en/model_doc/bit.md | 3 +- docs/source/en/model_doc/bitnet.md | 24 +++++++-------- docs/source/en/model_doc/blip.md | 2 +- docs/source/en/model_doc/bloom.md | 3 ++ docs/source/en/model_doc/camembert.md | 6 ++-- docs/source/en/model_doc/chinese_clip.md | 2 +- docs/source/en/model_doc/clipseg.md | 2 +- docs/source/en/model_doc/cohere.md | 1 + docs/source/en/model_doc/cpmant.md | 2 +- docs/source/en/model_doc/data2vec.md | 3 ++ docs/source/en/model_doc/deberta.md | 1 + docs/source/en/model_doc/deepseek_v2.md | 2 +- docs/source/en/model_doc/deformable_detr.md | 6 ++-- docs/source/en/model_doc/deplot.md | 2 +- docs/source/en/model_doc/depth_anything.md | 2 +- docs/source/en/model_doc/depth_anything_v2.md | 2 +- docs/source/en/model_doc/depth_pro.md | 13 ++++---- docs/source/en/model_doc/detr.md | 6 ++-- docs/source/en/model_doc/dinat.md | 1 + .../en/model_doc/dinov2_with_registers.md | 3 +- docs/source/en/model_doc/doge.md | 2 +- docs/source/en/model_doc/dpr.md | 6 ++-- docs/source/en/model_doc/efficientloftr.md | 20 +++++-------- docs/source/en/model_doc/eomt.md | 2 +- docs/source/en/model_doc/exaone4.md | 2 +- docs/source/en/model_doc/falcon3.md | 1 + docs/source/en/model_doc/falcon_h1.md | 2 +- docs/source/en/model_doc/flaubert.md | 1 + docs/source/en/model_doc/florence2.md | 30 +++++++++---------- docs/source/en/model_doc/gemma3n.md | 12 ++++---- docs/source/en/model_doc/git.md | 2 +- docs/source/en/model_doc/glm4v_moe.md | 1 + docs/source/en/model_doc/gpt_bigcode.md | 1 + docs/source/en/model_doc/gptj.md | 1 + docs/source/en/model_doc/granite_speech.md | 1 + docs/source/en/model_doc/granitemoeshared.md | 2 +- docs/source/en/model_doc/granitevision.md | 1 + docs/source/en/model_doc/hgnet_v2.md | 2 +- docs/source/en/model_doc/informer.md | 2 +- docs/source/en/model_doc/instructblip.md | 2 +- .../en/model_doc/kyutai_speech_to_text.md | 1 + docs/source/en/model_doc/layoutlmv3.md | 5 ++-- docs/source/en/model_doc/lfm2.md | 2 +- docs/source/en/model_doc/lfm2_vl.md | 1 + docs/source/en/model_doc/lightglue.md | 10 +++---- docs/source/en/model_doc/lilt.md | 1 + docs/source/en/model_doc/llama4.md | 12 ++++---- docs/source/en/model_doc/llava.md | 1 + docs/source/en/model_doc/llava_next_video.md | 1 + docs/source/en/model_doc/llava_onevision.md | 1 + docs/source/en/model_doc/markuplm.md | 1 + docs/source/en/model_doc/mask2former.md | 2 +- docs/source/en/model_doc/maskformer.md | 4 +-- docs/source/en/model_doc/matcha.md | 2 +- docs/source/en/model_doc/minimax.md | 10 +++---- docs/source/en/model_doc/ministral.md | 2 +- docs/source/en/model_doc/mistral.md | 2 +- docs/source/en/model_doc/mixtral.md | 2 ++ docs/source/en/model_doc/mobilenet_v1.md | 18 +++++------ docs/source/en/model_doc/mobilenet_v2.md | 20 ++++++------- docs/source/en/model_doc/mobilevit.md | 1 + docs/source/en/model_doc/moshi.md | 4 +++ docs/source/en/model_doc/mra.md | 2 +- docs/source/en/model_doc/musicgen.md | 2 ++ docs/source/en/model_doc/musicgen_melody.md | 3 ++ docs/source/en/model_doc/nat.md | 1 + docs/source/en/model_doc/nllb.md | 4 +-- docs/source/en/model_doc/oneformer.md | 2 +- docs/source/en/model_doc/openai-gpt.md | 8 ++--- docs/source/en/model_doc/parakeet.md | 16 +++++----- docs/source/en/model_doc/patchtsmixer.md | 2 +- docs/source/en/model_doc/phimoe.md | 1 + docs/source/en/model_doc/pix2struct.md | 2 +- docs/source/en/model_doc/plbart.md | 2 +- docs/source/en/model_doc/pop2piano.md | 1 + .../en/model_doc/prompt_depth_anything.md | 2 +- docs/source/en/model_doc/qwen3_next.md | 1 + docs/source/en/model_doc/rwkv.md | 2 +- docs/source/en/model_doc/seamless_m4t_v2.md | 2 ++ docs/source/en/model_doc/seggpt.md | 1 + docs/source/en/model_doc/shieldgemma2.md | 6 ++-- docs/source/en/model_doc/superglue.md | 10 +++---- docs/source/en/model_doc/superpoint.md | 9 +++--- docs/source/en/model_doc/tapas.md | 1 + docs/source/en/model_doc/tapex.md | 1 + .../en/model_doc/time_series_transformer.md | 8 ++--- docs/source/en/model_doc/timesformer.md | 2 +- docs/source/en/model_doc/udop.md | 2 +- docs/source/en/model_doc/univnet.md | 2 +- docs/source/en/model_doc/upernet.md | 2 +- docs/source/en/model_doc/videomae.md | 1 + docs/source/en/model_doc/vit_mae.md | 1 + docs/source/en/model_doc/vitdet.md | 2 +- docs/source/en/model_doc/vitmatte.md | 2 +- docs/source/en/model_doc/vits.md | 6 ++-- docs/source/en/model_doc/voxtral.md | 1 + docs/source/en/model_doc/wav2vec2_phoneme.md | 8 ++--- docs/source/en/model_doc/xcodec.md | 3 +- docs/source/en/model_doc/xmod.md | 2 ++ docs/source/en/model_doc/yolos.md | 1 + docs/source/en/model_doc/yoso.md | 2 +- docs/source/en/model_doc/zamba.md | 1 + docs/source/en/model_doc/zamba2.md | 1 + docs/source/en/model_doc/zoedepth.md | 1 + docs/source/en/open_webui.md | 1 + docs/source/en/pad_truncation.md | 14 ++++----- docs/source/en/philosophy.md | 12 ++++---- docs/source/en/pipeline_gradio.md | 8 ++--- docs/source/en/pr_checks.md | 2 ++ docs/source/en/quantization/concept_guide.md | 8 ++--- .../source/en/quantization/finegrained_fp8.md | 2 +- docs/source/en/quantization/quanto.md | 2 +- docs/source/en/quantization/selecting.md | 16 +++++----- docs/source/en/serving.md | 1 + .../en/tasks/document_question_answering.md | 4 +++ docs/source/en/tasks/idefics.md | 1 + docs/source/en/tasks/image_text_to_text.md | 1 + docs/source/en/tasks/image_to_image.md | 1 + docs/source/en/tasks/mask_generation.md | 2 ++ .../en/tasks/masked_language_modeling.md | 1 + docs/source/en/tasks/object_detection.md | 8 +++-- docs/source/en/tasks/semantic_segmentation.md | 1 + docs/source/en/tasks/video_text_to_text.md | 1 + .../en/tasks/visual_question_answering.md | 3 ++ .../tasks/zero_shot_image_classification.md | 2 +- .../en/tasks/zero_shot_object_detection.md | 1 + docs/source/en/testing.md | 6 ++-- 148 files changed, 338 insertions(+), 246 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 7728546633b..ea62fd54588 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -278,13 +278,14 @@ are working on it).
useful to avoid duplicated work, and to differentiate it from PRs ready to be merged.
☐ Make sure existing tests pass.
☐ If adding a new feature, also add tests for it.
- - If you are adding a new model, make sure you use + +- If you are adding a new model, make sure you use `ModelTester.all_model_classes = (MyModel, MyModelWithLMHead,...)` to trigger the common tests. - - If you are adding new `@slow` tests, make sure they pass using +- If you are adding new `@slow` tests, make sure they pass using `RUN_SLOW=1 python -m pytest tests/models/my_new_model/test_my_new_model.py`. - - If you are adding a new tokenizer, write tests and make sure +- If you are adding a new tokenizer, write tests and make sure `RUN_SLOW=1 python -m pytest tests/models/{your_model_name}/test_tokenization_{your_model_name}.py` passes. - - CircleCI does not run the slow tests, but GitHub Actions does every night!
+- CircleCI does not run the slow tests, but GitHub Actions does every night!
☐ All public methods must have informative docstrings (see [`modeling_bert.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/bert/modeling_bert.py) @@ -340,6 +341,7 @@ RUN_SLOW=yes python -m pytest -n auto --dist=loadfile -s -v ./examples/pytorch/t ``` Like the slow tests, there are other environment variables available which are not enabled by default during testing: + - `RUN_CUSTOM_TOKENIZERS`: Enables tests for custom tokenizers. More environment variables and additional information can be found in the [testing_utils.py](https://github.com/huggingface/transformers/blob/main/src/transformers/testing_utils.py). diff --git a/docs/source/en/attention_interface.md b/docs/source/en/attention_interface.md index 407a47a7d35..621aa7409da 100644 --- a/docs/source/en/attention_interface.md +++ b/docs/source/en/attention_interface.md @@ -193,4 +193,4 @@ def custom_attention_mask( It mostly works thanks to the `mask_function`, which is a `Callable` in the form of [torch's mask_mod functions](https://pytorch.org/blog/flexattention/), taking 4 indices as input and returning a boolean to indicate if this position should take part in the attention computation. -If you cannot use the `mask_function` to create your mask for some reason, you can try to work around it by doing something similar to our [torch export workaround](https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/executorch.py). \ No newline at end of file +If you cannot use the `mask_function` to create your mask for some reason, you can try to work around it by doing something similar to our [torch export workaround](https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/executorch.py). diff --git a/docs/source/en/auto_docstring.md b/docs/source/en/auto_docstring.md index 6445ee53014..e6c75341997 100644 --- a/docs/source/en/auto_docstring.md +++ b/docs/source/en/auto_docstring.md @@ -210,9 +210,9 @@ There are some rules for documenting different types of arguments and they're li This can span multiple lines. ``` - * Include `type` in backticks. - * Add *optional* if the argument is not required or has a default value. - * Add "defaults to X" if it has a default value. You don't need to add "defaults to `None`" if the default value is `None`. + * Include `type` in backticks. + * Add *optional* if the argument is not required or has a default value. + * Add "defaults to X" if it has a default value. You don't need to add "defaults to `None`" if the default value is `None`. These arguments can also be passed to `@auto_docstring` as a `custom_args` argument. It is used to define the docstring block for new arguments once if they are repeated in multiple places in the modeling file. diff --git a/docs/source/en/cache_explanation.md b/docs/source/en/cache_explanation.md index 1a3439bc792..6d6718b8cab 100644 --- a/docs/source/en/cache_explanation.md +++ b/docs/source/en/cache_explanation.md @@ -162,6 +162,7 @@ generated_ids = model.generate(**inputs, use_cache=True, max_new_tokens=10) Before the [`Cache`] class, the cache used to be stored as a tuple of tuples of tensors. This format is dynamic because it grows as text is generated, similar to [`DynamicCache`]. The legacy format is essentially the same data structure but organized differently. + - It's a tuple of tuples, where each inner tuple contains the key and value tensors for a layer. - The tensors have the same shape `[batch_size, num_heads, seq_len, head_dim]`. - The format is less flexible and doesn't support features like quantization or offloading. diff --git a/docs/source/en/chat_extras.md b/docs/source/en/chat_extras.md index 20d5cf22ce4..f5282515827 100644 --- a/docs/source/en/chat_extras.md +++ b/docs/source/en/chat_extras.md @@ -221,4 +221,4 @@ model_input = tokenizer.apply_chat_template( messages, tools = [current_time, multiply] ) -``` \ No newline at end of file +``` diff --git a/docs/source/en/chat_templating.md b/docs/source/en/chat_templating.md index b5f9bafa96e..1e83da188a0 100644 --- a/docs/source/en/chat_templating.md +++ b/docs/source/en/chat_templating.md @@ -77,9 +77,9 @@ Mistral-7B-Instruct uses `[INST]` and `[/INST]` tokens to indicate the start and The input to `apply_chat_template` should be structured as a list of dictionaries with `role` and `content` keys. The `role` key specifies the speaker, and the `content` key contains the message. The common roles are: - - `user` for messages from the user - - `assistant` for messages from the model - - `system` for directives on how the model should act (usually placed at the beginning of the chat) +- `user` for messages from the user +- `assistant` for messages from the model +- `system` for directives on how the model should act (usually placed at the beginning of the chat) [`apply_chat_template`] takes this list and returns a formatted sequence. Set `tokenize=True` if you want to tokenize the sequence. diff --git a/docs/source/en/cursor.md b/docs/source/en/cursor.md index 70e4a33f9ad..e56155a8e42 100644 --- a/docs/source/en/cursor.md +++ b/docs/source/en/cursor.md @@ -21,6 +21,7 @@ where `port` is the port used by `transformers serve` (`8000` by default). On th You're now ready to set things up on the app side! In Cursor, while you can't set a new provider, you can change the endpoint for OpenAI requests in the model selection settings. First, navigate to "Settings" > "Cursor Settings", "Models" tab, and expand the "API Keys" collapsible. To set your `transformers serve` endpoint, follow this order: + 1. Unselect ALL models in the list above (e.g. `gpt4`, ...); 2. Add and select the model you want to use (e.g. `Qwen/Qwen3-4B`) 3. Add some random text to OpenAI API Key. This field won't be used, but it can't be empty; diff --git a/docs/source/en/generation_strategies.md b/docs/source/en/generation_strategies.md index 7123896dd1a..d2d49e1f702 100644 --- a/docs/source/en/generation_strategies.md +++ b/docs/source/en/generation_strategies.md @@ -229,6 +229,7 @@ tokenizer.batch_decode(outputs, skip_special_tokens=True) ## Custom generation methods Custom generation methods enable specialized behavior such as: + - have the model continue thinking if it is uncertain; - roll back generation if the model gets stuck; - handle special tokens with custom logic; @@ -301,6 +302,7 @@ Updating your Python requirements accordingly will remove this error message. ### Creating a custom generation method To create a new generation method, you need to create a new [**Model**](https://huggingface.co/new) repository and push a few files into it. + 1. The model you've designed your generation method with. 2. `custom_generate/generate.py`, which contains all the logic for your custom generation method. 3. `custom_generate/requirements.txt`, used to optionally add new Python requirements and/or lock specific versions to correctly use your method. @@ -377,6 +379,7 @@ def generate(model, input_ids, generation_config=None, left_padding=None, **kwar ``` Follow the recommended practices below to ensure your custom generation method works as expected. + - Feel free to reuse the logic for validation and input preparation in the original [`~GenerationMixin.generate`]. - Pin the `transformers` version in the requirements if you use any private method/attribute in `model`. - Consider adding model validation, input validation, or even a separate test file to help users sanity-check your code in their environment. @@ -410,6 +413,7 @@ tags: ``` Recommended practices: + - Document input and output differences in [`~GenerationMixin.generate`]. - Add self-contained examples to enable quick experimentation. - Describe soft-requirements such as if the method only works well with a certain family of models. @@ -442,6 +446,7 @@ output = model.generate( ### Finding custom generation methods You can find all custom generation methods by [searching for their custom tag.](https://huggingface.co/models?other=custom_generate), `custom_generate`. In addition to the tag, we curate two collections of `custom_generate` methods: + - [Custom generation methods - Community](https://huggingface.co/collections/transformers-community/custom-generation-methods-community-6888fb1da0efbc592d3a8ab6) -- a collection of powerful methods contributed by the community; - [Custom generation methods - Tutorials](https://huggingface.co/collections/transformers-community/custom-generation-methods-tutorials-6823589657a94940ea02cfec) -- a collection of reference implementations for methods that previously were part of `transformers`, as well as tutorials for `custom_generate`. diff --git a/docs/source/en/glossary.md b/docs/source/en/glossary.md index 9e57c3fdc9f..1c8d8ebc214 100644 --- a/docs/source/en/glossary.md +++ b/docs/source/en/glossary.md @@ -185,9 +185,9 @@ See the [Fine-tune a pretrained model](https://huggingface.co/docs/transformers/ The model head refers to the last layer of a neural network that accepts the raw hidden states and projects them onto a different dimension. There is a different model head for each task. For example: - * [`GPT2ForSequenceClassification`] is a sequence classification head - a linear layer - on top of the base [`GPT2Model`]. - * [`ViTForImageClassification`] is an image classification head - a linear layer on top of the final hidden state of the `CLS` token - on top of the base [`ViTModel`]. - * [`Wav2Vec2ForCTC`] is a language modeling head with [CTC](#connectionist-temporal-classification-ctc) on top of the base [`Wav2Vec2Model`]. +* [`GPT2ForSequenceClassification`] is a sequence classification head - a linear layer - on top of the base [`GPT2Model`]. +* [`ViTForImageClassification`] is an image classification head - a linear layer on top of the final hidden state of the `CLS` token - on top of the base [`ViTModel`]. +* [`Wav2Vec2ForCTC`] is a language modeling head with [CTC](#connectionist-temporal-classification-ctc) on top of the base [`Wav2Vec2Model`]. ## I diff --git a/docs/source/en/how_to_hack_models.md b/docs/source/en/how_to_hack_models.md index 0a3c38a3e14..d5ce5bde790 100644 --- a/docs/source/en/how_to_hack_models.md +++ b/docs/source/en/how_to_hack_models.md @@ -149,4 +149,4 @@ Call [print_trainable_parameters](https://huggingface.co/docs/peft/package_refer ```py model.print_trainable_parameters() "trainable params: 589,824 || all params: 94,274,096 || trainable%: 0.6256" -``` \ No newline at end of file +``` diff --git a/docs/source/en/internal/model_debugging_utils.md b/docs/source/en/internal/model_debugging_utils.md index 8f0d0b15b63..b2bded0b895 100644 --- a/docs/source/en/internal/model_debugging_utils.md +++ b/docs/source/en/internal/model_debugging_utils.md @@ -218,9 +218,9 @@ path reference to the associated `.safetensors` file. Each tensor is written to the state dictionary. File names are constructed using the `module_path` as a prefix with a few possible postfixes that are built recursively. -* Module inputs are denoted with the `_inputs` and outputs by `_outputs`. -* `list` and `tuple` instances, such as `args` or function return values, will be postfixed with `_{index}`. -* `dict` instances will be postfixed with `_{key}`. +* Module inputs are denoted with the `_inputs` and outputs by `_outputs`. +* `list` and `tuple` instances, such as `args` or function return values, will be postfixed with `_{index}`. +* `dict` instances will be postfixed with `_{key}`. ### Comparing between implementations @@ -255,6 +255,7 @@ how many tests are being skipped and for which models. When porting models to transformers, tests fail as they should, and sometimes `test_modeling_common` feels irreconcilable with the peculiarities of our brand new model. But how can we be sure we're not breaking everything by adding a seemingly innocent skip? This utility: + - scans all test_modeling_common methods - looks for times where a method is skipped - returns a summary json you can load as a DataFrame/inspect diff --git a/docs/source/en/llm_tutorial.md b/docs/source/en/llm_tutorial.md index 0cbbbc6ac04..bf7ac5e3145 100644 --- a/docs/source/en/llm_tutorial.md +++ b/docs/source/en/llm_tutorial.md @@ -94,6 +94,7 @@ model.generate(**inputs, num_beams=4, do_sample=True) ``` [`~GenerationMixin.generate`] can also be extended with external libraries or custom code: + 1. the `logits_processor` parameter accepts custom [`LogitsProcessor`] instances for manipulating the next token probability distribution; 2. the `stopping_criteria` parameters supports custom [`StoppingCriteria`] to stop text generation; 3. other custom generation methods can be loaded through the `custom_generate` flag ([docs](generation_strategies.md/#custom-decoding-methods)). diff --git a/docs/source/en/main_classes/logging.md b/docs/source/en/main_classes/logging.md index 34da2ac9d1b..330c68218bf 100644 --- a/docs/source/en/main_classes/logging.md +++ b/docs/source/en/main_classes/logging.md @@ -80,6 +80,7 @@ We use both in the `transformers` library. We leverage and adapt `logging`'s `ca management of these warning messages by the verbosity setters above. What does that mean for developers of the library? We should respect the following heuristics: + - `warnings` should be favored for developers of the library and libraries dependent on `transformers` - `logging` should be used for end-users of the library using it in every-day projects diff --git a/docs/source/en/main_classes/processors.md b/docs/source/en/main_classes/processors.md index 8863a632628..44a2bceeca6 100644 --- a/docs/source/en/main_classes/processors.md +++ b/docs/source/en/main_classes/processors.md @@ -17,6 +17,7 @@ rendered properly in your Markdown viewer. # Processors Processors can mean two different things in the Transformers library: + - the objects that pre-process inputs for multi-modal models such as [Wav2Vec2](../model_doc/wav2vec2) (speech and text) or [CLIP](../model_doc/clip) (text and vision) - deprecated objects that were used in older versions of the library to preprocess data for GLUE or SQUAD. diff --git a/docs/source/en/main_classes/text_generation.md b/docs/source/en/main_classes/text_generation.md index cb853f722e1..d879669bcab 100644 --- a/docs/source/en/main_classes/text_generation.md +++ b/docs/source/en/main_classes/text_generation.md @@ -30,15 +30,15 @@ like token streaming. ## GenerationConfig [[autodoc]] generation.GenerationConfig - - from_pretrained - - from_model_config - - save_pretrained - - update - - validate - - get_generation_mode + - from_pretrained + - from_model_config + - save_pretrained + - update + - validate + - get_generation_mode ## GenerationMixin [[autodoc]] GenerationMixin - - generate - - compute_transition_scores + - generate + - compute_transition_scores diff --git a/docs/source/en/model_doc/align.md b/docs/source/en/model_doc/align.md index 7379c84fc3a..275b510ccd5 100644 --- a/docs/source/en/model_doc/align.md +++ b/docs/source/en/model_doc/align.md @@ -148,6 +148,7 @@ for label, score in zip(candidate_labels, probs): ``` ## Resources + - Refer to the [Kakao Brain’s Open Source ViT, ALIGN, and the New COYO Text-Image Dataset](https://huggingface.co/blog/vit-align) blog post for more details. ## AlignConfig diff --git a/docs/source/en/model_doc/arcee.md b/docs/source/en/model_doc/arcee.md index a5335608edb..ebedd73a4a4 100644 --- a/docs/source/en/model_doc/arcee.md +++ b/docs/source/en/model_doc/arcee.md @@ -102,4 +102,4 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ## ArceeForTokenClassification [[autodoc]] ArceeForTokenClassification - - forward \ No newline at end of file + - forward diff --git a/docs/source/en/model_doc/beit.md b/docs/source/en/model_doc/beit.md index 5158bafa395..ee516a935ed 100644 --- a/docs/source/en/model_doc/beit.md +++ b/docs/source/en/model_doc/beit.md @@ -123,6 +123,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - See also: [Image classification task guide](../tasks/image_classification) **Semantic segmentation** + - [Semantic segmentation task guide](../tasks/semantic_segmentation) If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. diff --git a/docs/source/en/model_doc/bert-generation.md b/docs/source/en/model_doc/bert-generation.md index b5be3458db7..d57734b069b 100644 --- a/docs/source/en/model_doc/bert-generation.md +++ b/docs/source/en/model_doc/bert-generation.md @@ -156,4 +156,4 @@ print(tokenizer.decode(outputs[0])) ## BertGenerationDecoder [[autodoc]] BertGenerationDecoder - - forward \ No newline at end of file + - forward diff --git a/docs/source/en/model_doc/bertweet.md b/docs/source/en/model_doc/bertweet.md index 917adab47cc..6d34b88a561 100644 --- a/docs/source/en/model_doc/bertweet.md +++ b/docs/source/en/model_doc/bertweet.md @@ -88,6 +88,7 @@ echo -e "Plants create through a process known as photosynthesis." | tran ## Notes + - Use the [`AutoTokenizer`] or [`BertweetTokenizer`] because it's preloaded with a custom vocabulary adapted to tweet-specific tokens like hashtags (#), mentions (@), emojis, and common abbreviations. Make sure to also install the [emoji](https://pypi.org/project/emoji/) library. - Inputs should be padded on the right (`padding="max_length"`) because BERT uses absolute position embeddings. diff --git a/docs/source/en/model_doc/big_bird.md b/docs/source/en/model_doc/big_bird.md index 877445a4ba5..c3137725814 100644 --- a/docs/source/en/model_doc/big_bird.md +++ b/docs/source/en/model_doc/big_bird.md @@ -87,6 +87,7 @@ print(f"The predicted token is: {predicted_token}") ## Notes + - Inputs should be padded on the right because BigBird uses absolute position embeddings. - BigBird supports `original_full` and `block_sparse` attention. If the input sequence length is less than 1024, it is recommended to use `original_full` since sparse patterns don't offer much benefit for smaller inputs. - The current implementation uses window size of 3 blocks and 2 global blocks, only supports the ITC-implementation, and doesn't support `num_random_blocks=0`. diff --git a/docs/source/en/model_doc/bit.md b/docs/source/en/model_doc/bit.md index 5a6630566fc..5ed3b8f816a 100644 --- a/docs/source/en/model_doc/bit.md +++ b/docs/source/en/model_doc/bit.md @@ -36,6 +36,7 @@ The original code can be found [here](https://github.com/google-research/big_tra ## Usage tips - BiT models are equivalent to ResNetv2 in terms of architecture, except that: 1) all batch normalization layers are replaced by [group normalization](https://huggingface.co/papers/1803.08494), + 2) [weight standardization](https://huggingface.co/papers/1903.10520) is used for convolutional layers. The authors show that the combination of both is useful for training with large batch sizes, and has a significant impact on transfer learning. @@ -72,4 +73,4 @@ If you're interested in submitting a resource to be included here, please feel f ## BitForImageClassification [[autodoc]] BitForImageClassification - - forward \ No newline at end of file + - forward diff --git a/docs/source/en/model_doc/bitnet.md b/docs/source/en/model_doc/bitnet.md index 69f9cb75131..c674f51fc30 100644 --- a/docs/source/en/model_doc/bitnet.md +++ b/docs/source/en/model_doc/bitnet.md @@ -38,22 +38,22 @@ Several versions of the model weights are available on Hugging Face: ### Model Details * **Architecture:** Transformer-based, modified with `BitLinear` layers (BitNet framework). - * Uses Rotary Position Embeddings (RoPE). - * Uses squared ReLU (ReLUΒ²) activation in FFN layers. - * Employs [`subln`](https://proceedings.mlr.press/v202/wang23u.html) normalization. - * No bias terms in linear or normalization layers. + * Uses Rotary Position Embeddings (RoPE). + * Uses squared ReLU (ReLUΒ²) activation in FFN layers. + * Employs [`subln`](https://proceedings.mlr.press/v202/wang23u.html) normalization. + * No bias terms in linear or normalization layers. * **Quantization:** Native 1.58-bit weights and 8-bit activations (W1.58A8). - * Weights are quantized to ternary values {-1, 0, +1} using absmean quantization during the forward pass. - * Activations are quantized to 8-bit integers using absmax quantization (per-token). - * **Crucially, the model was *trained from scratch* with this quantization scheme, not post-training quantized.** + * Weights are quantized to ternary values {-1, 0, +1} using absmean quantization during the forward pass. + * Activations are quantized to 8-bit integers using absmax quantization (per-token). + * **Crucially, the model was *trained from scratch* with this quantization scheme, not post-training quantized.** * **Parameters:** ~2 Billion * **Training Tokens:** 4 Trillion -* **Context Length:** Maximum sequence length of **4096 tokens**. - * *Recommendation:* For optimal performance on tasks requiring very long contexts (beyond the pre-training length or for specialized long-reasoning tasks), we recommend performing intermediate long-sequence adaptation/training before the final fine-tuning stage. +* **Context Length:** Maximum sequence length of **4096 tokens**. + * *Recommendation:* For optimal performance on tasks requiring very long contexts (beyond the pre-training length or for specialized long-reasoning tasks), we recommend performing intermediate long-sequence adaptation/training before the final fine-tuning stage. * **Training Stages:** - 1. **Pre-training:** Large-scale training on public text/code and synthetic math data using a two-stage learning rate and weight decay schedule. - 2. **Supervised Fine-tuning (SFT):** Fine-tuned on instruction-following and conversational datasets using sum loss aggregation and specific hyperparameter tuning. - 3. **Direct Preference Optimization (DPO):** Aligned with human preferences using preference pairs. + 1. **Pre-training:** Large-scale training on public text/code and synthetic math data using a two-stage learning rate and weight decay schedule. + 2. **Supervised Fine-tuning (SFT):** Fine-tuned on instruction-following and conversational datasets using sum loss aggregation and specific hyperparameter tuning. + 3. **Direct Preference Optimization (DPO):** Aligned with human preferences using preference pairs. * **Tokenizer:** LLaMA 3 Tokenizer (vocab size: 128,256). ## Usage tips diff --git a/docs/source/en/model_doc/blip.md b/docs/source/en/model_doc/blip.md index 5ef78728996..5e727050f6e 100644 --- a/docs/source/en/model_doc/blip.md +++ b/docs/source/en/model_doc/blip.md @@ -128,7 +128,7 @@ Refer to this [notebook](https://github.com/huggingface/notebooks/blob/main/exam ## BlipTextLMHeadModel [[autodoc]] BlipTextLMHeadModel -- forward + - forward ## BlipVisionModel diff --git a/docs/source/en/model_doc/bloom.md b/docs/source/en/model_doc/bloom.md index c78cb4447eb..51e2970c25f 100644 --- a/docs/source/en/model_doc/bloom.md +++ b/docs/source/en/model_doc/bloom.md @@ -43,16 +43,19 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [`BloomForCausalLM`] is supported by this [causal language modeling example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling#gpt-2gpt-and-causal-language-modeling) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb). See also: + - [Causal language modeling task guide](../tasks/language_modeling) - [Text classification task guide](../tasks/sequence_classification) - [Token classification task guide](../tasks/token_classification) - [Question answering task guide](../tasks/question_answering) ⚑️ Inference + - A blog on [Optimization story: Bloom inference](https://huggingface.co/blog/bloom-inference-optimization). - A blog on [Incredibly Fast BLOOM Inference with DeepSpeed and Accelerate](https://huggingface.co/blog/bloom-inference-pytorch-scripts). βš™οΈ Training + - A blog on [The Technology Behind BLOOM Training](https://huggingface.co/blog/bloom-megatron-deepspeed). ## BloomConfig diff --git a/docs/source/en/model_doc/camembert.md b/docs/source/en/model_doc/camembert.md index 971954ed52a..8affbd73a57 100644 --- a/docs/source/en/model_doc/camembert.md +++ b/docs/source/en/model_doc/camembert.md @@ -16,10 +16,10 @@ rendered properly in your Markdown viewer. *This model was released on 2019-11-10 and added to Hugging Face Transformers on 2020-11-16.*
-
- PyTorch +
+ PyTorch SDPA -
+
# CamemBERT diff --git a/docs/source/en/model_doc/chinese_clip.md b/docs/source/en/model_doc/chinese_clip.md index 7ed4d503c00..96b094ccd91 100644 --- a/docs/source/en/model_doc/chinese_clip.md +++ b/docs/source/en/model_doc/chinese_clip.md @@ -119,4 +119,4 @@ Currently, following scales of pretrained Chinese-CLIP models are available on ## ChineseCLIPVisionModel [[autodoc]] ChineseCLIPVisionModel - - forward \ No newline at end of file + - forward diff --git a/docs/source/en/model_doc/clipseg.md b/docs/source/en/model_doc/clipseg.md index 7ca9b3926ac..099fd4fb1ba 100644 --- a/docs/source/en/model_doc/clipseg.md +++ b/docs/source/en/model_doc/clipseg.md @@ -106,4 +106,4 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h ## CLIPSegForImageSegmentation [[autodoc]] CLIPSegForImageSegmentation - - forward \ No newline at end of file + - forward diff --git a/docs/source/en/model_doc/cohere.md b/docs/source/en/model_doc/cohere.md index 1deca4d00ee..022a178b5cf 100644 --- a/docs/source/en/model_doc/cohere.md +++ b/docs/source/en/model_doc/cohere.md @@ -122,6 +122,7 @@ visualizer("Plants create energy through a process known as") ## Notes + - Don't use the dtype parameter in [`~AutoModel.from_pretrained`] if you're using FlashAttention-2 because it only supports fp16 or bf16. You should use [Automatic Mixed Precision](https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html), set fp16 or bf16 to True if using [`Trainer`], or use [torch.autocast](https://pytorch.org/docs/stable/amp.html#torch.autocast). ## CohereConfig diff --git a/docs/source/en/model_doc/cpmant.md b/docs/source/en/model_doc/cpmant.md index 47eec6e79d6..bb70a369bb7 100644 --- a/docs/source/en/model_doc/cpmant.md +++ b/docs/source/en/model_doc/cpmant.md @@ -49,4 +49,4 @@ This model was contributed by [OpenBMB](https://huggingface.co/openbmb). The ori ## CpmAntForCausalLM [[autodoc]] CpmAntForCausalLM - - all \ No newline at end of file + - all diff --git a/docs/source/en/model_doc/data2vec.md b/docs/source/en/model_doc/data2vec.md index a3845f3c0ff..66213b42ae7 100644 --- a/docs/source/en/model_doc/data2vec.md +++ b/docs/source/en/model_doc/data2vec.md @@ -103,6 +103,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [`Data2VecVisionForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb). **Data2VecText documentation resources** + - [Text classification task guide](../tasks/sequence_classification) - [Token classification task guide](../tasks/token_classification) - [Question answering task guide](../tasks/question_answering) @@ -111,10 +112,12 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [Multiple choice task guide](../tasks/multiple_choice) **Data2VecAudio documentation resources** + - [Audio classification task guide](../tasks/audio_classification) - [Automatic speech recognition task guide](../tasks/asr) **Data2VecVision documentation resources** + - [Image classification](../tasks/image_classification) - [Semantic segmentation](../tasks/semantic_segmentation) diff --git a/docs/source/en/model_doc/deberta.md b/docs/source/en/model_doc/deberta.md index 76fe8e1a3b6..08be80c19ff 100644 --- a/docs/source/en/model_doc/deberta.md +++ b/docs/source/en/model_doc/deberta.md @@ -92,6 +92,7 @@ echo -e '{"text": "A soccer game with multiple people playing.", "text_pair": "S ## Notes + - DeBERTa uses **relative position embeddings**, so it does not require **right-padding** like BERT. - For best results, use DeBERTa on sentence-level or sentence-pair classification tasks like MNLI, RTE, or SST-2. - If you're using DeBERTa for token-level tasks like masked language modeling, make sure to load a checkpoint specifically pretrained or fine-tuned for token-level tasks. diff --git a/docs/source/en/model_doc/deepseek_v2.md b/docs/source/en/model_doc/deepseek_v2.md index bcdf65fbe8c..fcff8521c07 100644 --- a/docs/source/en/model_doc/deepseek_v2.md +++ b/docs/source/en/model_doc/deepseek_v2.md @@ -47,4 +47,4 @@ The model uses Multi-head Latent Attention (MLA) and DeepSeekMoE architectures f ## DeepseekV2ForSequenceClassification [[autodoc]] DeepseekV2ForSequenceClassification - - forward \ No newline at end of file + - forward diff --git a/docs/source/en/model_doc/deformable_detr.md b/docs/source/en/model_doc/deformable_detr.md index da03770bcbe..c83dede7808 100644 --- a/docs/source/en/model_doc/deformable_detr.md +++ b/docs/source/en/model_doc/deformable_detr.md @@ -16,9 +16,9 @@ rendered properly in your Markdown viewer. *This model was released on 2020-10-08 and added to Hugging Face Transformers on 2022-09-14.*
-
- PyTorch -
+
+ PyTorch +
# Deformable DETR diff --git a/docs/source/en/model_doc/deplot.md b/docs/source/en/model_doc/deplot.md index 0eb3975530a..5a7d4d12dcd 100644 --- a/docs/source/en/model_doc/deplot.md +++ b/docs/source/en/model_doc/deplot.md @@ -68,4 +68,4 @@ scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=1000, nu DePlot is a model trained using `Pix2Struct` architecture. For API reference, see [`Pix2Struct` documentation](pix2struct). - \ No newline at end of file + diff --git a/docs/source/en/model_doc/depth_anything.md b/docs/source/en/model_doc/depth_anything.md index 5ac7007595f..44774c961ea 100644 --- a/docs/source/en/model_doc/depth_anything.md +++ b/docs/source/en/model_doc/depth_anything.md @@ -86,4 +86,4 @@ Image.fromarray(depth.astype("uint8")) ## DepthAnythingForDepthEstimation [[autodoc]] DepthAnythingForDepthEstimation - - forward \ No newline at end of file + - forward diff --git a/docs/source/en/model_doc/depth_anything_v2.md b/docs/source/en/model_doc/depth_anything_v2.md index e8637ba6192..fbcf2248f65 100644 --- a/docs/source/en/model_doc/depth_anything_v2.md +++ b/docs/source/en/model_doc/depth_anything_v2.md @@ -110,4 +110,4 @@ If you're interested in submitting a resource to be included here, please feel f ## DepthAnythingForDepthEstimation [[autodoc]] DepthAnythingForDepthEstimation - - forward \ No newline at end of file + - forward diff --git a/docs/source/en/model_doc/depth_pro.md b/docs/source/en/model_doc/depth_pro.md index 6872fca5138..c19703cdccc 100644 --- a/docs/source/en/model_doc/depth_pro.md +++ b/docs/source/en/model_doc/depth_pro.md @@ -84,12 +84,13 @@ alt="drawing" width="600"/> The `DepthProForDepthEstimation` model uses a `DepthProEncoder`, for encoding the input image and a `FeatureFusionStage` for fusing the output features from encoder. The `DepthProEncoder` further uses two encoders: + - `patch_encoder` - - Input image is scaled with multiple ratios, as specified in the `scaled_images_ratios` configuration. - - Each scaled image is split into smaller **patches** of size `patch_size` with overlapping areas determined by `scaled_images_overlap_ratios`. - - These patches are processed by the **`patch_encoder`** + - Input image is scaled with multiple ratios, as specified in the `scaled_images_ratios` configuration. + - Each scaled image is split into smaller **patches** of size `patch_size` with overlapping areas determined by `scaled_images_overlap_ratios`. + - These patches are processed by the **`patch_encoder`** - `image_encoder` - - Input image is also rescaled to `patch_size` and processed by the **`image_encoder`** + - Input image is also rescaled to `patch_size` and processed by the **`image_encoder`** Both these encoders can be configured via `patch_model_config` and `image_model_config` respectively, both of which are separate `Dinov2Model` by default. @@ -159,8 +160,8 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - Official Implementation: [apple/ml-depth-pro](https://github.com/apple/ml-depth-pro) - DepthPro Inference Notebook: [DepthPro Inference](https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/DepthPro_inference.ipynb) - DepthPro for Super Resolution and Image Segmentation - - Read blog on Medium: [Depth Pro: Beyond Depth](https://medium.com/@raoarmaghanshakir040/depth-pro-beyond-depth-9d822fc557ba) - - Code on Github: [geetu040/depthpro-beyond-depth](https://github.com/geetu040/depthpro-beyond-depth) + - Read blog on Medium: [Depth Pro: Beyond Depth](https://medium.com/@raoarmaghanshakir040/depth-pro-beyond-depth-9d822fc557ba) + - Code on Github: [geetu040/depthpro-beyond-depth](https://github.com/geetu040/depthpro-beyond-depth) If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. diff --git a/docs/source/en/model_doc/detr.md b/docs/source/en/model_doc/detr.md index 6d7792803c5..46c9d3dadce 100644 --- a/docs/source/en/model_doc/detr.md +++ b/docs/source/en/model_doc/detr.md @@ -16,9 +16,9 @@ rendered properly in your Markdown viewer. *This model was released on 2020-05-26 and added to Hugging Face Transformers on 2021-06-09.*
-
- PyTorch -
+
+ PyTorch +
# DETR diff --git a/docs/source/en/model_doc/dinat.md b/docs/source/en/model_doc/dinat.md index e6d3385003c..89f0f5cb657 100644 --- a/docs/source/en/model_doc/dinat.md +++ b/docs/source/en/model_doc/dinat.md @@ -65,6 +65,7 @@ DiNAT can be used as a *backbone*. When `output_hidden_states = True`, it will output both `hidden_states` and `reshaped_hidden_states`. The `reshaped_hidden_states` have a shape of `(batch, num_channels, height, width)` rather than `(batch_size, height, width, num_channels)`. Notes: + - DiNAT depends on [NATTEN](https://github.com/SHI-Labs/NATTEN/)'s implementation of Neighborhood Attention and Dilated Neighborhood Attention. You can install it with pre-built wheels for Linux by referring to [shi-labs.com/natten](https://shi-labs.com/natten), or build on your system by running `pip install natten`. Note that the latter will likely take time to compile. NATTEN does not support Windows devices yet. diff --git a/docs/source/en/model_doc/dinov2_with_registers.md b/docs/source/en/model_doc/dinov2_with_registers.md index 503291eeb5f..d6b9c08f2f8 100644 --- a/docs/source/en/model_doc/dinov2_with_registers.md +++ b/docs/source/en/model_doc/dinov2_with_registers.md @@ -25,6 +25,7 @@ The [Vision Transformer](vit) (ViT) is a transformer encoder model (BERT-like) o Next, people figured out ways to make ViT work really well on self-supervised image feature extraction (i.e. learning meaningful features, also called embeddings) on images without requiring any labels. Some example papers here include [DINOv2](dinov2) and [MAE](vit_mae). The authors of DINOv2 noticed that ViTs have artifacts in attention maps. It's due to the model using some image patches as β€œregisters”. The authors propose a fix: just add some new tokens (called "register" tokens), which you only use during pre-training (and throw away afterwards). This results in: + - no artifacts - interpretable attention maps - and improved performances. @@ -57,4 +58,4 @@ The original code can be found [here](https://github.com/facebookresearch/dinov2 ## Dinov2WithRegistersForImageClassification [[autodoc]] Dinov2WithRegistersForImageClassification - - forward \ No newline at end of file + - forward diff --git a/docs/source/en/model_doc/doge.md b/docs/source/en/model_doc/doge.md index ffa9ced7913..b2e44356ddc 100644 --- a/docs/source/en/model_doc/doge.md +++ b/docs/source/en/model_doc/doge.md @@ -101,4 +101,4 @@ outputs = model.generate( ## DogeForSequenceClassification [[autodoc]] DogeForSequenceClassification - - forward \ No newline at end of file + - forward diff --git a/docs/source/en/model_doc/dpr.md b/docs/source/en/model_doc/dpr.md index 5fe48bc47e7..18b060cb111 100644 --- a/docs/source/en/model_doc/dpr.md +++ b/docs/source/en/model_doc/dpr.md @@ -44,9 +44,9 @@ This model was contributed by [lhoestq](https://huggingface.co/lhoestq). The ori - DPR consists in three models: - * Question encoder: encode questions as vectors - * Context encoder: encode contexts as vectors - * Reader: extract the answer of the questions inside retrieved contexts, along with a relevance score (high if the inferred span actually answers the question). + * Question encoder: encode questions as vectors + * Context encoder: encode contexts as vectors + * Reader: extract the answer of the questions inside retrieved contexts, along with a relevance score (high if the inferred span actually answers the question). ## DPRConfig diff --git a/docs/source/en/model_doc/efficientloftr.md b/docs/source/en/model_doc/efficientloftr.md index faf71f4bac0..da28a68074a 100644 --- a/docs/source/en/model_doc/efficientloftr.md +++ b/docs/source/en/model_doc/efficientloftr.md @@ -144,27 +144,23 @@ processed_outputs = processor.post_process_keypoint_matching(outputs, image_size ## EfficientLoFTRImageProcessor [[autodoc]] EfficientLoFTRImageProcessor - -- preprocess -- post_process_keypoint_matching -- visualize_keypoint_matching + - preprocess + - post_process_keypoint_matching + - visualize_keypoint_matching ## EfficientLoFTRImageProcessorFast [[autodoc]] EfficientLoFTRImageProcessorFast - -- preprocess -- post_process_keypoint_matching -- visualize_keypoint_matching + - preprocess + - post_process_keypoint_matching + - visualize_keypoint_matching ## EfficientLoFTRModel [[autodoc]] EfficientLoFTRModel - -- forward + - forward ## EfficientLoFTRForKeypointMatching [[autodoc]] EfficientLoFTRForKeypointMatching - -- forward + - forward diff --git a/docs/source/en/model_doc/eomt.md b/docs/source/en/model_doc/eomt.md index 199d87dc794..7ff1419b381 100644 --- a/docs/source/en/model_doc/eomt.md +++ b/docs/source/en/model_doc/eomt.md @@ -207,4 +207,4 @@ plt.show() ## EomtForUniversalSegmentation [[autodoc]] EomtForUniversalSegmentation - - forward \ No newline at end of file + - forward diff --git a/docs/source/en/model_doc/exaone4.md b/docs/source/en/model_doc/exaone4.md index 93ca33babd3..9482f5be2c0 100644 --- a/docs/source/en/model_doc/exaone4.md +++ b/docs/source/en/model_doc/exaone4.md @@ -204,4 +204,4 @@ print(tokenizer.decode(output[0])) ## Exaone4ForQuestionAnswering [[autodoc]] Exaone4ForQuestionAnswering - - forward \ No newline at end of file + - forward diff --git a/docs/source/en/model_doc/falcon3.md b/docs/source/en/model_doc/falcon3.md index 368a5457ab6..3d79a4e225d 100644 --- a/docs/source/en/model_doc/falcon3.md +++ b/docs/source/en/model_doc/falcon3.md @@ -30,5 +30,6 @@ Depth up-scaling for improved reasoning: Building on recent studies on the effec Knowledge distillation for better tiny models: To provide compact and efficient alternatives, we developed Falcon3-1B-Base and Falcon3-3B-Base by leveraging pruning and knowledge distillation techniques, using less than 100GT of curated high-quality data, thereby redefining pre-training efficiency. ## Resources + - [Blog post](https://huggingface.co/blog/falcon3) - [Models on Huggingface](https://huggingface.co/collections/tiiuae/falcon3-67605ae03578be86e4e87026) diff --git a/docs/source/en/model_doc/falcon_h1.md b/docs/source/en/model_doc/falcon_h1.md index c17ecea1cc0..48a647cd379 100644 --- a/docs/source/en/model_doc/falcon_h1.md +++ b/docs/source/en/model_doc/falcon_h1.md @@ -60,4 +60,4 @@ print(tokenizer.batch_decode(response, skip_special_tokens=True)[0]) [[autodoc]] FalconH1ForCausalLM - forward -This HF implementation is contributed by [younesbelkada](https://github.com/younesbelkada) and [DhiaEddineRhaiem](https://github.com/dhiaEddineRhaiem). \ No newline at end of file +This HF implementation is contributed by [younesbelkada](https://github.com/younesbelkada) and [DhiaEddineRhaiem](https://github.com/dhiaEddineRhaiem). diff --git a/docs/source/en/model_doc/flaubert.md b/docs/source/en/model_doc/flaubert.md index dd3ce34336d..fe5b96d00c5 100644 --- a/docs/source/en/model_doc/flaubert.md +++ b/docs/source/en/model_doc/flaubert.md @@ -44,6 +44,7 @@ community for further reproducible experiments in French NLP.* This model was contributed by [formiel](https://huggingface.co/formiel). The original code can be found [here](https://github.com/getalp/Flaubert). Tips: + - Like RoBERTa, without the sentence ordering prediction (so just trained on the MLM objective). ## Resources diff --git a/docs/source/en/model_doc/florence2.md b/docs/source/en/model_doc/florence2.md index 77e8de10c31..b7171e1faab 100644 --- a/docs/source/en/model_doc/florence2.md +++ b/docs/source/en/model_doc/florence2.md @@ -138,21 +138,21 @@ print(parsed_answer) ## Notes - Florence-2 is a prompt-based model. You need to provide a task prompt to tell the model what to do. Supported tasks are: - - `` - - `` - - `` - - `` - - `` - - `` - - `` - - `` - - `` - - `` - - `` - - `` - - `` - - `` - - `` + - `` + - `` + - `` + - `` + - `` + - `` + - `` + - `` + - `` + - `` + - `` + - `` + - `` + - `` + - `` - The raw output of the model is a string that needs to be parsed. The [`Florence2Processor`] has a [`~Florence2Processor.post_process_generation`] method that can parse the string into a more usable format, like bounding boxes and labels for object detection. ## Resources diff --git a/docs/source/en/model_doc/gemma3n.md b/docs/source/en/model_doc/gemma3n.md index 7c2e3ecc926..8012ed675a2 100644 --- a/docs/source/en/model_doc/gemma3n.md +++ b/docs/source/en/model_doc/gemma3n.md @@ -121,9 +121,9 @@ echo -e "Plants create energy through a process known as" | transformers run --t ## Notes -- Use [`Gemma3nForConditionalGeneration`] for image-audio-and-text, image-and-text, image-and-audio, audio-and-text, +- Use [`Gemma3nForConditionalGeneration`] for image-audio-and-text, image-and-text, image-and-audio, audio-and-text, image-only and audio-only inputs. -- Gemma 3n supports multiple images per input, but make sure the images are correctly batched before passing them to +- Gemma 3n supports multiple images per input, but make sure the images are correctly batched before passing them to the processor. Each batch should be a list of one or more images. ```py @@ -148,11 +148,11 @@ echo -e "Plants create energy through a process known as" | transformers run --t ] ``` -- Text passed to the processor should have a `` token wherever an image should be inserted. -- Gemma 3n accept at most one target audio clip per input, though multiple audio clips can be provided in few-shot +- Text passed to the processor should have a `` token wherever an image should be inserted. +- Gemma 3n accept at most one target audio clip per input, though multiple audio clips can be provided in few-shot prompts, for example. -- Text passed to the processor should have a `` token wherever an audio clip should be inserted. -- The processor has its own [`~ProcessorMixin.apply_chat_template`] method to convert chat messages to model inputs. +- Text passed to the processor should have a `` token wherever an audio clip should be inserted. +- The processor has its own [`~ProcessorMixin.apply_chat_template`] method to convert chat messages to model inputs. ## Gemma3nAudioFeatureExtractor diff --git a/docs/source/en/model_doc/git.md b/docs/source/en/model_doc/git.md index a2aa0901b21..06a65a6dd89 100644 --- a/docs/source/en/model_doc/git.md +++ b/docs/source/en/model_doc/git.md @@ -81,4 +81,4 @@ The resource should ideally demonstrate something new instead of duplicating an ## GitForCausalLM [[autodoc]] GitForCausalLM - - forward \ No newline at end of file + - forward diff --git a/docs/source/en/model_doc/glm4v_moe.md b/docs/source/en/model_doc/glm4v_moe.md index 0388cc9eb61..c814fdb5bec 100644 --- a/docs/source/en/model_doc/glm4v_moe.md +++ b/docs/source/en/model_doc/glm4v_moe.md @@ -35,6 +35,7 @@ Through our open-source work, we aim to explore the technological frontier toget ![bench_45](https://raw.githubusercontent.com/zai-org/GLM-V/refs/heads/main/resources/bench_45v.jpeg) Beyond benchmark performance, GLM-4.5V focuses on real-world usability. Through efficient hybrid training, it can handle diverse types of visual content, enabling full-spectrum vision reasoning, including: + - **Image reasoning** (scene understanding, complex multi-image analysis, spatial recognition) - **Video understanding** (long video segmentation and event recognition) - **GUI tasks** (screen reading, icon recognition, desktop operation assistance) diff --git a/docs/source/en/model_doc/gpt_bigcode.md b/docs/source/en/model_doc/gpt_bigcode.md index fec23ad0f14..9f051c347f9 100644 --- a/docs/source/en/model_doc/gpt_bigcode.md +++ b/docs/source/en/model_doc/gpt_bigcode.md @@ -36,6 +36,7 @@ The model is an optimized [GPT2 model](https://huggingface.co/docs/transformers/ ## Implementation details The main differences compared to GPT2. + - Added support for Multi-Query Attention. - Use `gelu_pytorch_tanh` instead of classic `gelu`. - Avoid unnecessary synchronizations (this has since been added to GPT2 in #20061, but wasn't in the reference codebase). diff --git a/docs/source/en/model_doc/gptj.md b/docs/source/en/model_doc/gptj.md index 59e84daea5c..7b81ee12d27 100644 --- a/docs/source/en/model_doc/gptj.md +++ b/docs/source/en/model_doc/gptj.md @@ -133,6 +133,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [`GPTJForCausalLM`] is supported by this [causal language modeling example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling#gpt-2gpt-and-causal-language-modeling), [text generation example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-generation), and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb). **Documentation resources** + - [Text classification task guide](../tasks/sequence_classification) - [Question answering task guide](../tasks/question_answering) - [Causal language modeling task guide](../tasks/language_modeling) diff --git a/docs/source/en/model_doc/granite_speech.md b/docs/source/en/model_doc/granite_speech.md index 680dba3a473..1d05ee346b6 100644 --- a/docs/source/en/model_doc/granite_speech.md +++ b/docs/source/en/model_doc/granite_speech.md @@ -37,6 +37,7 @@ Note that most of the aforementioned components are implemented generically to e This model was contributed by [Alexander Brooks](https://huggingface.co/abrooks9944), [Avihu Dekel](https://huggingface.co/Avihu), and [George Saon](https://huggingface.co/gsaon). ## Usage tips + - This model bundles its own LoRA adapter, which will be automatically loaded and enabled/disabled as needed during inference calls. Be sure to install [PEFT](https://github.com/huggingface/peft) to ensure the LoRA is correctly applied! diff --git a/docs/source/en/model_doc/granitemoeshared.md b/docs/source/en/model_doc/granitemoeshared.md index 8b256de647f..9db702c9f70 100644 --- a/docs/source/en/model_doc/granitemoeshared.md +++ b/docs/source/en/model_doc/granitemoeshared.md @@ -62,4 +62,4 @@ This HF implementation is contributed by [Mayank Mishra](https://huggingface.co/ ## GraniteMoeSharedForCausalLM [[autodoc]] GraniteMoeSharedForCausalLM - - forward \ No newline at end of file + - forward diff --git a/docs/source/en/model_doc/granitevision.md b/docs/source/en/model_doc/granitevision.md index f5a6316a22c..b95982ee81f 100644 --- a/docs/source/en/model_doc/granitevision.md +++ b/docs/source/en/model_doc/granitevision.md @@ -22,6 +22,7 @@ rendered properly in your Markdown viewer. The [Granite Vision](https://www.ibm.com/new/announcements/ibm-granite-3-1-powerful-performance-long-context-and-more) model is a variant of [LLaVA-NeXT](llava_next), leveraging a [Granite](granite) language model alongside a [SigLIP](SigLIP) visual encoder. It utilizes multiple concatenated vision hidden states as its image features, similar to [VipLlava](vipllava). It also uses a larger set of image grid pinpoints than the original LlaVa-NeXT models to support additional aspect ratios. Tips: + - This model is loaded into Transformers as an instance of LlaVA-Next. The usage and tips from [LLaVA-NeXT](llava_next) apply to this model as well. - You can apply the chat template on the tokenizer / processor in the same way as well. Example chat format: diff --git a/docs/source/en/model_doc/hgnet_v2.md b/docs/source/en/model_doc/hgnet_v2.md index e5da5a0582d..8e7791ce71e 100644 --- a/docs/source/en/model_doc/hgnet_v2.md +++ b/docs/source/en/model_doc/hgnet_v2.md @@ -89,4 +89,4 @@ print(f"The predicted class label is: {predicted_class_label}") ## HGNetV2ForImageClassification [[autodoc]] HGNetV2ForImageClassification - - forward \ No newline at end of file + - forward diff --git a/docs/source/en/model_doc/informer.md b/docs/source/en/model_doc/informer.md index 7e79399cbc5..a9cea0f09ca 100644 --- a/docs/source/en/model_doc/informer.md +++ b/docs/source/en/model_doc/informer.md @@ -52,4 +52,4 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h ## InformerForPrediction [[autodoc]] InformerForPrediction - - forward \ No newline at end of file + - forward diff --git a/docs/source/en/model_doc/instructblip.md b/docs/source/en/model_doc/instructblip.md index d22d8df0d39..ac84a71d887 100644 --- a/docs/source/en/model_doc/instructblip.md +++ b/docs/source/en/model_doc/instructblip.md @@ -77,4 +77,4 @@ The attributes can be obtained from model config, as `model.config.num_query_tok [[autodoc]] InstructBlipForConditionalGeneration - forward - - generate \ No newline at end of file + - generate diff --git a/docs/source/en/model_doc/kyutai_speech_to_text.md b/docs/source/en/model_doc/kyutai_speech_to_text.md index cdd4aec7302..f3482c37ae0 100644 --- a/docs/source/en/model_doc/kyutai_speech_to_text.md +++ b/docs/source/en/model_doc/kyutai_speech_to_text.md @@ -19,6 +19,7 @@ rendered properly in your Markdown viewer. ## Overview [Kyutai STT](https://kyutai.org/next/stt) is a speech-to-text model architecture based on the [Mimi codec](https://huggingface.co/docs/transformers/en/model_doc/mimi), which encodes audio into discrete tokens in a streaming fashion, and a [Moshi-like](https://huggingface.co/docs/transformers/en/model_doc/moshi) autoregressive decoder. Kyutai's lab has released two model checkpoints: + - [kyutai/stt-1b-en_fr](https://huggingface.co/kyutai/stt-1b-en_fr): a 1B-parameter model capable of transcribing both English and French - [kyutai/stt-2.6b-en](https://huggingface.co/kyutai/stt-2.6b-en): a 2.6B-parameter model focused solely on English, optimized for maximum transcription accuracy diff --git a/docs/source/en/model_doc/layoutlmv3.md b/docs/source/en/model_doc/layoutlmv3.md index 9bb75e7772b..b9964fa3f86 100644 --- a/docs/source/en/model_doc/layoutlmv3.md +++ b/docs/source/en/model_doc/layoutlmv3.md @@ -37,8 +37,8 @@ This model was contributed by [nielsr](https://huggingface.co/nielsr). The origi ## Usage tips - In terms of data processing, LayoutLMv3 is identical to its predecessor [LayoutLMv2](layoutlmv2), except that: - - images need to be resized and normalized with channels in regular RGB format. LayoutLMv2 on the other hand normalizes the images internally and expects the channels in BGR format. - - text is tokenized using byte-pair encoding (BPE), as opposed to WordPiece. + - images need to be resized and normalized with channels in regular RGB format. LayoutLMv2 on the other hand normalizes the images internally and expects the channels in BGR format. + - text is tokenized using byte-pair encoding (BPE), as opposed to WordPiece. Due to these differences in data preprocessing, one can use [`LayoutLMv3Processor`] which internally combines a [`LayoutLMv3ImageProcessor`] (for the image modality) and a [`LayoutLMv3Tokenizer`]/[`LayoutLMv3TokenizerFast`] (for the text modality) to prepare all data for the model. - Regarding usage of [`LayoutLMv3Processor`], we refer to the [usage guide](layoutlmv2#usage-layoutlmv2processor) of its predecessor. @@ -73,6 +73,7 @@ LayoutLMv3 is nearly identical to LayoutLMv2, so we've also included LayoutLMv2 - [Question answering task guide](../tasks/question_answering) **Document question answering** + - [Document question answering task guide](../tasks/document_question_answering) ## LayoutLMv3Config diff --git a/docs/source/en/model_doc/lfm2.md b/docs/source/en/model_doc/lfm2.md index 0e78f9935f9..58f1d754588 100644 --- a/docs/source/en/model_doc/lfm2.md +++ b/docs/source/en/model_doc/lfm2.md @@ -82,4 +82,4 @@ print(tokenizer.decode(output[0], skip_special_tokens=False)) ## Lfm2ForCausalLM [[autodoc]] Lfm2ForCausalLM - - forward \ No newline at end of file + - forward diff --git a/docs/source/en/model_doc/lfm2_vl.md b/docs/source/en/model_doc/lfm2_vl.md index 2e25d94e883..fb6b2ad8a4e 100644 --- a/docs/source/en/model_doc/lfm2_vl.md +++ b/docs/source/en/model_doc/lfm2_vl.md @@ -28,6 +28,7 @@ rendered properly in your Markdown viewer. ## Architecture LFM2-VL consists of three main components: a language model backbone, a vision encoder, and a multimodal projector. LFM2-VL builds upon the LFM2 backbone, inheriting from either LFM2-1.2B (for LFM2-VL-1.6B) or LFM2-350M (for LFM2-VL-450M). For the vision tower, LFM2-VL uses SigLIP2 NaFlex encoders to convert input images into token sequences. Two variants are implemented: + * Shape-optimized (400M) for more fine-grained vision capabilities for LFM2-VL-1.6B * Base (86M) for fast image processing for LFM2-VL-450M diff --git a/docs/source/en/model_doc/lightglue.md b/docs/source/en/model_doc/lightglue.md index 16827345ef0..878bd1982ed 100644 --- a/docs/source/en/model_doc/lightglue.md +++ b/docs/source/en/model_doc/lightglue.md @@ -143,13 +143,11 @@ processed_outputs = processor.post_process_keypoint_matching(outputs, image_size ## LightGlueImageProcessor [[autodoc]] LightGlueImageProcessor - -- preprocess -- post_process_keypoint_matching -- visualize_keypoint_matching + - preprocess + - post_process_keypoint_matching + - visualize_keypoint_matching ## LightGlueForKeypointMatching [[autodoc]] LightGlueForKeypointMatching - -- forward + - forward diff --git a/docs/source/en/model_doc/lilt.md b/docs/source/en/model_doc/lilt.md index 54475e7cb3b..407e4aad3c4 100644 --- a/docs/source/en/model_doc/lilt.md +++ b/docs/source/en/model_doc/lilt.md @@ -62,6 +62,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - Demo notebooks for LiLT can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LiLT). **Documentation resources** + - [Text classification task guide](../tasks/sequence_classification) - [Token classification task guide](../tasks/token_classification) - [Question answering task guide](../tasks/question_answering) diff --git a/docs/source/en/model_doc/llama4.md b/docs/source/en/model_doc/llama4.md index 84812a41997..ee7f2e2a54f 100644 --- a/docs/source/en/model_doc/llama4.md +++ b/docs/source/en/model_doc/llama4.md @@ -27,9 +27,11 @@ rendered properly in your Markdown viewer. [Llama 4](https://ai.meta.com/blog/llama-4-multimodal-intelligence/), developed by Meta, introduces a new auto-regressive Mixture-of-Experts (MoE) architecture. This generation includes two models: + - The highly capable Llama 4 Maverick with 17B active parameters out of ~400B total, with 128 experts. - The efficient Llama 4 Scout also has 17B active parameters out of ~109B total, using just 16 experts. - + Both models leverage early fusion for native multimodality, enabling them to process text and image inputs. Maverick and Scout are both trained on up to 40 trillion tokens on data encompassing 200 languages (with specific fine-tuning support for 12 languages including Arabic, Spanish, German, and Hindi). @@ -421,24 +423,24 @@ model = Llama4ForConditionalGeneration.from_pretrained( ## Llama4ForConditionalGeneration [[autodoc]] Llama4ForConditionalGeneration -- forward + - forward ## Llama4ForCausalLM [[autodoc]] Llama4ForCausalLM -- forward + - forward ## Llama4TextModel [[autodoc]] Llama4TextModel -- forward + - forward ## Llama4ForCausalLM [[autodoc]] Llama4ForCausalLM -- forward + - forward ## Llama4VisionModel [[autodoc]] Llama4VisionModel -- forward + - forward diff --git a/docs/source/en/model_doc/llava.md b/docs/source/en/model_doc/llava.md index ea402071451..e387fb4b54c 100644 --- a/docs/source/en/model_doc/llava.md +++ b/docs/source/en/model_doc/llava.md @@ -57,6 +57,7 @@ The attributes can be obtained from model config, as `model.config.vision_config Each **checkpoint** is trained with a specific prompt format, depending on the underlying large language model backbone. To ensure correct formatting, use the processor's `apply_chat_template` method. **Important:** + - You must construct a conversation history β€” passing a plain string won't work. - Each message should be a dictionary with `"role"` and `"content"` keys. - The `"content"` should be a list of dictionaries for different modalities like `"text"` and `"image"`. diff --git a/docs/source/en/model_doc/llava_next_video.md b/docs/source/en/model_doc/llava_next_video.md index cbbe1361d3b..61aa7e1ffc5 100644 --- a/docs/source/en/model_doc/llava_next_video.md +++ b/docs/source/en/model_doc/llava_next_video.md @@ -64,6 +64,7 @@ The attributes can be obtained from model config, as `model.config.vision_config Each **checkpoint** is trained with a specific prompt format, depending on the underlying large language model backbone. To ensure correct formatting, use the processor's `apply_chat_template` method. **Important:** + - You must construct a conversation history β€” passing a plain string won't work. - Each message should be a dictionary with `"role"` and `"content"` keys. - The `"content"` should be a list of dictionaries for different modalities like `"text"` and `"image"`. diff --git a/docs/source/en/model_doc/llava_onevision.md b/docs/source/en/model_doc/llava_onevision.md index 48fa769835f..08bc075495b 100644 --- a/docs/source/en/model_doc/llava_onevision.md +++ b/docs/source/en/model_doc/llava_onevision.md @@ -59,6 +59,7 @@ Tips: Each **checkpoint** is trained with a specific prompt format, depending on the underlying large language model backbone. To ensure correct formatting, use the processor’s `apply_chat_template` method. **Important:** + - You must construct a conversation history β€” passing a plain string won't work. - Each message should be a dictionary with `"role"` and `"content"` keys. - The `"content"` should be a list of dictionaries for different modalities like `"text"` and `"image"`. diff --git a/docs/source/en/model_doc/markuplm.md b/docs/source/en/model_doc/markuplm.md index c7608f397f6..504817a1499 100644 --- a/docs/source/en/model_doc/markuplm.md +++ b/docs/source/en/model_doc/markuplm.md @@ -30,6 +30,7 @@ performance, similar to [LayoutLM](layoutlm). The model can be used for tasks like question answering on web pages or information extraction from web pages. It obtains state-of-the-art results on 2 important benchmarks: + - [WebSRC](https://x-lance.github.io/WebSRC/), a dataset for Web-Based Structural Reading Comprehension (a bit like SQuAD but for web pages) - [SWDE](https://www.researchgate.net/publication/221299838_From_one_tree_to_a_forest_a_unified_solution_for_structured_web_data_extraction), a dataset for information extraction from web pages (basically named-entity recognition on web pages) diff --git a/docs/source/en/model_doc/mask2former.md b/docs/source/en/model_doc/mask2former.md index fc4b87f836d..91a02cf6f71 100644 --- a/docs/source/en/model_doc/mask2former.md +++ b/docs/source/en/model_doc/mask2former.md @@ -86,4 +86,4 @@ The resource should ideally demonstrate something new instead of duplicating an - preprocess - post_process_semantic_segmentation - post_process_instance_segmentation - - post_process_panoptic_segmentation \ No newline at end of file + - post_process_panoptic_segmentation diff --git a/docs/source/en/model_doc/maskformer.md b/docs/source/en/model_doc/maskformer.md index 17ef4c876e0..aed2dcfa6c4 100644 --- a/docs/source/en/model_doc/maskformer.md +++ b/docs/source/en/model_doc/maskformer.md @@ -44,7 +44,7 @@ This model was contributed by [francesco](https://huggingface.co/francesco). The ## Usage tips -- MaskFormer's Transformer decoder is identical to the decoder of [DETR](detr). During training, the authors of DETR did find it helpful to use auxiliary losses in the decoder, especially to help the model output the correct number of objects of each class. If you set the parameter `use_auxiliary_loss` of [`MaskFormerConfig`] to `True`, then prediction feedforward neural networks and Hungarian losses are added after each decoder layer (with the FFNs sharing parameters). +- MaskFormer's Transformer decoder is identical to the decoder of [DETR](detr). During training, the authors of DETR did find it helpful to use auxiliary losses in the decoder, especially to help the model output the correct number of objects of each class. If you set the parameter `use_auxiliary_loss` of [`MaskFormerConfig`] to `True`, then prediction feedforward neural networks and Hungarian losses are added after each decoder layer (with the FFNs sharing parameters). - If you want to train the model in a distributed environment across multiple nodes, then one should update the `get_num_masks` function inside in the `MaskFormerLoss` class of `modeling_maskformer.py`. When training on multiple nodes, this should be set to the average number of target masks across all nodes, as can be seen in the original implementation [here](https://github.com/facebookresearch/MaskFormer/blob/da3e60d85fdeedcb31476b5edd7d328826ce56cc/mask_former/modeling/criterion.py#L169). @@ -102,4 +102,4 @@ This model was contributed by [francesco](https://huggingface.co/francesco). The ## MaskFormerForInstanceSegmentation [[autodoc]] MaskFormerForInstanceSegmentation - - forward \ No newline at end of file + - forward diff --git a/docs/source/en/model_doc/matcha.md b/docs/source/en/model_doc/matcha.md index 9180d765c2b..a5b2689dcb5 100644 --- a/docs/source/en/model_doc/matcha.md +++ b/docs/source/en/model_doc/matcha.md @@ -79,4 +79,4 @@ scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=1000, nu MatCha is a model that is trained using `Pix2Struct` architecture. You can find more information about `Pix2Struct` in the [Pix2Struct documentation](https://huggingface.co/docs/transformers/main/en/model_doc/pix2struct). - \ No newline at end of file + diff --git a/docs/source/en/model_doc/minimax.md b/docs/source/en/model_doc/minimax.md index a27d45089ce..d1fe109c243 100644 --- a/docs/source/en/model_doc/minimax.md +++ b/docs/source/en/model_doc/minimax.md @@ -35,12 +35,12 @@ The architecture of MiniMax is briefly described as follows: - Activated Parameters per Token: 45.9B - Number Layers: 80 - Hybrid Attention: a softmax attention is positioned after every 7 lightning attention. - - Number of attention heads: 64 - - Attention head dimension: 128 + - Number of attention heads: 64 + - Attention head dimension: 128 - Mixture of Experts: - - Number of experts: 32 - - Expert hidden dimension: 9216 - - Top-2 routing strategy + - Number of experts: 32 + - Expert hidden dimension: 9216 + - Top-2 routing strategy - Positional Encoding: Rotary Position Embedding (RoPE) applied to half of the attention head dimension with a base frequency of 10,000,000 - Hidden Size: 6144 - Vocab Size: 200,064 diff --git a/docs/source/en/model_doc/ministral.md b/docs/source/en/model_doc/ministral.md index c2128512586..117547934f3 100644 --- a/docs/source/en/model_doc/ministral.md +++ b/docs/source/en/model_doc/ministral.md @@ -83,4 +83,4 @@ The example below demonstrates how to use Ministral for text generation: ## MinistralForQuestionAnswering [[autodoc]] MinistralForQuestionAnswering -- forward \ No newline at end of file + - forward diff --git a/docs/source/en/model_doc/mistral.md b/docs/source/en/model_doc/mistral.md index 865ee414532..4c598fc79a7 100644 --- a/docs/source/en/model_doc/mistral.md +++ b/docs/source/en/model_doc/mistral.md @@ -163,4 +163,4 @@ Use the [AttentionMaskVisualizer](https://github.com/huggingface/transformers/bl ## MistralForQuestionAnswering [[autodoc]] MistralForQuestionAnswering -- forward + - forward diff --git a/docs/source/en/model_doc/mixtral.md b/docs/source/en/model_doc/mixtral.md index 7665b5901a6..1e9574145aa 100644 --- a/docs/source/en/model_doc/mixtral.md +++ b/docs/source/en/model_doc/mixtral.md @@ -42,6 +42,7 @@ Mixtral-8x7B is a decoder-only Transformer with the following architectural choi - Despite the model having 45 billion parameters, the compute required for a single forward pass is the same as that of a 14 billion parameter model. This is because even though each of the experts have to be loaded in RAM (70B like ram requirement) each token from the hidden states are dispatched twice (top 2 routing) and thus the compute (the operation required at each forward computation) is just 2 X sequence_length. The following implementation details are shared with Mistral AI's first model [Mistral-7B](mistral): + - Sliding Window Attention - Trained with 8k context length and fixed cache size, with a theoretical attention span of 128K tokens - GQA (Grouped Query Attention) - allowing faster inference and lower cache size. - Byte-fallback BPE tokenizer - ensures that characters are never mapped to out of vocabulary tokens. @@ -55,6 +56,7 @@ For more details refer to the [release blog post](https://mistral.ai/news/mixtra ## Usage tips The Mistral team has released 2 checkpoints: + - a base model, [Mixtral-8x7B-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1), which has been pre-trained to predict the next token on internet-scale data. - an instruction tuned model, [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1), which is the base model optimized for chat purposes using supervised fine-tuning (SFT) and direct preference optimization (DPO). diff --git a/docs/source/en/model_doc/mobilenet_v1.md b/docs/source/en/model_doc/mobilenet_v1.md index 809be7f652a..eea159bdd73 100644 --- a/docs/source/en/model_doc/mobilenet_v1.md +++ b/docs/source/en/model_doc/mobilenet_v1.md @@ -85,10 +85,10 @@ print(f"The predicted class label is: {predicted_class_label}") ## Notes -- Checkpoint names follow the pattern `mobilenet_v1_{depth_multiplier}_{resolution}`, like `mobilenet_v1_1.0_224`. `1.0` is the depth multiplier and `224` is the image resolution. -- While trained on images of a specific sizes, the model architecture works with images of different sizes (minimum 32x32). The [`MobileNetV1ImageProcessor`] handles the necessary preprocessing. -- MobileNet is pretrained on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k), a dataset with 1000 classes. However, the model actually predicts 1001 classes. The additional class is an extra "background" class (index 0). -- The original TensorFlow checkpoints determines the padding amount at inference because it depends on the input image size. To use the native PyTorch padding behavior, set `tf_padding=False` in [`MobileNetV1Config`]. +- Checkpoint names follow the pattern `mobilenet_v1_{depth_multiplier}_{resolution}`, like `mobilenet_v1_1.0_224`. `1.0` is the depth multiplier and `224` is the image resolution. +- While trained on images of a specific sizes, the model architecture works with images of different sizes (minimum 32x32). The [`MobileNetV1ImageProcessor`] handles the necessary preprocessing. +- MobileNet is pretrained on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k), a dataset with 1000 classes. However, the model actually predicts 1001 classes. The additional class is an extra "background" class (index 0). +- The original TensorFlow checkpoints determines the padding amount at inference because it depends on the input image size. To use the native PyTorch padding behavior, set `tf_padding=False` in [`MobileNetV1Config`]. ```python from transformers import MobileNetV1Config @@ -96,11 +96,11 @@ print(f"The predicted class label is: {predicted_class_label}") config = MobileNetV1Config.from_pretrained("google/mobilenet_v1_1.0_224", tf_padding=True) ``` -- The Transformers implementation does not support the following features. - - Uses global average pooling instead of the optional 7x7 average pooling with stride 2. For larger inputs, this gives a pooled output that is larger than a 1x1 pixel. - - Does not support other `output_stride` values (fixed at 32). For smaller `output_strides`, the original implementation uses dilated convolution to prevent spatial resolution from being reduced further. (which would require dilated convolutions). - - `output_hidden_states=True` returns *all* intermediate hidden states. It is not possible to extract the output from specific layers for other downstream purposes. - - Does not include the quantized models from the original checkpoints because they include "FakeQuantization" operations to unquantize the weights. +- The Transformers implementation does not support the following features. + - Uses global average pooling instead of the optional 7x7 average pooling with stride 2. For larger inputs, this gives a pooled output that is larger than a 1x1 pixel. + - Does not support other `output_stride` values (fixed at 32). For smaller `output_strides`, the original implementation uses dilated convolution to prevent spatial resolution from being reduced further. (which would require dilated convolutions). + - `output_hidden_states=True` returns *all* intermediate hidden states. It is not possible to extract the output from specific layers for other downstream purposes. + - Does not include the quantized models from the original checkpoints because they include "FakeQuantization" operations to unquantize the weights. ## MobileNetV1Config diff --git a/docs/source/en/model_doc/mobilenet_v2.md b/docs/source/en/model_doc/mobilenet_v2.md index 2039f9e4413..bf94454e438 100644 --- a/docs/source/en/model_doc/mobilenet_v2.md +++ b/docs/source/en/model_doc/mobilenet_v2.md @@ -82,11 +82,11 @@ print(f"The predicted class label is: {predicted_class_label}") ## Notes -- Classification checkpoint names follow the pattern `mobilenet_v2_{depth_multiplier}_{resolution}`, like `mobilenet_v2_1.4_224`. `1.4` is the depth multiplier and `224` is the image resolution. Segmentation checkpoint names follow the pattern `deeplabv3_mobilenet_v2_{depth_multiplier}_{resolution}`. -- While trained on images of a specific sizes, the model architecture works with images of different sizes (minimum 32x32). The [`MobileNetV2ImageProcessor`] handles the necessary preprocessing. -- MobileNet is pretrained on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k), a dataset with 1000 classes. However, the model actually predicts 1001 classes. The additional class is an extra "background" class (index 0). -- The segmentation models use a [DeepLabV3+](https://huggingface.co/papers/1802.02611) head which is often pretrained on datasets like [PASCAL VOC](https://huggingface.co/datasets/merve/pascal-voc). -- The original TensorFlow checkpoints determines the padding amount at inference because it depends on the input image size. To use the native PyTorch padding behavior, set `tf_padding=False` in [`MobileNetV2Config`]. +- Classification checkpoint names follow the pattern `mobilenet_v2_{depth_multiplier}_{resolution}`, like `mobilenet_v2_1.4_224`. `1.4` is the depth multiplier and `224` is the image resolution. Segmentation checkpoint names follow the pattern `deeplabv3_mobilenet_v2_{depth_multiplier}_{resolution}`. +- While trained on images of a specific sizes, the model architecture works with images of different sizes (minimum 32x32). The [`MobileNetV2ImageProcessor`] handles the necessary preprocessing. +- MobileNet is pretrained on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k), a dataset with 1000 classes. However, the model actually predicts 1001 classes. The additional class is an extra "background" class (index 0). +- The segmentation models use a [DeepLabV3+](https://huggingface.co/papers/1802.02611) head which is often pretrained on datasets like [PASCAL VOC](https://huggingface.co/datasets/merve/pascal-voc). +- The original TensorFlow checkpoints determines the padding amount at inference because it depends on the input image size. To use the native PyTorch padding behavior, set `tf_padding=False` in [`MobileNetV2Config`]. ```python from transformers import MobileNetV2Config @@ -94,11 +94,11 @@ print(f"The predicted class label is: {predicted_class_label}") config = MobileNetV2Config.from_pretrained("google/mobilenet_v2_1.4_224", tf_padding=True) ``` -- The Transformers implementation does not support the following features. - - Uses global average pooling instead of the optional 7x7 average pooling with stride 2. For larger inputs, this gives a pooled output that is larger than a 1x1 pixel. - - `output_hidden_states=True` returns *all* intermediate hidden states. It is not possible to extract the output from specific layers for other downstream purposes. - - Does not include the quantized models from the original checkpoints because they include "FakeQuantization" operations to unquantize the weights. - - For segmentation models, the final convolution layer of the backbone is computed even though the DeepLabV3+ head doesn't use it. +- The Transformers implementation does not support the following features. + - Uses global average pooling instead of the optional 7x7 average pooling with stride 2. For larger inputs, this gives a pooled output that is larger than a 1x1 pixel. + - `output_hidden_states=True` returns *all* intermediate hidden states. It is not possible to extract the output from specific layers for other downstream purposes. + - Does not include the quantized models from the original checkpoints because they include "FakeQuantization" operations to unquantize the weights. + - For segmentation models, the final convolution layer of the backbone is computed even though the DeepLabV3+ head doesn't use it. ## MobileNetV2Config diff --git a/docs/source/en/model_doc/mobilevit.md b/docs/source/en/model_doc/mobilevit.md index 9975cf68155..ca0a35f6ece 100644 --- a/docs/source/en/model_doc/mobilevit.md +++ b/docs/source/en/model_doc/mobilevit.md @@ -28,6 +28,7 @@ Unless required by applicable law or agreed to in writing, software distributed You can find all the original MobileViT checkpoints under the [Apple](https://huggingface.co/apple/models?search=mobilevit) organization. > [!TIP] +> > - This model was contributed by [matthijs](https://huggingface.co/Matthijs). > > Click on the MobileViT models in the right sidebar for more examples of how to apply MobileViT to different vision tasks. diff --git a/docs/source/en/model_doc/moshi.md b/docs/source/en/model_doc/moshi.md index ff7b4bc8a15..885623b26e5 100644 --- a/docs/source/en/model_doc/moshi.md +++ b/docs/source/en/model_doc/moshi.md @@ -38,6 +38,7 @@ The abstract from the paper is the following: *We introduce Moshi, a speech-text foundation model and full-duplex spoken dialogue framework. Current systems for spoken dialogue rely on pipelines of independent components, namely voice activity detection, speech recognition, textual dialogue and text-to-speech. Such frameworks cannot emulate the experience of real conversations. First, their complexity induces a latency of several seconds between interactions. Second, text being the intermediate modality for dialogue, non-linguistic information that modifies meaningβ€” such as emotion or non-speech soundsβ€” is lost in the interaction. Finally, they rely on a segmentation into speaker turns, which does not take into account overlapping speech, interruptions and interjections. Moshi solves these independent issues altogether by casting spoken dialogue as speech-to-speech generation. Starting from a text language model backbone, Moshi generates speech as tokens from the residual quantizer of a neural audio codec, while modeling separately its own speech and that of the user into parallel streams. This allows for the removal of explicit speaker turns, and the modeling of arbitrary conversational dynamics. We moreover extend the hierarchical semantic-to-acoustic token generation of previous work to first predict time-aligned text tokens as a prefix to audio tokens. Not only this β€œInner Monologue” method significantly improves the linguistic quality of generated speech, but we also illustrate how it can provide streaming speech recognition and text-to-speech. Our resulting model is the first real-time full-duplex spoken large language model, with a theoretical latency of 160ms, 200ms in practice, and is available at github.com/kyutai-labs/moshi.* Moshi deals with 3 streams of information: + 1. The user's audio 2. Moshi's audio 3. Moshi's textual output @@ -70,6 +71,7 @@ The original checkpoints can be converted using the conversion script `src/trans ### How to use the model: This implementation has two main aims: + 1. quickly test model generation by simplifying the original API 2. simplify training. A training guide will come soon, but user contributions are welcomed! @@ -84,6 +86,7 @@ It is designed for intermediate use. We strongly recommend using the original [i Moshi is a streaming auto-regressive model with two streams of audio. To put it differently, one audio stream corresponds to what the model said/will say and the other audio stream corresponds to what the user said/will say. [`MoshiForConditionalGeneration.generate`] thus needs 3 inputs: + 1. `input_ids` - corresponding to the text token history 2. `moshi_input_values` or `moshi_audio_codes`- corresponding to the model audio history 3. `user_input_values` or `user_audio_codes` - corresponding to the user audio history @@ -91,6 +94,7 @@ Moshi is a streaming auto-regressive model with two streams of audio. To put it These three inputs must be synchronized. Meaning that their lengths must correspond to the same number of tokens. You can dynamically use the 3 inputs depending on what you want to test: + 1. Simply check the model response to an user prompt - in that case, `input_ids` can be filled with pad tokens and `user_input_values` can be a zero tensor of the same shape than the user prompt. 2. Test more complex behaviour - in that case, you must be careful about how the input tokens are synchronized with the audios. diff --git a/docs/source/en/model_doc/mra.md b/docs/source/en/model_doc/mra.md index ed11d1d9e04..422ed3cec51 100644 --- a/docs/source/en/model_doc/mra.md +++ b/docs/source/en/model_doc/mra.md @@ -64,4 +64,4 @@ The original code can be found [here](https://github.com/mlpen/mra-attention). ## MraForQuestionAnswering [[autodoc]] MraForQuestionAnswering - - forward \ No newline at end of file + - forward diff --git a/docs/source/en/model_doc/musicgen.md b/docs/source/en/model_doc/musicgen.md index c7c5efbc6e0..cda41e0df2a 100644 --- a/docs/source/en/model_doc/musicgen.md +++ b/docs/source/en/model_doc/musicgen.md @@ -230,6 +230,7 @@ generation config. ## Model Structure The MusicGen model can be de-composed into three distinct stages: + 1. Text encoder: maps the text inputs to a sequence of hidden-state representations. The pre-trained MusicGen models use a frozen text encoder from either T5 or Flan-T5 2. MusicGen decoder: a language model (LM) that auto-regressively generates audio tokens (or codes) conditional on the encoder hidden-state representations 3. Audio encoder/decoder: used to encode an audio prompt to use as prompt tokens, and recover the audio waveform from the audio tokens predicted by the decoder @@ -256,6 +257,7 @@ be combined with the frozen text encoder and audio encoder/decoders to recover t model. Tips: + * MusicGen is trained on the 32kHz checkpoint of Encodec. You should ensure you use a compatible version of the Encodec model. * Sampling mode tends to deliver better results than greedy - you can toggle sampling with the variable `do_sample` in the call to [`MusicgenForConditionalGeneration.generate`] diff --git a/docs/source/en/model_doc/musicgen_melody.md b/docs/source/en/model_doc/musicgen_melody.md index ff670ef8529..caf8cdd739d 100644 --- a/docs/source/en/model_doc/musicgen_melody.md +++ b/docs/source/en/model_doc/musicgen_melody.md @@ -40,6 +40,7 @@ This model was contributed by [ylacombe](https://huggingface.co/ylacombe). The o ## Difference with [MusicGen](https://huggingface.co/docs/transformers/main/en/model_doc/musicgen) There are two key differences with MusicGen: + 1. The audio prompt is used here as a conditional signal for the generated audio sample, whereas it's used for audio continuation in [MusicGen](https://huggingface.co/docs/transformers/main/en/model_doc/musicgen). 2. Conditional text and audio signals are concatenated to the decoder's hidden states instead of being used as a cross-attention signal, as in MusicGen. @@ -224,6 +225,7 @@ Note that any arguments passed to the generate method will **supersede** those i ## Model Structure The MusicGen model can be de-composed into three distinct stages: + 1. Text encoder: maps the text inputs to a sequence of hidden-state representations. The pre-trained MusicGen models use a frozen text encoder from either T5 or Flan-T5. 2. MusicGen Melody decoder: a language model (LM) that auto-regressively generates audio tokens (or codes) conditional on the encoder hidden-state representations 3. Audio decoder: used to recover the audio waveform from the audio tokens predicted by the decoder. @@ -253,6 +255,7 @@ python src/transformers/models/musicgen_melody/convert_musicgen_melody_transform ``` Tips: + * MusicGen is trained on the 32kHz checkpoint of Encodec. You should ensure you use a compatible version of the Encodec model. * Sampling mode tends to deliver better results than greedy - you can toggle sampling with the variable `do_sample` in the call to [`MusicgenMelodyForConditionalGeneration.generate`] diff --git a/docs/source/en/model_doc/nat.md b/docs/source/en/model_doc/nat.md index dadcae6f17f..36662173f2f 100644 --- a/docs/source/en/model_doc/nat.md +++ b/docs/source/en/model_doc/nat.md @@ -68,6 +68,7 @@ The `reshaped_hidden_states` have a shape of `(batch, num_channels, height, widt `(batch_size, height, width, num_channels)`. Notes: + - NAT depends on [NATTEN](https://github.com/SHI-Labs/NATTEN/)'s implementation of Neighborhood Attention. You can install it with pre-built wheels for Linux by referring to [shi-labs.com/natten](https://shi-labs.com/natten), or build on your system by running `pip install natten`. diff --git a/docs/source/en/model_doc/nllb.md b/docs/source/en/model_doc/nllb.md index 77fffafde67..f44c03dcfdd 100644 --- a/docs/source/en/model_doc/nllb.md +++ b/docs/source/en/model_doc/nllb.md @@ -128,9 +128,9 @@ visualizer("UN Chief says there is no military solution in Syria") >>> tokenizer = NllbTokenizer.from_pretrained("facebook/nllb-200-distilled-600M", legacy_behaviour=True) ``` - - For non-English languages, specify the language's [BCP-47](https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200) code with the `src_lang` keyword as shown below. +- For non-English languages, specify the language's [BCP-47](https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200) code with the `src_lang` keyword as shown below. - - See example below for a translation from Romanian to German. +- See example below for a translation from Romanian to German. ```python >>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer diff --git a/docs/source/en/model_doc/oneformer.md b/docs/source/en/model_doc/oneformer.md index c4b3bd142fe..7f5d32bc55a 100644 --- a/docs/source/en/model_doc/oneformer.md +++ b/docs/source/en/model_doc/oneformer.md @@ -39,7 +39,7 @@ This model was contributed by [Jitesh Jain](https://huggingface.co/praeclarumjj3 ## Usage tips -- OneFormer requires two inputs during inference: *image* and *task token*. +- OneFormer requires two inputs during inference: *image* and *task token*. - During training, OneFormer only uses panoptic annotations. - If you want to train the model in a distributed environment across multiple nodes, then one should update the `get_num_masks` function inside in the `OneFormerLoss` class of `modeling_oneformer.py`. When training on multiple nodes, this should be diff --git a/docs/source/en/model_doc/openai-gpt.md b/docs/source/en/model_doc/openai-gpt.md index fba08ceca00..04d37d89cc4 100644 --- a/docs/source/en/model_doc/openai-gpt.md +++ b/docs/source/en/model_doc/openai-gpt.md @@ -84,22 +84,22 @@ echo -e "The future of AI is" | transformers run --task text-generation --model ## OpenAIGPTModel [[autodoc]] OpenAIGPTModel -- forward + - forward ## OpenAIGPTLMHeadModel [[autodoc]] OpenAIGPTLMHeadModel -- forward + - forward ## OpenAIGPTDoubleHeadsModel [[autodoc]] OpenAIGPTDoubleHeadsModel -- forward + - forward ## OpenAIGPTForSequenceClassification [[autodoc]] OpenAIGPTForSequenceClassification -- forward + - forward ## OpenAIGPTTokenizer diff --git a/docs/source/en/model_doc/parakeet.md b/docs/source/en/model_doc/parakeet.md index c99473ac1a7..4cb72e7e458 100644 --- a/docs/source/en/model_doc/parakeet.md +++ b/docs/source/en/model_doc/parakeet.md @@ -27,12 +27,13 @@ rendered properly in your Markdown viewer. Parakeet models, [introduced by NVIDIA NeMo](https://developer.nvidia.com/blog/pushing-the-boundaries-of-speech-recognition-with-nemo-parakeet-asr-models/), are models that combine a [Fast Conformer](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/models.html#fast-conformer) encoder with connectionist temporal classification (CTC), recurrent neural network transducer (RNNT) or token and duration transducer (TDT) decoder for automatic speech recognition. **Model Architecture** + - **Fast Conformer Encoder**: A linearly scalable Conformer architecture that processes mel-spectrogram features and reduces sequence length through subsampling. This is more efficient version of the Conformer Encoder found in [FastSpeech2Conformer](./fastspeech2_conformer.md) (see [`ParakeetEncoder`] for the encoder implementation and details). - [**ParakeetForCTC**](#parakeetforctc): a Fast Conformer Encoder + a CTC decoder - - **CTC Decoder**: Simple but effective decoder consisting of: - - 1D convolution projection from encoder hidden size to vocabulary size (for optimal NeMo compatibility). - - CTC loss computation for training. - - Greedy CTC decoding for inference. + - **CTC Decoder**: Simple but effective decoder consisting of: + - 1D convolution projection from encoder hidden size to vocabulary size (for optimal NeMo compatibility). + - CTC loss computation for training. + - Greedy CTC decoding for inference. The original implementation can be found in [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). Model checkpoints are to be found under [the NVIDIA organization](https://huggingface.co/nvidia/models?search=parakeet). @@ -189,7 +190,7 @@ outputs.loss.backward() ## ParakeetTokenizerFast -[[autodoc]] ParakeetTokenizerFast +[[autodoc]] ParakeetTokenizerFast ## ParakeetFeatureExtractor @@ -205,11 +206,11 @@ outputs.loss.backward() ## ParakeetEncoderConfig -[[autodoc]] ParakeetEncoderConfig +[[autodoc]] ParakeetEncoderConfig ## ParakeetCTCConfig -[[autodoc]] ParakeetCTCConfig +[[autodoc]] ParakeetCTCConfig ## ParakeetEncoder @@ -218,4 +219,3 @@ outputs.loss.backward() ## ParakeetForCTC [[autodoc]] ParakeetForCTC - diff --git a/docs/source/en/model_doc/patchtsmixer.md b/docs/source/en/model_doc/patchtsmixer.md index 23ebb89b6ad..4a9ddef4641 100644 --- a/docs/source/en/model_doc/patchtsmixer.md +++ b/docs/source/en/model_doc/patchtsmixer.md @@ -89,4 +89,4 @@ The model can also be used for time series classification and time series regres ## PatchTSMixerForRegression [[autodoc]] PatchTSMixerForRegression - - forward \ No newline at end of file + - forward diff --git a/docs/source/en/model_doc/phimoe.md b/docs/source/en/model_doc/phimoe.md index 609f56c488b..7394e26b5b9 100644 --- a/docs/source/en/model_doc/phimoe.md +++ b/docs/source/en/model_doc/phimoe.md @@ -45,6 +45,7 @@ The original code for PhiMoE can be found [here](https://huggingface.co/microsof Phi-3.5-MoE-instruct has been integrated in the development version (4.44.2.dev) of `transformers`. Until the official version is released through `pip`, ensure that you are doing the following: + * When loading the model, ensure that `trust_remote_code=True` is passed as an argument of the `from_pretrained()` function. The current `transformers` version can be verified with: `pip list | grep transformers`. diff --git a/docs/source/en/model_doc/pix2struct.md b/docs/source/en/model_doc/pix2struct.md index c43c9b3b92e..412d2c2fef9 100644 --- a/docs/source/en/model_doc/pix2struct.md +++ b/docs/source/en/model_doc/pix2struct.md @@ -79,4 +79,4 @@ The original code can be found [here](https://github.com/google-research/pix2str ## Pix2StructForConditionalGeneration [[autodoc]] Pix2StructForConditionalGeneration - - forward \ No newline at end of file + - forward diff --git a/docs/source/en/model_doc/plbart.md b/docs/source/en/model_doc/plbart.md index d8ce330cb0f..b3459299437 100644 --- a/docs/source/en/model_doc/plbart.md +++ b/docs/source/en/model_doc/plbart.md @@ -120,4 +120,4 @@ it's passed with the `text_target` keyword argument. ## PLBartForCausalLM [[autodoc]] PLBartForCausalLM - - forward \ No newline at end of file + - forward diff --git a/docs/source/en/model_doc/pop2piano.md b/docs/source/en/model_doc/pop2piano.md index 90e0cd3f063..c934d878903 100644 --- a/docs/source/en/model_doc/pop2piano.md +++ b/docs/source/en/model_doc/pop2piano.md @@ -59,6 +59,7 @@ pip install pretty-midi==0.2.9 essentia==2.1b6.dev1034 librosa scipy ``` Please note that you may need to restart your runtime after installation. + * Pop2Piano is an Encoder-Decoder based model like T5. * Pop2Piano can be used to generate midi-audio files for a given audio sequence. * Choosing different composers in `Pop2PianoForConditionalGeneration.generate()` can lead to variety of different results. diff --git a/docs/source/en/model_doc/prompt_depth_anything.md b/docs/source/en/model_doc/prompt_depth_anything.md index 0ac26609b4d..d4b6f4cc259 100644 --- a/docs/source/en/model_doc/prompt_depth_anything.md +++ b/docs/source/en/model_doc/prompt_depth_anything.md @@ -99,4 +99,4 @@ If you are interested in submitting a resource to be included here, please feel [[autodoc]] PromptDepthAnythingImageProcessorFast - preprocess - - post_process_depth_estimation \ No newline at end of file + - post_process_depth_estimation diff --git a/docs/source/en/model_doc/qwen3_next.md b/docs/source/en/model_doc/qwen3_next.md index 73793413609..62b52e3d6d5 100644 --- a/docs/source/en/model_doc/qwen3_next.md +++ b/docs/source/en/model_doc/qwen3_next.md @@ -19,6 +19,7 @@ rendered properly in your Markdown viewer. The Qwen3-Next series represents our next-generation foundation models, optimized for extreme context length and large-scale parameter efficiency. The series introduces a suite of architectural innovations designed to maximize performance while minimizing computational cost: + - **Hybrid Attention**: Replaces standard attention with the combination of **Gated DeltaNet** and **Gated Attention**, enabling efficient context modeling. - **High-Sparsity MoE**: Achieves an extreme low activation ratio as 1:50 in MoE layers β€” drastically reducing FLOPs per token while preserving model capacity. - **Multi-Token Prediction(MTP)**: Boosts pretraining model performance, and accelerates inference. diff --git a/docs/source/en/model_doc/rwkv.md b/docs/source/en/model_doc/rwkv.md index c0bd1273f61..9b5d64fedbb 100644 --- a/docs/source/en/model_doc/rwkv.md +++ b/docs/source/en/model_doc/rwkv.md @@ -152,4 +152,4 @@ $$D_{i} = e^{u + K_{i} - q} + e^{M_{i}} \tilde{D}_{i} \hbox{ where } q = \max( which finally gives us -$$O_{i} = \sigma(R_{i}) \frac{N_{i}}{D_{i}}$$ \ No newline at end of file +$$O_{i} = \sigma(R_{i}) \frac{N_{i}}{D_{i}}$$ diff --git a/docs/source/en/model_doc/seamless_m4t_v2.md b/docs/source/en/model_doc/seamless_m4t_v2.md index 7b2111e62f2..4a32199243a 100644 --- a/docs/source/en/model_doc/seamless_m4t_v2.md +++ b/docs/source/en/model_doc/seamless_m4t_v2.md @@ -139,6 +139,7 @@ The architecture of this new version differs from the first in a few aspects: #### Improvements on the second-pass model The second seq2seq model, named text-to-unit model, is now non-auto regressive, meaning that it computes units in a **single forward pass**. This achievement is made possible by: + - the use of **character-level embeddings**, meaning that each character of the predicted translated text has its own embeddings, which are then used to predict the unit tokens. - the use of an intermediate duration predictor, that predicts speech duration at the **character-level** on the predicted translated text. - the use of a new text-to-unit decoder mixing convolutions and self-attention to handle longer context. @@ -146,6 +147,7 @@ The second seq2seq model, named text-to-unit model, is now non-auto regressive, #### Difference in the speech encoder The speech encoder, which is used during the first-pass generation process to predict the translated text, differs mainly from the previous speech encoder through these mechanisms: + - the use of chunked attention mask to prevent attention across chunks, ensuring that each position attends only to positions within its own chunk and a fixed number of previous chunks. - the use of relative position embeddings which only considers distance between sequence elements rather than absolute positions. Please refer to [Self-Attentionwith Relative Position Representations (Shaw et al.)](https://huggingface.co/papers/1803.02155) for more details. - the use of a causal depth-wise convolution instead of a non-causal one. diff --git a/docs/source/en/model_doc/seggpt.md b/docs/source/en/model_doc/seggpt.md index a5568d5c80e..356b0f7abcf 100644 --- a/docs/source/en/model_doc/seggpt.md +++ b/docs/source/en/model_doc/seggpt.md @@ -30,6 +30,7 @@ The abstract from the paper is the following: *We present SegGPT, a generalist model for segmenting everything in context. We unify various segmentation tasks into a generalist in-context learning framework that accommodates different kinds of segmentation data by transforming them into the same format of images. The training of SegGPT is formulated as an in-context coloring problem with random color mapping for each data sample. The objective is to accomplish diverse tasks according to the context, rather than relying on specific colors. After training, SegGPT can perform arbitrary segmentation tasks in images or videos via in-context inference, such as object instance, stuff, part, contour, and text. SegGPT is evaluated on a broad range of tasks, including few-shot semantic segmentation, video object segmentation, semantic segmentation, and panoptic segmentation. Our results show strong capabilities in segmenting in-domain and out-of* Tips: + - One can use [`SegGptImageProcessor`] to prepare image input, prompt and mask to the model. - One can either use segmentation maps or RGB images as prompt masks. If using the latter make sure to set `do_convert_rgb=False` in the `preprocess` method. - It's highly advisable to pass `num_labels` when using `segmentation_maps` (not considering background) during preprocessing and postprocessing with [`SegGptImageProcessor`] for your use case. diff --git a/docs/source/en/model_doc/shieldgemma2.md b/docs/source/en/model_doc/shieldgemma2.md index 871cdd31db7..6a67c2d61b5 100644 --- a/docs/source/en/model_doc/shieldgemma2.md +++ b/docs/source/en/model_doc/shieldgemma2.md @@ -22,9 +22,9 @@ rendered properly in your Markdown viewer. The ShieldGemma 2 model was proposed in a [technical report](https://huggingface.co/papers/2504.01081) by Google. ShieldGemma 2, built on [Gemma 3](https://ai.google.dev/gemma/docs/core/model_card_3), is a 4 billion (4B) parameter model that checks the safety of both synthetic and natural images against key categories to help you build robust datasets and models. With this addition to the Gemma family of models, researchers and developers can now easily minimize the risk of harmful content in their models across key areas of harm as defined below: -- No Sexually Explicit content: The image shall not contain content that depicts explicit or graphic sexual acts (e.g., pornography, erotic nudity, depictions of rape or sexual assault). -- No Dangerous Content: The image shall not contain content that facilitates or encourages activities that could cause real-world harm (e.g., building firearms and explosive devices, promotion of terrorism, instructions for suicide). -- No Violence/Gore content: The image shall not contain content that depicts shocking, sensational, or gratuitous violence (e.g., excessive blood and gore, gratuitous violence against animals, extreme injury or moment of death). +- No Sexually Explicit content: The image shall not contain content that depicts explicit or graphic sexual acts (e.g., pornography, erotic nudity, depictions of rape or sexual assault). +- No Dangerous Content: The image shall not contain content that facilitates or encourages activities that could cause real-world harm (e.g., building firearms and explosive devices, promotion of terrorism, instructions for suicide). +- No Violence/Gore content: The image shall not contain content that depicts shocking, sensational, or gratuitous violence (e.g., excessive blood and gore, gratuitous violence against animals, extreme injury or moment of death). We recommend using ShieldGemma 2 as an input filter to vision language models, or as an output filter of image generation systems. To train a robust image safety model, we curated training datasets of natural and synthetic images and instruction-tuned Gemma 3 to demonstrate strong performance. diff --git a/docs/source/en/model_doc/superglue.md b/docs/source/en/model_doc/superglue.md index d25ca822e4c..2b65da80def 100644 --- a/docs/source/en/model_doc/superglue.md +++ b/docs/source/en/model_doc/superglue.md @@ -143,13 +143,11 @@ processed_outputs = processor.post_process_keypoint_matching(outputs, image_size ## SuperGlueImageProcessor [[autodoc]] SuperGlueImageProcessor - -- preprocess -- post_process_keypoint_matching -- visualize_keypoint_matching + - preprocess + - post_process_keypoint_matching + - visualize_keypoint_matching ## SuperGlueForKeypointMatching [[autodoc]] SuperGlueForKeypointMatching - -- forward + - forward diff --git a/docs/source/en/model_doc/superpoint.md b/docs/source/en/model_doc/superpoint.md index 26ffb2c8b4b..3efd5ecf90f 100644 --- a/docs/source/en/model_doc/superpoint.md +++ b/docs/source/en/model_doc/superpoint.md @@ -129,16 +129,15 @@ processed_outputs = processor.post_process_keypoint_detection(outputs, [image_si ## SuperPointImageProcessor [[autodoc]] SuperPointImageProcessor - -- preprocess + - preprocess ## SuperPointImageProcessorFast [[autodoc]] SuperPointImageProcessorFast -- preprocess -- post_process_keypoint_detection + - preprocess + - post_process_keypoint_detection ## SuperPointForKeypointDetection [[autodoc]] SuperPointForKeypointDetection -- forward + - forward diff --git a/docs/source/en/model_doc/tapas.md b/docs/source/en/model_doc/tapas.md index c5144121df6..09c624c7fb7 100644 --- a/docs/source/en/model_doc/tapas.md +++ b/docs/source/en/model_doc/tapas.md @@ -30,6 +30,7 @@ token types that encode tabular structure. TAPAS is pre-trained on the masked la millions of tables from English Wikipedia and corresponding texts. For question answering, TAPAS has 2 heads on top: a cell selection head and an aggregation head, for (optionally) performing aggregations (such as counting or summing) among selected cells. TAPAS has been fine-tuned on several datasets: + - [SQA](https://www.microsoft.com/en-us/download/details.aspx?id=54253) (Sequential Question Answering by Microsoft) - [WTQ](https://github.com/ppasupat/WikiTableQuestions) (Wiki Table Questions by Stanford University) - [WikiSQL](https://github.com/salesforce/WikiSQL) (by Salesforce). diff --git a/docs/source/en/model_doc/tapex.md b/docs/source/en/model_doc/tapex.md index 0a10826ee1a..606d8940c4e 100644 --- a/docs/source/en/model_doc/tapex.md +++ b/docs/source/en/model_doc/tapex.md @@ -37,6 +37,7 @@ Bei Chen, Jiaqi Guo, Morteza Ziyadi, Zeqi Lin, Weizhu Chen, Jian-Guang Lou. TAPE which it can be fine-tuned to answer natural language questions related to tabular data, as well as performing table fact checking. TAPEX has been fine-tuned on several datasets: + - [SQA](https://www.microsoft.com/en-us/download/details.aspx?id=54253) (Sequential Question Answering by Microsoft) - [WTQ](https://github.com/ppasupat/WikiTableQuestions) (Wiki Table Questions by Stanford University) - [WikiSQL](https://github.com/salesforce/WikiSQL) (by Salesforce) diff --git a/docs/source/en/model_doc/time_series_transformer.md b/docs/source/en/model_doc/time_series_transformer.md index 921b7e01d4b..36a68af80ca 100644 --- a/docs/source/en/model_doc/time_series_transformer.md +++ b/docs/source/en/model_doc/time_series_transformer.md @@ -35,16 +35,16 @@ point forecasting model. This means that the model learns a distribution, from w and a decoder, which predicts a `prediction_length` of time series values into the future (called `future_values`). During training, one needs to provide pairs of (`past_values` and `future_values`) to the model. - In addition to the raw (`past_values` and `future_values`), one typically provides additional features to the model. These can be the following: - - `past_time_features`: temporal features which the model will add to `past_values`. These serve as "positional encodings" for the Transformer encoder. + - `past_time_features`: temporal features which the model will add to `past_values`. These serve as "positional encodings" for the Transformer encoder. Examples are "day of the month", "month of the year", etc. as scalar values (and then stacked together as a vector). e.g. if a given time-series value was obtained on the 11th of August, then one could have [11, 8] as time feature vector (11 being "day of the month", 8 being "month of the year"). - - `future_time_features`: temporal features which the model will add to `future_values`. These serve as "positional encodings" for the Transformer decoder. + - `future_time_features`: temporal features which the model will add to `future_values`. These serve as "positional encodings" for the Transformer decoder. Examples are "day of the month", "month of the year", etc. as scalar values (and then stacked together as a vector). e.g. if a given time-series value was obtained on the 11th of August, then one could have [11, 8] as time feature vector (11 being "day of the month", 8 being "month of the year"). - - `static_categorical_features`: categorical features which are static over time (i.e., have the same value for all `past_values` and `future_values`). + - `static_categorical_features`: categorical features which are static over time (i.e., have the same value for all `past_values` and `future_values`). An example here is the store ID or region ID that identifies a given time-series. Note that these features need to be known for ALL data points (also those in the future). - - `static_real_features`: real-valued features which are static over time (i.e., have the same value for all `past_values` and `future_values`). + - `static_real_features`: real-valued features which are static over time (i.e., have the same value for all `past_values` and `future_values`). An example here is the image representation of the product for which you have the time-series values (like the [ResNet](resnet) embedding of a "shoe" picture, if your time-series is about the sales of shoes). Note that these features need to be known for ALL data points (also those in the future). diff --git a/docs/source/en/model_doc/timesformer.md b/docs/source/en/model_doc/timesformer.md index 59e9ee71817..1d87158d72e 100644 --- a/docs/source/en/model_doc/timesformer.md +++ b/docs/source/en/model_doc/timesformer.md @@ -54,4 +54,4 @@ the number of input frames per clip changes based on the model size so you shoul ## TimesformerForVideoClassification [[autodoc]] TimesformerForVideoClassification - - forward \ No newline at end of file + - forward diff --git a/docs/source/en/model_doc/udop.md b/docs/source/en/model_doc/udop.md index eb400cc39d5..cc370accf3e 100644 --- a/docs/source/en/model_doc/udop.md +++ b/docs/source/en/model_doc/udop.md @@ -115,4 +115,4 @@ to fine-tune UDOP on a custom dataset as well as inference. 🌎 ## UdopEncoderModel [[autodoc]] UdopEncoderModel - - forward \ No newline at end of file + - forward diff --git a/docs/source/en/model_doc/univnet.md b/docs/source/en/model_doc/univnet.md index 7a580692833..4329846ab7f 100644 --- a/docs/source/en/model_doc/univnet.md +++ b/docs/source/en/model_doc/univnet.md @@ -81,4 +81,4 @@ To the best of my knowledge, there is no official code release, but an unofficia ## UnivNetModel [[autodoc]] UnivNetModel - - forward \ No newline at end of file + - forward diff --git a/docs/source/en/model_doc/upernet.md b/docs/source/en/model_doc/upernet.md index 2c2e50fc560..900b5635fc1 100644 --- a/docs/source/en/model_doc/upernet.md +++ b/docs/source/en/model_doc/upernet.md @@ -81,4 +81,4 @@ If you're interested in submitting a resource to be included here, please feel f ## UperNetForSemanticSegmentation [[autodoc]] UperNetForSemanticSegmentation - - forward \ No newline at end of file + - forward diff --git a/docs/source/en/model_doc/videomae.md b/docs/source/en/model_doc/videomae.md index 590011c7345..eb02fc48bb4 100644 --- a/docs/source/en/model_doc/videomae.md +++ b/docs/source/en/model_doc/videomae.md @@ -75,6 +75,7 @@ you're interested in submitting a resource to be included here, please feel free review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. **Video classification** + - [A notebook](https://github.com/huggingface/notebooks/blob/main/examples/video_classification.ipynb) that shows how to fine-tune a VideoMAE model on a custom dataset. - [Video classification task guide](../tasks/video_classification) diff --git a/docs/source/en/model_doc/vit_mae.md b/docs/source/en/model_doc/vit_mae.md index 1099019a842..0547594ae11 100644 --- a/docs/source/en/model_doc/vit_mae.md +++ b/docs/source/en/model_doc/vit_mae.md @@ -66,6 +66,7 @@ reconstruction = outputs.logits ## Notes + - ViTMAE is typically used in two stages. Self-supervised pretraining with [`ViTMAEForPreTraining`], and then discarding the decoder and fine-tuning the encoder. After fine-tuning, the weights can be plugged into a model like [`ViTForImageClassification`]. - Use [`ViTImageProcessor`] for input preparation. diff --git a/docs/source/en/model_doc/vitdet.md b/docs/source/en/model_doc/vitdet.md index 539ae5e376c..a1250f1bb90 100644 --- a/docs/source/en/model_doc/vitdet.md +++ b/docs/source/en/model_doc/vitdet.md @@ -40,4 +40,4 @@ Tips: ## VitDetModel [[autodoc]] VitDetModel - - forward \ No newline at end of file + - forward diff --git a/docs/source/en/model_doc/vitmatte.md b/docs/source/en/model_doc/vitmatte.md index 519a2dd74d6..0584df8e67a 100644 --- a/docs/source/en/model_doc/vitmatte.md +++ b/docs/source/en/model_doc/vitmatte.md @@ -62,4 +62,4 @@ The model expects both the image and trimap (concatenated) as input. Use [`ViTMa ## VitMatteForImageMatting [[autodoc]] VitMatteForImageMatting - - forward \ No newline at end of file + - forward diff --git a/docs/source/en/model_doc/vits.md b/docs/source/en/model_doc/vits.md index 664edcb92ae..96dc9389247 100644 --- a/docs/source/en/model_doc/vits.md +++ b/docs/source/en/model_doc/vits.md @@ -149,10 +149,10 @@ Audio(waveform, rate=model.config.sampling_rate) ## VitsTokenizer [[autodoc]] VitsTokenizer -- __call__ -- save_vocabulary + - __call__ + - save_vocabulary ## VitsModel [[autodoc]] VitsModel -- forward + - forward diff --git a/docs/source/en/model_doc/voxtral.md b/docs/source/en/model_doc/voxtral.md index 56fc84d30d0..3dd2fc9e0d3 100644 --- a/docs/source/en/model_doc/voxtral.md +++ b/docs/source/en/model_doc/voxtral.md @@ -22,6 +22,7 @@ Voxtral is an upgrade of [Ministral 3B and Mistral Small 3B](https://mistral.ai/ You can read more in Mistral's [realease blog post](https://mistral.ai/news/voxtral). The model is available in two checkpoints: + - 3B: [mistralai/Voxtral-Mini-3B-2507](https://huggingface.co/mistralai/Voxtral-Mini-3B-2507) - 24B: [mistralai/Voxtral-Small-24B-2507](https://huggingface.co/mistralai/Voxtral-Small-24B-2507) diff --git a/docs/source/en/model_doc/wav2vec2_phoneme.md b/docs/source/en/model_doc/wav2vec2_phoneme.md index c2621f8924c..206ea048c02 100644 --- a/docs/source/en/model_doc/wav2vec2_phoneme.md +++ b/docs/source/en/model_doc/wav2vec2_phoneme.md @@ -63,7 +63,7 @@ except for the tokenizer. ## Wav2Vec2PhonemeCTCTokenizer [[autodoc]] Wav2Vec2PhonemeCTCTokenizer - - __call__ - - batch_decode - - decode - - phonemize + - __call__ + - batch_decode + - decode + - phonemize diff --git a/docs/source/en/model_doc/xcodec.md b/docs/source/en/model_doc/xcodec.md index ca6d6e473fc..957a7409348 100644 --- a/docs/source/en/model_doc/xcodec.md +++ b/docs/source/en/model_doc/xcodec.md @@ -36,6 +36,7 @@ The abstract of the paper states the following: *Recent advancements in audio generation have been significantly propelled by the capabilities of Large Language Models (LLMs). The existing research on audio LLM has primarily focused on enhancing the architecture and scale of audio language models, as well as leveraging larger datasets, and generally, acoustic codecs, such as EnCodec, are used for audio tokenization. However, these codecs were originally designed for audio compression, which may lead to suboptimal performance in the context of audio LLM. Our research aims to address the shortcomings of current audio LLM codecs, particularly their challenges in maintaining semantic integrity in generated audio. For instance, existing methods like VALL-E, which condition acoustic token generation on text transcriptions, often suffer from content inaccuracies and elevated word error rates (WER) due to semantic misinterpretations of acoustic tokens, resulting in word skipping and errors. To overcome these issues, we propose a straightforward yet effective approach called X-Codec. X-Codec incorporates semantic features from a pre-trained semantic encoder before the Residual Vector Quantization (RVQ) stage and introduces a semantic reconstruction loss after RVQ. By enhancing the semantic ability of the codec, X-Codec significantly reduces WER in speech synthesis tasks and extends these benefits to non-speech applications, including music and sound generation. Our experiments in text-to-speech, music continuation, and text-to-sound tasks demonstrate that integrating semantic information substantially improves the overall performance of language models in audio generation.* Model cards: + - [xcodec-hubert-librispeech](https://huggingface.co/hf-audio/xcodec-hubert-librispeech) (for speech) - [xcodec-wavlm-mls](https://huggingface.co/hf-audio/xcodec-wavlm-mls) (for speech) - [xcodec-wavlm-more-data](https://huggingface.co/hf-audio/xcodec-wavlm-more-data) (for speech) @@ -97,4 +98,4 @@ sf.write("reconstruction.wav", reconstruction.T, sampling_rate) [[autodoc]] XcodecModel - decode - encode - - forward \ No newline at end of file + - forward diff --git a/docs/source/en/model_doc/xmod.md b/docs/source/en/model_doc/xmod.md index 0593e9940bd..624b7ebb2d2 100644 --- a/docs/source/en/model_doc/xmod.md +++ b/docs/source/en/model_doc/xmod.md @@ -36,6 +36,7 @@ The original code can be found [here](https://github.com/facebookresearch/fairse ## Usage tips Tips: + - X-MOD is similar to [XLM-R](xlm-roberta), but a difference is that the input language needs to be specified so that the correct language adapter can be activated. - The main models – base and large – have adapters for 81 languages. @@ -44,6 +45,7 @@ Tips: ### Input language There are two ways to specify the input language: + 1. By setting a default language before using the model: ```python diff --git a/docs/source/en/model_doc/yolos.md b/docs/source/en/model_doc/yolos.md index 666f9674332..4a75b2ed020 100644 --- a/docs/source/en/model_doc/yolos.md +++ b/docs/source/en/model_doc/yolos.md @@ -97,6 +97,7 @@ for score, label, box in zip(filtered_scores, filtered_labels, pixel_boxes): ## Notes + - Use [`YolosImageProcessor`] for preparing images (and optional targets) for the model. Contrary to [DETR](./detr), YOLOS doesn't require a `pixel_mask`. ## Resources diff --git a/docs/source/en/model_doc/yoso.md b/docs/source/en/model_doc/yoso.md index 8e121dd88cd..211b0dcf809 100644 --- a/docs/source/en/model_doc/yoso.md +++ b/docs/source/en/model_doc/yoso.md @@ -99,4 +99,4 @@ alt="drawing" width="600"/> ## YosoForQuestionAnswering [[autodoc]] YosoForQuestionAnswering - - forward \ No newline at end of file + - forward diff --git a/docs/source/en/model_doc/zamba.md b/docs/source/en/model_doc/zamba.md index 635bc76fb0c..847f0532e2a 100644 --- a/docs/source/en/model_doc/zamba.md +++ b/docs/source/en/model_doc/zamba.md @@ -69,6 +69,7 @@ print(tokenizer.decode(outputs[0])) ## Model card The model cards can be found at: + * [Zamba-7B](https://huggingface.co/Zyphra/Zamba-7B-v1) ## Issues diff --git a/docs/source/en/model_doc/zamba2.md b/docs/source/en/model_doc/zamba2.md index 7296ef1b250..c9d3d3d1de7 100644 --- a/docs/source/en/model_doc/zamba2.md +++ b/docs/source/en/model_doc/zamba2.md @@ -61,6 +61,7 @@ print(tokenizer.decode(outputs[0])) ## Model card The model cards can be found at: + * [Zamba2-1.2B](https://huggingface.co/Zyphra/Zamba2-1.2B) * [Zamba2-2.7B](https://huggingface.co/Zyphra/Zamba2-2.7B) * [Zamba2-7B](https://huggingface.co/Zyphra/Zamba2-7B) diff --git a/docs/source/en/model_doc/zoedepth.md b/docs/source/en/model_doc/zoedepth.md index 5252d2b4d36..92840a77046 100644 --- a/docs/source/en/model_doc/zoedepth.md +++ b/docs/source/en/model_doc/zoedepth.md @@ -109,6 +109,7 @@ Image.fromarray(depth.astype("uint8")) ``` ## Resources + - Refer to this [notebook](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/ZoeDepth) for an inference example. ## ZoeDepthConfig diff --git a/docs/source/en/open_webui.md b/docs/source/en/open_webui.md index 9042131631e..2946fc95f14 100644 --- a/docs/source/en/open_webui.md +++ b/docs/source/en/open_webui.md @@ -9,6 +9,7 @@ transformers serve --enable-cors ``` Before you can speak into Open WebUI, you need to update its settings to use your server for speech to text (STT) tasks. Launch Open WebUI, and navigate to the audio tab inside the admin settings. If you're using Open WebUI with the default ports, [this link (default)](http://localhost:3000/admin/settings/audio) or [this link (python deployment)](http://localhost:8080/admin/settings/audio) will take you there. Do the following changes there: + 1. Change the type of "Speech-to-Text Engine" to "OpenAI"; 2. Update the address to your server's address -- `http://localhost:8000/v1` by default; 3. Type your model of choice into the "STT Model" field, e.g. `openai/whisper-large-v3` ([available models](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&sort=trending)). diff --git a/docs/source/en/pad_truncation.md b/docs/source/en/pad_truncation.md index 345f86283d1..45b2509e86d 100644 --- a/docs/source/en/pad_truncation.md +++ b/docs/source/en/pad_truncation.md @@ -22,25 +22,25 @@ In most cases, padding your batch to the length of the longest sequence and trun The `padding` argument controls padding. It can be a boolean or a string: - - `True` or `'longest'`: pad to the longest sequence in the batch (no padding is applied if you only provide +- `True` or `'longest'`: pad to the longest sequence in the batch (no padding is applied if you only provide a single sequence). - - `'max_length'`: pad to a length specified by the `max_length` argument or the maximum length accepted +- `'max_length'`: pad to a length specified by the `max_length` argument or the maximum length accepted by the model if no `max_length` is provided (`max_length=None`). Padding will still be applied if you only provide a single sequence. - - `False` or `'do_not_pad'`: no padding is applied. This is the default behavior. +- `False` or `'do_not_pad'`: no padding is applied. This is the default behavior. The `truncation` argument controls truncation. It can be a boolean or a string: - - `True` or `'longest_first'`: truncate to a maximum length specified by the `max_length` argument or +- `True` or `'longest_first'`: truncate to a maximum length specified by the `max_length` argument or the maximum length accepted by the model if no `max_length` is provided (`max_length=None`). This will truncate token by token, removing a token from the longest sequence in the pair until the proper length is reached. - - `'only_second'`: truncate to a maximum length specified by the `max_length` argument or the maximum +- `'only_second'`: truncate to a maximum length specified by the `max_length` argument or the maximum length accepted by the model if no `max_length` is provided (`max_length=None`). This will only truncate the second sentence of a pair if a pair of sequences (or a batch of pairs of sequences) is provided. - - `'only_first'`: truncate to a maximum length specified by the `max_length` argument or the maximum +- `'only_first'`: truncate to a maximum length specified by the `max_length` argument or the maximum length accepted by the model if no `max_length` is provided (`max_length=None`). This will only truncate the first sentence of a pair if a pair of sequences (or a batch of pairs of sequences) is provided. - - `False` or `'do_not_truncate'`: no truncation is applied. This is the default behavior. +- `False` or `'do_not_truncate'`: no truncation is applied. This is the default behavior. The `max_length` argument controls the length of the padding and truncation. It can be an integer or `None`, in which case it will default to the maximum length the model can accept. If the model has no specific maximum input length, truncation or padding to `max_length` is deactivated. diff --git a/docs/source/en/philosophy.md b/docs/source/en/philosophy.md index 7cfa46458b7..e98b1fa57bd 100644 --- a/docs/source/en/philosophy.md +++ b/docs/source/en/philosophy.md @@ -26,24 +26,24 @@ The library was designed with two strong goals in mind: 1. Be as easy and fast to use as possible: - - We strongly limited the number of user-facing abstractions to learn, in fact, there are almost no abstractions, +- We strongly limited the number of user-facing abstractions to learn, in fact, there are almost no abstractions, just three standard classes required to use each model: [configuration](main_classes/configuration), [models](main_classes/model), and a preprocessing class ([tokenizer](main_classes/tokenizer) for NLP, [image processor](main_classes/image_processor) for vision, [feature extractor](main_classes/feature_extractor) for audio, and [processor](main_classes/processors) for multimodal inputs). - - All of these classes can be initialized in a simple and unified way from pretrained instances by using a common +- All of these classes can be initialized in a simple and unified way from pretrained instances by using a common `from_pretrained()` method which downloads (if needed), caches and loads the related class instance and associated data (configurations' hyperparameters, tokenizers' vocabulary, and models' weights) from a pretrained checkpoint provided on [Hugging Face Hub](https://huggingface.co/models) or your own saved checkpoint. - - On top of those three base classes, the library provides two APIs: [`pipeline`] for quickly +- On top of those three base classes, the library provides two APIs: [`pipeline`] for quickly using a model for inference on a given task and [`Trainer`] to quickly train or fine-tune a PyTorch model. - - As a consequence, this library is NOT a modular toolbox of building blocks for neural nets. If you want to +- As a consequence, this library is NOT a modular toolbox of building blocks for neural nets. If you want to extend or build upon the library, just use regular Python or PyTorch and inherit from the base classes of the library to reuse functionalities like model loading and saving. If you'd like to learn more about our coding philosophy for models, check out our [Repeat Yourself](https://huggingface.co/blog/transformers-design-philosophy) blog post. 2. Provide state-of-the-art models with performances as close as possible to the original models: - - We provide at least one example for each architecture which reproduces a result provided by the official authors +- We provide at least one example for each architecture which reproduces a result provided by the official authors of said architecture. - - The code is usually as close to the original code base as possible which means some PyTorch code may be not as +- The code is usually as close to the original code base as possible which means some PyTorch code may be not as *pytorchic* as it could be as a result of being converted from other Deep Learning frameworks. A few other goals: diff --git a/docs/source/en/pipeline_gradio.md b/docs/source/en/pipeline_gradio.md index 0cd65665d33..b53bcc8bd18 100644 --- a/docs/source/en/pipeline_gradio.md +++ b/docs/source/en/pipeline_gradio.md @@ -45,8 +45,8 @@ gr.Interface.from_pipeline(pipeline).launch(share=True) The Space below is created with the code above and hosted on Spaces. diff --git a/docs/source/en/pr_checks.md b/docs/source/en/pr_checks.md index 7056adf2149..5fdbbbab05b 100644 --- a/docs/source/en/pr_checks.md +++ b/docs/source/en/pr_checks.md @@ -21,6 +21,7 @@ rendered properly in your Markdown viewer. # Checks on a Pull Request When you open a pull request on πŸ€— Transformers, a fair number of checks will be run to make sure the patch you are adding is not breaking anything existing. Those checks are of four types: + - regular tests - documentation build - code and documentation style @@ -194,6 +195,7 @@ Another way when the patterns are just different casings of the same replacement ``` In this case, the code is copied from `BertForSequenceClassification` by replacing: + - `Bert` by `MobileBert` (for instance when using `MobileBertModel` in the init) - `bert` by `mobilebert` (for instance when defining `self.mobilebert`) - `BERT` by `MOBILEBERT` (in the constant `MOBILEBERT_INPUTS_DOCSTRING`) diff --git a/docs/source/en/quantization/concept_guide.md b/docs/source/en/quantization/concept_guide.md index e9d3b451484..df3a2bdc6f2 100644 --- a/docs/source/en/quantization/concept_guide.md +++ b/docs/source/en/quantization/concept_guide.md @@ -20,9 +20,9 @@ Quantization reduces the memory footprint and computational cost of large machin Reducing a model's precision offers several significant benefits: -- Smaller model size: Lower-precision data types require less storage space. An int8 model, for example, is roughly 4 times smaller than its float32 counterpart. -- Faster inference: Operations on lower-precision data types, especially integers, can be significantly faster on compatible hardware (CPUs and GPUs often have specialized instructions for int8 operations). This leads to lower latency. -- Reduced energy consumption: Faster computations and smaller memory transfers often translate to lower power usage. +- Smaller model size: Lower-precision data types require less storage space. An int8 model, for example, is roughly 4 times smaller than its float32 counterpart. +- Faster inference: Operations on lower-precision data types, especially integers, can be significantly faster on compatible hardware (CPUs and GPUs often have specialized instructions for int8 operations). This leads to lower latency. +- Reduced energy consumption: Faster computations and smaller memory transfers often translate to lower power usage. The primary trade-off in quantization is *efficiency* vs. *accuracy*. Reducing precision saves resources but inevitably introduces small errors (quantization noise). The goal is to minimize this error using appropriate schemes (affine/symmetric), granularity (per-tensor/channel), and techniques (PTQ/QAT) so that the model's performance on its target task degrades as little as possible. @@ -171,4 +171,4 @@ To explore quantization and related performance optimization concepts more deepl - [Introduction to Quantization cooked in πŸ€— with πŸ’—πŸ§‘β€πŸ³](https://huggingface.co/blog/merve/quantization) - [EfficientML.ai Lecture 5 - Quantization Part I](https://www.youtube.com/watch?v=RP23-dRVDWM) - [Making Deep Learning Go Brrrr From First Principles](https://horace.io/brrr_intro.html) -- [Accelerating Generative AI with PyTorch Part 2: LLM Optimizations](https://pytorch.org/blog/accelerating-generative-ai-2/) \ No newline at end of file +- [Accelerating Generative AI with PyTorch Part 2: LLM Optimizations](https://pytorch.org/blog/accelerating-generative-ai-2/) diff --git a/docs/source/en/quantization/finegrained_fp8.md b/docs/source/en/quantization/finegrained_fp8.md index bbf273d8d93..1afd1505029 100644 --- a/docs/source/en/quantization/finegrained_fp8.md +++ b/docs/source/en/quantization/finegrained_fp8.md @@ -59,4 +59,4 @@ Use [`~PreTrainedModel.save_pretrained`] to save the quantized model and reload quant_path = "/path/to/save/quantized/model" model.save_pretrained(quant_path) model = AutoModelForCausalLM.from_pretrained(quant_path, device_map="auto") -``` \ No newline at end of file +``` diff --git a/docs/source/en/quantization/quanto.md b/docs/source/en/quantization/quanto.md index b3cf58b5b6a..f58f93025f4 100644 --- a/docs/source/en/quantization/quanto.md +++ b/docs/source/en/quantization/quanto.md @@ -66,4 +66,4 @@ model = torch.compile(model) Read the [Quanto: a PyTorch quantization backend for Optimum](https://huggingface.co/blog/quanto-introduction) blog post to learn more about the library design and benchmarks. -For more hands-on examples, take a look at the Quanto [notebook](https://colab.research.google.com/drive/16CXfVmtdQvciSh9BopZUDYcmXCDpvgrT?usp=sharing). \ No newline at end of file +For more hands-on examples, take a look at the Quanto [notebook](https://colab.research.google.com/drive/16CXfVmtdQvciSh9BopZUDYcmXCDpvgrT?usp=sharing). diff --git a/docs/source/en/quantization/selecting.md b/docs/source/en/quantization/selecting.md index 69b989bca88..49502c15b6c 100644 --- a/docs/source/en/quantization/selecting.md +++ b/docs/source/en/quantization/selecting.md @@ -112,9 +112,9 @@ Consider the quantization method below during fine-tuning to save memory. ### bitsandbytes[[training]] -* **Description:** The standard method for QLoRA fine-tuning via PEFT. -* **Pros:** Enables fine-tuning large models on consumer GPUs; widely supported and documented for PEFT. -* **Cons:** Primarily for NVIDIA GPUs. +* **Description:** The standard method for QLoRA fine-tuning via PEFT. +* **Pros:** Enables fine-tuning large models on consumer GPUs; widely supported and documented for PEFT. +* **Cons:** Primarily for NVIDIA GPUs. Other methods offer PEFT compatibility, though bitsandbytes is the most established and straightforward path for QLoRA. @@ -124,10 +124,10 @@ See the [bitsandbytes documentation](./bitsandbytes#qlora) and [PEFT Docs](https Methods like [AQLM](./aqlm), [SpQR](./spqr), [VPTQ](./vptq), [HIGGS](./higgs), etc., push the boundaries of compression (< 2-bit) or explore novel techniques. -* Consider these if: - * You need extreme compression (sub-4-bit). - * You are conducting research or require state-of-the-art results from their respective papers. - * You have significant compute resources available for potentially complex quantization procedures. +* Consider these if: + * You need extreme compression (sub-4-bit). + * You are conducting research or require state-of-the-art results from their respective papers. + * You have significant compute resources available for potentially complex quantization procedures. We recommend consulting each methods documentation and associated papers carefully before choosing one for use in production. ## Benchmark Comparison @@ -154,4 +154,4 @@ The key takeaways are: | **Sub-4-bit** (VPTQ, AQLM, 2-bit GPTQ) | Extreme (>4x) | Noticeable drop, especially at 2-bit | Quantization times can be very long (AQLM, VPTQ). Performance varies. | > [!TIP] -> Always benchmark the performance (accuracy and speed) of the quantized model on your specific task and hardware to ensure it meets your requirements. Refer to the individual documentation pages linked above for detailed usage instructions. \ No newline at end of file +> Always benchmark the performance (accuracy and speed) of the quantized model on your specific task and hardware to ensure it meets your requirements. Refer to the individual documentation pages linked above for detailed usage instructions. diff --git a/docs/source/en/serving.md b/docs/source/en/serving.md index 9b3fffbd548..4287c5d2d5e 100644 --- a/docs/source/en/serving.md +++ b/docs/source/en/serving.md @@ -37,6 +37,7 @@ In this document, we dive into the different supported endpoints and modalities; You can serve models of diverse modalities supported by `transformers` with the `transformers serve` CLI. It spawns a local server that offers compatibility with the OpenAI SDK, which is the _de facto_ standard for LLM conversations and other related tasks. This way, you can use the server from many third party applications, or test it using the `transformers chat` CLI ([docs](conversations#chat-cli)). The server supports the following REST APIs: + - `/v1/chat/completions` - `/v1/responses` - `/v1/audio/transcriptions` diff --git a/docs/source/en/tasks/document_question_answering.md b/docs/source/en/tasks/document_question_answering.md index 902a948307f..2c729f76adc 100644 --- a/docs/source/en/tasks/document_question_answering.md +++ b/docs/source/en/tasks/document_question_answering.md @@ -104,6 +104,7 @@ yourself with the features. ``` Here's what the individual fields represent: + * `id`: the example's id * `image`: a PIL.Image.Image object containing the document image * `query`: the question string - natural language asked question, in several languages @@ -257,6 +258,7 @@ Once examples are encoded, however, they will look like this: ``` We'll need to find the position of the answer in the encoded input. + * `token_type_ids` tells us which tokens are part of the question, and which ones are part of the document's words. * `tokenizer.cls_token_id` will help find the special token at the beginning of the input. * `word_ids` will help match the answer found in the original `words` to the same answer in the full encoded input and determine @@ -365,6 +367,7 @@ of the Hugging Face course for inspiration. Congratulations! You've successfully navigated the toughest part of this guide and now you are ready to train your own model. Training involves the following steps: + * Load the model with [`AutoModelForDocumentQuestionAnswering`] using the same checkpoint as in the preprocessing. * Define your training hyperparameters in [`TrainingArguments`]. * Define a function to batch examples together, here the [`DefaultDataCollator`] will do just fine @@ -465,6 +468,7 @@ document question answering with your model, and pass the image + question combi ``` You can also manually replicate the results of the pipeline if you'd like: + 1. Take an image and a question, prepare them for the model using the processor from your model. 2. Forward the result or preprocessing through the model. 3. The model returns `start_logits` and `end_logits`, which indicate which token is at the start of the answer and diff --git a/docs/source/en/tasks/idefics.md b/docs/source/en/tasks/idefics.md index 5fef5953d5b..b03c7bccd9c 100644 --- a/docs/source/en/tasks/idefics.md +++ b/docs/source/en/tasks/idefics.md @@ -36,6 +36,7 @@ being a large model means it requires significant computational resources and in this approach suits your use case better than fine-tuning specialized models for each individual task. In this guide, you'll learn how to: + - [Load IDEFICS](#loading-the-model) and [load the quantized version of the model](#quantized-model) - Use IDEFICS for: - [Image captioning](#image-captioning) diff --git a/docs/source/en/tasks/image_text_to_text.md b/docs/source/en/tasks/image_text_to_text.md index 5412882b59f..8820a534030 100644 --- a/docs/source/en/tasks/image_text_to_text.md +++ b/docs/source/en/tasks/image_text_to_text.md @@ -23,6 +23,7 @@ Image-text-to-text models, also known as vision language models (VLMs), are lang In this guide, we provide a brief overview of VLMs and show how to use them with Transformers for inference. To begin with, there are multiple types of VLMs: + - base models used for fine-tuning - chat fine-tuned models for conversation - instruction fine-tuned models diff --git a/docs/source/en/tasks/image_to_image.md b/docs/source/en/tasks/image_to_image.md index 645496d671b..55380e9b0d1 100644 --- a/docs/source/en/tasks/image_to_image.md +++ b/docs/source/en/tasks/image_to_image.md @@ -21,6 +21,7 @@ rendered properly in your Markdown viewer. Image-to-Image task is the task where an application receives an image and outputs another image. This has various subtasks, including image enhancement (super resolution, low light enhancement, deraining and so on), image inpainting, and more. This guide will show you how to: + - Use an image-to-image pipeline for super resolution task, - Run image-to-image models for same task without a pipeline. diff --git a/docs/source/en/tasks/mask_generation.md b/docs/source/en/tasks/mask_generation.md index 06ba26ea123..817cb9819e7 100644 --- a/docs/source/en/tasks/mask_generation.md +++ b/docs/source/en/tasks/mask_generation.md @@ -20,6 +20,7 @@ Mask generation is the task of generating semantically meaningful masks for an i This task is very similar to [image segmentation](semantic_segmentation), but many differences exist. Image segmentation models are trained on labeled datasets and are limited to the classes they have seen during training; they return a set of masks and corresponding classes, given an image. Mask generation models are trained on large amounts of data and operate in two modes. + - Prompting mode: In this mode, the model takes in an image and a prompt, where a prompt can be a 2D point location (XY coordinates) in the image within an object or a bounding box surrounding an object. In prompting mode, the model only returns the mask over the object that the prompt is pointing out. - Segment Everything mode: In segment everything, given an image, the model generates every mask in the image. To do so, a grid of points is generated and overlaid on the image for inference. @@ -34,6 +35,7 @@ SAM serves as a powerful foundation model for segmentation as it has large data [SA-1B](https://ai.meta.com/datasets/segment-anything/), a dataset with 1 million images and 1.1 billion masks. In this guide, you will learn how to: + - Infer in segment everything mode with batching, - Infer in point prompting mode, - Infer in box prompting mode. diff --git a/docs/source/en/tasks/masked_language_modeling.md b/docs/source/en/tasks/masked_language_modeling.md index 3c024739d73..619374f91da 100644 --- a/docs/source/en/tasks/masked_language_modeling.md +++ b/docs/source/en/tasks/masked_language_modeling.md @@ -150,6 +150,7 @@ To apply this preprocessing function over the entire dataset, use the πŸ€— Datas This dataset contains the token sequences, but some of these are longer than the maximum input length for the model. You can now use a second preprocessing function to + - concatenate all the sequences - split the concatenated sequences into shorter chunks defined by `block_size`, which should be both shorter than the maximum input length and short enough for your GPU RAM. diff --git a/docs/source/en/tasks/object_detection.md b/docs/source/en/tasks/object_detection.md index 093644b662f..ef2a86190bb 100644 --- a/docs/source/en/tasks/object_detection.md +++ b/docs/source/en/tasks/object_detection.md @@ -121,6 +121,7 @@ To get familiar with the data, explore what the examples look like. ``` The examples in the dataset have the following fields: + - `image_id`: the example image id - `image`: a `PIL.Image.Image` object containing the image - `width`: width of the image @@ -216,6 +217,7 @@ Instantiate the image processor from the same checkpoint as the model you want t ``` Before passing the images to the `image_processor`, apply two preprocessing transformations to the dataset: + - Augmenting images - Reformatting annotations to meet DETR expectations @@ -505,6 +507,7 @@ The images in this dataset are still quite large, even after resizing. This mean require at least one GPU. Training involves the following steps: + 1. Load the model with [`AutoModelForObjectDetection`] using the same checkpoint as in the preprocessing. 2. Define your training hyperparameters in [`TrainingArguments`]. 3. Pass the training arguments to [`Trainer`] along with the model, dataset, image processor, and data collator. @@ -527,9 +530,10 @@ and `id2label` maps that you created earlier from the dataset's metadata. Additi In the [`TrainingArguments`] use `output_dir` to specify where to save your model, then configure hyperparameters as you see fit. For `num_train_epochs=30` training will take about 35 minutes in Google Colab T4 GPU, increase the number of epoch to get better results. Important notes: - - Do not remove unused columns because this will drop the image column. Without the image column, you + +- Do not remove unused columns because this will drop the image column. Without the image column, you can't create `pixel_values`. For this reason, set `remove_unused_columns` to `False`. - - Set `eval_do_concat_batches=False` to get proper evaluation results. Images have different number of target boxes, if batches are concatenated we will not be able to determine which boxes belongs to particular image. +- Set `eval_do_concat_batches=False` to get proper evaluation results. Images have different number of target boxes, if batches are concatenated we will not be able to determine which boxes belongs to particular image. If you wish to share your model by pushing to the Hub, set `push_to_hub` to `True` (you must be signed in to Hugging Face to upload your model). diff --git a/docs/source/en/tasks/semantic_segmentation.md b/docs/source/en/tasks/semantic_segmentation.md index 08d68047dc6..de88a0af686 100644 --- a/docs/source/en/tasks/semantic_segmentation.md +++ b/docs/source/en/tasks/semantic_segmentation.md @@ -23,6 +23,7 @@ rendered properly in your Markdown viewer. Image segmentation models separate areas corresponding to different areas of interest in an image. These models work by assigning a label to each pixel. There are several types of segmentation: semantic segmentation, instance segmentation, and panoptic segmentation. In this guide, we will: + 1. [Take a look at different types of segmentation](#types-of-segmentation). 2. [Have an end-to-end fine-tuning example for semantic segmentation](#fine-tuning-a-model-for-segmentation). diff --git a/docs/source/en/tasks/video_text_to_text.md b/docs/source/en/tasks/video_text_to_text.md index b0f698f039e..58ca97e9a56 100644 --- a/docs/source/en/tasks/video_text_to_text.md +++ b/docs/source/en/tasks/video_text_to_text.md @@ -25,6 +25,7 @@ These models have nearly the same architecture as [image-text-to-text](../image_ In this guide, we provide a brief overview of video LMs and show how to use them with Transformers for inference. To begin with, there are multiple types of video LMs: + - base models used for fine-tuning - chat fine-tuned models for conversation - instruction fine-tuned models diff --git a/docs/source/en/tasks/visual_question_answering.md b/docs/source/en/tasks/visual_question_answering.md index e06283c9ceb..e0f7873760e 100644 --- a/docs/source/en/tasks/visual_question_answering.md +++ b/docs/source/en/tasks/visual_question_answering.md @@ -23,6 +23,7 @@ The input to models supporting this task is typically a combination of an image answer expressed in natural language. Some noteworthy use case examples for VQA include: + * Accessibility applications for visually impaired individuals. * Education: posing questions about visual materials presented in lectures or textbooks. VQA can also be utilized in interactive museum exhibits or historical sites. * Customer service and e-commerce: VQA can enhance user experience by letting users ask questions about products. @@ -105,6 +106,7 @@ Let's take a look at an example to understand the dataset's features: ``` The features relevant to the task include: + * `question`: the question to be answered from the image * `image_id`: the path to the image the question refers to * `label`: the annotations @@ -325,6 +327,7 @@ learned something from the data and take the first example from the dataset to i Even though not very confident, the model indeed has learned something. With more examples and longer training, you'll get far better results! You can also manually replicate the results of the pipeline if you'd like: + 1. Take an image and a question, prepare them for the model using the processor from your model. 2. Forward the result or preprocessing through the model. 3. From the logits, get the most likely answer's id, and find the actual answer in the `id2label`. diff --git a/docs/source/en/tasks/zero_shot_image_classification.md b/docs/source/en/tasks/zero_shot_image_classification.md index d923ca44b40..b4ea0529b21 100644 --- a/docs/source/en/tasks/zero_shot_image_classification.md +++ b/docs/source/en/tasks/zero_shot_image_classification.md @@ -146,4 +146,4 @@ Pass the inputs through the model, and post-process the results: {'score': 0.0010570387, 'label': 'bike'}, {'score': 0.0003393686, 'label': 'tree'}, {'score': 3.1572064e-05, 'label': 'cat'}] -``` \ No newline at end of file +``` diff --git a/docs/source/en/tasks/zero_shot_object_detection.md b/docs/source/en/tasks/zero_shot_object_detection.md index 265bf52d4ed..434eca36e33 100644 --- a/docs/source/en/tasks/zero_shot_object_detection.md +++ b/docs/source/en/tasks/zero_shot_object_detection.md @@ -29,6 +29,7 @@ as a list of candidate classes, and output the bounding boxes and labels where t > Hugging Face houses many such [open vocabulary zero shot object detectors](https://huggingface.co/models?pipeline_tag=zero-shot-object-detection). In this guide, you will learn how to use such models: + - to detect objects based on text prompts - for batch object detection - for image-guided object detection diff --git a/docs/source/en/testing.md b/docs/source/en/testing.md index b5e79100317..01658aa2beb 100644 --- a/docs/source/en/testing.md +++ b/docs/source/en/testing.md @@ -845,11 +845,11 @@ commit it to the main repository we need make sure it's skipped during `make tes Methods: -- A **skip** means that you expect your test to pass only if some conditions are met, otherwise pytest should skip +- A **skip** means that you expect your test to pass only if some conditions are met, otherwise pytest should skip running the test altogether. Common examples are skipping windows-only tests on non-windows platforms, or skipping tests that depend on an external resource which is not available at the moment (for example a database). -- A **xfail** means that you expect a test to fail for some reason. A common example is a test for a feature not yet +- A **xfail** means that you expect a test to fail for some reason. A common example is a test for a feature not yet implemented, or a bug not yet fixed. When a test passes despite being expected to fail (marked with pytest.mark.xfail), it's an xpass and will be reported in the test summary. @@ -908,7 +908,7 @@ def test_feature_x(): docutils = pytest.importorskip("docutils", minversion="0.3") ``` -- Skip a test based on a condition: +- Skip a test based on a condition: ```python no-style @pytest.mark.skipif(sys.version_info < (3,6), reason="requires python3.6 or higher")