diff --git a/ISSUES.md b/ISSUES.md index 9c96162647b..c87bd9fc2c3 100644 --- a/ISSUES.md +++ b/ISSUES.md @@ -38,7 +38,6 @@ In particular all "Please explain" questions or objectively very user-specific f * "How to train T5 on De->En translation?" - ## The GitHub Issues Everything which hints at a bug should be opened as an [issue](https://github.com/huggingface/transformers/issues). @@ -247,7 +246,6 @@ You are not required to read the following guidelines before opening an issue. H Try not use italics and bold text too much as these often make the text more difficult to read. - 12. If you are cross-referencing a specific comment in a given thread or another issue, always link to that specific comment, rather than using the issue link. If you do the latter it could be quite impossible to find which specific comment you're referring to. To get the link to the specific comment do not copy the url from the location bar of your browser, but instead, click the `...` icon in the upper right corner of the comment and then select "Copy Link". @@ -257,7 +255,6 @@ You are not required to read the following guidelines before opening an issue. H 1. https://github.com/huggingface/transformers/issues/9257 2. https://github.com/huggingface/transformers/issues/9257#issuecomment-749945162 - 13. If you are replying to a last comment, it's totally fine to make your reply with just your comment in it. The readers can follow the information flow here. But if you're replying to a comment that happened some comments back it's always a good practice to quote just the relevant lines you're replying it. The `>` is used for quoting, or you can always use the menu to do so. For example your editor box will look like: diff --git a/README.md b/README.md index 850b76f5c4f..8b09a84f29e 100644 --- a/README.md +++ b/README.md @@ -63,12 +63,11 @@ limitations under the License. +Transformers acts as the model-definition framework for state-of-the-art machine learning models in text, computer +vision, audio, video, and multimodal model, for both inference and training. -Transformers acts as the model-definition framework for state-of-the-art machine learning models in text, computer -vision, audio, video, and multimodal model, for both inference and training. - -It centralizes the model definition so that this definition is agreed upon across the ecosystem. `transformers` is the -pivot across frameworks: if a model definition is supported, it will be compatible with the majority of training +It centralizes the model definition so that this definition is agreed upon across the ecosystem. `transformers` is the +pivot across frameworks: if a model definition is supported, it will be compatible with the majority of training frameworks (Axolotl, Unsloth, DeepSpeed, FSDP, PyTorch-Lightning, ...), inference engines (vLLM, SGLang, TGI, ...), and adjacent modeling libraries (llama.cpp, mlx, ...) which leverage the model definition from `transformers`. @@ -194,7 +193,6 @@ pipeline("https://huggingface.co/datasets/Narsil/image_dummy/raw/main/parrots.pn
Visual question answering -

diff --git a/awesome-transformers.md b/awesome-transformers.md index adc84f101ea..d0398e7bde6 100644 --- a/awesome-transformers.md +++ b/awesome-transformers.md @@ -6,7 +6,7 @@ developers, researchers, students, professors, engineers, and anyone else to bui In this list, we showcase incredibly impactful and novel projects that have pushed the field forward. We celebrate 100 of these projects as we reach the milestone of 100k stars as a community; but we're very open to pull requests -adding other projects to the list. If you believe a project should be here and it's not, then please, open a PR +adding other projects to the list. If you believe a project should be here and it's not, then please, open a PR to add it. ## [gpt4all](https://github.com/nomic-ai/gpt4all) @@ -49,7 +49,7 @@ Keywords: LLMs, Large Language Models, Agents, Chains [LlamaIndex](https://github.com/run-llama/llama_index) is a project that provides a central interface to connect your LLM's with external data. It provides various kinds of indices and retrieval mechanisms to perform different LLM tasks and obtain knowledge-augmented results. -Keywords: LLMs, Large Language Models, Data Retrieval, Indices, Knowledge Augmentation +Keywords: LLMs, Large Language Models, Data Retrieval, Indices, Knowledge Augmentation ## [ParlAI](https://github.com/facebookresearch/ParlAI) @@ -257,7 +257,7 @@ Stable-Dreamfusion is a pytorch implementation of the text-to-3D model Dreamfusi Keywords: Text-to-3D, Stable Diffusion ## [txtai](https://github.com/neuml/txtai) - + [txtai](https://github.com/neuml/txtai) is an open-source platform for semantic search and workflows powered by language models. txtai builds embeddings databases, which are a union of vector indexes and relational databases enabling similarity search with SQL. Semantic workflows connect language models together into unified applications. Keywords: Semantic search, LLM @@ -309,8 +309,8 @@ Keywords: OCR, LaTeX, Math formula OpenCLIP is an open source implementation of OpenAI's CLIP. -The goal of this repository is to enable training models with contrastive image-text supervision, and to investigate their properties such as robustness to distribution shift. -The starting point is an implementation of CLIP that matches the accuracy of the original CLIP models when trained on the same dataset. +The goal of this repository is to enable training models with contrastive image-text supervision, and to investigate their properties such as robustness to distribution shift. +The starting point is an implementation of CLIP that matches the accuracy of the original CLIP models when trained on the same dataset. Specifically, a ResNet-50 model trained with this codebase on OpenAI's 15 million image subset of YFCC achieves 32.7% top-1 accuracy on ImageNet. @@ -596,7 +596,7 @@ Keywords: Data-Centric AI, Data Quality, Noisy Labels, Outlier Detection, Active ## [BentoML](https://github.com/bentoml/BentoML) -[BentoML](https://github.com/bentoml) is the unified framework for building, shipping, and scaling production-ready AI applications incorporating traditional ML, pre-trained AI models, Generative and Large Language Models. +[BentoML](https://github.com/bentoml) is the unified framework for building, shipping, and scaling production-ready AI applications incorporating traditional ML, pre-trained AI models, Generative and Large Language Models. All Hugging Face models and pipelines can be seamlessly integrated into BentoML applications, enabling the running of models on the most suitable hardware and independent scaling based on usage. Keywords: BentoML, Framework, Deployment, AI Applications @@ -606,4 +606,3 @@ Keywords: BentoML, Framework, Deployment, AI Applications [LLaMA Factory](https://github.com/hiyouga/LLaMA-Factory) offers a user-friendly fine-tuning framework that incorporates PEFT. The repository includes training(fine-tuning) and inference examples for LLaMA-2, BLOOM, Falcon, Baichuan, Qwen, and other LLMs. A ChatGLM version is also available in [ChatGLM-Efficient-Tuning](https://github.com/hiyouga/ChatGLM-Efficient-Tuning). Keywords: PEFT, fine-tuning, LLaMA-2, ChatGLM, Qwen - diff --git a/docs/source/en/accelerator_selection.md b/docs/source/en/accelerator_selection.md index 5d5bbc2675f..3cd809cba6a 100644 --- a/docs/source/en/accelerator_selection.md +++ b/docs/source/en/accelerator_selection.md @@ -69,7 +69,6 @@ CUDA_VISIBLE_DEVICES=0,2 torchrun trainer-program.py ... Only GPUs 0 and 2 are "visible" to PyTorch and are mapped to `cuda:0` and `cuda:1` respectively. To reverse the order (use GPU 2 as `cuda:0` and GPU 0 as `cuda:1`): - ```bash CUDA_VISIBLE_DEVICES=2,0 torchrun trainer-program.py ... ``` @@ -108,7 +107,6 @@ To reverse the order (use XPU 2 as `xpu:0` and XPU 0 as `xpu:1`): ZE_AFFINITY_MASK=2,0 torchrun trainer-program.py ... ``` - You can also control the order of Intel XPUs with: ```bash @@ -120,7 +118,5 @@ For more information about device enumeration and sorting on Intel XPU, please r - - > [!WARNING] > Environment variables can be exported instead of being added to the command line. This is not recommended because it can be confusing if you forget how the environment variable was set up and you end up using the wrong accelerators. Instead, it is common practice to set the environment variable for a specific training run on the same command line. diff --git a/docs/source/en/auto_docstring.md b/docs/source/en/auto_docstring.md index 5fc4ed061ce..6445ee53014 100644 --- a/docs/source/en/auto_docstring.md +++ b/docs/source/en/auto_docstring.md @@ -145,7 +145,6 @@ Arguments can also be passed directly to `@auto_docstring` for more control. Use The `Returns` and `Examples` parts of the docstring can also be manually specified. - ```python MODEL_COMMON_CUSTOM_ARGS = r""" common_arg_1 (`torch.Tensor`, *optional*, defaults to `default_value`): @@ -202,7 +201,6 @@ There are some rules for documenting different types of arguments and they're li If a standard argument behaves differently in your model, then you can override it locally in a `r""" """` block. This local definition has a higher priority. For example, the `labels` argument is often customized per model and typically requires overriding. - - New or custom arguments should be documented within an `r""" """` block after the signature if it is a function or in the `__init__` method's docstring if it is a class. ```py diff --git a/docs/source/en/cache_explanation.md b/docs/source/en/cache_explanation.md index 0e192fd47f4..77fc2c9c328 100644 --- a/docs/source/en/cache_explanation.md +++ b/docs/source/en/cache_explanation.md @@ -59,11 +59,9 @@ Refer to the table below to compare how caching improves efficiency. | without caching | with caching | |---|---| -| for each step, recompute all previous `K` and `V` | for each step, only compute current `K` and `V` +| for each step, recompute all previous `K` and `V` | for each step, only compute current `K` and `V` | attention cost per step is **quadratic** with sequence length | attention cost per step is **linear** with sequence length (memory grows linearly, but compute/token remains low) | - - ## Cache class A basic KV cache interface takes a key and value tensor for the current token and returns the updated `K` and `V` tensors. This is internally managed by a model's `forward` method. @@ -143,7 +141,6 @@ Cache position is used internally for two purposes: The generation loop usually takes care of the cache position, but if you're writing a custom generation method, it is important that cache positions are accurate since they are used to write and read key/value states into fixed slots. - ```py import torch from transformers import AutoTokenizer, AutoModelForCausalLM, DynamicCache, infer_device @@ -160,7 +157,6 @@ generated_ids = model.generate(**inputs, use_cache=True, max_new_tokens=10) ``` - ## Legacy cache format Before the [`Cache`] class, the cache used to be stored as a tuple of tuples of tensors. This format is dynamic because it grows as text is generated, similar to [`DynamicCache`]. diff --git a/docs/source/en/chat_extras.md b/docs/source/en/chat_extras.md index dc933dd6815..20d5cf22ce4 100644 --- a/docs/source/en/chat_extras.md +++ b/docs/source/en/chat_extras.md @@ -29,7 +29,6 @@ the arguments, argument types, and function docstring are parsed in order to gen Although passing Python functions is very convenient, the parser can only handle [Google-style](https://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings) docstrings. Refer to the examples below for how to format a tool-ready function. - ```py def get_current_temperature(location: str, unit: str): """ @@ -103,7 +102,6 @@ Hold the call in the `tool_calls` key of an `assistant` message. This is the rec > [!WARNING] > Although `tool_calls` is similar to the OpenAI API, the OpenAI API uses a JSON string as its `tool_calls` format. This may cause errors or strange model behavior if used in Transformers, which expects a dict. - ```py tool_call = {"name": "get_current_temperature", "arguments": {"location": "Paris, France", "unit": "celsius"}} messages.append({"role": "assistant", "tool_calls": [{"type": "function", "function": tool_call}]}) @@ -131,7 +129,6 @@ The temperature in Paris, France right now is 22°C.<|im_end|> > Although the key in the assistant message is called `tool_calls`, in most cases, models only emit a single tool call at a time. Some older models emit multiple tool calls at the same time, but this is a > significantly more complex process, as you need to handle multiple tool responses at once and disambiguate them, often using tool call IDs. Please refer to the model card to see exactly what format a model expects for tool calls. - ## JSON schemas Another way to define tools is by passing a [JSON schema](https://json-schema.org/learn/getting-started-step-by-step). diff --git a/docs/source/en/chat_templating.md b/docs/source/en/chat_templating.md index b32fa8ec43f..b1e8428afaa 100644 --- a/docs/source/en/chat_templating.md +++ b/docs/source/en/chat_templating.md @@ -16,13 +16,13 @@ rendered properly in your Markdown viewer. # Chat templates -The [chat basics](./conversations) guide covers how to store chat histories and generate text from chat models using [`TextGenerationPipeline`]. +The [chat basics](./conversations) guide covers how to store chat histories and generate text from chat models using [`TextGenerationPipeline`]. This guide is intended for more advanced users, and covers the underlying classes and methods, as well as the key concepts for understanding what's actually going on when you chat with a model. The critical insight needed to understand chat models is this: All causal LMs, whether chat-trained or not, continue a sequence of tokens. When causal LMs are trained, the training usually begins with "pre-training" on a huge corpus of text, which creates a "base" model. These base models are then often "fine-tuned" for chat, which means training them on data that is formatted as a sequence of messages. The chat is still just a sequence of tokens, though! The list of `role` and `content` dictionaries that you pass -to a chat model get converted to a token sequence, often with control tokens like `<|user|>` or `<|assistant|>` or `<|end_of_message|>`, which allow the model to see the chat structure. +to a chat model get converted to a token sequence, often with control tokens like `<|user|>` or `<|assistant|>` or `<|end_of_message|>`, which allow the model to see the chat structure. There are many possible chat formats, and different models may use different formats or control tokens, even if they were fine-tuned from the same base model! Don't panic, though - you don't need to memorize every possible chat format in order to use chat models. Chat models come with **chat templates**, which indicate how they expect chats to be formatted. @@ -43,6 +43,7 @@ chat = [ tokenizer.apply_chat_template(chat, tokenize=False) ``` + ```md [INST] Hello, how are you? [/INST]I'm doing great. How can I help you today? [INST] I'd like to show off how chat templating works! [/INST] ``` @@ -62,6 +63,7 @@ chat = [ tokenizer.apply_chat_template(chat, tokenize=False) ``` + ```md <|user|>\nHello, how are you?\n<|assistant|>\nI'm doing great. How can I help you today?\n<|user|>\nI'd like to show off how chat templating works!\n ``` @@ -110,6 +112,7 @@ Pass the tokenized chat to [`~GenerationMixin.generate`] to generate a response. outputs = model.generate(tokenized_chat, max_new_tokens=128) print(tokenizer.decode(outputs[0])) ``` + ```md <|system|> You are a friendly chatbot who always responds in the style of a pirate @@ -125,9 +128,9 @@ Matey, I'm afraid I must inform ye that humans cannot eat helicopters. Helicopte ### add_generation_prompt -You may have noticed the [add_generation_prompt](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.add_generation_prompt) argument in the above examples. +You may have noticed the [add_generation_prompt](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.add_generation_prompt) argument in the above examples. This argument adds tokens to the end of the chat that indicate the start of an `assistant` response. Remember: Beneath all the chat abstractions, chat models are still just language models that continue a sequence of tokens! -If you include tokens that tell it that it's now in an `assistant` response, it will correctly write a response, but if you don't include these tokens, the model may get confused and do something strange, like **continuing** the user's message instead of replying to it! +If you include tokens that tell it that it's now in an `assistant` response, it will correctly write a response, but if you don't include these tokens, the model may get confused and do something strange, like **continuing** the user's message instead of replying to it! Let's see an example to understand what `add_generation_prompt` is actually doing. First, let's format a chat without `add_generation_prompt`: @@ -135,6 +138,7 @@ Let's see an example to understand what `add_generation_prompt` is actually doin tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False) tokenized_chat ``` + ```md <|im_start|>user Hi there!<|im_end|> @@ -150,6 +154,7 @@ Now, let's format the same chat with `add_generation_prompt=True`: tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) tokenized_chat ``` + ```md <|im_start|>user Hi there!<|im_end|> @@ -186,7 +191,6 @@ model.generate(**formatted_chat) [`TextGenerationPipeline`] sets [add_generation_prompt](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.add_generation_prompt) to `True` by default to start a new message. However, if the final message in the chat has the `assistant` role, it assumes the message is a prefill and switches to `continue_final_message=True`. This is because most models don’t support multiple consecutive assistant messages. To override this behavior, explicitly pass the [continue_final_message](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template.continue_final_message) argument to the pipeline. - ## Model training Training a model with a chat template is a good way to ensure the template matches the tokens the model was trained on. Apply the chat template as a preprocessing step to your dataset. Set `add_generation_prompt=False` because the additional tokens to prompt an assistant response aren't helpful during training. @@ -212,6 +216,7 @@ dataset = Dataset.from_dict({"chat": [chat1, chat2]}) dataset = dataset.map(lambda x: {"formatted_chat": tokenizer.apply_chat_template(x["chat"], tokenize=False, add_generation_prompt=False)}) print(dataset['formatted_chat'][0]) ``` + ```md <|user|> Which is bigger, the moon or the sun? diff --git a/docs/source/en/chat_templating_multimodal.md b/docs/source/en/chat_templating_multimodal.md index f28c09e96b6..e469fde86b5 100644 --- a/docs/source/en/chat_templating_multimodal.md +++ b/docs/source/en/chat_templating_multimodal.md @@ -18,8 +18,7 @@ rendered properly in your Markdown viewer. Multimodal chat models accept inputs like images, audio or video, in addition to text. The `content` key in a multimodal chat history is a list containing multiple items of different types. This is unlike text-only chat models whose `content` key is a single string. - -In the same way the [Tokenizer](./fast_tokenizer) class handles chat templates and tokenization for text-only models, +In the same way the [Tokenizer](./fast_tokenizer) class handles chat templates and tokenization for text-only models, the [Processor](./processors) class handles preprocessing, tokenization and chat templates for multimodal models. Their [`~ProcessorMixin.apply_chat_template`] methods are almost identical. This guide will show you how to chat with multimodal models with the high-level [`ImageTextToTextPipeline`] and at a lower level using the [`~ProcessorMixin.apply_chat_template`] and [`~GenerationMixin.generate`] methods. @@ -57,7 +56,6 @@ out = pipe(text=messages, max_new_tokens=128) print(out[0]['generated_text'][-1]['content']) ``` - ``` Ahoy, me hearty! These be two feline friends, likely some tabby cats, taking a siesta on a cozy pink blanket. They're resting near remote controls, perhaps after watching some TV or just enjoying some quiet time together. Cats sure know how to find comfort and relaxation, don't they? ``` @@ -66,10 +64,9 @@ Aside from the gradual descent from pirate-speak into modern American English (i ## Using `apply_chat_template` -Like [text-only models](./chat_templating), use the [`~ProcessorMixin.apply_chat_template`] method to prepare the chat messages for multimodal models. +Like [text-only models](./chat_templating), use the [`~ProcessorMixin.apply_chat_template`] method to prepare the chat messages for multimodal models. This method handles the tokenization and formatting of the chat messages, including images and other media types. The resulting inputs are passed to the model for generation. - ```python from transformers import AutoProcessor, AutoModelForImageTextToText @@ -99,7 +96,6 @@ processed_chat = processor.apply_chat_template(messages, add_generation_prompt=T print(list(processed_chat.keys())) ``` - ``` ['input_ids', 'attention_mask', 'pixel_values', 'image_grid_thw'] ``` @@ -113,7 +109,6 @@ print(processor.decode(out[0])) The decoded output contains the full conversation so far, including the user message and the placeholder tokens that contain the image information. You may need to trim the previous conversation from the output before displaying it to the user. - ## Video inputs Some vision models also support video inputs. The message format is very similar to the format for [image inputs](#image-inputs). @@ -148,6 +143,7 @@ messages = [ ``` ### Example: Passing decoded video objects + ```python import numpy as np @@ -167,7 +163,9 @@ messages = [ }, ] ``` + You can also use existing (`"load_video()"`) function to load a video, edit the video in memory and pass it in the messages. + ```python # Make sure a video backend library (pyav, decord, or torchvision) is available. @@ -200,7 +198,6 @@ Pass `messages` to [`~ProcessorMixin.apply_chat_template`] to tokenize the input The `num_frames` parameter controls how many frames to uniformly sample from the video. Each checkpoint has a maximum frame count it was pretrained with and exceeding this count can significantly lower generation quality. It's important to choose a frame count that fits both the model capacity and your hardware resources. If `num_frames` isn't specified, the entire video is loaded without any frame sampling. - ```python processed_chat = processor.apply_chat_template( messages, @@ -265,4 +262,3 @@ print(processed_chat.keys()) - diff --git a/docs/source/en/chat_templating_writing.md b/docs/source/en/chat_templating_writing.md index f4f3b1201e3..936ce2a2c7f 100644 --- a/docs/source/en/chat_templating_writing.md +++ b/docs/source/en/chat_templating_writing.md @@ -18,7 +18,6 @@ rendered properly in your Markdown viewer. A chat template is a [Jinja](https://jinja.palletsprojects.com/en/stable/templates/) template stored in the tokenizer's [chat_template](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.chat_template) attribute. Jinja is a templating language that allows you to write Python-like code and syntax. - ```jinja {%- for message in messages %} {{- '<|' + message['role'] + |>\n' }} @@ -108,7 +107,6 @@ We strongly recommend using `-` to ensure only the intended content is printed. ### Special variables and callables - The only constants in a template are the `messages` variable and the `add_generation_prompt` boolean. However, you have access to **any other keyword arguments that are passed** to the [`~PreTrainedTokenizerBase.apply_chat_template`] method. diff --git a/docs/source/en/conversations.md b/docs/source/en/conversations.md index 0fed56c632d..a36be2203a5 100644 --- a/docs/source/en/conversations.md +++ b/docs/source/en/conversations.md @@ -48,7 +48,6 @@ transformers chat -h The chat is implemented on top of the [AutoClass](./model_doc/auto), using tooling from [text generation](./llm_tutorial) and [chat](./chat_templating). It uses the `transformers serve` CLI under the hood ([docs](./serving.md#serve-cli)). - ## TextGenerationPipeline [`TextGenerationPipeline`] is a high-level text generation class with a "chat mode". Chat mode is enabled when a conversational model is detected and the chat prompt is [properly formatted](./llm_tutorial#wrong-prompt-format). @@ -109,7 +108,7 @@ quantization_config = BitsAndBytesConfig(load_in_8bit=True) pipeline = pipeline(task="text-generation", model="meta-llama/Meta-Llama-3-8B-Instruct", device_map="auto", model_kwargs={"quantization_config": quantization_config}) ``` -In general, model size and performance are directly correlated. Larger models are slower in addition to requiring more memory because each active parameter must be read from memory for every generated token. +In general, model size and performance are directly correlated. Larger models are slower in addition to requiring more memory because each active parameter must be read from memory for every generated token. This is a bottleneck for LLM text generation and the main options for improving generation speed are to either quantize a model or use hardware with higher memory bandwidth. Adding more compute power doesn't meaningfully help. You can also try techniques like [speculative decoding](./generation_strategies#speculative-decoding), where a smaller model generates candidate tokens that are verified by the larger model. If the candidate tokens are correct, the larger model can generate more than one token at a time. This significantly alleviates the bandwidth bottleneck and improves generation speed. diff --git a/docs/source/en/cursor.md b/docs/source/en/cursor.md index 18ebe803edf..799e1715b3b 100644 --- a/docs/source/en/cursor.md +++ b/docs/source/en/cursor.md @@ -38,5 +38,3 @@ You are now ready to use your local model in Cursor! For instance, if you toggle

- - diff --git a/docs/source/en/generation_strategies.md b/docs/source/en/generation_strategies.md index 63b70899af4..3c277fa7df0 100644 --- a/docs/source/en/generation_strategies.md +++ b/docs/source/en/generation_strategies.md @@ -389,7 +389,6 @@ from .utils import some_function Only relative imports from the same-level `custom_generate` folder are supported. Parent/sibling folder imports are not valid. The `custom_generate` argument also works locally with any directory that contains a `custom_generate` structure. This is the recommended workflow for developing your custom generation method. - #### requirements.txt You can optionally specify additional Python requirements in a `requirements.txt` file inside the `custom_generate` folder. These are checked at runtime and an exception will be thrown if they're missing, nudging users to update their environment accordingly. diff --git a/docs/source/en/index.md b/docs/source/en/index.md index ab0677b5a54..e9738f6ccfa 100644 --- a/docs/source/en/index.md +++ b/docs/source/en/index.md @@ -19,7 +19,6 @@ rendered properly in your Markdown viewer. - Transformers acts as the model-definition framework for state-of-the-art machine learning models in text, computer vision, audio, video, and multimodal model, for both inference and training. diff --git a/docs/source/en/internal/file_utils.md b/docs/source/en/internal/file_utils.md index 31fbc5b8811..63db5756a62 100644 --- a/docs/source/en/internal/file_utils.md +++ b/docs/source/en/internal/file_utils.md @@ -20,7 +20,6 @@ This page lists all of Transformers general utility functions that are found in Most of those are only useful if you are studying the general code in the library. - ## Enums and namedtuples [[autodoc]] utils.ExplicitEnum diff --git a/docs/source/en/internal/generation_utils.md b/docs/source/en/internal/generation_utils.md index d47eba82d8c..87b0111ff05 100644 --- a/docs/source/en/internal/generation_utils.md +++ b/docs/source/en/internal/generation_utils.md @@ -65,7 +65,6 @@ values. Here, for instance, it has two keys that are `sequences` and `scores`. We document here all output types. - [[autodoc]] generation.GenerateDecoderOnlyOutput [[autodoc]] generation.GenerateEncoderDecoderOutput @@ -74,13 +73,11 @@ We document here all output types. [[autodoc]] generation.GenerateBeamEncoderDecoderOutput - ## LogitsProcessor A [`LogitsProcessor`] can be used to modify the prediction scores of a language model head for generation. - [[autodoc]] AlternatingCodebooksLogitsProcessor - __call__ @@ -174,8 +171,6 @@ generation. [[autodoc]] WatermarkLogitsProcessor - __call__ - - ## StoppingCriteria A [`StoppingCriteria`] can be used to change when to stop generation (other than EOS token). Please note that this is exclusively available to our PyTorch implementations. @@ -300,7 +295,6 @@ A [`Constraint`] can be used to force the generation to include specific tokens - to_legacy_cache - from_legacy_cache - ## Watermark Utils [[autodoc]] WatermarkingConfig diff --git a/docs/source/en/internal/import_utils.md b/docs/source/en/internal/import_utils.md index 77554c85b02..15325819817 100644 --- a/docs/source/en/internal/import_utils.md +++ b/docs/source/en/internal/import_utils.md @@ -22,8 +22,8 @@ worked around. We don't want for all users of `transformers` to have to install we therefore mark those as soft dependencies rather than hard dependencies. The transformers toolkit is not made to error-out on import of a model that has a specific dependency; instead, an -object for which you are lacking a dependency will error-out when calling any method on it. As an example, if -`torchvision` isn't installed, the fast image processors will not be available. +object for which you are lacking a dependency will error-out when calling any method on it. As an example, if +`torchvision` isn't installed, the fast image processors will not be available. This object is still importable: @@ -55,7 +55,7 @@ All objects under a given filename have an automatic dependency to the tool link **Tokenizers**: All files starting with `tokenization_` and ending with `_fast` have an automatic `tokenizers` dependency -**Vision**: All files starting with `image_processing_` have an automatic dependency to the `vision` dependency group; +**Vision**: All files starting with `image_processing_` have an automatic dependency to the `vision` dependency group; at the time of writing, this only contains the `pillow` dependency. **Vision + Torch + Torchvision**: All files starting with `image_processing_` and ending with `_fast` have an automatic @@ -66,7 +66,7 @@ All of these automatic dependencies are added on top of the explicit dependencie ### Explicit Object Dependencies We add a method called `requires` that is used to explicitly specify the dependencies of a given object. As an -example, the `Trainer` class has two hard dependencies: `torch` and `accelerate`. Here is how we specify these +example, the `Trainer` class has two hard dependencies: `torch` and `accelerate`. Here is how we specify these required dependencies: ```python diff --git a/docs/source/en/internal/model_debugging_utils.md b/docs/source/en/internal/model_debugging_utils.md index cf2c0353fc7..aa5371cd38e 100644 --- a/docs/source/en/internal/model_debugging_utils.md +++ b/docs/source/en/internal/model_debugging_utils.md @@ -21,10 +21,8 @@ provides for it. Most of those are only useful if you are adding new models in the library. - ## Model addition debuggers - ### Model addition debugger - context manager for model adders This context manager is a power user tool intended for model adders. It tracks all forward calls within a model forward @@ -72,7 +70,6 @@ with model_addition_debugger_context( ``` - ### Reading results The debugger generates two files from the forward call, both with the same base name, but ending either with @@ -231,10 +228,8 @@ Once the forward passes of two models have been traced by the debugger, one can below: we can see slight differences between these two implementations' key projection layer. Inputs are mostly identical, but not quite. Looking through the file differences makes it easier to pinpoint which layer is wrong. - ![download-icon](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/files_difference_debugging.png) - ### Limitations and scope This feature will only work for torch-based models. Models relying heavily on external kernel calls may work, but trace will @@ -253,7 +248,7 @@ layers. This small util is a power user tool intended for model adders and maintainers. It lists all test methods existing in `test_modeling_common.py`, inherited by all model tester classes, and scans the repository to measure -how many tests are being skipped and for which models. +how many tests are being skipped and for which models. ### Rationale @@ -268,8 +263,7 @@ This utility: ![download-icon](https://huggingface.co/datasets/huggingface/documentation-images/resolve/f7f671f69b88ce4967e19179172c248958d35742/transformers/tests_skipped_visualisation.png) - -### Usage +### Usage You can run the skipped test analyzer in two ways: diff --git a/docs/source/en/internal/pipelines_utils.md b/docs/source/en/internal/pipelines_utils.md index 6ea6de9a61b..23856e5639c 100644 --- a/docs/source/en/internal/pipelines_utils.md +++ b/docs/source/en/internal/pipelines_utils.md @@ -20,7 +20,6 @@ This page lists all the utility functions the library provides for pipelines. Most of those are only useful if you are studying the code of the models in the library. - ## Argument handling [[autodoc]] pipelines.ArgumentHandler diff --git a/docs/source/en/kv_cache.md b/docs/source/en/kv_cache.md index f0a781cba4f..a7c39a6a8d2 100644 --- a/docs/source/en/kv_cache.md +++ b/docs/source/en/kv_cache.md @@ -67,7 +67,7 @@ out = model.generate(**inputs, do_sample=False, max_new_tokens=20, past_key_valu ## Fixed-size cache -The default [`DynamicCache`] prevents you from taking advantage of most just-in-time (JIT) optimizations because the cache size isn't fixed. JIT optimizations enable you to maximize latency at the expense of memory usage. All of the following cache types are compatible with JIT optimizations like [torch.compile](./llm_optims#static-kv-cache-and-torchcompile) to accelerate generation. +The default [`DynamicCache`] prevents you from taking advantage of most just-in-time (JIT) optimizations because the cache size isn't fixed. JIT optimizations enable you to maximize latency at the expense of memory usage. All of the following cache types are compatible with JIT optimizations like [torch.compile](./llm_optims#static-kv-cache-and-torchcompile) to accelerate generation. A fixed-size cache ([`StaticCache`]) pre-allocates a specific maximum cache size for the kv pairs. You can generate up to the maximum cache size without needing to modify it. However, having a fixed (usually large) size for the key/value states means that while generating, a lot of tokens will actually be masked as they should not take part in the attention. So this trick allows to easily `compile` the decoding stage, but it incurs a waste of tokens in the attention computation. As all things, it's then a trade-off which should be very good if you generate with several sequence of more or less the same lengths, but may be sub-optimal if you have for example 1 very large sequence, and then only short sequences (as the fix cache size would be large, a lot would be wasted for the short sequences). Make sure you understand the impact if you use it! diff --git a/docs/source/en/llm_tutorial.md b/docs/source/en/llm_tutorial.md index 0f4f91d30a6..0cbbbc6ac04 100644 --- a/docs/source/en/llm_tutorial.md +++ b/docs/source/en/llm_tutorial.md @@ -24,6 +24,7 @@ In Transformers, the [`~GenerationMixin.generate`] API handles text generation, > [!TIP] > You can also chat with a model directly from the command line. ([reference](./conversations.md#transformers)) +> > ```shell > transformers chat Qwen/Qwen2.5-0.5B-Instruct > ``` @@ -35,6 +36,7 @@ Before you begin, it's helpful to install [bitsandbytes](https://hf.co/docs/bits ```bash !pip install -U transformers bitsandbytes ``` + Bitsandbytes supports multiple backends in addition to CUDA-based GPUs. Refer to the multi-backend installation [guide](https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend) to learn more. Load a LLM with [`~PreTrainedModel.from_pretrained`] and add the following two parameters to reduce the memory requirements. @@ -154,7 +156,6 @@ print(tokenizer.batch_decode(outputs, skip_special_tokens=True)) | `repetition_penalty` | `float` | Set it to `>1.0` if you're seeing the model repeat itself often. Larger values apply a larger penalty. | | `eos_token_id` | `list[int]` | The token(s) that will cause generation to stop. The default value is usually good, but you can specify a different token. | - ## Pitfalls The section below covers some common issues you may encounter during text generation and how to solve them. diff --git a/docs/source/en/llm_tutorial_optimization.md b/docs/source/en/llm_tutorial_optimization.md index 63d9308a84f..04a61dd82cb 100644 --- a/docs/source/en/llm_tutorial_optimization.md +++ b/docs/source/en/llm_tutorial_optimization.md @@ -66,6 +66,7 @@ If you have access to an 8 x 80GB A100 node, you could load BLOOM as follows ```bash !pip install transformers accelerate bitsandbytes optimum ``` + ```python from transformers import AutoModelForCausalLM @@ -98,6 +99,7 @@ result ``` **Output**: + ``` Here is a Python function that transforms bytes to Giga bytes:\n\n```python\ndef bytes_to_giga_bytes(bytes):\n return bytes / 1024 / 1024 / 1024\n```\n\nThis function takes a single ``` @@ -116,6 +118,7 @@ bytes_to_giga_bytes(torch.cuda.max_memory_allocated()) ``` **Output**: + ```bash 29.0260648727417 ``` @@ -127,7 +130,6 @@ Note that if we had tried to run the model in full float32 precision, a whopping If you are unsure in which format the model weights are stored on the Hub, you can always look into the checkpoint's config under `"dtype"`, *e.g.* [here](https://huggingface.co/meta-llama/Llama-2-7b-hf/blob/6fdf2e60f86ff2481f2241aaee459f85b5b0bbb9/config.json#L21). It is recommended to set the model to the same precision type as written in the config when loading with `from_pretrained(..., dtype=...)` except when the original type is float32 in which case one can use both `float16` or `bfloat16` for inference. - Let's define a `flush(...)` function to free all allocated memory so that we can accurately measure the peak allocated GPU memory. ```python @@ -148,6 +150,7 @@ Let's call it now for the next experiment. ```python flush() ``` + From the Accelerate library, you can also use a device-agnostic utility method called [release_memory](https://github.com/huggingface/accelerate/blob/29be4788629b772a3b722076e433b5b3b5c85da3/src/accelerate/utils/memory.py#L63), which takes various hardware backends like XPU, MLU, NPU, MPS, and more into account. ```python @@ -204,6 +207,7 @@ result ``` **Output**: + ``` Here is a Python function that transforms bytes to Giga bytes:\n\n```python\ndef bytes_to_giga_bytes(bytes):\n return bytes / 1024 / 1024 / 1024\n```\n\nThis function takes a single ``` @@ -215,6 +219,7 @@ bytes_to_giga_bytes(torch.cuda.max_memory_allocated()) ``` **Output**: + ``` 15.219234466552734 ``` @@ -222,8 +227,8 @@ bytes_to_giga_bytes(torch.cuda.max_memory_allocated()) Significantly less! We're down to just a bit over 15 GBs and could therefore run this model on consumer GPUs like the 4090. We're seeing a very nice gain in memory efficiency and more or less no degradation to the model's output. However, we can also notice a slight slow-down during inference. - We delete the models and flush the memory again. + ```python del model del pipe @@ -245,6 +250,7 @@ result ``` **Output**: + ``` Here is a Python function that transforms bytes to Giga bytes:\n\n```\ndef bytes_to_gigabytes(bytes):\n return bytes / 1024 / 1024 / 1024\n```\n\nThis function takes a single argument ``` @@ -256,6 +262,7 @@ bytes_to_giga_bytes(torch.cuda.max_memory_allocated()) ``` **Output**: + ``` 9.543574333190918 ``` @@ -270,6 +277,7 @@ Also note that inference here was again a bit slower compared to 8-bit quantizat del model del pipe ``` + ```python flush() ``` @@ -384,6 +392,7 @@ def alternating(list1, list2): ----- """ ``` + For demonstration purposes, we duplicate the system prompt by ten so that the input length is long enough to observe Flash Attention's memory savings. We append the original text prompt `"Question: Please write a function in Python that transforms bytes to Giga bytes.\n\nAnswer: Here"` @@ -413,6 +422,7 @@ result ``` **Output**: + ``` Generated in 10.96854019165039 seconds. Sure. Here is a function that does that.\n\ndef bytes_to_giga(bytes):\n return bytes / 1024 / 1024 / 1024\n\nAnswer: Sure. Here is a function that does that.\n\ndef @@ -429,6 +439,7 @@ bytes_to_giga_bytes(torch.cuda.max_memory_allocated()) ``` **Output**: + ```bash 37.668193340301514 ``` @@ -460,6 +471,7 @@ result ``` **Output**: + ``` Generated in 3.0211617946624756 seconds. Sure. Here is a function that does that.\n\ndef bytes_to_giga(bytes):\n return bytes / 1024 / 1024 / 1024\n\nAnswer: Sure. Here is a function that does that.\n\ndef @@ -474,6 +486,7 @@ bytes_to_giga_bytes(torch.cuda.max_memory_allocated()) ``` **Output**: + ``` 32.617331981658936 ``` @@ -604,6 +617,7 @@ generated_text ``` **Output**: + ``` shape of input_ids torch.Size([1, 21]) shape of input_ids torch.Size([1, 22]) @@ -641,6 +655,7 @@ generated_text ``` **Output**: + ``` shape of input_ids torch.Size([1, 1]) length of key-value cache 20 @@ -712,6 +727,7 @@ tokenizer.batch_decode(generation_output.sequences)[0][len(prompt):] ``` **Output**: + ``` is a modified version of the function that returns Mega bytes instead. @@ -733,6 +749,7 @@ config = model.config ``` **Output**: + ``` 7864320000 ``` @@ -773,7 +790,6 @@ The most notable application of GQA is [Llama-v2](https://huggingface.co/meta-ll > As a conclusion, it is strongly recommended to make use of either GQA or MQA if the LLM is deployed with auto-regressive decoding and is required to handle large input sequences as is the case for example for chat. - ## Conclusion The research community is constantly coming up with new, nifty ways to speed up inference time for ever-larger LLMs. As an example, one such promising research direction is [speculative decoding](https://huggingface.co/papers/2211.17192) where "easy tokens" are generated by smaller, faster language models and only "hard tokens" are generated by the LLM itself. Going into more detail is out of the scope of this notebook, but can be read upon in this [nice blog post](https://huggingface.co/blog/assisted-generation). diff --git a/docs/source/en/main_classes/callback.md b/docs/source/en/main_classes/callback.md index b29c9e7264e..bc1413a9474 100644 --- a/docs/source/en/main_classes/callback.md +++ b/docs/source/en/main_classes/callback.md @@ -54,7 +54,6 @@ The main class that implements callbacks is [`TrainerCallback`]. It gets the Trainer's internal state via [`TrainerState`], and can take some actions on the training loop via [`TrainerControl`]. - ## Available Callbacks Here is the list of the available [`TrainerCallback`] in the library: diff --git a/docs/source/en/main_classes/configuration.md b/docs/source/en/main_classes/configuration.md index 0cfef06d3ce..933621f6a14 100644 --- a/docs/source/en/main_classes/configuration.md +++ b/docs/source/en/main_classes/configuration.md @@ -24,7 +24,6 @@ Each derived config class implements model specific attributes. Common attribute `hidden_size`, `num_attention_heads`, and `num_hidden_layers`. Text models further implement: `vocab_size`. - ## PretrainedConfig [[autodoc]] PretrainedConfig diff --git a/docs/source/en/main_classes/data_collator.md b/docs/source/en/main_classes/data_collator.md index 2941338375b..33d156ec93f 100644 --- a/docs/source/en/main_classes/data_collator.md +++ b/docs/source/en/main_classes/data_collator.md @@ -25,7 +25,6 @@ on the formed batch. Examples of use can be found in the [example scripts](../examples) or [example notebooks](../notebooks). - ## Default data collator [[autodoc]] data.data_collator.default_data_collator diff --git a/docs/source/en/main_classes/deepspeed.md b/docs/source/en/main_classes/deepspeed.md index 0b9e28656c0..b04949229da 100644 --- a/docs/source/en/main_classes/deepspeed.md +++ b/docs/source/en/main_classes/deepspeed.md @@ -16,7 +16,7 @@ rendered properly in your Markdown viewer. # DeepSpeed -[DeepSpeed](https://github.com/deepspeedai/DeepSpeed), powered by Zero Redundancy Optimizer (ZeRO), is an optimization library for training and fitting very large models onto a GPU. It is available in several ZeRO stages, where each stage progressively saves more GPU memory by partitioning the optimizer state, gradients, parameters, and enabling offloading to a CPU or NVMe. DeepSpeed is integrated with the [`Trainer`] class and most of the setup is automatically taken care of for you. +[DeepSpeed](https://github.com/deepspeedai/DeepSpeed), powered by Zero Redundancy Optimizer (ZeRO), is an optimization library for training and fitting very large models onto a GPU. It is available in several ZeRO stages, where each stage progressively saves more GPU memory by partitioning the optimizer state, gradients, parameters, and enabling offloading to a CPU or NVMe. DeepSpeed is integrated with the [`Trainer`] class and most of the setup is automatically taken care of for you. However, if you want to use DeepSpeed without the [`Trainer`], Transformers provides a [`HfDeepSpeedConfig`] class. diff --git a/docs/source/en/main_classes/executorch.md b/docs/source/en/main_classes/executorch.md index 3178085c913..3406309aa32 100644 --- a/docs/source/en/main_classes/executorch.md +++ b/docs/source/en/main_classes/executorch.md @@ -15,14 +15,12 @@ rendered properly in your Markdown viewer. --> - # ExecuTorch [`ExecuTorch`](https://github.com/pytorch/executorch) is an end-to-end solution for enabling on-device inference capabilities across mobile and edge devices including wearables, embedded devices and microcontrollers. It is part of the PyTorch ecosystem and supports the deployment of PyTorch models with a focus on portability, productivity, and performance. ExecuTorch introduces well defined entry points to perform model, device, and/or use-case specific optimizations such as backend delegation, user-defined compiler transformations, memory planning, and more. The first step in preparing a PyTorch model for execution on an edge device using ExecuTorch is to export the model. This is achieved through the use of a PyTorch API called [`torch.export`](https://pytorch.org/docs/stable/export.html). - ## ExecuTorch Integration An integration point is being developed to ensure that 🤗 Transformers can be exported using `torch.export`. The goal of this integration is not only to enable export but also to ensure that the exported artifact can be further lowered and optimized to run efficiently in `ExecuTorch`, particularly for mobile and edge use cases. diff --git a/docs/source/en/main_classes/feature_extractor.md b/docs/source/en/main_classes/feature_extractor.md index fd451a35481..294ecad6309 100644 --- a/docs/source/en/main_classes/feature_extractor.md +++ b/docs/source/en/main_classes/feature_extractor.md @@ -18,7 +18,6 @@ rendered properly in your Markdown viewer. A feature extractor is in charge of preparing input features for audio or vision models. This includes feature extraction from sequences, e.g., pre-processing audio files to generate Log-Mel Spectrogram features, feature extraction from images, e.g., cropping image files, but also padding, normalization, and conversion to NumPy and PyTorch tensors. - ## FeatureExtractionMixin [[autodoc]] feature_extraction_utils.FeatureExtractionMixin diff --git a/docs/source/en/main_classes/image_processor.md b/docs/source/en/main_classes/image_processor.md index 7dc9de60571..61be0306630 100644 --- a/docs/source/en/main_classes/image_processor.md +++ b/docs/source/en/main_classes/image_processor.md @@ -26,6 +26,7 @@ from transformers import AutoImageProcessor processor = AutoImageProcessor.from_pretrained("facebook/detr-resnet-50", use_fast=True) ``` + Note that `use_fast` will be set to `True` by default in a future release. When using a fast image processor, you can also set the `device` argument to specify the device on which the processing should be done. By default, the processing is done on the same device as the inputs if the inputs are tensors, or on the CPU otherwise. @@ -57,7 +58,6 @@ Here are some speed comparisons between the base and fast image processors for t These benchmarks were run on an [AWS EC2 g5.2xlarge instance](https://aws.amazon.com/ec2/instance-types/g5/), utilizing an NVIDIA A10G Tensor Core GPU. - ## ImageProcessingMixin [[autodoc]] image_processing_utils.ImageProcessingMixin @@ -72,7 +72,6 @@ These benchmarks were run on an [AWS EC2 g5.2xlarge instance](https://aws.amazon [[autodoc]] image_processing_utils.BaseImageProcessor - ## BaseImageProcessorFast [[autodoc]] image_processing_utils_fast.BaseImageProcessorFast diff --git a/docs/source/en/main_classes/logging.md b/docs/source/en/main_classes/logging.md index 5cbdf9ae27e..34da2ac9d1b 100644 --- a/docs/source/en/main_classes/logging.md +++ b/docs/source/en/main_classes/logging.md @@ -55,7 +55,6 @@ logger.info("INFO") logger.warning("WARN") ``` - All the methods of this logging module are documented below, the main ones are [`logging.get_verbosity`] to get the current level of verbosity in the logger and [`logging.set_verbosity`] to set the verbosity to the level of your choice. In order (from the least diff --git a/docs/source/en/main_classes/model.md b/docs/source/en/main_classes/model.md index d7768a905ce..e3e77a8e2e1 100644 --- a/docs/source/en/main_classes/model.md +++ b/docs/source/en/main_classes/model.md @@ -26,7 +26,6 @@ file or directory, or from a pretrained model configuration provided by the libr The other methods that are common to each model are defined in [`~modeling_utils.ModuleUtilsMixin`] and [`~generation.GenerationMixin`]. - ## PreTrainedModel [[autodoc]] PreTrainedModel diff --git a/docs/source/en/main_classes/onnx.md b/docs/source/en/main_classes/onnx.md index 81d31c97e88..5f8869948d2 100644 --- a/docs/source/en/main_classes/onnx.md +++ b/docs/source/en/main_classes/onnx.md @@ -51,4 +51,3 @@ to export models for different types of topologies or tasks. ### FeaturesManager [[autodoc]] onnx.features.FeaturesManager - diff --git a/docs/source/en/main_classes/optimizer_schedules.md b/docs/source/en/main_classes/optimizer_schedules.md index 84d9ca7b907..3bab249ab4e 100644 --- a/docs/source/en/main_classes/optimizer_schedules.md +++ b/docs/source/en/main_classes/optimizer_schedules.md @@ -22,7 +22,6 @@ The `.optimization` module provides: - several schedules in the form of schedule objects that inherit from `_LRSchedule`: - a gradient accumulation class to accumulate the gradients of multiple batches - ## AdaFactor [[autodoc]] Adafactor diff --git a/docs/source/en/main_classes/output.md b/docs/source/en/main_classes/output.md index 295f99e21d1..8a9ae879fb1 100644 --- a/docs/source/en/main_classes/output.md +++ b/docs/source/en/main_classes/output.md @@ -47,7 +47,6 @@ However, this is not always the case. Some models apply normalization or subsequ - You can access each attribute as you would usually do, and if that attribute has not been returned by the model, you will get `None`. Here for instance `outputs.loss` is the loss computed by the model, and `outputs.attentions` is `None`. diff --git a/docs/source/en/main_classes/pipelines.md b/docs/source/en/main_classes/pipelines.md index 0e4cf55995b..31139ddf429 100644 --- a/docs/source/en/main_classes/pipelines.md +++ b/docs/source/en/main_classes/pipelines.md @@ -81,7 +81,6 @@ for out in tqdm(pipe(KeyDataset(dataset, "file"))): For ease of use, a generator is also possible: - ```python from transformers import pipeline @@ -196,7 +195,6 @@ This is a occasional very long sentence compared to the other. In that case, the tokens long, so the whole batch will be [64, 400] instead of [64, 4], leading to the high slowdown. Even worse, on bigger batches, the program simply crashes. - ``` ------------------------------ Streaming no batching @@ -245,7 +243,6 @@ multiple forward pass of a model. Under normal circumstances, this would yield i In order to circumvent this issue, both of these pipelines are a bit specific, they are `ChunkPipeline` instead of regular `Pipeline`. In short: - ```python preprocessed = pipe.preprocess(inputs) model_outputs = pipe.forward(preprocessed) @@ -254,7 +251,6 @@ outputs = pipe.postprocess(model_outputs) Now becomes: - ```python all_model_outputs = [] for preprocessed in pipe.preprocess(inputs): @@ -282,7 +278,6 @@ If you want to override a specific pipeline. Don't hesitate to create an issue for your task at hand, the goal of the pipeline is to be easy to use and support most cases, so `transformers` could maybe support your use case. - If you want to try simply you can: - Subclass your pipeline of choice @@ -302,7 +297,6 @@ my_pipeline = pipeline(model="xxxx", pipeline_class=MyPipeline) That should enable you to do all the custom code you want. - ## Implementing a pipeline [Implementing a new pipeline](../add_new_pipeline) @@ -329,7 +323,6 @@ Pipelines available for audio tasks include the following. - __call__ - all - ### ZeroShotAudioClassificationPipeline [[autodoc]] ZeroShotAudioClassificationPipeline diff --git a/docs/source/en/main_classes/processors.md b/docs/source/en/main_classes/processors.md index 2c2e0cd31b7..8863a632628 100644 --- a/docs/source/en/main_classes/processors.md +++ b/docs/source/en/main_classes/processors.md @@ -71,7 +71,6 @@ Additionally, the following method can be used to load values from a data file a [[autodoc]] data.processors.glue.glue_convert_examples_to_features - ## XNLI [The Cross-Lingual NLI Corpus (XNLI)](https://www.nyu.edu/projects/bowman/xnli/) is a benchmark that evaluates the @@ -88,7 +87,6 @@ Please note that since the gold labels are available on the test set, evaluation An example using these processors is given in the [run_xnli.py](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification/run_xnli.py) script. - ## SQuAD [The Stanford Question Answering Dataset (SQuAD)](https://rajpurkar.github.io/SQuAD-explorer//) is a benchmark that @@ -115,11 +113,9 @@ Additionally, the following method can be used to convert SQuAD examples into [[autodoc]] data.processors.squad.squad_convert_examples_to_features - These processors as well as the aforementioned method can be used with files containing the data as well as with the *tensorflow_datasets* package. Examples are given below. - ### Example usage Here is an example using the processors as well as the conversion method using data files: diff --git a/docs/source/en/main_classes/tokenizer.md b/docs/source/en/main_classes/tokenizer.md index 83d2ae5df6a..52c9751226d 100644 --- a/docs/source/en/main_classes/tokenizer.md +++ b/docs/source/en/main_classes/tokenizer.md @@ -22,7 +22,7 @@ Rust library [🤗 Tokenizers](https://github.com/huggingface/tokenizers). The " 1. a significant speed-up in particular when doing batched tokenization and 2. additional methods to map between the original string (character and words) and the token space (e.g. getting the - index of the token comprising a given character or the span of characters corresponding to a given token). + index of the token comprising a given character or the span of characters corresponding to a given token). The base classes [`PreTrainedTokenizer`] and [`PreTrainedTokenizerFast`] implement the common methods for encoding string inputs in model inputs (see below) and instantiating/saving python and @@ -50,12 +50,11 @@ several advanced alignment methods which can be used to map between the original token space (e.g., getting the index of the token comprising a given character or the span of characters corresponding to a given token). - # Multimodal Tokenizer Apart from that each tokenizer can be a "multimodal" tokenizer which means that the tokenizer will hold all relevant special tokens as part of tokenizer attributes for easier access. For example, if the tokenizer is loaded from a vision-language model like LLaVA, you will -be able to access `tokenizer.image_token_id` to obtain the special image token used as a placeholder. +be able to access `tokenizer.image_token_id` to obtain the special image token used as a placeholder. To enable extra special tokens for any type of tokenizer, you have to add the following lines and save the tokenizer. Extra special tokens do not have to be modality related and can ne anything that the model often needs access to. In the below code, tokenizer at `output_dir` will have direct access diff --git a/docs/source/en/main_classes/video_processor.md b/docs/source/en/main_classes/video_processor.md index ee69030ab1a..29d29d0cb60 100644 --- a/docs/source/en/main_classes/video_processor.md +++ b/docs/source/en/main_classes/video_processor.md @@ -22,7 +22,6 @@ The video processor extends the functionality of image processors by allowing Vi When adding a new VLM or updating an existing one to enable distinct video preprocessing, saving and reloading the processor configuration will store the video related arguments in a dedicated file named `video_preprocessing_config.json`. Don't worry if you haven't updated your VLM, the processor will try to load video related configurations from a file named `preprocessing_config.json`. - ### Usage Example Here's an example of how to load a video processor with [`llava-hf/llava-onevision-qwen2-0.5b-ov-hf`](https://huggingface.co/llava-hf/llava-onevision-qwen2-0.5b-ov-hf) model: @@ -59,7 +58,6 @@ The video processor can also sample video frames using the technique best suited - ```python from transformers import AutoVideoProcessor @@ -92,4 +90,3 @@ print(processed_video_inputs.pixel_values_videos.shape) ## BaseVideoProcessor [[autodoc]] video_processing_utils.BaseVideoProcessor - diff --git a/docs/source/en/model_doc/aimv2.md b/docs/source/en/model_doc/aimv2.md index 9d0abbaaf36..acf9c4de12f 100644 --- a/docs/source/en/model_doc/aimv2.md +++ b/docs/source/en/model_doc/aimv2.md @@ -25,7 +25,6 @@ The abstract from the paper is the following: *We introduce a novel method for pre-training of large-scale vision encoders. Building on recent advancements in autoregressive pre-training of vision models, we extend this framework to a multimodal setting, i.e., images and text. In this paper, we present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process, scalability, and remarkable performance across a range of downstream tasks. This is achieved by pairing the vision encoder with a multimodal decoder that autoregressively generates raw image patches and text tokens. Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification. Notably, our AIMV2-3B encoder achieves 89.5% accuracy on ImageNet-1k with a frozen trunk. Furthermore, AIMV2 consistently outperforms state-of-the-art contrastive models (e.g., CLIP, SigLIP) in multimodal image understanding across diverse settings.* - This model was contributed by [Yaswanth Gali](https://huggingface.co/yaswanthgali). The original code can be found [here](https://github.com/apple/ml-aim). diff --git a/docs/source/en/model_doc/aria.md b/docs/source/en/model_doc/aria.md index e5f4afa7b7a..ddd0815aaa5 100644 --- a/docs/source/en/model_doc/aria.md +++ b/docs/source/en/model_doc/aria.md @@ -98,7 +98,7 @@ print(response) Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends. - + The example below uses [torchao](../quantization/torchao) to only quantize the weights to int4 and the [rhymes-ai/Aria-sequential_mlp](https://huggingface.co/rhymes-ai/Aria-sequential_mlp) checkpoint. This checkpoint replaces grouped GEMM with `torch.nn.Linear` layers for easier quantization. ```py @@ -142,7 +142,6 @@ response = processor.decode(output_ids, skip_special_tokens=True) print(response) ``` - ## AriaImageProcessor [[autodoc]] AriaImageProcessor diff --git a/docs/source/en/model_doc/audio-spectrogram-transformer.md b/docs/source/en/model_doc/audio-spectrogram-transformer.md index 40115810467..092bf3b26f3 100644 --- a/docs/source/en/model_doc/audio-spectrogram-transformer.md +++ b/docs/source/en/model_doc/audio-spectrogram-transformer.md @@ -52,13 +52,13 @@ the authors compute the stats for a downstream dataset. ### Using Scaled Dot Product Attention (SDPA) -PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function -encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the -[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) +PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function +encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the +[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) or the [GPU Inference](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention) page for more information. -SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set +SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set `attn_implementation="sdpa"` in `from_pretrained()` to explicitly request SDPA to be used. ``` diff --git a/docs/source/en/model_doc/auto.md b/docs/source/en/model_doc/auto.md index 2f8cbc2009b..c1db5e2541a 100644 --- a/docs/source/en/model_doc/auto.md +++ b/docs/source/en/model_doc/auto.md @@ -23,7 +23,6 @@ automatically retrieve the relevant model given the name/path to the pretrained Instantiating one of [`AutoConfig`], [`AutoModel`], and [`AutoTokenizer`] will directly create a class of the relevant architecture. For instance - ```python model = AutoModel.from_pretrained("google-bert/bert-base-cased") ``` diff --git a/docs/source/en/model_doc/aya_vision.md b/docs/source/en/model_doc/aya_vision.md index 1f02b30344a..d0822173e89 100644 --- a/docs/source/en/model_doc/aya_vision.md +++ b/docs/source/en/model_doc/aya_vision.md @@ -29,7 +29,7 @@ You can find all the original Aya Vision checkpoints under the [Aya Vision](http > [!TIP] > This model was contributed by [saurabhdash](https://huggingface.co/saurabhdash) and [yonigozlan](https://huggingface.co/yonigozlan). -> +> > Click on the Aya Vision models in the right sidebar for more examples of how to apply Aya Vision to different image-to-text tasks. The example below demonstrates how to generate text based on an image with [`Pipeline`] or the [`AutoModel`] class. diff --git a/docs/source/en/model_doc/bark.md b/docs/source/en/model_doc/bark.md index a5787ab234e..6024b0e83ed 100644 --- a/docs/source/en/model_doc/bark.md +++ b/docs/source/en/model_doc/bark.md @@ -76,7 +76,7 @@ Note that 🤗 Optimum must be installed before using this feature. [Here's how Flash Attention 2 is an even faster, optimized version of the previous optimization. -##### Installation +##### Installation First, check whether your hardware is compatible with Flash Attention 2. The latest list of compatible hardware can be found in the [official documentation](https://github.com/Dao-AILab/flash-attention#installation-and-features). If your hardware is not compatible with Flash Attention 2, you can still benefit from attention kernel optimisations through Better Transformer support covered [above](https://huggingface.co/docs/transformers/main/en/model_doc/bark#using-better-transformer). @@ -86,7 +86,6 @@ Next, [install](https://github.com/Dao-AILab/flash-attention#installation-and-fe pip install -U flash-attn --no-build-isolation ``` - ##### Usage To load a model using Flash Attention 2, we can pass the `attn_implementation="flash_attention_2"` flag to [`.from_pretrained`](https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel.from_pretrained). We'll also load the model in half-precision (e.g. `torch.float16`), since it results in almost no degradation to audio quality but significantly lower memory usage and faster inference: @@ -97,7 +96,6 @@ model = BarkModel.from_pretrained("suno/bark-small", dtype=torch.float16, attn_i ##### Performance comparison - The following diagram shows the latency for the native attention implementation (no optimisation) against Better Transformer and Flash Attention 2. In all cases, we generate 400 semantic tokens on a 40GB A100 GPU with PyTorch 2.1. Flash Attention 2 is also consistently faster than Better Transformer, and its performance improves even more as batch sizes increase:
@@ -108,7 +106,6 @@ To put this into perspective, on an NVIDIA A100 and when generating 400 semantic At batch size 8, on an NVIDIA A100, Flash Attention 2 is also 10% faster than Better Transformer, and at batch size 16, 25%. - #### Combining optimization techniques You can combine optimization techniques, and use CPU offload, half-precision and Flash Attention 2 (or 🤗 Better Transformer) all at once. @@ -147,7 +144,7 @@ These presets are also uploaded in the hub [here](https://huggingface.co/suno/ba >>> audio_array = audio_array.cpu().numpy().squeeze() ``` -Bark can generate highly realistic, **multilingual** speech as well as other audio - including music, background noise and simple sound effects. +Bark can generate highly realistic, **multilingual** speech as well as other audio - including music, background noise and simple sound effects. ```python >>> # Multilingual speech - simplified Chinese @@ -165,7 +162,6 @@ Bark can generate highly realistic, **multilingual** speech as well as other aud The model can also produce **nonverbal communications** like laughing, sighing and crying. - ```python >>> # Adding non-speech cues to the input text >>> inputs = processor("Hello uh ... [clears throat], my dog is cute [laughter]") @@ -235,4 +231,3 @@ To save the audio, simply take the sample rate from the model config and some sc [[autodoc]] BarkSemanticConfig - all - diff --git a/docs/source/en/model_doc/bart.md b/docs/source/en/model_doc/bart.md index d1eeafb82b2..f81eaae98fb 100644 --- a/docs/source/en/model_doc/bart.md +++ b/docs/source/en/model_doc/bart.md @@ -15,7 +15,6 @@ rendered properly in your Markdown viewer. --> *This model was released on 2019-10-29 and added to Hugging Face Transformers on 2020-11-16.* -
PyTorch @@ -46,6 +45,7 @@ pipeline = pipeline( pipeline("Plants create through a process known as photosynthesis.") ``` + diff --git a/docs/source/en/model_doc/barthez.md b/docs/source/en/model_doc/barthez.md index 43b6521f101..f7a100a4208 100644 --- a/docs/source/en/model_doc/barthez.md +++ b/docs/source/en/model_doc/barthez.md @@ -31,7 +31,6 @@ You can find all of the original BARThez checkpoints under the [BARThez](https:/ > This model was contributed by [moussakam](https://huggingface.co/moussakam). > Refer to the [BART](./bart) docs for more usage examples. - The example below demonstrates how to predict the `` token with [`Pipeline`], [`AutoModel`], and from the command line. diff --git a/docs/source/en/model_doc/bartpho.md b/docs/source/en/model_doc/bartpho.md index 9e86a1b615d..15e96c57669 100644 --- a/docs/source/en/model_doc/bartpho.md +++ b/docs/source/en/model_doc/bartpho.md @@ -33,12 +33,9 @@ You can find all the original checkpoints under the [VinAI](https://huggingface. The example below demonstrates how to summarize text with [`Pipeline`] or the [`AutoModel`] class. - - - ```python import torch from transformers import pipeline @@ -98,8 +95,6 @@ transformers run --task summarization --model vinai/bartpho-word --device 0 - - ## Notes - BARTpho uses the large architecture of BART with an additional layer-normalization layer on top of the encoder and decoder. The BART-specific classes should be replaced with the mBART-specific classes. diff --git a/docs/source/en/model_doc/bert-japanese.md b/docs/source/en/model_doc/bert-japanese.md index 812e5a455ad..6599efa73e0 100644 --- a/docs/source/en/model_doc/bert-japanese.md +++ b/docs/source/en/model_doc/bert-japanese.md @@ -81,7 +81,6 @@ API reference information. - ## BertJapaneseTokenizer [[autodoc]] BertJapaneseTokenizer diff --git a/docs/source/en/model_doc/bertweet.md b/docs/source/en/model_doc/bertweet.md index 6488e197d21..223932877c0 100644 --- a/docs/source/en/model_doc/bertweet.md +++ b/docs/source/en/model_doc/bertweet.md @@ -26,7 +26,6 @@ rendered properly in your Markdown viewer. [BERTweet](https://huggingface.co/papers/2005.10200) shares the same architecture as [BERT-base](./bert), but it’s pretrained like [RoBERTa](./roberta) on English Tweets. It performs really well on Tweet-related tasks like part-of-speech tagging, named entity recognition, and text classification. - You can find all the original BERTweet checkpoints under the [VinAI Research](https://huggingface.co/vinai?search_models=BERTweet) organization. > [!TIP] @@ -49,6 +48,7 @@ pipeline = pipeline( ) pipeline("Plants create through a process known as photosynthesis.") ``` + diff --git a/docs/source/en/model_doc/big_bird.md b/docs/source/en/model_doc/big_bird.md index 5e431c6883d..877445a4ba5 100644 --- a/docs/source/en/model_doc/big_bird.md +++ b/docs/source/en/model_doc/big_bird.md @@ -47,6 +47,7 @@ pipeline = pipeline( ) pipeline("Plants create [MASK] through a process known as photosynthesis.") ``` + @@ -81,6 +82,7 @@ print(f"The predicted token is: {predicted_token}") ```bash !echo -e "Plants create [MASK] through a process known as photosynthesis." | transformers run --task fill-mask --model google/bigbird-roberta-base --device 0 ``` + diff --git a/docs/source/en/model_doc/bigbird_pegasus.md b/docs/source/en/model_doc/bigbird_pegasus.md index fe3241ed7ab..cfc55e361e7 100644 --- a/docs/source/en/model_doc/bigbird_pegasus.md +++ b/docs/source/en/model_doc/bigbird_pegasus.md @@ -52,6 +52,7 @@ Through photosynthesis, plants capture energy from sunlight using a green pigmen These ingredients are then transformed into glucose, a type of sugar that serves as a source of chemical energy, and oxygen, which is released as a byproduct into the atmosphere. The glucose produced during photosynthesis is not just used immediately; plants also store it as starch or convert it into other organic compounds like cellulose, which is essential for building their cellular structure. This energy reserve allows them to grow, develop leaves, produce flowers, bear fruit, and carry out various physiological processes throughout their lifecycle.""") ``` + @@ -77,6 +78,7 @@ input_ids = tokenizer(input_text, return_tensors="pt").to(model.device) output = model.generate(**input_ids, cache_implementation="static") print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` + diff --git a/docs/source/en/model_doc/biogpt.md b/docs/source/en/model_doc/biogpt.md index 4676a440c75..9a664fa288f 100644 --- a/docs/source/en/model_doc/biogpt.md +++ b/docs/source/en/model_doc/biogpt.md @@ -135,31 +135,26 @@ print(output) [[autodoc]] BioGptConfig - ## BioGptTokenizer [[autodoc]] BioGptTokenizer - save_vocabulary - ## BioGptModel [[autodoc]] BioGptModel - forward - ## BioGptForCausalLM [[autodoc]] BioGptForCausalLM - forward - ## BioGptForTokenClassification [[autodoc]] BioGptForTokenClassification - forward - ## BioGptForSequenceClassification [[autodoc]] BioGptForSequenceClassification diff --git a/docs/source/en/model_doc/bitnet.md b/docs/source/en/model_doc/bitnet.md index 6946ec65d43..69f9cb75131 100644 --- a/docs/source/en/model_doc/bitnet.md +++ b/docs/source/en/model_doc/bitnet.md @@ -35,10 +35,8 @@ Several versions of the model weights are available on Hugging Face: * [**`microsoft/bitnet-b1.58-2B-4T-gguf`**](https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf): Contains the model weights in GGUF format, compatible with the `bitnet.cpp` library for CPU inference. - ### Model Details - * **Architecture:** Transformer-based, modified with `BitLinear` layers (BitNet framework). * Uses Rotary Position Embeddings (RoPE). * Uses squared ReLU (ReLU²) activation in FFN layers. @@ -58,10 +56,8 @@ Several versions of the model weights are available on Hugging Face: 3. **Direct Preference Optimization (DPO):** Aligned with human preferences using preference pairs. * **Tokenizer:** LLaMA 3 Tokenizer (vocab size: 128,256). - ## Usage tips - **VERY IMPORTANT NOTE ON EFFICIENCY** > Please do NOT expect performance efficiency gains (in terms of speed, latency, or energy consumption) when using this model with the standard transformers library. @@ -106,7 +102,6 @@ response = tokenizer.decode(chat_outputs[0][chat_input.shape[-1]:], skip_special print("\nAssistant Response:", response) ``` - ## BitNetConfig [[autodoc]] BitNetConfig diff --git a/docs/source/en/model_doc/blenderbot-small.md b/docs/source/en/model_doc/blenderbot-small.md index 1967013208b..830db710e03 100644 --- a/docs/source/en/model_doc/blenderbot-small.md +++ b/docs/source/en/model_doc/blenderbot-small.md @@ -55,7 +55,6 @@ found [here](https://github.com/facebookresearch/ParlAI). Blenderbot Small is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than the left. - ## Resources - [Causal language modeling task guide](../tasks/language_modeling) diff --git a/docs/source/en/model_doc/blenderbot.md b/docs/source/en/model_doc/blenderbot.md index 99149c5d948..168c744235d 100644 --- a/docs/source/en/model_doc/blenderbot.md +++ b/docs/source/en/model_doc/blenderbot.md @@ -71,7 +71,6 @@ An example: `facebook/blenderbot_small_90M`, have a different architecture and consequently should be used with [BlenderbotSmall](blenderbot-small). - ## Resources - [Causal language modeling task guide](../tasks/language_modeling) diff --git a/docs/source/en/model_doc/blip-2.md b/docs/source/en/model_doc/blip-2.md index fe4e939c2dc..faaaee7b084 100644 --- a/docs/source/en/model_doc/blip-2.md +++ b/docs/source/en/model_doc/blip-2.md @@ -26,14 +26,14 @@ rendered properly in your Markdown viewer. The BLIP-2 model was proposed in [BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](https://huggingface.co/papers/2301.12597) by Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi. BLIP-2 leverages frozen pre-trained image encoders and large language models (LLMs) by training a lightweight, 12-layer Transformer encoder in between them, achieving state-of-the-art performance on various vision-language tasks. Most notably, BLIP-2 improves upon [Flamingo](https://huggingface.co/papers/2204.14198), an 80 billion parameter model, by 8.7% -on zero-shot VQAv2 with 54x fewer trainable parameters. +on zero-shot VQAv2 with 54x fewer trainable parameters. The abstract from the paper is the following: *The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen language model. BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods. For example, our model outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters. We also demonstrate the model's emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.* +alt="drawing" width="600"/> BLIP-2 architecture. Taken from the original paper. diff --git a/docs/source/en/model_doc/blip.md b/docs/source/en/model_doc/blip.md index 13a2a5731a5..5ef78728996 100644 --- a/docs/source/en/model_doc/blip.md +++ b/docs/source/en/model_doc/blip.md @@ -25,7 +25,6 @@ rendered properly in your Markdown viewer. [BLIP](https://huggingface.co/papers/2201.12086) (Bootstrapped Language-Image Pretraining) is a vision-language pretraining (VLP) framework designed for *both* understanding and generation tasks. Most existing pretrained models are only good at one or the other. It uses a captioner to generate captions and a filter to remove the noisy captions. This increases training data quality and more effectively uses the messy web data. - You can find all the original BLIP checkpoints under the [BLIP](https://huggingface.co/collections/Salesforce/blip-models-65242f40f1491fbf6a9e9472) collection. > [!TIP] diff --git a/docs/source/en/model_doc/bloom.md b/docs/source/en/model_doc/bloom.md index 805379338e3..c78cb4447eb 100644 --- a/docs/source/en/model_doc/bloom.md +++ b/docs/source/en/model_doc/bloom.md @@ -48,7 +48,6 @@ See also: - [Token classification task guide](../tasks/token_classification) - [Question answering task guide](../tasks/question_answering) - ⚡️ Inference - A blog on [Optimization story: Bloom inference](https://huggingface.co/blog/bloom-inference-optimization). - A blog on [Incredibly Fast BLOOM Inference with DeepSpeed and Accelerate](https://huggingface.co/blog/bloom-inference-pytorch-scripts). diff --git a/docs/source/en/model_doc/blt.md b/docs/source/en/model_doc/blt.md index 0289f77ac90..7e9052bcdd2 100644 --- a/docs/source/en/model_doc/blt.md +++ b/docs/source/en/model_doc/blt.md @@ -83,7 +83,6 @@ print(tokenizer.decode(generated_ids[0])) This model was contributed by [itazap](https://huggingface.co/). The original code can be found [here](). - ## BltConfig [[autodoc]] BltConfig diff --git a/docs/source/en/model_doc/bridgetower.md b/docs/source/en/model_doc/bridgetower.md index 6a2b09e263a..861dd32c16f 100644 --- a/docs/source/en/model_doc/bridgetower.md +++ b/docs/source/en/model_doc/bridgetower.md @@ -26,7 +26,7 @@ rendered properly in your Markdown viewer. The BridgeTower model was proposed in [BridgeTower: Building Bridges Between Encoders in Vision-Language Representative Learning](https://huggingface.co/papers/2206.08657) by Xiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, Wanxiang Che, Nan Duan. The goal of this model is to build a bridge between each uni-modal encoder and the cross-modal encoder to enable comprehensive and detailed interaction at each layer of the cross-modal encoder thus achieving remarkable performance on various downstream tasks with almost negligible additional performance and computational costs. -This paper has been accepted to the [AAAI'23](https://aaai.org/Conferences/AAAI-23/) conference. +This paper has been accepted to the [AAAI'23](https://aaai.org/Conferences/AAAI-23/) conference. The abstract from the paper is the following: @@ -54,6 +54,7 @@ The [`BridgeTowerProcessor`] wraps [`RobertaTokenizer`] and [`BridgeTowerImagePr encode the text and prepare the images respectively. The following example shows how to run contrastive learning using [`BridgeTowerProcessor`] and [`BridgeTowerForContrastiveLearning`]. + ```python >>> from transformers import BridgeTowerProcessor, BridgeTowerForContrastiveLearning >>> import requests @@ -76,6 +77,7 @@ The following example shows how to run contrastive learning using [`BridgeTowerP ``` The following example shows how to run image-text retrieval using [`BridgeTowerProcessor`] and [`BridgeTowerForImageAndTextRetrieval`]. + ```python >>> from transformers import BridgeTowerProcessor, BridgeTowerForImageAndTextRetrieval >>> import requests @@ -130,7 +132,6 @@ Tips: - Please refer to [Table 5](https://huggingface.co/papers/2206.08657) for BridgeTower's performance on Image Retrieval and other down stream tasks. - The PyTorch version of this model is only available in torch 1.10 and higher. - ## BridgeTowerConfig [[autodoc]] BridgeTowerConfig @@ -177,4 +178,3 @@ Tips: [[autodoc]] BridgeTowerForImageAndTextRetrieval - forward - diff --git a/docs/source/en/model_doc/bros.md b/docs/source/en/model_doc/bros.md index aeb3dd76e52..4ef3d3737ae 100644 --- a/docs/source/en/model_doc/bros.md +++ b/docs/source/en/model_doc/bros.md @@ -57,7 +57,6 @@ def expand_and_normalize_bbox(bboxes, doc_width, doc_height): - [`~transformers.BrosForTokenClassification.forward`, `~transformers.BrosSpadeEEForTokenClassification.forward`, `~transformers.BrosSpadeEEForTokenClassification.forward`] require not only `input_ids` and `bbox` but also `box_first_token_mask` for loss calculation. It is a mask to filter out non-first tokens of each box. You can obtain this mask by saving start token indices of bounding boxes when creating `input_ids` from words. You can make `box_first_token_mask` with following code, - ```python def make_box_first_token_mask(bboxes, words, tokenizer, max_seq_length=512): @@ -102,7 +101,6 @@ def make_box_first_token_mask(bboxes, words, tokenizer, max_seq_length=512): [[autodoc]] BrosModel - forward - ## BrosForTokenClassification [[autodoc]] BrosForTokenClassification diff --git a/docs/source/en/model_doc/camembert.md b/docs/source/en/model_doc/camembert.md index ddce66f2ded..971954ed52a 100644 --- a/docs/source/en/model_doc/camembert.md +++ b/docs/source/en/model_doc/camembert.md @@ -50,6 +50,7 @@ from transformers import pipeline pipeline = pipeline("fill-mask", model="camembert-base", dtype=torch.float16, device=0) pipeline("Le camembert est un délicieux fromage .") ``` + @@ -72,6 +73,7 @@ predicted_token = tokenizer.decode(predicted_token_id) print(f"The predicted token is: {predicted_token}") ``` + @@ -84,7 +86,6 @@ echo -e "Le camembert est un délicieux fromage ." | transformers run --ta - Quantization reduces the memory burden of large models by representing weights in lower precision. Refer to the [Quantization](../quantization/overview) overview for available options. The example below uses [bitsandbytes](../quantization/bitsandbytes) quantization to quantize the weights to 8-bits. diff --git a/docs/source/en/model_doc/canine.md b/docs/source/en/model_doc/canine.md index 4e46e943c8e..53691dcbc22 100644 --- a/docs/source/en/model_doc/canine.md +++ b/docs/source/en/model_doc/canine.md @@ -86,6 +86,7 @@ echo -e "Plant create energy through a process known as photosynthesis." | trans inputs = ["Life is like a box of chocolates.", "You never know what you gonna get."] encoding = tokenizer(inputs, padding="longest", truncation=True, return_tensors="pt") ``` + - CANINE is primarily designed to be fine-tuned on a downstream task. The pretrained model can be used for either masked language modeling or next sentence prediction. ## CanineConfig diff --git a/docs/source/en/model_doc/chameleon.md b/docs/source/en/model_doc/chameleon.md index eb71349115e..dc573faa111 100644 --- a/docs/source/en/model_doc/chameleon.md +++ b/docs/source/en/model_doc/chameleon.md @@ -28,7 +28,6 @@ rendered properly in your Markdown viewer. The Chameleon model was proposed in [Chameleon: Mixed-Modal Early-Fusion Foundation Models ](https://huggingface.co/papers/2405.09818) by META AI Chameleon Team. Chameleon is a Vision-Language Model that use vector quantization to tokenize images which enables the model to generate multimodal output. The model takes images and texts as input, including an interleaved format, and generates textual response. Image generation module is not released yet. - The abstract from the paper is the following: *We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. We outline a stable training @@ -43,7 +42,6 @@ including Gemini Pro and GPT-4V, according to human judgments on a new long-form generation evaluation, where either the prompt or outputs contain mixed sequences of both images and text. Chameleon marks a significant step forward in unified modeling of full multimodal documents* - drawing @@ -52,7 +50,6 @@ alt="drawing" width="600"/> This model was contributed by [joaogante](https://huggingface.co/joaogante) and [RaushanTurganbay](https://huggingface.co/RaushanTurganbay). The original code can be found [here](https://github.com/facebookresearch/chameleon). - ## Usage tips - We advise users to use `padding_side="left"` when computing batched generation as it leads to more accurate results. Simply make sure to set `processor.tokenizer.padding_side = "left"` before generating. diff --git a/docs/source/en/model_doc/clipseg.md b/docs/source/en/model_doc/clipseg.md index e27d49ffe48..7ca9b3926ac 100644 --- a/docs/source/en/model_doc/clipseg.md +++ b/docs/source/en/model_doc/clipseg.md @@ -47,7 +47,7 @@ can be formulated. Finally, we find our system to adapt well to generalized queries involving affordances or properties* +alt="drawing" width="600"/> CLIPSeg overview. Taken from the original paper. diff --git a/docs/source/en/model_doc/clvp.md b/docs/source/en/model_doc/clvp.md index 926438a3c1f..eead4a54643 100644 --- a/docs/source/en/model_doc/clvp.md +++ b/docs/source/en/model_doc/clvp.md @@ -29,29 +29,25 @@ The abstract from the paper is the following: *In recent years, the field of image generation has been revolutionized by the application of autoregressive transformers and DDPMs. These approaches model the process of image generation as a step-wise probabilistic processes and leverage large amounts of compute and data to learn the image distribution. This methodology of improving performance need not be confined to images. This paper describes a way to apply advances in the image generative domain to speech synthesis. The result is TorToise - an expressive, multi-voice text-to-speech system.* - This model was contributed by [Susnato Dhar](https://huggingface.co/susnato). The original code can be found [here](https://github.com/neonbjb/tortoise-tts). - ## Usage tips 1. CLVP is an integral part of the Tortoise TTS model. 2. CLVP can be used to compare different generated speech candidates with the provided text, and the best speech tokens are forwarded to the diffusion model. 3. The use of the [`ClvpModelForConditionalGeneration.generate()`] method is strongly recommended for tortoise usage. -4. Note that the CLVP model expects the audio to be sampled at 22.05 kHz contrary to other audio models which expects 16 kHz. - +4. Note that the CLVP model expects the audio to be sampled at 22.05 kHz contrary to other audio models which expects 16 kHz. ## Brief Explanation: - The [`ClvpTokenizer`] tokenizes the text input, and the [`ClvpFeatureExtractor`] extracts the log mel-spectrogram from the desired audio. - [`ClvpConditioningEncoder`] takes those text tokens and audio representations and converts them into embeddings conditioned on the text and audio. - The [`ClvpForCausalLM`] uses those embeddings to generate multiple speech candidates. -- Each speech candidate is passed through the speech encoder ([`ClvpEncoder`]) which converts them into a vector representation, and the text encoder ([`ClvpEncoder`]) converts the text tokens into the same latent space. -- At the end, we compare each speech vector with the text vector to see which speech vector is most similar to the text vector. +- Each speech candidate is passed through the speech encoder ([`ClvpEncoder`]) which converts them into a vector representation, and the text encoder ([`ClvpEncoder`]) converts the text tokens into the same latent space. +- At the end, we compare each speech vector with the text vector to see which speech vector is most similar to the text vector. - [`ClvpModelForConditionalGeneration.generate()`] compresses all of the logic described above into a single method. - Example : ```python @@ -74,7 +70,6 @@ Example : >>> generated_output = model.generate(**processor_output) ``` - ## ClvpConfig [[autodoc]] ClvpConfig @@ -128,4 +123,3 @@ Example : ## ClvpDecoder [[autodoc]] ClvpDecoder - diff --git a/docs/source/en/model_doc/code_llama.md b/docs/source/en/model_doc/code_llama.md index 60e9cb4c3cf..a46e1f05b32 100644 --- a/docs/source/en/model_doc/code_llama.md +++ b/docs/source/en/model_doc/code_llama.md @@ -143,6 +143,7 @@ visualizer("""def func(a, b): - Infilling is only available in the 7B and 13B base models, and not in the Python, Instruct, 34B, or 70B models. - Use the `` token where you want your input to be filled. The tokenizer splits this token to create a formatted input string that follows the [original training pattern](https://github.com/facebookresearch/codellama/blob/cb51c14ec761370ba2e2bc351374a79265d0465e/llama/generation.py#L402). This is more robust than preparing the pattern yourself. + ```py from transformers import LlamaForCausalLM, CodeLlamaTokenizer @@ -158,6 +159,7 @@ visualizer("""def func(a, b): filling = tokenizer.batch_decode(generated_ids[:, input_ids.shape[1]:], skip_special_tokens = True)[0] print(PROMPT.replace("", filling)) ``` + - Use `bfloat16` for further training or fine-tuning and `float16` for inference. - The `BOS` character is not used for infilling when encoding the prefix or suffix, but only at the beginning of each prompt. - The tokenizer is a byte-pair encoding model based on [SentencePiece](https://github.com/google/sentencepiece). During decoding, if the first token is the start of the word (for example, “Banana”), the tokenizer doesn’t prepend the prefix space to the string. diff --git a/docs/source/en/model_doc/codegen.md b/docs/source/en/model_doc/codegen.md index e5ad3863b67..c341154921e 100644 --- a/docs/source/en/model_doc/codegen.md +++ b/docs/source/en/model_doc/codegen.md @@ -29,7 +29,7 @@ CodeGen is an autoregressive language model for program synthesis trained sequen The abstract from the paper is the following: -*Program synthesis strives to generate a computer program as a solution to a given problem specification. We propose a conversational program synthesis approach via large language models, which addresses the challenges of searching over a vast program space and user intent specification faced in prior approaches. Our new approach casts the process of writing a specification and program as a multi-turn conversation between a user and a system. It treats program synthesis as a sequence prediction problem, in which the specification is expressed in natural language and the desired program is conditionally sampled. We train a family of large language models, called CodeGen, on natural language and programming language data. With weak supervision in the data and the scaling up of data size and model size, conversational capacities emerge from the simple autoregressive language modeling. To study the model behavior on conversational program synthesis, we develop a multi-turn programming benchmark (MTPB), where solving each problem requires multi-step synthesis via multi-turn conversation between the user and the model. Our findings show the emergence of conversational capabilities and the effectiveness of the proposed conversational program synthesis paradigm. In addition, our model CodeGen (with up to 16B parameters trained on TPU-v4) outperforms OpenAI's Codex on the HumanEval benchmark. We make the training library JaxFormer including checkpoints available as open source contribution: [this https URL](https://github.com/salesforce/codegen).* +*Program synthesis strives to generate a computer program as a solution to a given problem specification. We propose a conversational program synthesis approach via large language models, which addresses the challenges of searching over a vast program space and user intent specification faced in prior approaches. Our new approach casts the process of writing a specification and program as a multi-turn conversation between a user and a system. It treats program synthesis as a sequence prediction problem, in which the specification is expressed in natural language and the desired program is conditionally sampled. We train a family of large language models, called CodeGen, on natural language and programming language data. With weak supervision in the data and the scaling up of data size and model size, conversational capacities emerge from the simple autoregressive language modeling. To study the model behavior on conversational program synthesis, we develop a multi-turn programming benchmark (MTPB), where solving each problem requires multi-step synthesis via multi-turn conversation between the user and the model. Our findings show the emergence of conversational capabilities and the effectiveness of the proposed conversational program synthesis paradigm. In addition, our model CodeGen (with up to 16B parameters trained on TPU-v4) outperforms OpenAI's Codex on the HumanEval benchmark. We make the training library JaxFormer including checkpoints available as open source contribution: [this https URL](https://github.com/salesforce/codegen).* This model was contributed by [Hiroaki Hayashi](https://huggingface.co/rooa). The original code can be found [here](https://github.com/salesforce/codegen). @@ -39,7 +39,7 @@ The original code can be found [here](https://github.com/salesforce/codegen). * CodeGen model [checkpoints](https://huggingface.co/models?other=codegen) are available on different pre-training data with variable sizes. * The format is: `Salesforce/codegen-{size}-{data}`, where * `size`: `350M`, `2B`, `6B`, `16B` - * `data`: + * `data`: * `nl`: Pre-trained on the Pile * `multi`: Initialized with `nl`, then further pre-trained on multiple programming languages data * `mono`: Initialized with `multi`, then further pre-trained on Python data diff --git a/docs/source/en/model_doc/cohere.md b/docs/source/en/model_doc/cohere.md index 9fc6d266d69..b8ccf20706a 100644 --- a/docs/source/en/model_doc/cohere.md +++ b/docs/source/en/model_doc/cohere.md @@ -22,14 +22,12 @@ rendered properly in your Markdown viewer.
- # Cohere Cohere [Command-R](https://cohere.com/blog/command-r) is a 35B parameter multilingual large language model designed for long context tasks like retrieval-augmented generation (RAG) and calling external APIs and tools. The model is specifically trained for grounded generation and supports both single-step and multi-step tool use. It supports a context length of 128K tokens. You can find all the original Command-R checkpoints under the [Command Models](https://huggingface.co/collections/CohereForAI/command-models-67652b401665205e17b192ad) collection. - > [!TIP] > Click on the Cohere models in the right sidebar for more examples of how to apply Cohere to different language tasks. @@ -123,7 +121,6 @@ visualizer("Plants create energy through a process known as")
- ## Notes - Don’t use the dtype parameter in [`~AutoModel.from_pretrained`] if you’re using FlashAttention-2 because it only supports fp16 or bf16. You should use [Automatic Mixed Precision](https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html), set fp16 or bf16 to True if using [`Trainer`], or use [torch.autocast](https://pytorch.org/docs/stable/amp.html#torch.autocast). @@ -145,7 +142,6 @@ visualizer("Plants create energy through a process known as") [[autodoc]] CohereModel - forward - ## CohereForCausalLM [[autodoc]] CohereForCausalLM diff --git a/docs/source/en/model_doc/cohere2.md b/docs/source/en/model_doc/cohere2.md index b1edcf8c851..ed94fef1da1 100644 --- a/docs/source/en/model_doc/cohere2.md +++ b/docs/source/en/model_doc/cohere2.md @@ -22,7 +22,6 @@ rendered properly in your Markdown viewer. - # Cohere 2 [Cohere Command R7B](https://cohere.com/blog/command-r7b) is an open weights research release of a 7B billion parameter model. It is a multilingual model trained on 23 languages and has a context window of 128k. The model features three layers with sliding window attention and ROPE for efficient local context modeling and relative positional encoding. A fourth layer uses global attention without positional embeddings, enabling unrestricted token interactions across the entire sequence. @@ -31,7 +30,6 @@ This model is optimized for speed, cost-performance, and compute resources. You can find all the original Command-R checkpoints under the [Command Models](https://huggingface.co/collections/CohereForAI/command-models-67652b401665205e17b192ad) collection. - > [!TIP] > Click on the Cohere models in the right sidebar for more examples of how to apply Cohere to different language tasks. @@ -136,7 +134,6 @@ print(tokenizer.decode(output[0], skip_special_tokens=True)) [[autodoc]] Cohere2Model - forward - ## Cohere2ForCausalLM [[autodoc]] Cohere2ForCausalLM diff --git a/docs/source/en/model_doc/cohere2_vision.md b/docs/source/en/model_doc/cohere2_vision.md index 2e12ff3e476..e466ce6a5f0 100644 --- a/docs/source/en/model_doc/cohere2_vision.md +++ b/docs/source/en/model_doc/cohere2_vision.md @@ -113,6 +113,7 @@ outputs = pipe(text=messages, max_new_tokens=300, return_full_text=False) print(outputs) ``` + diff --git a/docs/source/en/model_doc/cpm.md b/docs/source/en/model_doc/cpm.md index ccfa1596bad..275f5629db1 100644 --- a/docs/source/en/model_doc/cpm.md +++ b/docs/source/en/model_doc/cpm.md @@ -42,7 +42,6 @@ NLP tasks in the settings of few-shot (even zero-shot) learning.* This model was contributed by [canwenxu](https://huggingface.co/canwenxu). The original implementation can be found here: https://github.com/TsinghuaAI/CPM-Generate - CPM's architecture is the same as GPT-2, except for tokenization method. Refer to [GPT-2 documentation](gpt2) for @@ -50,7 +49,6 @@ API reference information. - ## CpmTokenizer [[autodoc]] CpmTokenizer diff --git a/docs/source/en/model_doc/cpmant.md b/docs/source/en/model_doc/cpmant.md index 6f13f785ac1..47eec6e79d6 100644 --- a/docs/source/en/model_doc/cpmant.md +++ b/docs/source/en/model_doc/cpmant.md @@ -45,7 +45,7 @@ This model was contributed by [OpenBMB](https://huggingface.co/openbmb). The ori [[autodoc]] CpmAntModel - all - + ## CpmAntForCausalLM [[autodoc]] CpmAntForCausalLM diff --git a/docs/source/en/model_doc/csm.md b/docs/source/en/model_doc/csm.md index 1ee2b63dd71..16283247048 100644 --- a/docs/source/en/model_doc/csm.md +++ b/docs/source/en/model_doc/csm.md @@ -346,7 +346,6 @@ out.loss.backward() This model was contributed by [Eustache Le Bihan](https://huggingface.co/eustlb). The original code can be found [here](https://github.com/SesameAILabs/csm). - ## CsmConfig [[autodoc]] CsmConfig diff --git a/docs/source/en/model_doc/ctrl.md b/docs/source/en/model_doc/ctrl.md index e5b48d638b6..6244ee0a59e 100644 --- a/docs/source/en/model_doc/ctrl.md +++ b/docs/source/en/model_doc/ctrl.md @@ -55,7 +55,6 @@ This model was contributed by [keskarnitishr](https://huggingface.co/keskarnitis pre-computed values in the context of text generation. See the [`forward`](model_doc/ctrl#transformers.CTRLModel.forward) method for more information on the usage of this argument. - ## Resources - [Text classification task guide](../tasks/sequence_classification) diff --git a/docs/source/en/model_doc/d_fine.md b/docs/source/en/model_doc/d_fine.md index 9dffde75ebc..05e855d333b 100644 --- a/docs/source/en/model_doc/d_fine.md +++ b/docs/source/en/model_doc/d_fine.md @@ -24,13 +24,13 @@ Yansong Peng, Hebei Li, Peixi Wu, Yueyi Zhang, Xiaoyan Sun, Feng Wu The abstract from the paper is the following: -*We introduce D-FINE, a powerful real-time object detector that achieves outstanding localization precision by redefining the bounding box regression task in DETR models. D-FINE comprises two key components: Fine-grained Distribution Refinement (FDR) and Global Optimal Localization Self-Distillation (GO-LSD). +*We introduce D-FINE, a powerful real-time object detector that achieves outstanding localization precision by redefining the bounding box regression task in DETR models. D-FINE comprises two key components: Fine-grained Distribution Refinement (FDR) and Global Optimal Localization Self-Distillation (GO-LSD). FDR transforms the regression process from predicting fixed coordinates to iteratively refining probability distributions, providing a fine-grained intermediate representation that significantly enhances localization accuracy. GO-LSD is a bidirectional optimization strategy that transfers localization knowledge from refined distributions to shallower layers through self-distillation, while also simplifying the residual prediction tasks for deeper layers. Additionally, D-FINE incorporates lightweight optimizations in computationally intensive modules and operations, achieving a better balance between speed and accuracy. Specifically, D-FINE-L / X achieves 54.0% / 55.8% AP on the COCO dataset at 124 / 78 FPS on an NVIDIA T4 GPU. When pretrained on Objects365, D-FINE-L / X attains 57.1% / 59.3% AP, surpassing all existing real-time detectors. Furthermore, our method significantly enhances the performance of a wide range of DETR models by up to 5.3% AP with negligible extra parameters and training costs. Our code and pretrained models: this https URL.* -This model was contributed by [VladOS95-cyber](https://github.com/VladOS95-cyber). +This model was contributed by [VladOS95-cyber](https://github.com/VladOS95-cyber). The original code can be found [here](https://github.com/Peterande/D-FINE). -## Usage tips +## Usage tips ```python >>> import torch diff --git a/docs/source/en/model_doc/dab-detr.md b/docs/source/en/model_doc/dab-detr.md index 32b27d4b247..d85988ec1f5 100644 --- a/docs/source/en/model_doc/dab-detr.md +++ b/docs/source/en/model_doc/dab-detr.md @@ -77,7 +77,9 @@ for result in results: box = [round(i, 2) for i in box.tolist()] print(f"{model.config.id2label[label]}: {score:.2f} {box}") ``` + This should output + ``` cat: 0.87 [14.7, 49.39, 320.52, 469.28] remote: 0.86 [41.08, 72.37, 173.39, 117.2] @@ -89,6 +91,7 @@ couch: 0.59 [-0.04, 1.34, 639.9, 477.09] There are three other ways to instantiate a DAB-DETR model (depending on what you prefer): Option 1: Instantiate DAB-DETR with pre-trained weights for entire model + ```py >>> from transformers import DabDetrForObjectDetection @@ -96,19 +99,21 @@ Option 1: Instantiate DAB-DETR with pre-trained weights for entire model ``` Option 2: Instantiate DAB-DETR with randomly initialized weights for Transformer, but pre-trained weights for backbone + ```py >>> from transformers import DabDetrConfig, DabDetrForObjectDetection >>> config = DabDetrConfig() >>> model = DabDetrForObjectDetection(config) ``` + Option 3: Instantiate DAB-DETR with randomly initialized weights for backbone + Transformer + ```py >>> config = DabDetrConfig(use_pretrained_backbone=False) >>> model = DabDetrForObjectDetection(config) ``` - ## DabDetrConfig [[autodoc]] DabDetrConfig diff --git a/docs/source/en/model_doc/dac.md b/docs/source/en/model_doc/dac.md index e17cc69fc37..94f70fdff32 100644 --- a/docs/source/en/model_doc/dac.md +++ b/docs/source/en/model_doc/dac.md @@ -23,7 +23,6 @@ rendered properly in your Markdown viewer. ## Overview - The DAC model was proposed in [Descript Audio Codec: High-Fidelity Audio Compression with Improved RVQGAN](https://huggingface.co/papers/2306.06546) by Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, Kundan Kumar. The Descript Audio Codec (DAC) model is a powerful tool for compressing audio data, making it highly efficient for storage and transmission. By compressing 44.1 KHz audio into tokens at just 8kbps bandwidth, the DAC model enables high-quality audio processing while significantly reducing the data footprint. This is particularly useful in scenarios where bandwidth is limited or storage space is at a premium, such as in streaming applications, remote conferencing, and archiving large audio datasets. @@ -35,7 +34,6 @@ The abstract from the paper is the following: This model was contributed by [Kamil Akesbi](https://huggingface.co/kamilakesbi). The original code can be found [here](https://github.com/descriptinc/descript-audio-codec/tree/main?tab=readme-ov-file). - ## Model structure The Descript Audio Codec (DAC) model is structured into three distinct stages: @@ -44,11 +42,11 @@ The Descript Audio Codec (DAC) model is structured into three distinct stages: 2. Residual Vector Quantizer (RVQ) Model: Working in tandem with the encoder, this model quantizes the latent codes of the audio, refining the compression and ensuring high-quality reconstruction. 3. Decoder Model: This final stage reconstructs the audio from its compressed form, restoring it to a state that closely resembles the original input. -## Usage example +## Usage example -Here is a quick example of how to encode and decode an audio using this model: +Here is a quick example of how to encode and decode an audio using this model: -```python +```python >>> from datasets import load_dataset, Audio >>> from transformers import DacModel, AutoProcessor >>> librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") diff --git a/docs/source/en/model_doc/dbrx.md b/docs/source/en/model_doc/dbrx.md index 8b2e5ae75e3..a97e594e415 100644 --- a/docs/source/en/model_doc/dbrx.md +++ b/docs/source/en/model_doc/dbrx.md @@ -35,7 +35,6 @@ We estimate that this data is at least 2x better token-for-token than the data w This new dataset was developed using the full suite of Databricks tools, including Apache Spark™ and Databricks notebooks for data processing, and Unity Catalog for data management and governance. We used curriculum learning for pretraining, changing the data mix during training in ways we found to substantially improve model quality. - More detailed information about DBRX Instruct and DBRX Base can be found in our [technical blog post](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm). This model was contributed by [eitan-turok](https://huggingface.co/eitanturok) and [abhi-db](https://huggingface.co/abhi-db). The original code can be found [here](https://github.com/databricks/dbrx-instruct), though this may not be up to date. @@ -65,6 +64,7 @@ print(tokenizer.decode(outputs[0])) ``` If you have flash-attention installed (`pip install flash-attn`), it is possible to generate faster. (The HuggingFace documentation for flash-attention can be found [here](https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2).) + ```python from transformers import DbrxForCausalLM, AutoTokenizer import torch @@ -87,6 +87,7 @@ print(tokenizer.decode(outputs[0])) ``` You can also generate faster using the PyTorch scaled dot product attention. (The HuggingFace documentation for scaled dot product attention can be found [here](https://huggingface.co/docs/transformers/perf_infer_gpu_one#pytorch-scaled-dot-product-attention).) + ```python from transformers import DbrxForCausalLM, AutoTokenizer import torch @@ -112,15 +113,12 @@ print(tokenizer.decode(outputs[0])) [[autodoc]] DbrxConfig - ## DbrxModel [[autodoc]] DbrxModel - forward - ## DbrxForCausalLM [[autodoc]] DbrxForCausalLM - forward - diff --git a/docs/source/en/model_doc/deberta-v2.md b/docs/source/en/model_doc/deberta-v2.md index 7c92cd6cb9d..6ec0c0e5117 100644 --- a/docs/source/en/model_doc/deberta-v2.md +++ b/docs/source/en/model_doc/deberta-v2.md @@ -21,14 +21,12 @@ rendered properly in your Markdown viewer. - # DeBERTa-v2 [DeBERTa-v2](https://huggingface.co/papers/2006.03654) improves on the original [DeBERTa](./deberta) architecture by using a SentencePiece-based tokenizer and a new vocabulary size of 128K. It also adds an additional convolutional layer within the first transformer layer to better learn local dependencies of input tokens. Finally, the position projection and content projection matrices are shared in the attention layer to reduce the number of parameters. You can find all the original [DeBERTa-v2] checkpoints under the [Microsoft](https://huggingface.co/microsoft?search_models=deberta-v2) organization. - > [!TIP] > This model was contributed by [Pengcheng He](https://huggingface.co/DeBERTa). > @@ -86,6 +84,7 @@ print(f"Predicted label: {predicted_label}") ```bash echo -e "DeBERTa-v2 is great at understanding context!" | transformers run --task fill-mask --model microsoft/deberta-v2-xlarge-mnli --device 0 ``` + @@ -119,7 +118,6 @@ print(f"Predicted label: {predicted_label}") ``` - ## DebertaV2Config [[autodoc]] DebertaV2Config diff --git a/docs/source/en/model_doc/deberta.md b/docs/source/en/model_doc/deberta.md index 2d99bdbfd21..76fe8e1a3b6 100644 --- a/docs/source/en/model_doc/deberta.md +++ b/docs/source/en/model_doc/deberta.md @@ -31,7 +31,6 @@ Even with less training data than RoBERTa, DeBERTa manages to outperform it on s You can find all the original DeBERTa checkpoints under the [Microsoft](https://huggingface.co/microsoft?search_models=deberta) organization. - > [!TIP] > Click on the DeBERTa models in the right sidebar for more examples of how to apply DeBERTa to different language tasks. diff --git a/docs/source/en/model_doc/decision_transformer.md b/docs/source/en/model_doc/decision_transformer.md index cdfcd42f9a3..349b8eaae2e 100644 --- a/docs/source/en/model_doc/decision_transformer.md +++ b/docs/source/en/model_doc/decision_transformer.md @@ -28,14 +28,14 @@ by Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael La The abstract from the paper is the following: -*We introduce a framework that abstracts Reinforcement Learning (RL) as a sequence modeling problem. +*We introduce a framework that abstracts Reinforcement Learning (RL) as a sequence modeling problem. This allows us to draw upon the simplicity and scalability of the Transformer architecture, and associated advances - in language modeling such as GPT-x and BERT. In particular, we present Decision Transformer, an architecture that - casts the problem of RL as conditional sequence modeling. Unlike prior approaches to RL that fit value functions or - compute policy gradients, Decision Transformer simply outputs the optimal actions by leveraging a causally masked - Transformer. By conditioning an autoregressive model on the desired return (reward), past states, and actions, our - Decision Transformer model can generate future actions that achieve the desired return. Despite its simplicity, - Decision Transformer matches or exceeds the performance of state-of-the-art model-free offline RL baselines on + in language modeling such as GPT-x and BERT. In particular, we present Decision Transformer, an architecture that + casts the problem of RL as conditional sequence modeling. Unlike prior approaches to RL that fit value functions or + compute policy gradients, Decision Transformer simply outputs the optimal actions by leveraging a causally masked + Transformer. By conditioning an autoregressive model on the desired return (reward), past states, and actions, our + Decision Transformer model can generate future actions that achieve the desired return. Despite its simplicity, + Decision Transformer matches or exceeds the performance of state-of-the-art model-free offline RL baselines on Atari, OpenAI Gym, and Key-to-Door tasks.* This version of the model is for tasks where the state is a vector. @@ -46,7 +46,6 @@ This model was contributed by [edbeeching](https://huggingface.co/edbeeching). T [[autodoc]] DecisionTransformerConfig - ## DecisionTransformerGPT2Model [[autodoc]] DecisionTransformerGPT2Model diff --git a/docs/source/en/model_doc/deepseek_v3.md b/docs/source/en/model_doc/deepseek_v3.md index d8eb2e94203..81724e39943 100644 --- a/docs/source/en/model_doc/deepseek_v3.md +++ b/docs/source/en/model_doc/deepseek_v3.md @@ -26,17 +26,17 @@ We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 67 ## Limitations and call for contribution! -We are super happy to make this code community-powered, and would love to see how you can best optimize the following: +We are super happy to make this code community-powered, and would love to see how you can best optimize the following: - current implementation uses the "naive" attention compution (so not really MLA) -- current implementation loops through the experts. This should be replaced. Pointers to use `get_packed_weights` from `integrations/tensor_parallel`. +- current implementation loops through the experts. This should be replaced. Pointers to use `get_packed_weights` from `integrations/tensor_parallel`. - current implementation uses the eleuther formula for ROPE, using the original one would be more efficient! (should still follow our API) - static cache is not supported (this should be just a generation config issue / config shape issues) ### Usage tips The model uses Multi-head Latent Attention (MLA) and DeepSeekMoE architectures for efficient inference and cost-effective training. It employs an auxiliary-loss-free strategy for load balancing and multi-token prediction training objective. The model can be used for various language tasks after being pre-trained on 14.8 trillion tokens and going through Supervised Fine-Tuning and Reinforcement Learning stages. -You can run the model in `FP8` automatically, using 2 nodes of 8 H100 should be more than enough! +You can run the model in `FP8` automatically, using 2 nodes of 8 H100 should be more than enough! ```python # `run_deepseek_v1.py` @@ -61,7 +61,8 @@ outputs = model.generate(inputs, max_new_tokens=50) print(tokenizer.batch_decode(outputs)) print(time.time()-start) ``` -This generated: + +This generated: `````` <|Assistant|> @@ -157,18 +158,20 @@ Want to dive deeper or see a specific framework’s implementation (e.g., OpenAI `````` Use the following to run it + ```bash torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0|1 --rdzv-id an_id --rdzv-backend c10d --rdzv-endpoint master_addr:master_port run_deepseek_r1.py ``` -If you have: +If you have: + ```bash [rank0]: ncclInternalError: Internal check failed. [rank0]: Last error: [rank0]: Bootstrap : no socket interface found ``` -error, it means NCCL was probably not loaded. +error, it means NCCL was probably not loaded. ## DeepseekV3Config diff --git a/docs/source/en/model_doc/deepseek_vl.md b/docs/source/en/model_doc/deepseek_vl.md index 58695db8348..710e6144bb0 100644 --- a/docs/source/en/model_doc/deepseek_vl.md +++ b/docs/source/en/model_doc/deepseek_vl.md @@ -63,6 +63,7 @@ messages = [ pipe(text=messages, max_new_tokens=20, return_full_text=False) ``` + @@ -115,6 +116,7 @@ output_text = processor.batch_decode( print(output_text) ``` + @@ -138,9 +140,11 @@ model = DeepseekVLForConditionalGeneration.from_pretrained( quantization_config=quantization_config ) ``` + ### Notes - Do inference with multiple images in a single conversation. + ```py import torch from transformers import DeepseekVLForConditionalGeneration, AutoProcessor diff --git a/docs/source/en/model_doc/deepseek_vl_hybrid.md b/docs/source/en/model_doc/deepseek_vl_hybrid.md index d18ab7576ad..0613b50f1ad 100644 --- a/docs/source/en/model_doc/deepseek_vl_hybrid.md +++ b/docs/source/en/model_doc/deepseek_vl_hybrid.md @@ -62,6 +62,7 @@ messages = [ pipe(text=messages, max_new_tokens=20, return_full_text=False) ``` + @@ -114,6 +115,7 @@ output_text = processor.batch_decode( print(output_text) ``` + @@ -137,9 +139,11 @@ model = DeepseekVLHybridForConditionalGeneration.from_pretrained( quantization_config=quantization_config ) ``` + ### Notes - Do inference with multiple images in a single conversation. + ```py import torch from transformers import DeepseekVLHybridForConditionalGeneration, AutoProcessor diff --git a/docs/source/en/model_doc/deplot.md b/docs/source/en/model_doc/deplot.md index 651ddcef7fe..0eb3975530a 100644 --- a/docs/source/en/model_doc/deplot.md +++ b/docs/source/en/model_doc/deplot.md @@ -21,7 +21,7 @@ rendered properly in your Markdown viewer. PyTorch -## Overview +## Overview DePlot was proposed in the paper [DePlot: One-shot visual language reasoning by plot-to-table translation](https://huggingface.co/papers/2212.10505) from Fangyu Liu, Julian Martin Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Collier, Yasemin Altun. @@ -36,8 +36,7 @@ DePlot is a Visual Question Answering subset of `Pix2Struct` architecture. It re Currently one checkpoint is available for DePlot: -- `google/deplot`: DePlot fine-tuned on ChartQA dataset - +- `google/deplot`: DePlot fine-tuned on ChartQA dataset ```python from transformers import AutoProcessor, Pix2StructForConditionalGeneration @@ -57,6 +56,7 @@ print(processor.decode(predictions[0], skip_special_tokens=True)) ## Fine-tuning To fine-tune DePlot, refer to the pix2struct [fine-tuning notebook](https://github.com/huggingface/notebooks/blob/main/examples/image_captioning_pix2struct.ipynb). For `Pix2Struct` models, we have found out that fine-tuning the model with Adafactor and cosine learning rate scheduler leads to faster convergence: + ```python from transformers.optimization import Adafactor, get_cosine_schedule_with_warmup diff --git a/docs/source/en/model_doc/depth_pro.md b/docs/source/en/model_doc/depth_pro.md index 85423359ceb..6872fca5138 100644 --- a/docs/source/en/model_doc/depth_pro.md +++ b/docs/source/en/model_doc/depth_pro.md @@ -102,12 +102,14 @@ The network is supplemented with a focal length estimation head. A small convolu The `use_fov_model` parameter in `DepthProConfig` controls whether **FOV prediction** is enabled. By default, it is set to `False` to conserve memory and computation. When enabled, the **FOV encoder** is instantiated based on the `fov_model_config` parameter, which defaults to a `Dinov2Model`. The `use_fov_model` parameter can also be passed when initializing the `DepthProForDepthEstimation` model. The pretrained model at checkpoint `apple/DepthPro-hf` uses the FOV encoder. To use the pretrained-model without FOV encoder, set `use_fov_model=False` when loading the model, which saves computation. + ```py >>> from transformers import DepthProForDepthEstimation >>> model = DepthProForDepthEstimation.from_pretrained("apple/DepthPro-hf", use_fov_model=False) ``` To instantiate a new model with FOV encoder, set `use_fov_model=True` in the config. + ```py >>> from transformers import DepthProConfig, DepthProForDepthEstimation >>> config = DepthProConfig(use_fov_model=True) @@ -115,6 +117,7 @@ To instantiate a new model with FOV encoder, set `use_fov_model=True` in the con ``` Or set `use_fov_model=True` when initializing the model, which overrides the value in config. + ```py >>> from transformers import DepthProConfig, DepthProForDepthEstimation >>> config = DepthProConfig() @@ -123,13 +126,13 @@ Or set `use_fov_model=True` when initializing the model, which overrides the val ### Using Scaled Dot Product Attention (SDPA) -PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function -encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the -[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) +PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function +encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the +[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) or the [GPU Inference](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention) page for more information. -SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set +SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set `attn_implementation="sdpa"` in `from_pretrained()` to explicitly request SDPA to be used. ```py diff --git a/docs/source/en/model_doc/detr.md b/docs/source/en/model_doc/detr.md index 425ab0f04c5..6d7792803c5 100644 --- a/docs/source/en/model_doc/detr.md +++ b/docs/source/en/model_doc/detr.md @@ -113,6 +113,7 @@ DETR can be naturally extended to perform panoptic segmentation (which unifies s There are three other ways to instantiate a DETR model (depending on what you prefer): - Option 1: Instantiate DETR with pre-trained weights for entire model + ```python from transformers import DetrForObjectDetection @@ -120,6 +121,7 @@ model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50") ``` - Option 2: Instantiate DETR with randomly initialized weights for Transformer, but pre-trained weights for backbone + ```python from transformers import DetrConfig, DetrForObjectDetection @@ -128,6 +130,7 @@ model = DetrForObjectDetection(config) ``` - Option 3: Instantiate DETR with randomly initialized weights for backbone + Transformer + ```python config = DetrConfig(use_pretrained_backbone=False) model = DetrForObjectDetection(config) @@ -144,7 +147,7 @@ As a summary, consider the following table: | **Postprocessing** (i.e. converting the output of the model to Pascal VOC format) | [`~transformers.DetrImageProcessor.post_process`] | [`~transformers.DetrImageProcessor.post_process_segmentation`] | [`~transformers.DetrImageProcessor.post_process_segmentation`], [`~transformers.DetrImageProcessor.post_process_panoptic`] | | **evaluators** | `CocoEvaluator` with `iou_types="bbox"` | `CocoEvaluator` with `iou_types="bbox"` or `"segm"` | `CocoEvaluator` with `iou_tupes="bbox"` or `"segm"`, `PanopticEvaluator` | -- In short, one should prepare the data either in COCO detection or COCO panoptic format, then use [`~transformers.DetrImageProcessor`] to create `pixel_values`, `pixel_mask` and optional `labels`, which can then be used to train (or fine-tune) a model. +- In short, one should prepare the data either in COCO detection or COCO panoptic format, then use [`~transformers.DetrImageProcessor`] to create `pixel_values`, `pixel_mask` and optional `labels`, which can then be used to train (or fine-tune) a model. - For evaluation, one should first convert the outputs of the model using one of the postprocessing methods of [`~transformers.DetrImageProcessor`]. These can be provided to either `CocoEvaluator` or `PanopticEvaluator`, which allow you to calculate metrics like mean Average Precision (mAP) and Panoptic Quality (PQ). The latter objects are implemented in the [original repository](https://github.com/facebookresearch/detr). See the [example notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/DETR) for more info regarding evaluation. ## Resources diff --git a/docs/source/en/model_doc/dia.md b/docs/source/en/model_doc/dia.md index 1a07e8831ee..bab0cb4a72d 100644 --- a/docs/source/en/model_doc/dia.md +++ b/docs/source/en/model_doc/dia.md @@ -117,11 +117,9 @@ out = model(**inputs) out.loss.backward() ``` - This model was contributed by [Jaeyong Sung](https://huggingface.co/buttercrab), [Arthur Zucker](https://huggingface.co/ArthurZ), and [Anton Vlasjuk](https://huggingface.co/AntonV). The original code can be found [here](https://github.com/nari-labs/dia/). - ## DiaConfig [[autodoc]] DiaConfig diff --git a/docs/source/en/model_doc/diffllama.md b/docs/source/en/model_doc/diffllama.md index 406bae43c5f..79b8314d0ae 100644 --- a/docs/source/en/model_doc/diffllama.md +++ b/docs/source/en/model_doc/diffllama.md @@ -35,7 +35,6 @@ The abstract from the paper is the following: ### Usage tips The hyperparameters of this model is the same as Llama model. - ## DiffLlamaConfig [[autodoc]] DiffLlamaConfig diff --git a/docs/source/en/model_doc/dinov2.md b/docs/source/en/model_doc/dinov2.md index 59256756acf..0968641326a 100644 --- a/docs/source/en/model_doc/dinov2.md +++ b/docs/source/en/model_doc/dinov2.md @@ -19,7 +19,6 @@ specific language governing permissions and limitations under the License. - # DINOv2 [DINOv2](https://huggingface.co/papers/2304.07193) is a vision foundation model that uses [ViT](./vit) as a feature extractor for multiple downstream tasks like image classification and depth estimation. It focuses on stabilizing and accelerating training through techniques like a faster memory-efficient attention, sequence packing, improved stochastic depth, Fully Sharded Data Parallel (FSDP), and model distillation. diff --git a/docs/source/en/model_doc/dinov2_with_registers.md b/docs/source/en/model_doc/dinov2_with_registers.md index f89de76d216..fcafc6df306 100644 --- a/docs/source/en/model_doc/dinov2_with_registers.md +++ b/docs/source/en/model_doc/dinov2_with_registers.md @@ -45,7 +45,6 @@ Tips: This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/facebookresearch/dinov2). - ## Dinov2WithRegistersConfig [[autodoc]] Dinov2WithRegistersConfig diff --git a/docs/source/en/model_doc/dinov3.md b/docs/source/en/model_doc/dinov3.md index a11a8fd10cc..94e53165156 100644 --- a/docs/source/en/model_doc/dinov3.md +++ b/docs/source/en/model_doc/dinov3.md @@ -19,7 +19,6 @@ specific language governing permissions and limitations under the License. - # DINOv3 [DINOv3](https://huggingface.co/papers/2508.10104) is a family of versatile vision foundation models that outperforms the specialized state of the art across a broad range of settings, without fine-tuning. DINOv3 produces high-quality dense features that achieve outstanding performance on various vision tasks, significantly surpassing previous self- and weakly-supervised foundation models. diff --git a/docs/source/en/model_doc/dit.md b/docs/source/en/model_doc/dit.md index 3027905fe38..574ffe3ef11 100644 --- a/docs/source/en/model_doc/dit.md +++ b/docs/source/en/model_doc/dit.md @@ -85,6 +85,7 @@ print(f"The predicted class label is: {predicted_class_label}") ## Notes - The pretrained DiT weights can be loaded in a [BEiT] model with a modeling head to predict visual tokens. + ```py from transformers import BeitForMaskedImageModeling diff --git a/docs/source/en/model_doc/doge.md b/docs/source/en/model_doc/doge.md index 6221940d5d5..ffa9ced7913 100644 --- a/docs/source/en/model_doc/doge.md +++ b/docs/source/en/model_doc/doge.md @@ -17,7 +17,6 @@ rendered properly in your Markdown viewer. # Doge - ## Overview Doge is a series of small language models based on the [Doge](https://github.com/SmallDoges/small-doge) architecture, aiming to combine the advantages of state-space and self-attention algorithms, calculate dynamic masks from cached value states using the zero-order hold method, and solve the problem of existing mainstream language models getting lost in context. It uses the `wsd_scheduler` scheduler to pre-train on the `smollm-corpus`, and can continue training on new datasets or add sparse activation feedforward networks from stable stage checkpoints. @@ -28,7 +27,6 @@ As shown in the figure below, the sequence transformation part of the Doge archi Checkout all Doge model checkpoints [here](https://huggingface.co/collections/SmallDoge/doge-slm-679cc991f027c4a3abbded4a). - ## Usage
@@ -44,6 +42,7 @@ inputs = tokenizer("Hey how are you doing?", return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=100) print(tokenizer.batch_decode(outputs)) ``` +
@@ -82,6 +81,7 @@ outputs = model.generate( streamer=steamer ) ``` +
## DogeConfig diff --git a/docs/source/en/model_doc/donut.md b/docs/source/en/model_doc/donut.md index f06b6804d6e..e582dab748a 100644 --- a/docs/source/en/model_doc/donut.md +++ b/docs/source/en/model_doc/donut.md @@ -22,7 +22,7 @@ specific language governing permissions and limitations under the License. --> # Donut -[Donut (Document Understanding Transformer)](https://huggingface.co/papers/2111.15664) is a visual document understanding model that doesn't require an Optical Character Recognition (OCR) engine. Unlike traditional approaches that extract text using OCR before processing, Donut employs an end-to-end Transformer-based architecture to directly analyze document images. This eliminates OCR-related inefficiencies making it more accurate and adaptable to diverse languages and formats. +[Donut (Document Understanding Transformer)](https://huggingface.co/papers/2111.15664) is a visual document understanding model that doesn't require an Optical Character Recognition (OCR) engine. Unlike traditional approaches that extract text using OCR before processing, Donut employs an end-to-end Transformer-based architecture to directly analyze document images. This eliminates OCR-related inefficiencies making it more accurate and adaptable to diverse languages and formats. Donut features vision encoder ([Swin](./swin)) and a text decoder ([BART](./bart)). Swin converts document images into embeddings and BART processes them into meaningful text sequences. diff --git a/docs/source/en/model_doc/dots1.md b/docs/source/en/model_doc/dots1.md index 337cad8cb4c..316ab3b1f5b 100644 --- a/docs/source/en/model_doc/dots1.md +++ b/docs/source/en/model_doc/dots1.md @@ -25,7 +25,6 @@ The abstract from the report is the following: *Mixture of Experts (MoE) models have emerged as a promising paradigm for scaling language models efficiently by activating only a subset of parameters for each input token. In this report, we present dots.llm1, a large-scale MoE model that activates 14B parameters out of a total of 142B parameters, delivering performance on par with state-of-the-art models while reducing training and inference costs. Leveraging our meticulously crafted and efficient data processing pipeline, dots.llm1 achieves performance comparable to Qwen2.5-72B after pretraining on high-quality corpus and post-training to fully unlock its capabilities. Notably, no synthetic data is used during pretraining. To foster further research, we open-source intermediate training checkpoints spanning the entire training process, providing valuable insights into the learning dynamics of large language models.* - ## Dots1Config [[autodoc]] Dots1Config diff --git a/docs/source/en/model_doc/efficientloftr.md b/docs/source/en/model_doc/efficientloftr.md index 2cdec895efc..faf71f4bac0 100644 --- a/docs/source/en/model_doc/efficientloftr.md +++ b/docs/source/en/model_doc/efficientloftr.md @@ -45,6 +45,7 @@ results = keypoint_matcher([url_0, url_1], threshold=0.9) print(results[0]) # {'keypoint_image_0': {'x': ..., 'y': ...}, 'keypoint_image_1': {'x': ..., 'y': ...}, 'score': ...} ``` + @@ -167,4 +168,3 @@ processed_outputs = processor.post_process_keypoint_matching(outputs, image_size [[autodoc]] EfficientLoFTRForKeypointMatching - forward - diff --git a/docs/source/en/model_doc/efficientnet.md b/docs/source/en/model_doc/efficientnet.md index 859923126a9..b4fbe822562 100644 --- a/docs/source/en/model_doc/efficientnet.md +++ b/docs/source/en/model_doc/efficientnet.md @@ -23,7 +23,7 @@ rendered properly in your Markdown viewer. ## Overview -The EfficientNet model was proposed in [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://huggingface.co/papers/1905.11946) +The EfficientNet model was proposed in [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://huggingface.co/papers/1905.11946) by Mingxing Tan and Quoc V. Le. EfficientNets are a family of image classification models, which achieve state-of-the-art accuracy, yet being an order-of-magnitude smaller and faster than previous models. The abstract from the paper is the following: @@ -34,7 +34,6 @@ To go even further, we use neural architecture search to design a new baseline n This model was contributed by [adirik](https://huggingface.co/adirik). The original code can be found [here](https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet). - ## EfficientNetConfig [[autodoc]] EfficientNetConfig @@ -58,4 +57,3 @@ The original code can be found [here](https://github.com/tensorflow/tpu/tree/mas [[autodoc]] EfficientNetForImageClassification - forward - diff --git a/docs/source/en/model_doc/emu3.md b/docs/source/en/model_doc/emu3.md index 799de2f0c5c..0c95bc6d987 100644 --- a/docs/source/en/model_doc/emu3.md +++ b/docs/source/en/model_doc/emu3.md @@ -27,8 +27,7 @@ rendered properly in your Markdown viewer. The Emu3 model was proposed in [Emu3: Next-Token Prediction is All You Need](https://huggingface.co/papers/2409.18869) by Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, Yingli Zhao, Yulong Ao, Xuebin Min, Tao Li, Boya Wu, Bo Zhao, Bowen Zhang, Liangdong Wang, Guang Liu, Zheqi He, Xi Yang, Jingjing Liu, Yonghua Lin, Tiejun Huang, Zhongyuan Wang. -Emu3 is a multimodal LLM that uses vector quantization to tokenize images into discrete tokens. Discretized image tokens are later fused with text token ids for image and text generation. The model can additionally generate images by predicting image token ids. - +Emu3 is a multimodal LLM that uses vector quantization to tokenize images into discrete tokens. Discretized image tokens are later fused with text token ids for image and text generation. The model can additionally generate images by predicting image token ids. The abstract from the paper is the following: @@ -45,11 +44,9 @@ Tips: > [!TIP] > Emu3 implementation in Transformers uses a special image token to indicate where to merge image embeddings. The special image token isn't new and uses one of the reserved tokens: `<|extra_0|>`. You have to add `` to your prompt in the place where the image should be embedded for correct generation. - This model was contributed by [RaushanTurganbay](https://huggingface.co/RaushanTurganbay). The original code can be found [here](https://github.com/baaivision/Emu3). - ## Usage example ### Text generation inference @@ -143,7 +140,6 @@ for i, image in enumerate(images['pixel_values']): ``` - ## Emu3Config [[autodoc]] Emu3Config diff --git a/docs/source/en/model_doc/encodec.md b/docs/source/en/model_doc/encodec.md index 89099173039..9fc6c2c97e9 100644 --- a/docs/source/en/model_doc/encodec.md +++ b/docs/source/en/model_doc/encodec.md @@ -29,14 +29,14 @@ The abstract from the paper is the following: *We introduce a state-of-the-art real-time, high-fidelity, audio codec leveraging neural networks. It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion. We simplify and speed-up the training by using a single multiscale spectrogram adversary that efficiently reduces artifacts and produce high-quality samples. We introduce a novel loss balancer mechanism to stabilize training: the weight of a loss now defines the fraction of the overall gradient it should represent, thus decoupling the choice of this hyper-parameter from the typical scale of the loss. Finally, we study how lightweight Transformer models can be used to further compress the obtained representation by up to 40%, while staying faster than real time. We provide a detailed description of the key design choices of the proposed model including: training objective, architectural changes and a study of various perceptual loss functions. We present an extensive subjective evaluation (MUSHRA tests) together with an ablation study for a range of bandwidths and audio domains, including speech, noisy-reverberant speech, and music. Our approach is superior to the baselines methods across all evaluated settings, considering both 24 kHz monophonic and 48 kHz stereophonic audio.* -This model was contributed by [Matthijs](https://huggingface.co/Matthijs), [Patrick Von Platen](https://huggingface.co/patrickvonplaten) and [Arthur Zucker](https://huggingface.co/ArthurZ). +This model was contributed by [Matthijs](https://huggingface.co/Matthijs), [Patrick Von Platen](https://huggingface.co/patrickvonplaten) and [Arthur Zucker](https://huggingface.co/ArthurZ). The original code can be found [here](https://github.com/facebookresearch/encodec). -## Usage example +## Usage example Here is a quick example of how to encode and decode an audio using this model: -```python +```python >>> from datasets import load_dataset, Audio >>> from transformers import EncodecModel, AutoProcessor >>> librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") diff --git a/docs/source/en/model_doc/eomt.md b/docs/source/en/model_doc/eomt.md index 754b88e2c33..199d87dc794 100644 --- a/docs/source/en/model_doc/eomt.md +++ b/docs/source/en/model_doc/eomt.md @@ -39,7 +39,6 @@ Architecturally, EoMT introduces a small set of **learned queries** and a lightw alt="drawing" width="500"/> - The model supports semantic, instance, and panoptic segmentation using a unified architecture and task-specific post-processing. ## Usage Examples diff --git a/docs/source/en/model_doc/ernie4_5.md b/docs/source/en/model_doc/ernie4_5.md index e48073bbe6c..bf71049148d 100644 --- a/docs/source/en/model_doc/ernie4_5.md +++ b/docs/source/en/model_doc/ernie4_5.md @@ -38,7 +38,6 @@ Other models from the family can be found at [Ernie 4.5 Moe](./ernie4_5_moe). - ## Usage Tips ### Generate text @@ -84,7 +83,6 @@ generate_text = tokenizer.decode(output_ids, skip_special_tokens=True) This model was contributed by [Anton Vlasjuk](https://huggingface.co/AntonV). The original code can be found [here](https://github.com/PaddlePaddle/ERNIE). - ## Ernie4_5Config [[autodoc]] Ernie4_5Config diff --git a/docs/source/en/model_doc/ernie4_5_moe.md b/docs/source/en/model_doc/ernie4_5_moe.md index 20c4dcfd543..fb6b8d791be 100644 --- a/docs/source/en/model_doc/ernie4_5_moe.md +++ b/docs/source/en/model_doc/ernie4_5_moe.md @@ -40,7 +40,6 @@ Other models from the family can be found at [Ernie 4.5](./ernie4_5). - ## Usage Tips ### Generate text @@ -167,7 +166,6 @@ generate_text = tokenizer.decode(output_ids, skip_special_tokens=True) This model was contributed by [Anton Vlasjuk](https://huggingface.co/AntonV). The original code can be found [here](https://github.com/PaddlePaddle/ERNIE). - ## Ernie4_5_MoeConfig [[autodoc]] Ernie4_5_MoeConfig diff --git a/docs/source/en/model_doc/ernie_m.md b/docs/source/en/model_doc/ernie_m.md index 508fe2f596b..e044614e764 100644 --- a/docs/source/en/model_doc/ernie_m.md +++ b/docs/source/en/model_doc/ernie_m.md @@ -40,7 +40,6 @@ The abstract from the paper is the following: *Recent studies have demonstrated that pre-trained cross-lingual models achieve impressive performance in downstream cross-lingual tasks. This improvement benefits from learning a large amount of monolingual and parallel corpora. Although it is generally acknowledged that parallel corpora are critical for improving the model performance, existing methods are often constrained by the size of parallel corpora, especially for lowresource languages. In this paper, we propose ERNIE-M, a new training method that encourages the model to align the representation of multiple languages with monolingual corpora, to overcome the constraint that the parallel corpus size places on the model performance. Our key insight is to integrate back-translation into the pre-training process. We generate pseudo-parallel sentence pairs on a monolingual corpus to enable the learning of semantic alignments between different languages, thereby enhancing the semantic modeling of cross-lingual models. Experimental results show that ERNIE-M outperforms existing cross-lingual models and delivers new state-of-the-art results in various cross-lingual downstream tasks.* This model was contributed by [Susnato Dhar](https://huggingface.co/susnato). The original code can be found [here](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/paddlenlp/transformers/ernie_m). - ## Usage tips - Ernie-M is a BERT-like model so it is a stacked Transformer Encoder. @@ -59,7 +58,6 @@ This model was contributed by [Susnato Dhar](https://huggingface.co/susnato). Th [[autodoc]] ErnieMConfig - ## ErnieMTokenizer [[autodoc]] ErnieMTokenizer @@ -68,7 +66,6 @@ This model was contributed by [Susnato Dhar](https://huggingface.co/susnato). Th - create_token_type_ids_from_sequences - save_vocabulary - ## ErnieMModel [[autodoc]] ErnieMModel @@ -79,19 +76,16 @@ This model was contributed by [Susnato Dhar](https://huggingface.co/susnato). Th [[autodoc]] ErnieMForSequenceClassification - forward - ## ErnieMForMultipleChoice [[autodoc]] ErnieMForMultipleChoice - forward - ## ErnieMForTokenClassification [[autodoc]] ErnieMForTokenClassification - forward - ## ErnieMForQuestionAnswering [[autodoc]] ErnieMForQuestionAnswering diff --git a/docs/source/en/model_doc/esm.md b/docs/source/en/model_doc/esm.md index e83e2d5aa1d..a6190a71f02 100644 --- a/docs/source/en/model_doc/esm.md +++ b/docs/source/en/model_doc/esm.md @@ -44,12 +44,10 @@ sequence alignment (MSA) step at inference time, which means that ESMFold checkp they do not require a database of known protein sequences and structures with associated external query tools to make predictions, and are much faster as a result. - The abstract from "Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences" is - *In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Protein language modeling @@ -63,7 +61,6 @@ can be identified by linear projections. Representation learning produces featur applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure and improving state-of-the-art features for long-range contact prediction.* - The abstract from "Language models of protein sequences at the scale of evolution enable accurate structure prediction" is diff --git a/docs/source/en/model_doc/evolla.md b/docs/source/en/model_doc/evolla.md index a39103a06d1..56f1d2755e1 100644 --- a/docs/source/en/model_doc/evolla.md +++ b/docs/source/en/model_doc/evolla.md @@ -75,7 +75,6 @@ Tips: - This model was contributed by [Xibin Bayes Zhou](https://huggingface.co/XibinBayesZhou). - The original code can be found [here](https://github.com/westlake-repl/Evolla). - ## EvollaConfig [[autodoc]] EvollaConfig diff --git a/docs/source/en/model_doc/exaone4.md b/docs/source/en/model_doc/exaone4.md index 69d7ee0b2a8..93ca33babd3 100644 --- a/docs/source/en/model_doc/exaone4.md +++ b/docs/source/en/model_doc/exaone4.md @@ -20,7 +20,7 @@ rendered properly in your Markdown viewer. ## Overview **[EXAONE 4.0](https://github.com/LG-AI-EXAONE/EXAONE-4.0)** model is the language model, which integrates a **Non-reasoning mode** and **Reasoning mode** to achieve both the excellent usability of [EXAONE 3.5](https://github.com/LG-AI-EXAONE/EXAONE-3.5) and the advanced reasoning abilities of [EXAONE Deep](https://github.com/LG-AI-EXAONE/EXAONE-Deep). To pave the way for the agentic AI era, EXAONE 4.0 incorporates essential features such as agentic tool use, and its multilingual capabilities are extended -to support Spanish in addition to English and Korean. +to support Spanish in addition to English and Korean. The EXAONE 4.0 model series consists of two sizes: a mid-size **32B** model optimized for high performance, and a small-size **1.2B** model designed for on-device applications. @@ -33,7 +33,6 @@ For more details, please refer to our [technical report](https://huggingface.co/ All model weights including quantized versions are available at [Huggingface Collections](https://huggingface.co/collections/LGAI-EXAONE/exaone-40-686b2e0069800c835ed48375). - ## Model Details ### Model Specifications @@ -57,7 +56,6 @@ All model weights including quantized versions are available at [Huggingface Col | Tied word embedding | False | True | | Knowledge cut-off | Nov. 2024 | Nov. 2024 | - ## Usage tips ### Non-reasoning mode diff --git a/docs/source/en/model_doc/falcon_h1.md b/docs/source/en/model_doc/falcon_h1.md index 981c00bd626..c17ecea1cc0 100644 --- a/docs/source/en/model_doc/falcon_h1.md +++ b/docs/source/en/model_doc/falcon_h1.md @@ -21,7 +21,6 @@ The [FalconH1](https://huggingface.co/blog/tiiuae/falcon-h1) model was developed This model was contributed by [DhiyaEddine](https://huggingface.co/DhiyaEddine), [ybelkada](https://huggingface.co/ybelkada), [JingweiZuo](https://huggingface.co/JingweiZuo), [IlyasChahed](https://huggingface.co/IChahed), and [MaksimVelikanov](https://huggingface.co/yellowvm). The original code can be found [here](https://github.com/tiiuae/Falcon-H1). - ## FalconH1Config | Model | Depth | Dim | Attn Heads | KV | Mamba Heads | d_head | d_state | Ctx Len | @@ -33,8 +32,6 @@ The original code can be found [here](https://github.com/tiiuae/Falcon-H1). | H1 7B | 44 | 3072 | 12 | 2 | 24 | 128 / 128 | 256 | 256K | | H1 34B | 72 | 5120 | 20 | 4 | 32 | 128 / 128 | 256 | 256K | - - [[autodoc]] FalconH1Config @@ -90,6 +89,7 @@ echo -e "Plants create energy through a process known as" | transformers run --t Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends. The example below uses [torchao](../quantization/torchao) to only quantize the weights to 4-bits. + ```py #pip install torchao @@ -119,7 +119,6 @@ print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` - ## FlexOlmoConfig [[autodoc]] FlexOlmoConfig diff --git a/docs/source/en/model_doc/fnet.md b/docs/source/en/model_doc/fnet.md index 79a4e9e4434..e89a410b105 100644 --- a/docs/source/en/model_doc/fnet.md +++ b/docs/source/en/model_doc/fnet.md @@ -46,8 +46,8 @@ This model was contributed by [gchhablani](https://huggingface.co/gchhablani). T ## Usage tips -The model was trained without an attention mask as it is based on Fourier Transform. The model was trained with -maximum sequence length 512 which includes pad tokens. Hence, it is highly recommended to use the same maximum +The model was trained without an attention mask as it is based on Fourier Transform. The model was trained with +maximum sequence length 512 which includes pad tokens. Hence, it is highly recommended to use the same maximum sequence length for fine-tuning and inference. ## Resources diff --git a/docs/source/en/model_doc/fsmt.md b/docs/source/en/model_doc/fsmt.md index 27c7d3a899c..13a99ae40da 100644 --- a/docs/source/en/model_doc/fsmt.md +++ b/docs/source/en/model_doc/fsmt.md @@ -41,7 +41,6 @@ This model was contributed by [stas](https://huggingface.co/stas). The original either. Its tokenizer is very similar to [`XLMTokenizer`] and the main model is derived from [`BartModel`]. - ## FSMTConfig [[autodoc]] FSMTConfig diff --git a/docs/source/en/model_doc/funnel.md b/docs/source/en/model_doc/funnel.md index 611e17fba8c..57b011b9400 100644 --- a/docs/source/en/model_doc/funnel.md +++ b/docs/source/en/model_doc/funnel.md @@ -67,7 +67,6 @@ This model was contributed by [sgugger](https://huggingface.co/sgugger). The ori - [Masked language modeling task guide](../tasks/masked_language_modeling) - [Multiple choice task guide](../tasks/multiple_choice) - ## FunnelConfig [[autodoc]] FunnelConfig diff --git a/docs/source/en/model_doc/fuyu.md b/docs/source/en/model_doc/fuyu.md index 140216e2abc..34202b022f7 100644 --- a/docs/source/en/model_doc/fuyu.md +++ b/docs/source/en/model_doc/fuyu.md @@ -40,7 +40,6 @@ Finetuning the model in `float16` is not recommended and known to produce `nan`, - Tips: - To convert the model, you need to clone the original repository using `git clone https://github.com/persimmon-ai-labs/adept-inference`, then get the checkpoints: @@ -55,10 +54,12 @@ python src/transformers/models/fuyu/convert_fuyu_weights_to_hf.py --input_dir / ``` For the chat model: + ```bash wget https://axtkn4xl5cip.objectstorage.us-phoenix-1.oci.customer-oci.com/n/axtkn4xl5cip/b/adept-public-data/o/8b_chat_model_release.tar tar -xvf 8b_base_model_release.tar ``` + Then, model can be loaded via: ```py @@ -99,7 +100,6 @@ The `LlamaTokenizer` is used as it is a standard wrapper around sentencepiece. - The authors suggest to use the following prompt for image captioning: `f"Generate a coco-style caption.\\n"` - ## FuyuConfig [[autodoc]] FuyuConfig diff --git a/docs/source/en/model_doc/gemma.md b/docs/source/en/model_doc/gemma.md index d22d28d41c4..f1c088caf30 100644 --- a/docs/source/en/model_doc/gemma.md +++ b/docs/source/en/model_doc/gemma.md @@ -33,7 +33,6 @@ The instruction-tuned variant was fine-tuned with supervised learning on instruc You can find all the original Gemma checkpoints under the [Gemma](https://huggingface.co/collections/google/gemma-release-65d5efbccdbb8c4202ec078b) release. - > [!TIP] > Click on the Gemma models in the right sidebar for more examples of how to apply Gemma to different language tasks. @@ -163,7 +162,6 @@ visualizer("LLMs generate text through a process known as") [[autodoc]] GemmaTokenizer - ## GemmaTokenizerFast [[autodoc]] GemmaTokenizerFast diff --git a/docs/source/en/model_doc/gemma2.md b/docs/source/en/model_doc/gemma2.md index 680de41d038..5b4430296dc 100644 --- a/docs/source/en/model_doc/gemma2.md +++ b/docs/source/en/model_doc/gemma2.md @@ -40,7 +40,6 @@ The example below demonstrates how to chat with the model with [`Pipeline`] or t - ```python import torch from transformers import pipeline @@ -84,6 +83,7 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` echo -e "Explain quantum computing simply." | transformers run --task text-generation --model google/gemma-2-2b --device 0 ``` + @@ -113,7 +113,6 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True)) Use the [AttentionMaskVisualizer](https://github.com/huggingface/transformers/blob/beb9b5b02246b9b7ee81ddf938f93f44cfeaad19/src/transformers/utils/attention_visualizer.py#L139) to better understand what tokens the model can and cannot attend to. - ```python from transformers.utils.attention_visualizer import AttentionMaskVisualizer visualizer = AttentionMaskVisualizer("google/gemma-2b") diff --git a/docs/source/en/model_doc/gemma3.md b/docs/source/en/model_doc/gemma3.md index c14b79080fc..3c69cc1604f 100644 --- a/docs/source/en/model_doc/gemma3.md +++ b/docs/source/en/model_doc/gemma3.md @@ -195,6 +195,7 @@ visualizer("What is shown in this image?") }, ] ``` + - Text passed to the processor should have a `` token wherever an image should be inserted. - The processor has its own [`~ProcessorMixin.apply_chat_template`] method to convert chat messages to model inputs. - By default, images aren't cropped and only the base image is forwarded to the model. In high resolution images or images with non-square aspect ratios, artifacts can result because the vision encoder uses a fixed resolution of 896x896. To prevent these artifacts and improve performance during inference, set `do_pan_and_scan=True` to crop the image into multiple smaller patches and concatenate them with the base image embedding. You can disable pan and scan for faster inference. @@ -209,6 +210,7 @@ visualizer("What is shown in this image?") + do_pan_and_scan=True, ).to(model.device) ``` + - For Gemma-3 1B checkpoint trained in text-only mode, use [`AutoModelForCausalLM`] instead. ```py diff --git a/docs/source/en/model_doc/gemma3n.md b/docs/source/en/model_doc/gemma3n.md index b43379cf3fd..7c2e3ecc926 100644 --- a/docs/source/en/model_doc/gemma3n.md +++ b/docs/source/en/model_doc/gemma3n.md @@ -147,6 +147,7 @@ echo -e "Plants create energy through a process known as" | transformers run --t }, ] ``` + - Text passed to the processor should have a `` token wherever an image should be inserted. - Gemma 3n accept at most one target audio clip per input, though multiple audio clips can be provided in few-shot prompts, for example. diff --git a/docs/source/en/model_doc/glm.md b/docs/source/en/model_doc/glm.md index ca50c32da21..87daea7289a 100644 --- a/docs/source/en/model_doc/glm.md +++ b/docs/source/en/model_doc/glm.md @@ -53,7 +53,6 @@ Tips: - This model was contributed by [THUDM](https://huggingface.co/THUDM). The most recent code can be found [here](https://github.com/thudm/GLM-4). - ## Usage tips `GLM-4` can be found on the [Huggingface Hub](https://huggingface.co/collections/THUDM/glm-4-665fcf188c414b03c2f7e3b7) diff --git a/docs/source/en/model_doc/glm4v.md b/docs/source/en/model_doc/glm4v.md index be78c73b3fb..1f80d4b2584 100644 --- a/docs/source/en/model_doc/glm4v.md +++ b/docs/source/en/model_doc/glm4v.md @@ -75,6 +75,7 @@ messages = [ ] pipe(text=messages,max_new_tokens=20, return_full_text=False) ``` + @@ -123,6 +124,7 @@ output_text = processor.batch_decode( ) print(output_text) ``` + diff --git a/docs/source/en/model_doc/got_ocr2.md b/docs/source/en/model_doc/got_ocr2.md index 026273aa158..f8d6d69b0f6 100644 --- a/docs/source/en/model_doc/got_ocr2.md +++ b/docs/source/en/model_doc/got_ocr2.md @@ -34,7 +34,6 @@ alt="drawing" width="600"/> GOT-OCR2 training stages. Taken from the original paper. - Tips: GOT-OCR2 works on a wide range of tasks, including plain document OCR, scene text OCR, formatted document OCR, and even OCR for tables, charts, mathematical formulas, geometric shapes, molecular formulas and sheet music. While this implementation of the model will only output plain text, the outputs can be further processed to render the desired format, with packages like `pdftex`, `mathpix`, `matplotlib`, `tikz`, `verovio` or `pyecharts`. @@ -129,7 +128,6 @@ GOT-OCR2 can also generate formatted text, such as markdown or LaTeX. Here is an Although it might be reasonable in most cases to use a “for loop” for multi-page processing, some text data with formatting across several pages make it necessary to process all pages at once. GOT introduces a multi-page OCR (without “for loop”) feature, where multiple pages can be processed by the model at once, with the output being one continuous text. Here is an example of how to process multiple pages at once: - ```python >>> import torch >>> from transformers import AutoProcessor, AutoModelForImageTextToText, infer_device @@ -254,6 +252,7 @@ Here is an example of how to process sheet music: >>> with open("output.svg", "w") as f: >>> f.write(svg) ``` + drawing @@ -285,4 +284,3 @@ alt="drawing" width="600"/> [[autodoc]] GotOcr2ForConditionalGeneration - forward - diff --git a/docs/source/en/model_doc/gpt2.md b/docs/source/en/model_doc/gpt2.md index 1645a92f634..aaf2a50a173 100644 --- a/docs/source/en/model_doc/gpt2.md +++ b/docs/source/en/model_doc/gpt2.md @@ -23,7 +23,6 @@ rendered properly in your Markdown viewer. - # GPT-2 [GPT-2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) is a scaled up version of GPT, a causal transformer language model, with 10x more parameters and training data. The model was pretrained on a 40GB dataset to predict the next word in a sequence based on all the previous words. This approach enabled the model to perform many downstream tasks in a zero-shot setting. The blog post released by OpenAI can be found [here](https://openai.com/index/better-language-models/). @@ -47,6 +46,7 @@ from transformers import pipeline pipeline = pipeline(task="text-generation", model="openai-community/gpt2", dtype=torch.float16, device=0) pipeline("Hello, I'm a language model") ``` + diff --git a/docs/source/en/model_doc/gpt_bigcode.md b/docs/source/en/model_doc/gpt_bigcode.md index a16536cbbe5..e837f2a08f5 100644 --- a/docs/source/en/model_doc/gpt_bigcode.md +++ b/docs/source/en/model_doc/gpt_bigcode.md @@ -47,7 +47,6 @@ The main differences compared to GPT2. - Merge the key and value caches into one (this changes the format of layer_past/ present, does it risk creating problems?) - Use the memory layout (self.num_heads, 3, self.head_dim) instead of `(3, self.num_heads, self.head_dim)` for the QKV tensor with MHA. (prevents an overhead with the merged key and values, but makes the checkpoints incompatible with the original openai-community/gpt2 model). - You can read more about the optimizations in the [original pull request](https://github.com/huggingface/transformers/pull/22575) > [!NOTE] @@ -91,7 +90,6 @@ Below is a expected speedup diagram that compares pure inference time between th - ## GPTBigCodeConfig [[autodoc]] GPTBigCodeConfig diff --git a/docs/source/en/model_doc/gpt_neo.md b/docs/source/en/model_doc/gpt_neo.md index de48bce6508..4df9cf69842 100644 --- a/docs/source/en/model_doc/gpt_neo.md +++ b/docs/source/en/model_doc/gpt_neo.md @@ -22,12 +22,10 @@ rendered properly in your Markdown viewer. - ## GPT-Neo [GPT-Neo](https://zenodo.org/records/5297715) is an open-source alternative to GPT-2 and GPT-3 models, built with Mesh TensorFlow for TPUs. GPT-Neo uses local attention in every other layer for more efficiency. It is trained on the [Pile](https://huggingface.co/datasets/EleutherAI/pile), a diverse dataset consisting of 22 smaller high-quality datasets. The original github repository can be found [here](https://github.com/EleutherAI/gpt-neo/tree/v1.1) - You can find all the original GPT-Neo checkpoints under the [EleutherAI](https://huggingface.co/EleutherAI?search_models=gpt-neo) organization. > [!TIP] @@ -45,6 +43,7 @@ from transformers import pipeline pipeline = pipeline(task="text-generation", model="EleutherAI/gpt-neo-1.3B", dtype=torch.float16, device=0) pipeline("Hello, I'm a language model") ``` + diff --git a/docs/source/en/model_doc/gpt_neox.md b/docs/source/en/model_doc/gpt_neox.md index a24fc6aa1d7..fb2ff709304 100644 --- a/docs/source/en/model_doc/gpt_neox.md +++ b/docs/source/en/model_doc/gpt_neox.md @@ -71,7 +71,7 @@ The `generate()` method can be used to generate text using GPT Neo model. Flash Attention 2 is an faster, optimized version of the model. -### Installation +### Installation First, check whether your hardware is compatible with Flash Attention 2. The latest list of compatible hardware can be found in the [official documentation](https://github.com/Dao-AILab/flash-attention#installation-and-features). If your hardware is not compatible with Flash Attention 2, you can still benefit from attention kernel optimisations through Better Transformer support covered [above](https://huggingface.co/docs/transformers/main/en/model_doc/bark#using-better-transformer). @@ -92,7 +92,6 @@ model = GPTNeoXForCausalLM.from_pretrained("EleutherAI/gpt-neox-20b", dtype=torc ... ``` - ### Expected speedups Below is an expected speedup diagram that compares pure inference time between the native implementation in transformers using `stockmark/gpt-neox-japanese-1.4b` checkpoint and the Flash Attention 2 version of the model using a sequence length of 2048. @@ -101,7 +100,6 @@ Below is an expected speedup diagram that compares pure inference time between t - ## Using Scaled Dot Product Attention (SDPA) PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the @@ -162,7 +160,6 @@ following speedups during training and inference. | 4 | 1024 | 11.765 | 11.303 | 4.09 | 2558.96 | 2546.04 | 0.508 | | 4 | 2048 | 19.568 | 17.735 | 10.33 | 4175.5 | 4165.26 | 0.246 | - ## Resources - [Causal language modeling task guide](../tasks/language_modeling) diff --git a/docs/source/en/model_doc/gpt_neox_japanese.md b/docs/source/en/model_doc/gpt_neox_japanese.md index 7b22484b9a7..bf786f7561d 100644 --- a/docs/source/en/model_doc/gpt_neox_japanese.md +++ b/docs/source/en/model_doc/gpt_neox_japanese.md @@ -27,8 +27,6 @@ rendered properly in your Markdown viewer. GPT-NeoX-Japanese, a Japanese language model based on [GPT-NeoX](./gpt_neox). Japanese uses three types of characters (hiragana, katakana, kanji) and has a huge vocabulary. This model uses [BPEEncoder V2](https://github.com/tanreinama/Japanese-BPEEncoder_V2), a sub-word tokenizer to handle the different characters. - - The model also removes some bias parameters for better performance. You can find all the original GPT-NeoX-Japanese checkpoints under the [ABEJA](https://huggingface.co/abeja/models?search=gpt-neo-x) organization. diff --git a/docs/source/en/model_doc/gpt_oss.md b/docs/source/en/model_doc/gpt_oss.md index 136ebeb2957..47c970eb17e 100644 --- a/docs/source/en/model_doc/gpt_oss.md +++ b/docs/source/en/model_doc/gpt_oss.md @@ -41,7 +41,6 @@ Tips: This model was contributed by [INSERT YOUR HF USERNAME HERE](https://huggingface.co/). The original code can be found [here](). - ## GptOssConfig [[autodoc]] GptOssConfig diff --git a/docs/source/en/model_doc/granite.md b/docs/source/en/model_doc/granite.md index fce23a3c349..ef8bb0867b6 100644 --- a/docs/source/en/model_doc/granite.md +++ b/docs/source/en/model_doc/granite.md @@ -15,7 +15,6 @@ rendered properly in your Markdown viewer. --> *This model was released on 2024-08-23 and added to Hugging Face Transformers on 2024-08-27.* -
PyTorch FlashAttention @@ -69,12 +68,14 @@ inputs = tokenizer("Explain quantum computing in simple terms", return_tensors=" outputs = model.generate(**inputs, max_length=50, cache_implementation="static") print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` + ```python echo -e "Explain quantum computing simply." | transformers run --task text-generation --model ibm-granite/granite-3.3-8b-instruct --device 0 ``` + @@ -110,7 +111,6 @@ outputs = model.generate(**inputs, max_length=50, cache_implementation="static") print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` - ## GraniteConfig [[autodoc]] GraniteConfig diff --git a/docs/source/en/model_doc/granite_speech.md b/docs/source/en/model_doc/granite_speech.md index 5de42ff993f..680dba3a473 100644 --- a/docs/source/en/model_doc/granite_speech.md +++ b/docs/source/en/model_doc/granite_speech.md @@ -32,10 +32,8 @@ The [Granite Speech](https://huggingface.co/papers/2505.08699) model ([blog post 4. LoRA adapter(s): The Granite Speech model contains a modality specific LoRA, which will be enabled when audio features are provided, and disabled otherwise. - Note that most of the aforementioned components are implemented generically to enable compatibility and potential integration with other model architectures in transformers. - This model was contributed by [Alexander Brooks](https://huggingface.co/abrooks9944), [Avihu Dekel](https://huggingface.co/Avihu), and [George Saon](https://huggingface.co/gsaon). ## Usage tips @@ -47,22 +45,18 @@ This model was contributed by [Alexander Brooks](https://huggingface.co/abrooks9 [[autodoc]] GraniteSpeechConfig - ## GraniteSpeechEncoderConfig [[autodoc]] GraniteSpeechEncoderConfig - ## GraniteSpeechProcessor [[autodoc]] GraniteSpeechProcessor - ## GraniteSpeechFeatureExtractor [[autodoc]] GraniteSpeechFeatureExtractor - ## GraniteSpeechForConditionalGeneration [[autodoc]] GraniteSpeechForConditionalGeneration diff --git a/docs/source/en/model_doc/granitemoe.md b/docs/source/en/model_doc/granitemoe.md index 71c266a76b5..32616c07a28 100644 --- a/docs/source/en/model_doc/granitemoe.md +++ b/docs/source/en/model_doc/granitemoe.md @@ -65,7 +65,6 @@ for i in output: This model was contributed by [mayank-mishra](https://huggingface.co/mayank-mishra). - ## GraniteMoeConfig [[autodoc]] GraniteMoeConfig diff --git a/docs/source/en/model_doc/granitemoehybrid.md b/docs/source/en/model_doc/granitemoehybrid.md index 27b6e85d9e9..cb3db122e65 100644 --- a/docs/source/en/model_doc/granitemoehybrid.md +++ b/docs/source/en/model_doc/granitemoehybrid.md @@ -19,10 +19,8 @@ rendered properly in your Markdown viewer. ## Overview - The [GraniteMoeHybrid](https://www.ibm.com/new/announcements/ibm-granite-4-0-tiny-preview-sneak-peek) model builds on top of GraniteMoeSharedModel and Bamba. Its decoding layers consist of state space layers or MoE attention layers with shared experts. By default, the attention layers do not use positional encoding. - ```python from transformers import AutoModelForCausalLM, AutoTokenizer diff --git a/docs/source/en/model_doc/granitemoeshared.md b/docs/source/en/model_doc/granitemoeshared.md index d09ab5766fa..8b256de647f 100644 --- a/docs/source/en/model_doc/granitemoeshared.md +++ b/docs/source/en/model_doc/granitemoeshared.md @@ -19,7 +19,6 @@ rendered properly in your Markdown viewer. ## Overview - The GraniteMoe model was proposed in [Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler](https://huggingface.co/papers/2408.13359) by Yikang Shen, Matthew Stallone, Mayank Mishra, Gaoyuan Zhang, Shawn Tan, Aditya Prasad, Adriana Meza Soria, David D. Cox and Rameswar Panda. Additionally this class GraniteMoeSharedModel adds shared experts for Moe. @@ -51,7 +50,6 @@ for i in output: This HF implementation is contributed by [Mayank Mishra](https://huggingface.co/mayank-mishra), [Shawn Tan](https://huggingface.co/shawntan) and [Sukriti Sharma](https://huggingface.co/SukritiSharma). - ## GraniteMoeSharedConfig [[autodoc]] GraniteMoeSharedConfig diff --git a/docs/source/en/model_doc/granitevision.md b/docs/source/en/model_doc/granitevision.md index b138c66f79d..f5a6316a22c 100644 --- a/docs/source/en/model_doc/granitevision.md +++ b/docs/source/en/model_doc/granitevision.md @@ -25,11 +25,13 @@ Tips: - This model is loaded into Transformers as an instance of LlaVA-Next. The usage and tips from [LLaVA-NeXT](llava_next) apply to this model as well. - You can apply the chat template on the tokenizer / processor in the same way as well. Example chat format: + ```bash "<|user|>\nWhat’s shown in this image?\n<|assistant|>\nThis image shows a red stop sign.<|end_of_text|><|user|>\nDescribe the image in more details.\n<|assistant|>\n" ``` Sample inference: + ```python from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration, infer_device diff --git a/docs/source/en/model_doc/helium.md b/docs/source/en/model_doc/helium.md index ba06feb18fb..10748f27be4 100644 --- a/docs/source/en/model_doc/helium.md +++ b/docs/source/en/model_doc/helium.md @@ -27,7 +27,6 @@ rendered properly in your Markdown viewer. Helium was proposed in [Announcing Helium-1 Preview](https://kyutai.org/2025/01/13/helium.html) by the Kyutai Team. - Helium-1 preview is a lightweight language model with 2B parameters, targeting edge and mobile devices. It supports the following languages: English, French, German, Italian, Portuguese, Spanish. @@ -36,9 +35,6 @@ It supports the following languages: English, French, German, Italian, Portugues - **Language(s) (NLP):** English, French, German, Italian, Portuguese, Spanish - **License:** CC-BY 4.0 - - - ## Evaluation @@ -47,7 +43,7 @@ It supports the following languages: English, French, German, Italian, Portugues -The model was evaluated on MMLU, TriviaQA, NaturalQuestions, ARC Easy & Challenge, Open Book QA, Common Sense QA, +The model was evaluated on MMLU, TriviaQA, NaturalQuestions, ARC Easy & Challenge, Open Book QA, Common Sense QA, Physical Interaction QA, Social Interaction QA, HellaSwag, WinoGrande, Multilingual Knowledge QA, FLORES 200. #### Metrics @@ -92,7 +88,6 @@ We report BLEU on FLORES. || HS | 58.6 | 40.8 | 60.5 | 61.1 | 51.4 | || MKQA | 16.0 | 7.9 | 18.5 | 20.6 | 10.6 | - ## Technical Specifications ### Model Architecture and Objective @@ -110,12 +105,11 @@ Tips: - This model was contributed by [Laurent Mazare](https://huggingface.co/lmz) - ## Usage tips `Helium` can be found on the [Huggingface Hub](https://huggingface.co/models?other=helium) -In the following, we demonstrate how to use `helium-1-preview` for the inference. +In the following, we demonstrate how to use `helium-1-preview` for the inference. ```python >>> from transformers import AutoModelForCausalLM, AutoTokenizer diff --git a/docs/source/en/model_doc/herbert.md b/docs/source/en/model_doc/herbert.md index 718a1a3df0b..aa6a4bf96ad 100644 --- a/docs/source/en/model_doc/herbert.md +++ b/docs/source/en/model_doc/herbert.md @@ -45,7 +45,6 @@ models.* This model was contributed by [rmroczkowski](https://huggingface.co/rmroczkowski). The original code can be found [here](https://github.com/allegro/HerBERT). - ## Usage example ```python diff --git a/docs/source/en/model_doc/hgnet_v2.md b/docs/source/en/model_doc/hgnet_v2.md index 7461a19a032..e5da5a0582d 100644 --- a/docs/source/en/model_doc/hgnet_v2.md +++ b/docs/source/en/model_doc/hgnet_v2.md @@ -81,13 +81,11 @@ print(f"The predicted class label is: {predicted_class_label}") [[autodoc]] HGNetV2Config - ## HGNetV2Backbone [[autodoc]] HGNetV2Backbone - forward - ## HGNetV2ForImageClassification [[autodoc]] HGNetV2ForImageClassification diff --git a/docs/source/en/model_doc/hiera.md b/docs/source/en/model_doc/hiera.md index 9f4627dd53f..b8fd9c14183 100644 --- a/docs/source/en/model_doc/hiera.md +++ b/docs/source/en/model_doc/hiera.md @@ -25,7 +25,7 @@ rendered properly in your Markdown viewer. Hiera was proposed in [Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles](https://huggingface.co/papers/2306.00989) by Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, Christoph Feichtenhofer -The paper introduces "Hiera," a hierarchical Vision Transformer that simplifies the architecture of modern hierarchical vision transformers by removing unnecessary components without compromising on accuracy or efficiency. Unlike traditional transformers that add complex vision-specific components to improve supervised classification performance, Hiera demonstrates that such additions, often termed "bells-and-whistles," are not essential for high accuracy. By leveraging a strong visual pretext task (MAE) for pretraining, Hiera retains simplicity and achieves superior accuracy and speed both in inference and training across various image and video recognition tasks. The approach suggests that spatial biases required for vision tasks can be effectively learned through proper pretraining, eliminating the need for added architectural complexity. +The paper introduces "Hiera," a hierarchical Vision Transformer that simplifies the architecture of modern hierarchical vision transformers by removing unnecessary components without compromising on accuracy or efficiency. Unlike traditional transformers that add complex vision-specific components to improve supervised classification performance, Hiera demonstrates that such additions, often termed "bells-and-whistles," are not essential for high accuracy. By leveraging a strong visual pretext task (MAE) for pretraining, Hiera retains simplicity and achieves superior accuracy and speed both in inference and training across various image and video recognition tasks. The approach suggests that spatial biases required for vision tasks can be effectively learned through proper pretraining, eliminating the need for added architectural complexity. The abstract from the paper is the following: diff --git a/docs/source/en/model_doc/hubert.md b/docs/source/en/model_doc/hubert.md index 18c8062da36..5a072214406 100644 --- a/docs/source/en/model_doc/hubert.md +++ b/docs/source/en/model_doc/hubert.md @@ -115,6 +115,7 @@ print(transcription[0]) - HuBERT models expect raw audio input as a 1D float array sampled at 16kHz. - If you want to use a `head_mask`, use the model with `attn_implementation="eager"`. + ```python model = HubertModel.from_pretrained("facebook/hubert-base-ls960", attn_implementation="eager") ``` diff --git a/docs/source/en/model_doc/hunyuan_v1_dense.md b/docs/source/en/model_doc/hunyuan_v1_dense.md index 520c68b7fd9..84f9e44e522 100644 --- a/docs/source/en/model_doc/hunyuan_v1_dense.md +++ b/docs/source/en/model_doc/hunyuan_v1_dense.md @@ -25,7 +25,6 @@ To be released with the official model launch. To be released with the official model launch. - ## Usage tips To be released with the official model launch. @@ -48,4 +47,3 @@ To be released with the official model launch. [[autodoc]] HunYuanDenseV1ForSequenceClassification - forward - diff --git a/docs/source/en/model_doc/hunyuan_v1_moe.md b/docs/source/en/model_doc/hunyuan_v1_moe.md index 36a53742715..e9bff74fe1b 100644 --- a/docs/source/en/model_doc/hunyuan_v1_moe.md +++ b/docs/source/en/model_doc/hunyuan_v1_moe.md @@ -25,7 +25,6 @@ To be released with the official model launch. To be released with the official model launch. - ## Usage tips To be released with the official model launch. @@ -48,4 +47,3 @@ To be released with the official model launch. [[autodoc]] HunYuanMoEV1ForSequenceClassification - forward - diff --git a/docs/source/en/model_doc/idefics.md b/docs/source/en/model_doc/idefics.md index 6296e722660..fdb6e5de465 100644 --- a/docs/source/en/model_doc/idefics.md +++ b/docs/source/en/model_doc/idefics.md @@ -34,7 +34,6 @@ The abstract from the paper is the following: This model was contributed by [HuggingFaceM4](https://huggingface.co/HuggingFaceM4). The original code can be found [here](). (TODO: don't have a public link yet). - IDEFICS modeling code in Transformers is for finetuning and inferencing the pre-trained IDEFICS models. @@ -43,7 +42,6 @@ To train a new IDEFICS model from scratch use the m4 codebase (a link will be pr - ## IdeficsConfig [[autodoc]] IdeficsConfig diff --git a/docs/source/en/model_doc/idefics2.md b/docs/source/en/model_doc/idefics2.md index 63dd1ec8277..696ad7c5d2b 100644 --- a/docs/source/en/model_doc/idefics2.md +++ b/docs/source/en/model_doc/idefics2.md @@ -202,19 +202,16 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] Idefics2Config - ## Idefics2Model [[autodoc]] Idefics2Model - forward - ## Idefics2ForConditionalGeneration [[autodoc]] Idefics2ForConditionalGeneration - forward - ## Idefics2ImageProcessor [[autodoc]] Idefics2ImageProcessor - preprocess diff --git a/docs/source/en/model_doc/idefics3.md b/docs/source/en/model_doc/idefics3.md index b3e199e2b88..0c8f46a9aee 100644 --- a/docs/source/en/model_doc/idefics3.md +++ b/docs/source/en/model_doc/idefics3.md @@ -45,6 +45,7 @@ If `do_resize` is set to `True`, the model resizes images so that the longest ed The default resizing behavior can be customized by passing a dictionary to the `size` parameter. For example, `{"longest_edge": 4 * 364}` is the default, but you can change it to a different value if needed. Here’s how to control resizing and set a custom size: + ```python image_processor = Idefics3ImageProcessor(do_resize=True, size={"longest_edge": 2 * 364}, max_image_size=364) ``` @@ -53,7 +54,6 @@ Additionally, the `max_image_size` parameter, which controls the size of each sq This model was contributed by [amyeroberts](https://huggingface.co/amyeroberts) and [andimarafioti](https://huggingface.co/andito). - ## Idefics3Config [[autodoc]] Idefics3Config @@ -76,7 +76,6 @@ This model was contributed by [amyeroberts](https://huggingface.co/amyeroberts) [[autodoc]] Idefics3ForConditionalGeneration - forward - ## Idefics3ImageProcessor [[autodoc]] Idefics3ImageProcessor - preprocess diff --git a/docs/source/en/model_doc/ijepa.md b/docs/source/en/model_doc/ijepa.md index 9d7c7874f1a..a81e7c3ab28 100644 --- a/docs/source/en/model_doc/ijepa.md +++ b/docs/source/en/model_doc/ijepa.md @@ -31,10 +31,8 @@ You can find the original I-JEPA checkpoints under the [AI at Meta](https://hugg > [!TIP] > This model was contributed by [jmtzt](https://huggingface.co/jmtzt). - - > Click on the I-JEPA models in the right sidebar for more examples of how to apply I-JEPA to different image representation and classification tasks. The example below demonstrates how to extract image features with [`Pipeline`] or the [`AutoModel`] class. @@ -88,10 +86,10 @@ embed_2 = infer(image_2) similarity = cosine_similarity(embed_1, embed_2) print(similarity) ``` + - Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends. The example below uses [bitsandbytes](../quantization/bitsandbytes) to only quantize the weights to 4-bits. @@ -142,4 +140,3 @@ print(similarity) [[autodoc]] IJepaForImageClassification - forward - diff --git a/docs/source/en/model_doc/instructblip.md b/docs/source/en/model_doc/instructblip.md index b0669f1c065..d22d8df0d39 100644 --- a/docs/source/en/model_doc/instructblip.md +++ b/docs/source/en/model_doc/instructblip.md @@ -59,7 +59,6 @@ The attributes can be obtained from model config, as `model.config.num_query_tok [[autodoc]] InstructBlipProcessor - ## InstructBlipVisionModel [[autodoc]] InstructBlipVisionModel diff --git a/docs/source/en/model_doc/instructblipvideo.md b/docs/source/en/model_doc/instructblipvideo.md index e34b454a123..d4d868b7f90 100644 --- a/docs/source/en/model_doc/instructblipvideo.md +++ b/docs/source/en/model_doc/instructblipvideo.md @@ -59,7 +59,6 @@ The attributes can be obtained from model config, as `model.config.num_query_tok [[autodoc]] InstructBlipVideoProcessor - ## InstructBlipVideoVideoProcessor [[autodoc]] InstructBlipVideoVideoProcessor diff --git a/docs/source/en/model_doc/internvl.md b/docs/source/en/model_doc/internvl.md index bf760fdbdd7..7e9fea7f4f2 100644 --- a/docs/source/en/model_doc/internvl.md +++ b/docs/source/en/model_doc/internvl.md @@ -15,7 +15,6 @@ rendered properly in your Markdown viewer. --> *This model was released on 2025-04-14 and added to Hugging Face Transformers on 2025-04-18.* -
PyTorch @@ -32,19 +31,14 @@ The abstract from the paper is the following: *We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage. This unified training paradigm effectively addresses the complexities and alignment challenges commonly encountered in conventional post-hoc training pipelines for MLLMs. To further improve performance and scalability, InternVL3 incorporates variable visual position encoding (V2PE) to support extended multimodal contexts, employs advanced post-training techniques such as supervised fine-tuning (SFT) and mixed preference optimization (MPO), and adopts test-time scaling strategies alongside an optimized training infrastructure. Extensive empirical evaluations demonstrate that InternVL3 delivers superior performance across a wide range of multi-modal tasks. In particular, InternVL3-78B achieves a score of 72.2 on the MMMU benchmark, setting a new state-of-the-art among open-source MLLMs. Its capabilities remain highly competitive with leading proprietary models, including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while also maintaining strong pure-language proficiency. In pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs.* - drawing Overview of InternVL3 models architecture, which is the same as InternVL2.5. Taken from the original checkpoint. - - drawing Comparison of InternVL3 performance on OpenCompass against other SOTA VLLMs. Taken from the original checkpoint. - - This model was contributed by [yonigozlan](https://huggingface.co/yonigozlan). The original code can be found [here](https://github.com/OpenGVLab/InternVL). @@ -75,6 +69,7 @@ Here is how you can use the `image-text-to-text` pipeline to perform inference w >>> outputs[0]["generated_text"] 'The image showcases a vibrant scene of nature, featuring several flowers and a bee. \n\n1. **Foreground Flowers**: \n - The primary focus is on a large, pink cosmos flower with a prominent yellow center. The petals are soft and slightly r' ``` + ### Inference on a single image This example demonstrates how to perform inference on a single image with the InternVL models using chat templates. @@ -112,7 +107,6 @@ This example demonstrates how to perform inference on a single image with the In ### Text-only generation This example shows how to generate text using the InternVL model without providing any image input. - ```python >>> from transformers import AutoProcessor, AutoModelForImageTextToText >>> import torch diff --git a/docs/source/en/model_doc/jamba.md b/docs/source/en/model_doc/jamba.md index 0aa06b16e90..f85d08c5f64 100644 --- a/docs/source/en/model_doc/jamba.md +++ b/docs/source/en/model_doc/jamba.md @@ -75,6 +75,7 @@ input_ids = tokenizer("Plants create energy through a process known as", return_ output = model.generate(**input_ids, cache_implementation="static") print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` + @@ -140,19 +141,16 @@ print(assistant_response) [[autodoc]] JambaConfig - ## JambaModel [[autodoc]] JambaModel - forward - ## JambaForCausalLM [[autodoc]] JambaForCausalLM - forward - ## JambaForSequenceClassification [[autodoc]] transformers.JambaForSequenceClassification diff --git a/docs/source/en/model_doc/jetmoe.md b/docs/source/en/model_doc/jetmoe.md index 059fb956ce2..3fca2c2d676 100644 --- a/docs/source/en/model_doc/jetmoe.md +++ b/docs/source/en/model_doc/jetmoe.md @@ -27,15 +27,14 @@ rendered properly in your Markdown viewer. **JetMoe-8B** is an 8B Mixture-of-Experts (MoE) language model developed by [Yikang Shen](https://scholar.google.com.hk/citations?user=qff5rRYAAAAJ) and [MyShell](https://myshell.ai/). JetMoe project aims to provide a LLaMA2-level performance and efficient language model with a limited budget. -To achieve this goal, JetMoe uses a sparsely activated architecture inspired by the [ModuleFormer](https://huggingface.co/papers/2306.04640). +To achieve this goal, JetMoe uses a sparsely activated architecture inspired by the [ModuleFormer](https://huggingface.co/papers/2306.04640). Each JetMoe block consists of two MoE layers: Mixture of Attention Heads and Mixture of MLP Experts. Given the input tokens, it activates a subset of its experts to process them. -This sparse activation schema enables JetMoe to achieve much better training throughput than similar size dense models. +This sparse activation schema enables JetMoe to achieve much better training throughput than similar size dense models. The training throughput of JetMoe-8B is around 100B tokens per day on a cluster of 96 H100 GPUs with a straightforward 3-way pipeline parallelism strategy. This model was contributed by [Yikang Shen](https://huggingface.co/YikangS). - ## JetMoeConfig [[autodoc]] JetMoeConfig diff --git a/docs/source/en/model_doc/kosmos2_5.md b/docs/source/en/model_doc/kosmos2_5.md index 530f1d459ae..706ce04cef4 100644 --- a/docs/source/en/model_doc/kosmos2_5.md +++ b/docs/source/en/model_doc/kosmos2_5.md @@ -19,7 +19,6 @@ specific language governing permissions and limitations under the License.
- # KOSMOS-2.5 The Kosmos-2.5 model was proposed in [KOSMOS-2.5: A Multimodal Literate Model](https://huggingface.co/papers/2309.11419/) by Microsoft. @@ -159,7 +158,6 @@ image.save("output.png") - ## Chat version The authors also released Kosmos-2.5 Chat, which is a chat version optimized for document understanding. You can use it like so: diff --git a/docs/source/en/model_doc/kyutai_speech_to_text.md b/docs/source/en/model_doc/kyutai_speech_to_text.md index 30497e69594..f3428f6b86f 100644 --- a/docs/source/en/model_doc/kyutai_speech_to_text.md +++ b/docs/source/en/model_doc/kyutai_speech_to_text.md @@ -15,7 +15,7 @@ rendered properly in your Markdown viewer. --> *This model was released on 2025-06-17 and added to Hugging Face Transformers on 2025-06-25.* -# Kyutai Speech-To-Text +# Kyutai Speech-To-Text ## Overview [Kyutai STT](https://kyutai.org/next/stt) is a speech-to-text model architecture based on the [Mimi codec](https://huggingface.co/docs/transformers/en/model_doc/mimi), which encodes audio into discrete tokens in a streaming fashion, and a [Moshi-like](https://huggingface.co/docs/transformers/en/model_doc/moshi) autoregressive decoder. Kyutai’s lab has released two model checkpoints: @@ -98,7 +98,6 @@ for output in decoded_outputs: This model was contributed by [Eustache Le Bihan](https://huggingface.co/eustlb). The original code can be found [here](https://github.com/kyutai-labs/moshi). - ## KyutaiSpeechToTextConfig [[autodoc]] KyutaiSpeechToTextConfig diff --git a/docs/source/en/model_doc/layoutlm.md b/docs/source/en/model_doc/layoutlm.md index 708a5bc1ab4..88dde323e29 100644 --- a/docs/source/en/model_doc/layoutlm.md +++ b/docs/source/en/model_doc/layoutlm.md @@ -116,7 +116,6 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - Refer to this [notebook](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLM/Fine_tuning_LayoutLMForTokenClassification_on_FUNSD.ipynb) for an example of how to fine-tune LayoutLM for token classification. - Read [Deploy LayoutLM with Hugging Face Inference Endpoints](https://www.philschmid.de/inference-endpoints-layoutlm) to learn how to deploy LayoutLM. - ## LayoutLMConfig [[autodoc]] LayoutLMConfig diff --git a/docs/source/en/model_doc/layoutlmv2.md b/docs/source/en/model_doc/layoutlmv2.md index c376c04ad76..f74d3b4294e 100644 --- a/docs/source/en/model_doc/layoutlmv2.md +++ b/docs/source/en/model_doc/layoutlmv2.md @@ -55,10 +55,12 @@ this https URL.* LayoutLMv2 depends on `detectron2`, `torchvision` and `tesseract`. Run the following to install them: + ```bash python -m pip install 'git+https://github.com/facebookresearch/detectron2.git' python -m pip install torchvision tesseract ``` + (If you are developing for LayoutLMv2, note that passing the doctests also requires the installation of these packages.) ## Usage tips @@ -145,7 +147,6 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - See also: [Question answering task guide](../tasks/question_answering) - See also: [Document question answering task guide](../tasks/document_question_answering) - - A notebook on how to [finetune LayoutLMv2 for token-classification on CORD dataset](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/CORD/Fine_tuning_LayoutLMv2ForTokenClassification_on_CORD.ipynb). diff --git a/docs/source/en/model_doc/led.md b/docs/source/en/model_doc/led.md index 4acc6a63979..ce1baa619a8 100644 --- a/docs/source/en/model_doc/led.md +++ b/docs/source/en/model_doc/led.md @@ -89,6 +89,7 @@ print(tokenizer.decode(output[0], skip_special_tokens=True)) ```bash !echo -e "Plants are among the most remarkable and essential life forms on Earth, possessing a unique ability to produce their own food through a process known as photosynthesis. This complex biochemical process is fundamental not only to plant life but to virtually all life on the planet. Through photosynthesis, plants capture energy from sunlight using a green pigment called chlorophyll, which is located in specialized cell structures called chloroplasts." | transformers run --task summarization --model allenai/led-base-16384 --device 0 ``` + diff --git a/docs/source/en/model_doc/lfm2.md b/docs/source/en/model_doc/lfm2.md index 3ea0936b96b..0e78f9935f9 100644 --- a/docs/source/en/model_doc/lfm2.md +++ b/docs/source/en/model_doc/lfm2.md @@ -23,7 +23,7 @@ rendered properly in your Markdown viewer. ## Overview -[LFM2](https://www.liquid.ai/blog/liquid-foundation-models-v2-our-second-series-of-generative-ai-models) represents a new generation of Liquid Foundation Models developed by [Liquid AI](https://liquid.ai/), specifically designed for edge AI and on-device deployment. +[LFM2](https://www.liquid.ai/blog/liquid-foundation-models-v2-our-second-series-of-generative-ai-models) represents a new generation of Liquid Foundation Models developed by [Liquid AI](https://liquid.ai/), specifically designed for edge AI and on-device deployment. The models are available in three sizes (350M, 700M, and 1.2B parameters) and are engineered to run efficiently on CPU, GPU, and NPU hardware, making them particularly well-suited for applications requiring low latency, offline operation, and privacy. diff --git a/docs/source/en/model_doc/lfm2_vl.md b/docs/source/en/model_doc/lfm2_vl.md index 3a93a8189a7..2e25d94e883 100644 --- a/docs/source/en/model_doc/lfm2_vl.md +++ b/docs/source/en/model_doc/lfm2_vl.md @@ -19,7 +19,7 @@ rendered properly in your Markdown viewer. PyTorch
-# LFM2-VL +# LFM2-VL ## Overview @@ -31,7 +31,7 @@ LFM2-VL consists of three main components: a language model backbone, a vision e * Shape-optimized (400M) for more fine-grained vision capabilities for LFM2-VL-1.6B * Base (86M) for fast image processing for LFM2-VL-450M -The encoder processes images at their native resolution up to 512×512 pixels, efficiently handling smaller images without upscaling and supporting non-standard aspect ratios without distortion. Larger images are split into non-overlapping square patches of 512×512 each, preserving detail. In LFM2-VL-1.6B, the model also receives a thumbnail (a small, downscaled version of the original image capturing the overall scene) to enhance global context understanding and alignment. Special tokens mark each patch’s position and indicate the thumbnail’s start. The multimodal connector is a 2-layer MLP connector with pixel unshuffle to reduce image token count. +The encoder processes images at their native resolution up to 512×512 pixels, efficiently handling smaller images without upscaling and supporting non-standard aspect ratios without distortion. Larger images are split into non-overlapping square patches of 512×512 each, preserving detail. In LFM2-VL-1.6B, the model also receives a thumbnail (a small, downscaled version of the original image capturing the overall scene) to enhance global context understanding and alignment. Special tokens mark each patch’s position and indicate the thumbnail’s start. The multimodal connector is a 2-layer MLP connector with pixel unshuffle to reduce image token count. ## Example diff --git a/docs/source/en/model_doc/lightglue.md b/docs/source/en/model_doc/lightglue.md index 847fabdaac2..16827345ef0 100644 --- a/docs/source/en/model_doc/lightglue.md +++ b/docs/source/en/model_doc/lightglue.md @@ -153,4 +153,3 @@ processed_outputs = processor.post_process_keypoint_matching(outputs, image_size [[autodoc]] LightGlueForKeypointMatching - forward - diff --git a/docs/source/en/model_doc/llama2.md b/docs/source/en/model_doc/llama2.md index 96c733d88fa..c66667f235f 100644 --- a/docs/source/en/model_doc/llama2.md +++ b/docs/source/en/model_doc/llama2.md @@ -130,11 +130,13 @@ visualizer("Plants create energy through a process known as") # update model config with padding token model.config.pad_token_id ``` + - It is recommended to initialize the `embed_tokens` layer with the following code to ensure encoding the padding token outputs zeros. ```py self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.config.padding_idx) ``` + - The tokenizer is a byte-pair encoding model based on [SentencePiece](https://github.com/google/sentencepiece). During decoding, if the first token is the start of the word (for example, "Banana"), the tokenizer doesn't prepend the prefix space to the string. - Don't use the `dtype` parameter in [`~AutoModel.from_pretrained`] if you're using FlashAttention-2 because it only supports fp16 or bf16. You should use [Automatic Mixed Precision](https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html), set fp16 or bf16 to `True` if using [`Trainer`], or use [torch.autocast](https://pytorch.org/docs/stable/amp.html#torch.autocast). @@ -142,7 +144,6 @@ visualizer("Plants create energy through a process known as") [[autodoc]] LlamaConfig - ## LlamaTokenizer [[autodoc]] LlamaTokenizer @@ -165,7 +166,6 @@ visualizer("Plants create energy through a process known as") [[autodoc]] LlamaModel - forward - ## LlamaForCausalLM [[autodoc]] LlamaForCausalLM diff --git a/docs/source/en/model_doc/llama4.md b/docs/source/en/model_doc/llama4.md index 28e168b9043..84812a41997 100644 --- a/docs/source/en/model_doc/llama4.md +++ b/docs/source/en/model_doc/llama4.md @@ -17,7 +17,6 @@ rendered properly in your Markdown viewer. # Llama4 -
PyTorch @@ -53,7 +52,6 @@ The examples below demonstrates how to generate with [`Pipeline`] or the [`AutoM showcasing how to toggle the right attributes to enable very long-context generations, as some flavors of Llama 4 have context lengths going up to 10 million tokens. - @@ -255,7 +253,6 @@ Updating the default attention function can significantly improve compute perfor As of release, the Llama 4 model supports the following attention methods: `eager`, `flex_attention`, `sdpa`. We recommend using `flex_attention` for best results. Switching attention mechanism is done at the model initialization step: - @@ -278,6 +275,7 @@ model = Llama4ForConditionalGeneration.from_pretrained( dtype=torch.bfloat16, ) ``` + The `sdpa` attention method is generally more compute-efficient than the `eager` method. @@ -293,6 +291,7 @@ model = Llama4ForConditionalGeneration.from_pretrained( dtype=torch.bfloat16, ) ``` + The `eager` attention method is set by default, so no need for anything different when loading the model: @@ -307,10 +306,10 @@ model = Llama4ForConditionalGeneration.from_pretrained( dtype=torch.bfloat16, ) ``` + - ### Quantization Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for available quantization backends. @@ -318,8 +317,6 @@ At time of release, both FBGEMM and LLM-Compressor are supported; more quantizat See below for examples using both: - - Here is an example loading an BF16 model in FP8 using the FBGEMM approach: @@ -378,6 +375,7 @@ outputs = model.generate(**inputs.to(model.device), max_new_tokens=100) outputs = tokenizer.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:]) print(outputs[0]) ``` + diff --git a/docs/source/en/model_doc/llava.md b/docs/source/en/model_doc/llava.md index 1d7427b9015..e4ef7d77069 100644 --- a/docs/source/en/model_doc/llava.md +++ b/docs/source/en/model_doc/llava.md @@ -47,13 +47,11 @@ The original code can be found [here](https://github.com/haotian-liu/LLaVA/tree/ - Note the model has not been explicitly trained to process multiple images in the same prompt, although this is technically possible, you may experience inaccurate results. - > [!NOTE] > LLaVA models after release v4.46 will raise warnings about adding `processor.patch_size = {{patch_size}}`, `processor.num_additional_image_tokens = {{num_additional_image_tokens}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. It is strongly recommended to add the attributes to the processor if you own the model checkpoint, or open a PR if it is not owned by you. Adding these attributes means that LLaVA will try to infer the number of image tokens required per image and expand the text with as many `` placeholders as there will be tokens. Usually it is around 500 tokens per image, so make sure that the text is not truncated as otherwise there will be failure when merging the embeddings. The attributes can be obtained from model config, as `model.config.vision_config.patch_size` or `model.config.vision_feature_select_strategy`. The `num_additional_image_tokens` should be `1` if the vision backbone adds a CLS token or `0` if nothing extra is added to the vision patches. - ### Formatting Prompts with Chat Templates Each **checkpoint** is trained with a specific prompt format, depending on the underlying large language model backbone. To ensure correct formatting, use the processor’s `apply_chat_template` method. @@ -63,11 +61,9 @@ Each **checkpoint** is trained with a specific prompt format, depending on the u - Each message should be a dictionary with `"role"` and `"content"` keys. - The `"content"` should be a list of dictionaries for different modalities like `"text"` and `"image"`. - -Here’s an example of how to structure your input. +Here’s an example of how to structure your input. We will use [llava-hf/llava-1.5-7b-hf](https://huggingface.co/llava-hf/llava-1.5-7b-hf) and a conversation history of text and image. Each content field has to be a list of dicts, as follows: - ```python from transformers import AutoProcessor @@ -104,6 +100,7 @@ print(text_prompt) - If you want to construct a chat prompt yourself, below is a list of prompt formats accepted by each llava checkpoint: [llava-interleave models](https://huggingface.co/collections/llava-hf/llava-interleave-668e19a97da0036aad4a2f19) requires the following format: + ```bash "<|im_start|>user \nWhat is shown in this image?<|im_end|><|im_start|>assistant" ``` @@ -115,6 +112,7 @@ For multiple turns conversation: ``` [llava-1.5 models](https://huggingface.co/collections/llava-hf/llava-15-65f762d5b6941db5c2ba07e0) requires the following format: + ```bash "USER: \n ASSISTANT:" ``` @@ -127,12 +125,10 @@ For multiple turns conversation: 🚀 **Bonus:** If you're using `transformers>=4.49.0`, you can also get a vectorized output from `apply_chat_template`. See the **Usage Examples** below for more details on how to use it. - ## Usage examples ### Single input inference - ```python import torch from transformers import AutoProcessor, LlavaForConditionalGeneration @@ -164,7 +160,6 @@ generate_ids = model.generate(**inputs, max_new_tokens=30) processor.batch_decode(generate_ids, skip_special_tokens=True) ``` - ### Batched inference LLaVa also supports batched inference. Here is how you can do it: @@ -214,7 +209,6 @@ generate_ids = model.generate(**inputs, max_new_tokens=30) processor.batch_decode(generate_ids, skip_special_tokens=True) ``` - ## Note regarding reproducing original implementation In order to match the logits of the [original implementation](https://github.com/haotian-liu/LLaVA/tree/main), one needs to additionally specify `do_pad=True` when instantiating `LlavaImageProcessor`: @@ -238,7 +232,6 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - A [Google Colab demo](https://colab.research.google.com/drive/1qsl6cd2c8gGtEW1xV5io7S8NHh-Cp1TV?usp=sharing) on how to run Llava on a free-tier Google colab instance leveraging 4-bit inference. - A [similar notebook](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LLaVa/Inference_with_LLaVa_for_multimodal_generation.ipynb) showcasing batched inference. 🌎 - ## LlavaConfig [[autodoc]] LlavaConfig diff --git a/docs/source/en/model_doc/llava_next.md b/docs/source/en/model_doc/llava_next.md index e7ff4c896e2..3857f154cf4 100644 --- a/docs/source/en/model_doc/llava_next.md +++ b/docs/source/en/model_doc/llava_next.md @@ -141,7 +141,6 @@ with torch.inference_mode(): print(processor.decode(output[0], skip_special_tokens=True)) ``` - ## Notes * Different checkpoints (Mistral, Vicuna, etc.) require a specific prompt format depending on the underlying LLM. Always use [`~ProcessorMixin.apply_chat_template`] to ensure correct formatting. Refer to the [Templates](../chat_templating) guide for more details. @@ -189,7 +188,6 @@ output = model.generate(**inputs, max_new_tokens=100) print(processor.decode(output[0], skip_special_tokens=True)) ``` - ## LlavaNextConfig [[autodoc]] LlavaNextConfig diff --git a/docs/source/en/model_doc/llava_next_video.md b/docs/source/en/model_doc/llava_next_video.md index 9379c1cc2ed..131dd1aba50 100644 --- a/docs/source/en/model_doc/llava_next_video.md +++ b/docs/source/en/model_doc/llava_next_video.md @@ -30,7 +30,6 @@ The LLaVa-NeXT-Video model was proposed in [LLaVA-NeXT: A Strong Zero-shot Video [LLaVA-NeXT](llava_next) surprisingly has strong performance in understanding video content in zero-shot fashion with the AnyRes technique that it uses. The AnyRes technique naturally represents a high-resolution image into multiple images. This technique is naturally generalizable to represent videos because videos can be considered as a set of frames (similar to a set of images in LLaVa-NeXT). The current version of LLaVA-NeXT makes use of AnyRes and trains with supervised fine-tuning (SFT) on top of LLaVA-Next on video data to achieves better video understanding capabilities.The model is a current SOTA among open-source models on [VideoMME bench](https://huggingface.co/papers/2405.21075). - The introduction from the blog is the following: On January 30, 2024, we released LLaVA-NeXT, an open-source Large Multimodal Model (LMM) that has been trained exclusively on text-image data. With the proposed AnyRes technique, it boosts capabilities in reasoning, OCR, and world knowledge, demonstrating remarkable performance across a spectrum of image-based multimodal understanding tasks, and even exceeding Gemini-Pro on several image benchmarks, e.g. MMMU and MathVista. @@ -42,7 +41,6 @@ On January 30, 2024, we released LLaVA-NeXT, an open-source Large Multimodal Mod - Strong video understanding ability. (1) LLaVA-Next-Image, which combines the above two techniques, yields superior zero-shot performance than open-source LMMs tuned on videos. (2) LLaVA-Next-Video, further supervised fine-tuning (SFT) LLaVA-Next-Image on video data, achieves better video understanding capabilities compared to LLaVA-Next-Image. (3) LLaVA-Next-Video-DPO, which aligns the model response with AI feedback using direct preference optimization (DPO), showing significant performance boost. - Efficient deployment and inference with SGLang. It allows 5x faster inference on video tasks, allowing more scalable serving such as million-level video re-captioning. See instructions in our repo.** - This model was contributed by [RaushanTurganbay](https://huggingface.co/RaushanTurganbay). The original code can be found [here](https://github.com/LLaVA-VL/LLaVA-NeXT/tree/inference). @@ -56,13 +54,11 @@ The original code can be found [here](https://github.com/LLaVA-VL/LLaVA-NeXT/tre - > [!NOTE] > LLaVA models after release v4.46 will raise warnings about adding `processor.patch_size = {{patch_size}}`, `processor.num_additional_image_tokens = {{num_additional_image_tokens}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. It is strongly recommended to add the attributes to the processor if you own the model checkpoint, or open a PR if it is not owned by you. Adding these attributes means that LLaVA will try to infer the number of image tokens required per image and expand the text with as many `` placeholders as there will be tokens. Usually it is around 500 tokens per image, so make sure that the text is not truncated as otherwise there will be failure when merging the embeddings. The attributes can be obtained from model config, as `model.config.vision_config.patch_size` or `model.config.vision_feature_select_strategy`. The `num_additional_image_tokens` should be `1` if the vision backbone adds a CLS token or `0` if nothing extra is added to the vision patches. - ### Formatting Prompts with Chat Templates Each **checkpoint** is trained with a specific prompt format, depending on the underlying large language model backbone. To ensure correct formatting, use the processor’s `apply_chat_template` method. @@ -72,7 +68,6 @@ Each **checkpoint** is trained with a specific prompt format, depending on the u - Each message should be a dictionary with `"role"` and `"content"` keys. - The `"content"` should be a list of dictionaries for different modalities like `"text"` and `"image"`. - Here’s an example of how to structure your input. We will use [LLaVA-NeXT-Video-7B-hf](https://huggingface.co/llava-hf/LLaVA-NeXT-Video-7B-hf) and a conversation history of videos and images. ```python @@ -116,8 +111,6 @@ print(text_prompt) 🚀 **Bonus:** If you're using `transformers>=4.49.0`, you can also get a vectorized output from `apply_chat_template`. See the **Usage Examples** below for more details on how to use it. - - ## Usage example ### Single Media Mode @@ -153,10 +146,9 @@ out = model.generate(**inputs, max_new_tokens=60) processor.batch_decode(out, skip_special_tokens=True, clean_up_tokenization_spaces=True) ``` - ### Mixed Media Mode -The model can also generate from an interleaved image-video inputs. However note, that it was not trained in interleaved image-video setting which might affect the performance. Below is an example usage for mixed media input, add the following lines to the above code snippet: +The model can also generate from an interleaved image-video inputs. However note, that it was not trained in interleaved image-video setting which might affect the performance. Below is an example usage for mixed media input, add the following lines to the above code snippet: ```python @@ -196,7 +188,7 @@ processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokeniza ### Quantization using Bitsandbytes for memory efficiency -The model can be loaded in lower bits, significantly reducing memory burden while maintaining the performance of the original model. This allows for efficient deployment on resource-constrained cases. +The model can be loaded in lower bits, significantly reducing memory burden while maintaining the performance of the original model. This allows for efficient deployment on resource-constrained cases. First, make sure to install bitsandbytes by running `pip install bitsandbytes` and to have access to a GPU/accelerator that is supported by the library. @@ -210,7 +202,6 @@ We value your feedback to help identify bugs before the full release! Check out Then simply load the quantized model by adding [`BitsAndBytesConfig`](../main_classes/quantization#transformers.BitsAndBytesConfig) as shown below: - ```python from transformers import LlavaNextVideoForConditionalGeneration, LlavaNextVideoProcessor @@ -224,7 +215,6 @@ quantization_config = BitsAndBytesConfig( model = LlavaNextVideoForConditionalGeneration.from_pretrained("llava-hf/LLaVA-NeXT-Video-7B-hf", quantization_config=quantization_config, device_map="auto") ``` - ### Flash-Attention 2 to speed-up generation Additionally, we can greatly speed-up model inference by using [Flash Attention](../perf_train_gpu_one#flash-attention-2), which is a faster implementation of the attention mechanism used inside the model. @@ -249,8 +239,6 @@ model = LlavaNextVideoForConditionalGeneration.from_pretrained( ).to(0) ``` - - ## LlavaNextVideoConfig [[autodoc]] LlavaNextVideoConfig diff --git a/docs/source/en/model_doc/llava_onevision.md b/docs/source/en/model_doc/llava_onevision.md index e546530922a..48fa769835f 100644 --- a/docs/source/en/model_doc/llava_onevision.md +++ b/docs/source/en/model_doc/llava_onevision.md @@ -54,7 +54,6 @@ Tips: - ### Formatting Prompts with Chat Templates Each **checkpoint** is trained with a specific prompt format, depending on the underlying large language model backbone. To ensure correct formatting, use the processor’s `apply_chat_template` method. @@ -64,8 +63,7 @@ Each **checkpoint** is trained with a specific prompt format, depending on the u - Each message should be a dictionary with `"role"` and `"content"` keys. - The `"content"` should be a list of dictionaries for different modalities like `"text"` and `"image"`. - -Here’s an example of how to structure your input. +Here’s an example of how to structure your input. We will use [llava-onevision-qwen2-7b-si-hf](https://huggingface.co/llava-hf/llava-onevision-qwen2-7b-si-hf) and a conversation history of text and image. Each content field has to be a list of dicts, as follows: ```python @@ -103,11 +101,9 @@ print(text_prompt) 🚀 **Bonus:** If you're using `transformers>=4.49.0`, you can also get a vectorized output from `apply_chat_template`. See the **Usage Examples** below for more details on how to use it. - This model was contributed by [RaushanTurganbay](https://huggingface.co/RaushanTurganbay). The original code can be found [here](https://github.com/LLaVA-VL/LLaVA-NeXT/tree/main). - ## Usage example ### Single image inference @@ -293,7 +289,6 @@ model = LlavaOnevisionForConditionalGeneration.from_pretrained( ).to(0) ``` - ## LlavaOnevisionConfig [[autodoc]] LlavaOnevisionConfig diff --git a/docs/source/en/model_doc/longcat_flash.md b/docs/source/en/model_doc/longcat_flash.md index d9a9a4a7f60..651f3386f16 100644 --- a/docs/source/en/model_doc/longcat_flash.md +++ b/docs/source/en/model_doc/longcat_flash.md @@ -12,7 +12,6 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. - ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be rendered properly in your Markdown viewer. --> @@ -43,6 +42,7 @@ The original code can be found [here](https://huggingface.co/meituan-longcat/Lon ## Usage examples The model is large: you will need 2x8 H100 to run inference. + ```python # launch_longcat.py from transformers import LongcatFlashForCausalLM, AutoTokenizer @@ -76,6 +76,7 @@ torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0 | 1 --rdzv-id --r ``` And you'll get a nice generation: + ```json [Round 0] USER:Hello! What is the capital of France? What can you tell me about it? ASSISTANT:Hello! 😊 The capital of France is Paris, one of the most famous and beloved cities in the world. Here’s a quick overview of what makes Paris special: 1. Iconic Landmarks diff --git a/docs/source/en/model_doc/longformer.md b/docs/source/en/model_doc/longformer.md index c80294ab7a0..b8375998a06 100644 --- a/docs/source/en/model_doc/longformer.md +++ b/docs/source/en/model_doc/longformer.md @@ -85,7 +85,6 @@ echo -e "San Francisco 49ers cornerback Shawntae Spencer will miss the rest of t - ## Notes - Longformer is based on [RoBERTa](https://huggingface.co/docs/transformers/en/model_doc/roberta) and doesn't have `token_type_ids`. You don't need to indicate which token belongs to which segment. You only need to separate the segments with the separation token `` or `tokenizer.sep_token`. diff --git a/docs/source/en/model_doc/longt5.md b/docs/source/en/model_doc/longt5.md index bd22d757a74..a197de15a57 100644 --- a/docs/source/en/model_doc/longt5.md +++ b/docs/source/en/model_doc/longt5.md @@ -29,7 +29,6 @@ encoder-decoder transformer pre-trained in a text-to-text denoising generative s T5 model, and it enables using one of the two different efficient attention mechanisms - (1) Local attention, or (2) Transient-Global attention. - The abstract from the paper is the following: *Recent work has shown that either (1) increasing the input length or (2) increasing model size can improve the @@ -95,7 +94,6 @@ The complexity of this mechanism is `O(l(r + l/k))`. >>> rouge.compute(predictions=result["predicted_abstract"], references=result["abstract"]) ``` - ## Resources - [Translation task guide](../tasks/translation) diff --git a/docs/source/en/model_doc/m2m_100.md b/docs/source/en/model_doc/m2m_100.md index 29d43af97a2..f9ac7e5ebe9 100644 --- a/docs/source/en/model_doc/m2m_100.md +++ b/docs/source/en/model_doc/m2m_100.md @@ -44,7 +44,6 @@ open-source our scripts so that others may reproduce the data, evaluation, and f This model was contributed by [valhalla](https://huggingface.co/valhalla). - ## Usage tips and examples M2M100 is a multilingual encoder-decoder (seq-to-seq) model primarily intended for translation tasks. As the model is @@ -76,9 +75,9 @@ loss = model(**model_inputs).loss # forward pass **Generation** -M2M100 uses the `eos_token_id` as the `decoder_start_token_id` for generation with the target language id -being forced as the first generated token. To force the target language id as the first generated token, pass the -*forced_bos_token_id* parameter to the *generate* method. The following example shows how to translate between +M2M100 uses the `eos_token_id` as the `decoder_start_token_id` for generation with the target language id +being forced as the first generated token. To force the target language id as the first generated token, pass the +*forced_bos_token_id* parameter to the *generate* method. The following example shows how to translate between Hindi to French and Chinese to English using the *facebook/m2m100_418M* checkpoint. ```python @@ -136,7 +135,7 @@ Hindi to French and Chinese to English using the *facebook/m2m100_418M* checkpoi Flash Attention 2 is a faster, optimized version of the attention scores computation which relies on `cuda` kernels. -### Installation +### Installation First, check whether your hardware is compatible with Flash Attention 2. The latest list of compatible hardware can be found in the [official documentation](https://github.com/Dao-AILab/flash-attention#installation-and-features). diff --git a/docs/source/en/model_doc/mamba.md b/docs/source/en/model_doc/mamba.md index d243bcf7e40..031e353c93d 100644 --- a/docs/source/en/model_doc/mamba.md +++ b/docs/source/en/model_doc/mamba.md @@ -27,7 +27,6 @@ rendered properly in your Markdown viewer. You can find all the original Mamba checkpoints under the [State Space Models](https://huggingface.co/state-spaces) organization. - > [!TIP] > This model was contributed by [Molbap](https://huggingface.co/Molbap) and [AntonV](https://huggingface.co/AntonV). > Click on the Mamba models in the right sidebar for more examples of how to apply Mamba to different language tasks. @@ -93,6 +92,7 @@ input_ids = tokenizer("Plants create energy through a process known as", return_ output = model.generate(**input_ids) print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` + ## Notes - The current implementation uses the original CUDA kernels. The FlashAttention equivalent implementation is hosted in the [mamba-ssm](https://github.com/state-spaces/mamba) and [causal_conv1d](https://github.com/Dao-AILab/causal-conv1d) repositories. Make sure to install them if your hardware supports it! diff --git a/docs/source/en/model_doc/mamba2.md b/docs/source/en/model_doc/mamba2.md index 11666e1fa57..56a33dfbe0b 100644 --- a/docs/source/en/model_doc/mamba2.md +++ b/docs/source/en/model_doc/mamba2.md @@ -91,6 +91,7 @@ input_ids = tokenizer("Plants create energy through a process known as", return_ output = model.generate(**input_ids) print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` + ## Notes - Codestral Mamba has `groups=8` which are similar to the number of kv heads in an attention-based model. @@ -124,7 +125,6 @@ trainer = SFTTrainer( trainer.train() ``` - ## Mamba2Config [[autodoc]] Mamba2Config diff --git a/docs/source/en/model_doc/marian.md b/docs/source/en/model_doc/marian.md index 4b08ac1901c..00b2f91677d 100644 --- a/docs/source/en/model_doc/marian.md +++ b/docs/source/en/model_doc/marian.md @@ -25,23 +25,17 @@ rendered properly in your Markdown viewer. # MarianMT - - [MarianMT](https://huggingface.co/papers/1804.00344) is a machine translation model trained with the Marian framework which is written in pure C++. The framework includes its own custom auto-differentiation engine and efficient meta-algorithms to train encoder-decoder models like BART. All MarianMT models are transformer encoder-decoders with 6 layers in each component, use static sinusoidal positional embeddings, don't have a layernorm embedding, and the model starts generating with the prefix `pad_token_id` instead of ``. - - You can find all the original MarianMT checkpoints under the [Language Technology Research Group at the University of Helsinki](https://huggingface.co/Helsinki-NLP/models?search=opus-mt) organization. - > [!TIP] > This model was contributed by [sshleifer](https://huggingface.co/sshleifer). > > Click on the MarianMT models in the right sidebar for more examples of how to apply MarianMT to translation tasks. - The example below demonstrates how to translate text using [`Pipeline`] or the [`AutoModel`] class. @@ -78,7 +72,6 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True)) - Use the [AttentionMaskVisualizer](https://github.com/huggingface/transformers/blob/beb9b5b02246b9b7ee81ddf938f93f44cfeaad19/src/transformers/utils/attention_visualizer.py#L139) to better understand what tokens the model can and cannot attend to. ```python @@ -87,6 +80,7 @@ from transformers.utils.attention_visualizer import AttentionMaskVisualizer visualizer = AttentionMaskVisualizer("Helsinki-NLP/opus-mt-en-de") visualizer("Hello, how are you?") ``` +
diff --git a/docs/source/en/model_doc/markuplm.md b/docs/source/en/model_doc/markuplm.md index 897b97853bd..c7608f397f6 100644 --- a/docs/source/en/model_doc/markuplm.md +++ b/docs/source/en/model_doc/markuplm.md @@ -54,7 +54,7 @@ These are the XPATH tags and subscripts respectively for each token in the input - One can use [`MarkupLMProcessor`] to prepare all data for the model. Refer to the [usage guide](#usage-markuplmprocessor) for more info. +alt="drawing" width="600"/> MarkupLM architecture. Taken from the original paper. diff --git a/docs/source/en/model_doc/matcha.md b/docs/source/en/model_doc/matcha.md index e6a73c58fd0..9180d765c2b 100644 --- a/docs/source/en/model_doc/matcha.md +++ b/docs/source/en/model_doc/matcha.md @@ -42,7 +42,7 @@ Currently 6 checkpoints are available for MatCha: - `google/matcha-chartqa`: MatCha model fine-tuned on ChartQA dataset. It can be used to answer questions about charts. - `google/matcha-plotqa-v1`: MatCha model fine-tuned on PlotQA dataset. It can be used to answer questions about plots. - `google/matcha-plotqa-v2`: MatCha model fine-tuned on PlotQA dataset. It can be used to answer questions about plots. -- `google/matcha-chart2text-statista`: MatCha model fine-tuned on Statista dataset. +- `google/matcha-chart2text-statista`: MatCha model fine-tuned on Statista dataset. - `google/matcha-chart2text-pew`: MatCha model fine-tuned on Pew dataset. The models finetuned on `chart2text-pew` and `chart2text-statista` are more suited for summarization, whereas the models finetuned on `plotqa` and `chartqa` are more suited for question answering. @@ -67,6 +67,7 @@ print(processor.decode(predictions[0], skip_special_tokens=True)) ## Fine-tuning To fine-tune MatCha, refer to the pix2struct [fine-tuning notebook](https://github.com/huggingface/notebooks/blob/main/examples/image_captioning_pix2struct.ipynb). For `Pix2Struct` models, we have found out that fine-tuning the model with Adafactor and cosine learning rate scheduler leads to faster convergence: + ```python from transformers.optimization import Adafactor, get_cosine_schedule_with_warmup diff --git a/docs/source/en/model_doc/mega.md b/docs/source/en/model_doc/mega.md index 614df243553..d6580427778 100644 --- a/docs/source/en/model_doc/mega.md +++ b/docs/source/en/model_doc/mega.md @@ -44,19 +44,16 @@ The abstract from the paper is the following: This model was contributed by [mnaylor](https://huggingface.co/mnaylor). The original code can be found [here](https://github.com/facebookresearch/mega). - ## Usage tips - MEGA can perform quite well with relatively few parameters. See Appendix D in the MEGA paper for examples of architectural specs which perform well in various settings. If using MEGA as a decoder, be sure to set `bidirectional=False` to avoid errors with default bidirectional. - Mega-chunk is a variant of mega that reduces time and spaces complexity from quadratic to linear. Utilize chunking with MegaConfig.use_chunking and control chunk size with MegaConfig.chunk_size - ## Implementation Notes - The original implementation of MEGA had an inconsistent expectation of attention masks for padding and causal self-attention between the softmax attention and Laplace/squared ReLU method. This implementation addresses that inconsistency. - The original implementation did not include token type embeddings; this implementation adds support for these, with the option controlled by MegaConfig.add_token_type_embeddings - ## MegaConfig [[autodoc]] MegaConfig diff --git a/docs/source/en/model_doc/megatron-bert.md b/docs/source/en/model_doc/megatron-bert.md index f8845556f8f..5307fdcd491 100644 --- a/docs/source/en/model_doc/megatron-bert.md +++ b/docs/source/en/model_doc/megatron-bert.md @@ -45,8 +45,8 @@ achieve SOTA results on the WikiText103 (10.8 compared to SOTA perplexity of 15. accuracy of 63.2%) datasets. Our BERT model achieves SOTA results on the RACE dataset (90.9% compared to SOTA accuracy of 89.4%).* -This model was contributed by [jdemouth](https://huggingface.co/jdemouth). The original code can be found [here](https://github.com/NVIDIA/Megatron-LM). -That repository contains a multi-GPU and multi-node implementation of the Megatron Language models. In particular, +This model was contributed by [jdemouth](https://huggingface.co/jdemouth). The original code can be found [here](https://github.com/NVIDIA/Megatron-LM). +That repository contains a multi-GPU and multi-node implementation of the Megatron Language models. In particular, it contains a hybrid model parallel approach using "tensor parallel" and "pipeline parallel" techniques. ## Usage tips diff --git a/docs/source/en/model_doc/mimi.md b/docs/source/en/model_doc/mimi.md index 2d655aa5966..440f89b2c56 100644 --- a/docs/source/en/model_doc/mimi.md +++ b/docs/source/en/model_doc/mimi.md @@ -39,7 +39,7 @@ The example below demonstrates how to encode and decode audio with the [`AutoMod -```python +```python >>> from datasets import load_dataset, Audio >>> from transformers import MimiModel, AutoFeatureExtractor >>> librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") diff --git a/docs/source/en/model_doc/minimax.md b/docs/source/en/model_doc/minimax.md index 02d016c019c..a27d45089ce 100644 --- a/docs/source/en/model_doc/minimax.md +++ b/docs/source/en/model_doc/minimax.md @@ -109,8 +109,8 @@ To load and run a model using Flash Attention-2, refer to the snippet below: ### Sliding window Attention -The current implementation supports the sliding window attention mechanism and memory efficient cache management. -To enable sliding window attention, just make sure to have a `flash-attn` version that is compatible with sliding window attention (`>=2.3.0`). +The current implementation supports the sliding window attention mechanism and memory efficient cache management. +To enable sliding window attention, just make sure to have a `flash-attn` version that is compatible with sliding window attention (`>=2.3.0`). The Flash Attention-2 model uses also a more memory efficient cache slicing mechanism - as recommended per the official implementation of Mistral model that use rolling cache mechanism we keep the cache size fixed (`self.config.sliding_window`), support batched generation only for `padding_side="left"` and use the absolute position of the current token to compute the positional embedding. diff --git a/docs/source/en/model_doc/ministral.md b/docs/source/en/model_doc/ministral.md index 13b6f3d6c04..c2128512586 100644 --- a/docs/source/en/model_doc/ministral.md +++ b/docs/source/en/model_doc/ministral.md @@ -30,7 +30,6 @@ rendered properly in your Markdown viewer. This architecture turns out to coincide with Qwen2, with the main difference being the presence of biases in attention projections in Ministral. - You can find the Ministral checkpoints under the [Mistral AI](https://huggingface.co/mistralai) organization. ## Usage diff --git a/docs/source/en/model_doc/mistral.md b/docs/source/en/model_doc/mistral.md index 3714f45e55a..865ee414532 100644 --- a/docs/source/en/model_doc/mistral.md +++ b/docs/source/en/model_doc/mistral.md @@ -86,7 +86,6 @@ echo -e "My favorite condiment is" | transformers chat mistralai/Mistral-7B-v0.3 - Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends. The example below uses [bitsandbytes](../quantization/bitsandbytes) to only quantize the weights to 4-bits. diff --git a/docs/source/en/model_doc/mistral3.md b/docs/source/en/model_doc/mistral3.md index 54af880ed46..4ac264ac985 100644 --- a/docs/source/en/model_doc/mistral3.md +++ b/docs/source/en/model_doc/mistral3.md @@ -27,7 +27,6 @@ rendered properly in your Markdown viewer. You can find the original Mistral 3 checkpoints under the [Mistral AI](https://huggingface.co/mistralai/models?search=mistral-small-3) organization. - > [!TIP] > This model was contributed by [cyrilvallez](https://huggingface.co/cyrilvallez) and [yonigozlan](https://huggingface.co/yonigozlan). > Click on the Mistral3 models in the right sidebar for more examples of how to apply Mistral3 to different tasks. @@ -62,6 +61,7 @@ outputs = pipeline(text=messages, max_new_tokens=50, return_full_text=False) outputs[0]["generated_text"] 'The image depicts a vibrant and lush garden scene featuring a variety of wildflowers and plants. The central focus is on a large, pinkish-purple flower, likely a Greater Celandine (Chelidonium majus), with a' ``` + @@ -100,13 +100,15 @@ decoded_output = processor.decode(generate_ids[0, inputs["input_ids"].shape[1] : decoded_output 'The image depicts a vibrant and lush garden scene featuring a variety of wildflowers and plants. The central focus is on a large, pinkish-purple flower, likely a Greater Celandine (Chelidonium majus), with a' ``` + -## Notes +## Notes -- Mistral 3 supports text-only generation. -```py +- Mistral 3 supports text-only generation. + +```py import torch from transformers import AutoProcessor, AutoModelForImageTextToText, infer_device @@ -136,13 +138,16 @@ print(decoded_output) 5. Je me casse, à plus! ``` + /\_/\ ( o.o ) > ^ < + ```" ```` -- Mistral 3 accepts batched image and text inputs. +- Mistral 3 accepts batched image and text inputs. + ```py import torch from transformers import AutoProcessor, AutoModelForImageTextToText, infer_device @@ -184,7 +189,7 @@ messages = [ , "Describe this imageThe image depicts a vibrant street scene in what appears to be a Chinatown district. The focal point is a traditional Chinese"] ``` -- Mistral 3 also supported batched image and text inputs with a different number of images for each text. The example below quantizes the model with bitsandbytes. +- Mistral 3 also supported batched image and text inputs with a different number of images for each text. The example below quantizes the model with bitsandbytes. ```py import torch diff --git a/docs/source/en/model_doc/mixtral.md b/docs/source/en/model_doc/mixtral.md index ff501cd1a84..7665b5901a6 100644 --- a/docs/source/en/model_doc/mixtral.md +++ b/docs/source/en/model_doc/mixtral.md @@ -39,7 +39,7 @@ Mixtral-8x7B is the second large language model (LLM) released by [mistral.ai](h Mixtral-8x7B is a decoder-only Transformer with the following architectural choices: - Mixtral is a Mixture of Experts (MoE) model with 8 experts per MLP, with a total of 45 billion parameters. To learn more about mixture-of-experts, refer to the [blog post](https://huggingface.co/blog/moe). -- Despite the model having 45 billion parameters, the compute required for a single forward pass is the same as that of a 14 billion parameter model. This is because even though each of the experts have to be loaded in RAM (70B like ram requirement) each token from the hidden states are dispatched twice (top 2 routing) and thus the compute (the operation required at each forward computation) is just 2 X sequence_length. +- Despite the model having 45 billion parameters, the compute required for a single forward pass is the same as that of a 14 billion parameter model. This is because even though each of the experts have to be loaded in RAM (70B like ram requirement) each token from the hidden states are dispatched twice (top 2 routing) and thus the compute (the operation required at each forward computation) is just 2 X sequence_length. The following implementation details are shared with Mistral AI's first model [Mistral-7B](mistral): - Sliding Window Attention - Trained with 8k context length and fixed cache size, with a theoretical attention span of 128K tokens @@ -138,8 +138,8 @@ Below is a expected speedup diagram that compares pure inference time between th ### Sliding window Attention -The current implementation supports the sliding window attention mechanism and memory efficient cache management. -To enable sliding window attention, just make sure to have a `flash-attn` version that is compatible with sliding window attention (`>=2.3.0`). +The current implementation supports the sliding window attention mechanism and memory efficient cache management. +To enable sliding window attention, just make sure to have a `flash-attn` version that is compatible with sliding window attention (`>=2.3.0`). The Flash Attention-2 model uses also a more memory efficient cache slicing mechanism - as recommended per the official implementation of Mistral model that use rolling cache mechanism we keep the cache size fixed (`self.config.sliding_window`), support batched generation only for `padding_side="left"` and use the absolute position of the current token to compute the positional embedding. diff --git a/docs/source/en/model_doc/mlcd.md b/docs/source/en/model_doc/mlcd.md index 1ce785ee76b..7ff2fb434da 100644 --- a/docs/source/en/model_doc/mlcd.md +++ b/docs/source/en/model_doc/mlcd.md @@ -32,9 +32,9 @@ Tips: - We adopted the official [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT) and the official training dataset [LLaVA-NeXT-Data](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data) for evaluating the foundational visual models. -- The language model is [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct). +- The language model is [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct). -Result: +Result: | Vision Tower | RoPE2D | ChartQA | DocVQA | InfoVQA | OCRBench | MMMU | | :-------------------------------------------------------------------------------------------- | :----: | :-------- | :-------- | :-------- | :--------- | :-------- | @@ -45,7 +45,6 @@ Result: | **[MLCD (ViT-bigG-14-336px)](https://huggingface.co/DeepGlint-AI/mlcd-vit-bigG-patch14-336)** | √ | 71.07 | 79.63 | 44.38 | 572.00 | 46.78 | | **[MLCD (ViT-bigG-14-448px)](https://huggingface.co/DeepGlint-AI/mlcd-vit-bigG-patch14-448)** | √ | **73.80** | **83.34** | **46.59** | **582.00** | 46.00 | - ## Usage ```python diff --git a/docs/source/en/model_doc/mllama.md b/docs/source/en/model_doc/mllama.md index 1ea7f172bb3..a0fc5db41cf 100644 --- a/docs/source/en/model_doc/mllama.md +++ b/docs/source/en/model_doc/mllama.md @@ -35,15 +35,12 @@ The [Llama 3.2-Vision](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-ed - The text passed to the processor should have the `"<|image|>"` tokens where the images should be inserted. - The processor has its own `apply_chat_template` method to convert chat messages to text that can then be passed as text to the processor. If you're using `transformers>=4.49.0`, you can also get a vectorized output from `apply_chat_template`. See the **Usage Examples** below for more details on how to use it. - - Mllama has an extra token used as a placeholder for image positions in the text. It means that input ids and an input embedding layer will have an extra token. But since the weights for input and output embeddings are not tied, the `lm_head` layer has one less token and will fail if you want to calculate loss on image tokens or apply some logit processors. In case you are training, make sure to mask out special `"<|image|>"` tokens in the `labels` as the model should not be trained on predicting them. Otherwise if you see CUDA-side index errors when generating, use the below code to expand the `lm_head` by one more token. - ```python old_embeddings = model.get_output_embeddings() @@ -52,12 +49,13 @@ resized_embeddings = model._get_resized_lm_head(old_embeddings, new_num_tokens=n resized_embeddings.requires_grad_(old_embeddings.weight.requires_grad) model.set_output_embeddings(resized_embeddings) ``` - + ## Usage Example #### Instruct model + ```python import torch from transformers import MllamaForConditionalGeneration, AutoProcessor @@ -83,6 +81,7 @@ print(processor.decode(output[0])) ``` #### Base model + ```python import requests import torch @@ -102,7 +101,6 @@ output = model.generate(**inputs, do_sample=False, max_new_tokens=25) print(processor.decode(output[0], skip_special_tokens=True)) ``` - ## MllamaConfig [[autodoc]] MllamaConfig @@ -111,7 +109,6 @@ print(processor.decode(output[0], skip_special_tokens=True)) [[autodoc]] MllamaProcessor - ## MllamaImageProcessor [[autodoc]] MllamaImageProcessor diff --git a/docs/source/en/model_doc/mm-grounding-dino.md b/docs/source/en/model_doc/mm-grounding-dino.md index e411ef5defb..0d628c3b31d 100644 --- a/docs/source/en/model_doc/mm-grounding-dino.md +++ b/docs/source/en/model_doc/mm-grounding-dino.md @@ -100,7 +100,6 @@ for box, score, labels in zip(result["boxes"], result["scores"], result["labels" | [mm_grounding_dino_tiny_o365v1_goldg_v3det](https://huggingface.co/openmmlab-community/mm_grounding_dino_tiny_o365v1_goldg_v3det) | O365,GoldG,V3Det | 33.0 | 36.0 | 45.9 | 40.5(+11.7) | 21.5 | 25.5 | 40.2 | 30.6(+10.5) | | [mm_grounding_dino_tiny_o365v1_goldg_grit_v3det](https://huggingface.co/openmmlab-community/mm_grounding_dino_tiny_o365v1_goldg_grit_v3det) | O365,GoldG,GRIT,V3Det | 34.2 | 37.4 | 46.2 | 41.4(+12.6) | 23.6 | 27.6 | 40.5 | 31.9(+11.8) | - - This implementation also supports inference for [LLMDet](https://github.com/iSEE-Laboratory/LLMDet). Here's a table of LLMDet models and their performance on LVIS (results from [official repo](https://github.com/iSEE-Laboratory/LLMDet)): | Model | Pre-Train Data | MiniVal APr | MiniVal APc | MiniVal APf | MiniVal AP | Val1.0 APr | Val1.0 APc | Val1.0 APf | Val1.0 AP | @@ -109,7 +108,6 @@ for box, score, labels in zip(result["boxes"], result["scores"], result["labels" | [llmdet_base](https://huggingface.co/iSEE-Laboratory/llmdet_base) | (O365,GoldG,V3Det) + GroundingCap-1M | 48.3 | 40.8 | 43.1 | 54.3 | 38.5 | 28.2 | 34.3 | 47.8 | | [llmdet_large](https://huggingface.co/iSEE-Laboratory/llmdet_large) | (O365V2,OpenImageV6,GoldG) + GroundingCap-1M | 51.1 | 45.1 | 46.1 | 56.6 | 42.0 | 31.6 | 38.8 | 50.2 | - ## MMGroundingDinoConfig [[autodoc]] MMGroundingDinoConfig diff --git a/docs/source/en/model_doc/mms.md b/docs/source/en/model_doc/mms.md index 3ac351d0ddc..171beaf440d 100644 --- a/docs/source/en/model_doc/mms.md +++ b/docs/source/en/model_doc/mms.md @@ -376,6 +376,7 @@ detected_lang = model.config.id2label[lang_id] ``` To see all the supported languages of a checkpoint, you can print out the language ids as follows: + ```py processor.id2label.values() ``` diff --git a/docs/source/en/model_doc/mobilebert.md b/docs/source/en/model_doc/mobilebert.md index 4e3cc2e5d64..08486ace56e 100644 --- a/docs/source/en/model_doc/mobilebert.md +++ b/docs/source/en/model_doc/mobilebert.md @@ -15,7 +15,6 @@ rendered properly in your Markdown viewer. --> *This model was released on 2020-04-06 and added to Hugging Face Transformers on 2020-11-16.* -
PyTorch @@ -47,6 +46,7 @@ pipeline = pipeline( ) pipeline("The capital of France is [MASK].") ``` + @@ -85,7 +85,6 @@ echo -e "The capital of France is [MASK]." | transformers run --task fill-mask - - ## Notes - Inputs should be padded on the right because BERT uses absolute position embeddings. diff --git a/docs/source/en/model_doc/mobilenet_v1.md b/docs/source/en/model_doc/mobilenet_v1.md index c77bef73042..809be7f652a 100644 --- a/docs/source/en/model_doc/mobilenet_v1.md +++ b/docs/source/en/model_doc/mobilenet_v1.md @@ -32,7 +32,6 @@ You can all the original MobileNet checkpoints under the [Google](https://huggin The example below demonstrates how to classify an image with [`Pipeline`] or the [`AutoModel`] class. - @@ -84,18 +83,19 @@ print(f"The predicted class label is: {predicted_class_label}") - ## Notes - Checkpoint names follow the pattern `mobilenet_v1_{depth_multiplier}_{resolution}`, like `mobilenet_v1_1.0_224`. `1.0` is the depth multiplier and `224` is the image resolution. - While trained on images of a specific sizes, the model architecture works with images of different sizes (minimum 32x32). The [`MobileNetV1ImageProcessor`] handles the necessary preprocessing. - MobileNet is pretrained on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k), a dataset with 1000 classes. However, the model actually predicts 1001 classes. The additional class is an extra "background" class (index 0). - The original TensorFlow checkpoints determines the padding amount at inference because it depends on the input image size. To use the native PyTorch padding behavior, set `tf_padding=False` in [`MobileNetV1Config`]. + ```python from transformers import MobileNetV1Config config = MobileNetV1Config.from_pretrained("google/mobilenet_v1_1.0_224", tf_padding=True) ``` + - The Transformers implementation does not support the following features. - Uses global average pooling instead of the optional 7x7 average pooling with stride 2. For larger inputs, this gives a pooled output that is larger than a 1x1 pixel. - Does not support other `output_stride` values (fixed at 32). For smaller `output_strides`, the original implementation uses dilated convolution to prevent spatial resolution from being reduced further. (which would require dilated convolutions). diff --git a/docs/source/en/model_doc/mobilenet_v2.md b/docs/source/en/model_doc/mobilenet_v2.md index 3e1379e3f07..2039f9e4413 100644 --- a/docs/source/en/model_doc/mobilenet_v2.md +++ b/docs/source/en/model_doc/mobilenet_v2.md @@ -30,10 +30,8 @@ You can all the original MobileNet checkpoints under the [Google](https://huggin > [!TIP] > Click on the MobileNet V2 models in the right sidebar for more examples of how to apply MobileNet to different vision tasks. - The examples below demonstrate how to classify an image with [`Pipeline`] or the [`AutoModel`] class. - @@ -82,7 +80,6 @@ print(f"The predicted class label is: {predicted_class_label}") - ## Notes - Classification checkpoint names follow the pattern `mobilenet_v2_{depth_multiplier}_{resolution}`, like `mobilenet_v2_1.4_224`. `1.4` is the depth multiplier and `224` is the image resolution. Segmentation checkpoint names follow the pattern `deeplabv3_mobilenet_v2_{depth_multiplier}_{resolution}`. @@ -90,11 +87,13 @@ print(f"The predicted class label is: {predicted_class_label}") - MobileNet is pretrained on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k), a dataset with 1000 classes. However, the model actually predicts 1001 classes. The additional class is an extra "background" class (index 0). - The segmentation models use a [DeepLabV3+](https://huggingface.co/papers/1802.02611) head which is often pretrained on datasets like [PASCAL VOC](https://huggingface.co/datasets/merve/pascal-voc). - The original TensorFlow checkpoints determines the padding amount at inference because it depends on the input image size. To use the native PyTorch padding behavior, set `tf_padding=False` in [`MobileNetV2Config`]. + ```python from transformers import MobileNetV2Config config = MobileNetV2Config.from_pretrained("google/mobilenet_v2_1.4_224", tf_padding=True) ``` + - The Transformers implementation does not support the following features. - Uses global average pooling instead of the optional 7x7 average pooling with stride 2. For larger inputs, this gives a pooled output that is larger than a 1x1 pixel. - `output_hidden_states=True` returns *all* intermediate hidden states. It is not possible to extract the output from specific layers for other downstream purposes. diff --git a/docs/source/en/model_doc/mobilevit.md b/docs/source/en/model_doc/mobilevit.md index b4a51bd200f..9975cf68155 100644 --- a/docs/source/en/model_doc/mobilevit.md +++ b/docs/source/en/model_doc/mobilevit.md @@ -11,11 +11,8 @@ Unless required by applicable law or agreed to in writing, software distributed --> *This model was released on 2021-10-05 and added to Hugging Face Transformers on 2022-06-29.* - - # MobileViT -
PyTorch @@ -24,21 +21,17 @@ Unless required by applicable law or agreed to in writing, software distributed [MobileViT](https://huggingface.co/papers/2110.02178) is a lightweight vision transformer for mobile devices that merges CNNs's efficiency and inductive biases with transformers global context modeling. It treats transformers as convolutions, enabling global information processing without the heavy computational cost of standard ViTs. -
- You can find all the original MobileViT checkpoints under the [Apple](https://huggingface.co/apple/models?search=mobilevit) organization. - > [!TIP] > - This model was contributed by [matthijs](https://huggingface.co/Matthijs). > > Click on the MobileViT models in the right sidebar for more examples of how to apply MobileViT to different vision tasks. - The example below demonstrates how to do [Image Classification] with [`Pipeline`] and the [`AutoModel`] class. @@ -92,7 +85,6 @@ print(f"The predicted class label is:{predicted_class_label}") - ## Notes - Does **not** operate on sequential data, it's purely designed for image tasks. @@ -102,8 +94,6 @@ print(f"The predicted class label is:{predicted_class_label}") - The classification models are pretrained on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k). - The segmentation models use a [DeepLabV3](https://huggingface.co/papers/1706.05587) head and are pretrained on [PASCAL VOC](http://host.robots.ox.ac.uk/pascal/VOC/). - - ## MobileViTConfig [[autodoc]] MobileViTConfig diff --git a/docs/source/en/model_doc/modernbert-decoder.md b/docs/source/en/model_doc/modernbert-decoder.md index 050cae27646..ff61362a520 100644 --- a/docs/source/en/model_doc/modernbert-decoder.md +++ b/docs/source/en/model_doc/modernbert-decoder.md @@ -36,7 +36,7 @@ You can find all the original ModernBERT Decoder checkpoints under the [jhu-clsp > > Click on the ModernBERT Decoder models in the right sidebar for more examples of how to apply ModernBERT Decoder to different text generation tasks. -The example below demonstrates how to use ModernBERT Decoder for text generation with [`Pipeline`], [`AutoModel`] (with and without quantization), and from the command line. +The example below demonstrates how to use ModernBERT Decoder for text generation with [`Pipeline`], [`AutoModel`] (with and without quantization), and from the command line. @@ -151,6 +151,7 @@ with torch.no_grad(): generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) print(f"Generated text: {generated_text}") ``` + @@ -162,12 +163,10 @@ echo "The future of artificial intelligence is" | transformers run --task text-g - ## ModernBertDecoderConfig [[autodoc]] ModernBertDecoderConfig - ## ModernBertDecoderModel [[autodoc]] ModernBertDecoderModel @@ -182,4 +181,3 @@ echo "The future of artificial intelligence is" | transformers run --task text-g [[autodoc]] ModernBertDecoderForSequenceClassification - forward - diff --git a/docs/source/en/model_doc/modernbert.md b/docs/source/en/model_doc/modernbert.md index 872da561fbf..4be8d97f5e9 100644 --- a/docs/source/en/model_doc/modernbert.md +++ b/docs/source/en/model_doc/modernbert.md @@ -93,7 +93,6 @@ echo -e "Plants create [MASK] through a process known as photosynthesis." | tran [[autodoc]] ModernBertConfig - ## ModernBertModel [[autodoc]] ModernBertModel @@ -127,5 +126,3 @@ echo -e "Plants create [MASK] through a process known as photosynthesis." | tran ### Usage tips The ModernBert model can be fine-tuned using the HuggingFace Transformers library with its [official script](https://github.com/huggingface/transformers/blob/main/examples/pytorch/question-answering/run_qa.py) for question-answering tasks. - - diff --git a/docs/source/en/model_doc/moonshine.md b/docs/source/en/model_doc/moonshine.md index 7abe123b88e..b85a174a86f 100644 --- a/docs/source/en/model_doc/moonshine.md +++ b/docs/source/en/model_doc/moonshine.md @@ -83,6 +83,7 @@ predicted_ids = model.generate(**input_features, cache_implementation="static") transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True) transcription[0] ``` + @@ -101,4 +102,3 @@ transcription[0] [[autodoc]] MoonshineForConditionalGeneration - forward - generate - diff --git a/docs/source/en/model_doc/moshi.md b/docs/source/en/model_doc/moshi.md index e17a1b7b8b1..49fae1c539d 100644 --- a/docs/source/en/model_doc/moshi.md +++ b/docs/source/en/model_doc/moshi.md @@ -35,7 +35,7 @@ Moshi is a speech-text foundation model that casts spoken dialogue as speech-to- The abstract from the paper is the following: -*We introduce Moshi, a speech-text foundation model and full-duplex spoken dialogue framework. Current systems for spoken dialogue rely on pipelines of independent components, namely voice activity detection, speech recognition, textual dialogue and text-to-speech. Such frameworks cannot emulate the experience of real conversations. First, their complexity induces a latency of several seconds between interactions. Second, text being the intermediate modality for dialogue, non-linguistic information that modifies meaning— such as emotion or non-speech sounds— is lost in the interaction. Finally, they rely on a segmentation into speaker turns, which does not take into account overlapping speech, interruptions and interjections. Moshi solves these independent issues altogether by casting spoken dialogue as speech-to-speech generation. Starting from a text language model backbone, Moshi generates speech as tokens from the residual quantizer of a neural audio codec, while modeling separately its own speech and that of the user into parallel streams. This allows for the removal of explicit speaker turns, and the modeling of arbitrary conversational dynamics. We moreover extend the hierarchical semantic-to-acoustic token generation of previous work to first predict time-aligned text tokens as a prefix to audio tokens. Not only this “Inner Monologue” method significantly improves the linguistic quality of generated speech, but we also illustrate how it can provide streaming speech recognition and text-to-speech. Our resulting model is the first real-time full-duplex spoken large language model, with a theoretical latency of 160ms, 200ms in practice, and is available at github.com/kyutai-labs/moshi.* +*We introduce Moshi, a speech-text foundation model and full-duplex spoken dialogue framework. Current systems for spoken dialogue rely on pipelines of independent components, namely voice activity detection, speech recognition, textual dialogue and text-to-speech. Such frameworks cannot emulate the experience of real conversations. First, their complexity induces a latency of several seconds between interactions. Second, text being the intermediate modality for dialogue, non-linguistic information that modifies meaning— such as emotion or non-speech sounds— is lost in the interaction. Finally, they rely on a segmentation into speaker turns, which does not take into account overlapping speech, interruptions and interjections. Moshi solves these independent issues altogether by casting spoken dialogue as speech-to-speech generation. Starting from a text language model backbone, Moshi generates speech as tokens from the residual quantizer of a neural audio codec, while modeling separately its own speech and that of the user into parallel streams. This allows for the removal of explicit speaker turns, and the modeling of arbitrary conversational dynamics. We moreover extend the hierarchical semantic-to-acoustic token generation of previous work to first predict time-aligned text tokens as a prefix to audio tokens. Not only this “Inner Monologue” method significantly improves the linguistic quality of generated speech, but we also illustrate how it can provide streaming speech recognition and text-to-speech. Our resulting model is the first real-time full-duplex spoken large language model, with a theoretical latency of 160ms, 200ms in practice, and is available at github.com/kyutai-labs/moshi.* Moshi deals with 3 streams of information: 1. The user's audio @@ -63,11 +63,9 @@ Note that each timestamp - i.e each codebook - gets its own set of Linear Layers It's the audio encoder from Kyutai, that has recently been integrated to transformers, which is used to "tokenize" audio. It has the same use that [`~EncodecModel`] has in [`~MusicgenModel`]. - ## Tips: -The original checkpoints can be converted using the conversion script `src/transformers/models/moshi/convert_moshi_transformers.py` - +The original checkpoints can be converted using the conversion script `src/transformers/models/moshi/convert_moshi_transformers.py` ### How to use the model: @@ -108,12 +106,9 @@ To follow the example of the following image, `"Hello, I'm Moshi"` could be tran
- [`MoshiForConditionalGeneration.generate`] then auto-regressively feeds to itself its own audio stream, but since it doesn't have access to the user input stream while using `transformers`, it will thus **assume that the user is producing blank audio**. - - -```python +```python >>> from datasets import load_dataset, Audio >>> import torch, math >>> from transformers import MoshiForConditionalGeneration, AutoFeatureExtractor, AutoTokenizer, infer_device @@ -149,7 +144,7 @@ To follow the example of the following image, `"Hello, I'm Moshi"` could be tran Most of the work has to be done during data creation/pre-processing, because of the need to align/synchronize streams. Once it's done, you can simply forward `text_labels` and `audio_labels` to [`MoshiForConditionalGeneration.forward`], alongside the usual inputs, to get the model loss. - + A training guide will come soon, but user contributions are welcomed! ### How does the model forward the inputs / generate: @@ -162,13 +157,10 @@ A training guide will come soon, but user contributions are welcomed! 3. The depth decoder switches the dimension on which we forward / generate (codebooks instead of time). It uses the token generated from `text logits` and the `temporal context` to auto-regressively generate audio codebooks. - This model was contributed by [Yoach Lacombe (ylacombe)](https://huggingface.co/ylacombe). The original code can be found [here](https://github.com/kyutai-labs/moshi). - - ## MoshiConfig [[autodoc]] MoshiConfig diff --git a/docs/source/en/model_doc/mpt.md b/docs/source/en/model_doc/mpt.md index 9482e6a9195..60d14641177 100644 --- a/docs/source/en/model_doc/mpt.md +++ b/docs/source/en/model_doc/mpt.md @@ -23,11 +23,11 @@ rendered properly in your Markdown viewer. ## Overview -The MPT model was proposed by the [MosaicML](https://www.mosaicml.com/) team and released with multiple sizes and finetuned variants. The MPT models are a series of open source and commercially usable LLMs pre-trained on 1T tokens. +The MPT model was proposed by the [MosaicML](https://www.mosaicml.com/) team and released with multiple sizes and finetuned variants. The MPT models are a series of open source and commercially usable LLMs pre-trained on 1T tokens. -MPT models are GPT-style decoder-only transformers with several improvements: performance-optimized layer implementations, architecture changes that provide greater training stability, and the elimination of context length limits by replacing positional embeddings with ALiBi. +MPT models are GPT-style decoder-only transformers with several improvements: performance-optimized layer implementations, architecture changes that provide greater training stability, and the elimination of context length limits by replacing positional embeddings with ALiBi. -- MPT base: MPT base pre-trained models on next token prediction +- MPT base: MPT base pre-trained models on next token prediction - MPT instruct: MPT base models fine-tuned on instruction based tasks - MPT storywriter: MPT base models fine-tuned for 2500 steps on 65k-token excerpts of fiction books contained in the books3 corpus, this enables the model to handle very long sequences diff --git a/docs/source/en/model_doc/mt5.md b/docs/source/en/model_doc/mt5.md index fa02ee4c3c0..4e652458e1b 100644 --- a/docs/source/en/model_doc/mt5.md +++ b/docs/source/en/model_doc/mt5.md @@ -133,7 +133,6 @@ print(tokenizer.decode(output[0], skip_special_tokens=True)) See [`T5Tokenizer`] for all details. - ## MT5TokenizerFast [[autodoc]] MT5TokenizerFast diff --git a/docs/source/en/model_doc/musicgen.md b/docs/source/en/model_doc/musicgen.md index 7e91b2265fe..0ec3cb200d1 100644 --- a/docs/source/en/model_doc/musicgen.md +++ b/docs/source/en/model_doc/musicgen.md @@ -77,9 +77,9 @@ Generation is limited by the sinusoidal positional embeddings to 30 second input than 30 seconds of audio (1503 tokens), and input audio passed by Audio-Prompted Generation contributes to this limit so, given an input of 20 seconds of audio, MusicGen cannot generate more than 10 seconds of additional audio. -Transformers supports both mono (1-channel) and stereo (2-channel) variants of MusicGen. The mono channel versions -generate a single set of codebooks. The stereo versions generate 2 sets of codebooks, 1 for each channel (left/right), -and each set of codebooks is decoded independently through the audio compression model. The audio streams for each +Transformers supports both mono (1-channel) and stereo (2-channel) variants of MusicGen. The mono channel versions +generate a single set of codebooks. The stereo versions generate 2 sets of codebooks, 1 for each channel (left/right), +and each set of codebooks is decoded independently through the audio compression model. The audio streams for each channel are combined to give the final stereo output. ### Unconditional Generation @@ -208,7 +208,7 @@ For batched audio-prompted generation, the generated `audio_values` can be post- ### Generation Configuration -The default parameters that control the generation process, such as sampling, guidance scale and number of generated +The default parameters that control the generation process, such as sampling, guidance scale and number of generated tokens, can be found in the model's generation config, and updated as desired: ```python @@ -226,8 +226,8 @@ tokens, can be found in the model's generation config, and updated as desired: >>> model.generation_config.max_length = 256 ``` -Note that any arguments passed to the generate method will **supersede** those in the generation config, so setting -`do_sample=False` in the call to generate will supersede the setting of `model.generation_config.do_sample` in the +Note that any arguments passed to the generate method will **supersede** those in the generation config, so setting +`do_sample=False` in the call to generate will supersede the setting of `model.generation_config.do_sample` in the generation config. ## Model Structure @@ -239,7 +239,7 @@ The MusicGen model can be de-composed into three distinct stages: Thus, the MusicGen model can either be used as a standalone decoder model, corresponding to the class [`MusicgenForCausalLM`], or as a composite model that includes the text encoder and audio encoder/decoder, corresponding to the class -[`MusicgenForConditionalGeneration`]. If only the decoder needs to be loaded from the pre-trained checkpoint, it can be loaded by first +[`MusicgenForConditionalGeneration`]. If only the decoder needs to be loaded from the pre-trained checkpoint, it can be loaded by first specifying the correct config, or be accessed through the `.decoder` attribute of the composite model: ```python diff --git a/docs/source/en/model_doc/musicgen_melody.md b/docs/source/en/model_doc/musicgen_melody.md index d2cd51bbcf2..f43bfee4334 100644 --- a/docs/source/en/model_doc/musicgen_melody.md +++ b/docs/source/en/model_doc/musicgen_melody.md @@ -35,10 +35,8 @@ The abstract from the paper is the following: *We tackle the task of conditional music generation. We introduce MusicGen, a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns, which eliminates the need for cascading several models, e.g., hierarchically or upsampling. Following this approach, we demonstrate how MusicGen can generate high-quality samples, while being conditioned on textual description or melodic features, allowing better controls over the generated output. We conduct extensive empirical evaluation, considering both automatic and human studies, showing the proposed approach is superior to the evaluated baselines on a standard text-to-music benchmark. Through ablation studies, we shed light over the importance of each of the components comprising MusicGen.* - This model was contributed by [ylacombe](https://huggingface.co/ylacombe). The original code can be found [here](https://github.com/facebookresearch/audiocraft). The pre-trained checkpoints can be found on the [Hugging Face Hub](https://huggingface.co/models?sort=downloads&search=facebook%2Fmusicgen). - ## Difference with [MusicGen](https://huggingface.co/docs/transformers/main/en/model_doc/musicgen) There are two key differences with MusicGen: @@ -54,7 +52,6 @@ MusicGen Melody is compatible with two generation modes: greedy and sampling. In Transformers supports both mono (1-channel) and stereo (2-channel) variants of MusicGen Melody. The mono channel versions generate a single set of codebooks. The stereo versions generate 2 sets of codebooks, 1 for each channel (left/right), and each set of codebooks is decoded independently through the audio compression model. The audio streams for each channel are combined to give the final stereo output. - #### Audio Conditional Generation The model can generate an audio sample conditioned on a text and an audio prompt through use of the [`MusicgenMelodyProcessor`] to pre-process the inputs. @@ -67,6 +64,7 @@ pip install datasets[audio] ``` The audio file we are about to use is loaded as follows: + ```python >>> from datasets import load_dataset @@ -147,10 +145,9 @@ Or save them as a `.wav` file using a third-party library, e.g. `soundfile`: >>> sf.write("musicgen_out.wav", audio_values[0].T.numpy(), sampling_rate) ``` - ### Text-only Conditional Generation -The same [`MusicgenMelodyProcessor`] can be used to pre-process a text-only prompt. +The same [`MusicgenMelodyProcessor`] can be used to pre-process a text-only prompt. ```python >>> from transformers import AutoProcessor, MusicgenMelodyForConditionalGeneration @@ -168,7 +165,6 @@ The same [`MusicgenMelodyProcessor`] can be used to pre-process a text-only prom The `guidance_scale` is used in classifier free guidance (CFG), setting the weighting between the conditional logits (which are predicted from the text prompts) and the unconditional logits (which are predicted from an unconditional or 'null' prompt). Higher guidance scale encourages the model to generate samples that are more closely linked to the input prompt, usually at the expense of poorer audio quality. CFG is enabled by setting `guidance_scale > 1`. For best results, use `guidance_scale=3` (default). - You can also generate in batch: ```python @@ -263,7 +259,6 @@ Tips: * MusicGen is trained on the 32kHz checkpoint of Encodec. You should ensure you use a compatible version of the Encodec model. * Sampling mode tends to deliver better results than greedy - you can toggle sampling with the variable `do_sample` in the call to [`MusicgenMelodyForConditionalGeneration.generate`] - ## MusicgenMelodyDecoderConfig [[autodoc]] MusicgenMelodyDecoderConfig diff --git a/docs/source/en/model_doc/mvp.md b/docs/source/en/model_doc/mvp.md index 2cce9bd6cac..26aa2f29b76 100644 --- a/docs/source/en/model_doc/mvp.md +++ b/docs/source/en/model_doc/mvp.md @@ -25,7 +25,6 @@ rendered properly in your Markdown viewer. The MVP model was proposed in [MVP: Multi-task Supervised Pre-training for Natural Language Generation](https://huggingface.co/papers/2206.12131) by Tianyi Tang, Junyi Li, Wayne Xin Zhao and Ji-Rong Wen. - According to the abstract, - MVP follows a standard Transformer encoder-decoder architecture. @@ -67,6 +66,7 @@ For summarization, it is an example to use MVP and MVP with summarization-specif ``` For data-to-text generation, it is an example to use MVP and multi-task pre-trained variants. + ```python >>> from transformers import MvpTokenizerFast, MvpForConditionalGeneration diff --git a/docs/source/en/model_doc/myt5.md b/docs/source/en/model_doc/myt5.md index 40973575125..35ab716a8e7 100644 --- a/docs/source/en/model_doc/myt5.md +++ b/docs/source/en/model_doc/myt5.md @@ -44,4 +44,3 @@ The original code can be found [here](https://github.com/tomlimi/MYTE). ## MyT5Tokenizer [[autodoc]] MyT5Tokenizer - diff --git a/docs/source/en/model_doc/nemotron.md b/docs/source/en/model_doc/nemotron.md index 360a6ba2226..0a2104c5855 100644 --- a/docs/source/en/model_doc/nemotron.md +++ b/docs/source/en/model_doc/nemotron.md @@ -97,7 +97,6 @@ Minitron is released under the [NVIDIA Open Model License Agreement](https://dev | :------------- | :------------- | :------------- | :------------- | :------------- | | 75.0 | 74.0 | 24.1 | 50.9 | 29.5 - *Code generation performance*. Evaluated using [HumanEval](https://github.com/openai/human-eval): | p@1, 0-Shot | @@ -109,6 +108,7 @@ Please refer to our [paper](https://huggingface.co/papers/2407.14679) for the fu ### Citation If you find our work helpful, please consider citing our paper: + ``` @article{minitron2024, title={Compact Language Models via Pruning and Knowledge Distillation}, @@ -123,13 +123,11 @@ If you find our work helpful, please consider citing our paper: [[autodoc]] NemotronConfig - ## NemotronModel [[autodoc]] NemotronModel - forward - ## NemotronForCausalLM [[autodoc]] NemotronForCausalLM @@ -140,13 +138,11 @@ If you find our work helpful, please consider citing our paper: [[autodoc]] NemotronForSequenceClassification - forward - ## NemotronForQuestionAnswering [[autodoc]] NemotronForQuestionAnswering - forward - ## NemotronForTokenClassification [[autodoc]] NemotronForTokenClassification diff --git a/docs/source/en/model_doc/nllb-moe.md b/docs/source/en/model_doc/nllb-moe.md index f1456ee402d..d8c44a5fc0f 100644 --- a/docs/source/en/model_doc/nllb-moe.md +++ b/docs/source/en/model_doc/nllb-moe.md @@ -110,7 +110,6 @@ See example below for a translation from romanian to german: - [Translation task guide](../tasks/translation) - [Summarization task guide](../tasks/summarization) - ## NllbMoeConfig [[autodoc]] NllbMoeConfig @@ -135,4 +134,3 @@ See example below for a translation from romanian to german: [[autodoc]] NllbMoeForConditionalGeneration - forward - diff --git a/docs/source/en/model_doc/nllb.md b/docs/source/en/model_doc/nllb.md index 6f12a3aa746..77fffafde67 100644 --- a/docs/source/en/model_doc/nllb.md +++ b/docs/source/en/model_doc/nllb.md @@ -29,7 +29,6 @@ rendered properly in your Markdown viewer. [NLLB: No Language Left Behind](https://huggingface.co/papers/2207.04672) is a multilingual translation model. It's trained on data using data mining techniques tailored for low-resource languages and supports over 200 languages. NLLB features a conditional compute architecture using a Sparsely Gated Mixture of Experts. - You can find all the original NLLB checkpoints under the [AI at Meta](https://huggingface.co/facebook/models?search=nllb) organization. > [!TIP] @@ -132,6 +131,7 @@ visualizer("UN Chief says there is no military solution in Syria") - For non-English languages, specify the language's [BCP-47](https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200) code with the `src_lang` keyword as shown below. - See example below for a translation from Romanian to German. + ```python >>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer diff --git a/docs/source/en/model_doc/olmo2.md b/docs/source/en/model_doc/olmo2.md index bf582bc2ef5..7ecaa0e98fa 100644 --- a/docs/source/en/model_doc/olmo2.md +++ b/docs/source/en/model_doc/olmo2.md @@ -87,6 +87,7 @@ echo -e "Plants create energy through a process known as" | transformers run --t Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends. The example below uses [torchao](../quantization/torchao) to only quantize the weights to 4-bits. + ```py #pip install torchao @@ -116,7 +117,6 @@ print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` - ## Notes - OLMo2 uses RMSNorm instead of standard layer norm. The RMSNorm is applied to attention queries and keys, and it is applied after the attention and feedforward layers rather than before. @@ -129,7 +129,6 @@ print(tokenizer.decode(output[0], skip_special_tokens=True)) model = AutoModelForCausalLM.from_pretrained("allenai/OLMo-2-0425-1B", revision="stage1-step140000-tokens294B") ``` - ## Olmo2Config [[autodoc]] Olmo2Config diff --git a/docs/source/en/model_doc/olmo3.md b/docs/source/en/model_doc/olmo3.md index ecf384ee7cc..57f3309e748 100644 --- a/docs/source/en/model_doc/olmo3.md +++ b/docs/source/en/model_doc/olmo3.md @@ -12,7 +12,6 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. - ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be rendered properly in your Markdown viewer. --> @@ -88,6 +87,7 @@ echo -e "Plants create energy through a process known as" | transformers run --t Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends. The example below uses [torchao](../quantization/torchao) to only quantize the weights to 4-bits. + ```py #pip install torchao @@ -117,7 +117,6 @@ print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` - ## Notes - Load specific intermediate checkpoints by adding the `revision` parameter to [`~PreTrainedModel.from_pretrained`]. @@ -128,7 +127,6 @@ print(tokenizer.decode(output[0], skip_special_tokens=True)) model = AutoModelForCausalLM.from_pretrained("allenai/TBA", revision="stage1-step140000-tokens294B") ``` - ## Olmo3Config [[autodoc]] Olmo3Config diff --git a/docs/source/en/model_doc/openai-gpt.md b/docs/source/en/model_doc/openai-gpt.md index b45b205e259..fba08ceca00 100644 --- a/docs/source/en/model_doc/openai-gpt.md +++ b/docs/source/en/model_doc/openai-gpt.md @@ -15,7 +15,6 @@ rendered properly in your Markdown viewer. --> *This model was released on 2018-06-11 and added to Hugging Face Transformers on 2023-06-20.* -
PyTorch @@ -24,8 +23,6 @@ rendered properly in your Markdown viewer.
- - # GPT [GPT (Generative Pre-trained Transformer)](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf) ([blog post](https://openai.com/index/language-unsupervised/)) focuses on effectively learning text representations and transferring them to tasks. This model trains the Transformer decoder to predict the next word, and then fine-tuned on labeled data. @@ -39,12 +36,9 @@ You can find all the original GPT checkpoints under the [OpenAI community](https The example below demonstrates how to generate text with [`Pipeline`], [`AutoModel`], and from the command line. - - - ```python import torch from transformers import pipeline @@ -75,6 +69,7 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True)) echo -e "The future of AI is" | transformers run --task text-generation --model openai-community/openai-gpt --device 0 ``` + diff --git a/docs/source/en/model_doc/opt.md b/docs/source/en/model_doc/opt.md index e645956f1ec..7c65689594e 100644 --- a/docs/source/en/model_doc/opt.md +++ b/docs/source/en/model_doc/opt.md @@ -36,7 +36,6 @@ You can find all the original OPT checkpoints under the [OPT](https://huggingfac The example below demonstrates how to generate text with [`Pipeline`], [`AutoModel`], and from the command line. - @@ -65,12 +64,14 @@ model_inputs = tokenizer([prompt], return_tensors="pt").to(model.device) generated_ids = model.generate(**model_inputs, max_new_tokens=30, do_sample=False) tokenizer.batch_decode(generated_ids)[0] ``` + ```py echo -e "Plants create energy through a process known as" | transformers run --task text-generation --model facebook/opt-125m --device 0 ``` + diff --git a/docs/source/en/model_doc/ovis2.md b/docs/source/en/model_doc/ovis2.md index 342e34ef7a1..731ebbb83f0 100644 --- a/docs/source/en/model_doc/ovis2.md +++ b/docs/source/en/model_doc/ovis2.md @@ -19,7 +19,7 @@ rendered properly in your Markdown viewer. ## Overview -The [Ovis2](https://github.com/AIDC-AI/Ovis) is an updated version of the [Ovis](https://huggingface.co/papers/2405.20797) model developed by the AIDC-AI team at Alibaba International Digital Commerce Group. +The [Ovis2](https://github.com/AIDC-AI/Ovis) is an updated version of the [Ovis](https://huggingface.co/papers/2405.20797) model developed by the AIDC-AI team at Alibaba International Digital Commerce Group. Ovis2 is the latest advancement in multi-modal large language models (MLLMs), succeeding Ovis1.6. It retains the architectural design of the Ovis series, which focuses on aligning visual and textual embeddings, and introduces major improvements in data curation and training methods. diff --git a/docs/source/en/model_doc/paligemma.md b/docs/source/en/model_doc/paligemma.md index 58aa622a0d3..fa7c193da45 100644 --- a/docs/source/en/model_doc/paligemma.md +++ b/docs/source/en/model_doc/paligemma.md @@ -140,6 +140,7 @@ visualizer(" What is in this image?") answer = "a pallas cat" inputs = processor(images=image, text=prompt, suffix=answer, return_tensors="pt") ``` + - PaliGemma can support multiple input images if it is fine-tuned to accept multiple images. For example, the [NLVR2](https://huggingface.co/google/paligemma-3b-ft-nlvr2-448) checkpoint supports multiple images. Pass the images as a list to the processor. ```py diff --git a/docs/source/en/model_doc/patchtsmixer.md b/docs/source/en/model_doc/patchtsmixer.md index 5541f4d8093..23ebb89b6ad 100644 --- a/docs/source/en/model_doc/patchtsmixer.md +++ b/docs/source/en/model_doc/patchtsmixer.md @@ -25,15 +25,13 @@ rendered properly in your Markdown viewer. The PatchTSMixer model was proposed in [TSMixer: Lightweight MLP-Mixer Model for Multivariate Time Series Forecasting](https://huggingface.co/papers/2306.09364) by Vijay Ekambaram, Arindam Jati, Nam Nguyen, Phanwadee Sinthong and Jayant Kalagnanam. - PatchTSMixer is a lightweight time-series modeling approach based on the MLP-Mixer architecture. In this HuggingFace implementation, we provide PatchTSMixer's capabilities to effortlessly facilitate lightweight mixing across patches, channels, and hidden features for effective multivariate time-series modeling. It also supports various attention mechanisms starting from simple gated attention to more complex self-attention blocks that can be customized accordingly. The model can be pretrained and subsequently used for various downstream tasks such as forecasting, classification and regression. - The abstract from the paper is the following: *TSMixer is a lightweight neural architecture exclusively composed of multi-layer perceptron (MLP) modules designed for multivariate forecasting and representation learning on patched time series. Our model draws inspiration from the success of MLP-Mixer models in computer vision. We demonstrate the challenges involved in adapting Vision MLP-Mixer for time series and introduce empirically validated components to enhance accuracy. This includes a novel design paradigm of attaching online reconciliation heads to the MLP-Mixer backbone, for explicitly modeling the time-series properties such as hierarchy and channel-correlations. We also propose a Hybrid channel modeling approach to effectively handle noisy channel interactions and generalization across diverse datasets, a common challenge in existing patch channel-mixing methods. Additionally, a simple gated attention mechanism is introduced in the backbone to prioritize important features. By incorporating these lightweight components, we significantly enhance the learning capability of simple MLP structures, outperforming complex Transformer models with minimal computing usage. Moreover, TSMixer's modular design enables compatibility with both supervised and masked self-supervised learning methods, making it a promising building block for time-series Foundation Models. TSMixer outperforms state-of-the-art MLP and Transformer models in forecasting by a considerable margin of 8-60%. It also outperforms the latest strong benchmarks of Patch-Transformer models (by 1-2%) with a significant reduction in memory and runtime (2-3X).* -This model was contributed by [ajati](https://huggingface.co/ajati), [vijaye12](https://huggingface.co/vijaye12), +This model was contributed by [ajati](https://huggingface.co/ajati), [vijaye12](https://huggingface.co/vijaye12), [gsinthong](https://huggingface.co/gsinthong), [namctin](https://huggingface.co/namctin), [wmgifford](https://huggingface.co/wmgifford), [kashif](https://huggingface.co/kashif). @@ -68,31 +66,26 @@ The model can also be used for time series classification and time series regres [[autodoc]] PatchTSMixerConfig - ## PatchTSMixerModel [[autodoc]] PatchTSMixerModel - forward - ## PatchTSMixerForPrediction [[autodoc]] PatchTSMixerForPrediction - forward - ## PatchTSMixerForTimeSeriesClassification [[autodoc]] PatchTSMixerForTimeSeriesClassification - forward - ## PatchTSMixerForPretraining [[autodoc]] PatchTSMixerForPretraining - forward - ## PatchTSMixerForRegression [[autodoc]] PatchTSMixerForRegression diff --git a/docs/source/en/model_doc/pegasus_x.md b/docs/source/en/model_doc/pegasus_x.md index 4f048e5496c..783581ad96d 100644 --- a/docs/source/en/model_doc/pegasus_x.md +++ b/docs/source/en/model_doc/pegasus_x.md @@ -53,6 +53,7 @@ Through photosynthesis, plants capture energy from sunlight using a green pigmen These ingredients are then transformed into glucose, a type of sugar that serves as a source of chemical energy, and oxygen, which is released as a byproduct into the atmosphere. The glucose produced during photosynthesis is not just used immediately; plants also store it as starch or convert it into other organic compounds like cellulose, which is essential for building their cellular structure. This energy reserve allows them to grow, develop leaves, produce flowers, bear fruit, and carry out various physiological processes throughout their lifecycle.""") ``` + @@ -78,12 +79,14 @@ input_ids = tokenizer(input_text, return_tensors="pt").to(model.device) output = model.generate(**input_ids, cache_implementation="static") print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` + ```bash echo -e "Plants are among the most remarkable and essential life forms on Earth, possessing a unique ability to produce their own food through a process known as photosynthesis. This complex biochemical process is fundamental not only to plant life but to virtually all life on the planet. Through photosynthesis, plants capture energy from sunlight using a green pigment called chlorophyll, which is located in specialized cell structures called chloroplasts." | transformers run --task summarization --model google/pegasus-x-large --device 0 ``` + diff --git a/docs/source/en/model_doc/perception_lm.md b/docs/source/en/model_doc/perception_lm.md index ee6b63fce6f..7d3d608253f 100644 --- a/docs/source/en/model_doc/perception_lm.md +++ b/docs/source/en/model_doc/perception_lm.md @@ -38,11 +38,9 @@ video captions. Additionally, we introduce PLM–VideoBench, a suite for evaluat understanding tasks focusing on the ability to reason about “what”, “where”, “when”, and “how” of a video. We make our work fully reproducible by providing data, training recipes, code & models.* - This model was contributed by [shumingh](https://huggingface.co/shumingh). The original code can be found [here](https://github.com/facebookresearch/perception_models). - ## PerceptionLMConfig [[autodoc]] PerceptionLMConfig diff --git a/docs/source/en/model_doc/persimmon.md b/docs/source/en/model_doc/persimmon.md index 764c959879a..854eaee835d 100644 --- a/docs/source/en/model_doc/persimmon.md +++ b/docs/source/en/model_doc/persimmon.md @@ -39,7 +39,7 @@ The original code can be found [here](https://github.com/persimmon-ai-labs/adept The `Persimmon` models were trained using `bfloat16`, but the original inference uses `float16` The checkpoints uploaded on the hub use `dtype = 'float16'` which will be -used by the `AutoModel` API to cast the checkpoints from `torch.float32` to `torch.float16`. +used by the `AutoModel` API to cast the checkpoints from `torch.float32` to `torch.float16`. The `dtype` of the online weights is mostly irrelevant, unless you are using `dtype="auto"` when initializing a model using `model = AutoModelForCausalLM.from_pretrained("path", dtype = "auto")`. The reason is that the model will first be downloaded ( using the `dtype` of the checkpoints online) then it will be cast to the default `dtype` of `torch` (becomes `torch.float32`). Users should specify the `dtype` they want, and if they don't it will be `torch.float32`. @@ -47,7 +47,6 @@ Finetuning the model in `float16` is not recommended and known to produce `nan`, - Tips: - To convert the model, you need to clone the original repository using `git clone https://github.com/persimmon-ai-labs/adept-inference`, then get the checkpoints: @@ -62,6 +61,7 @@ python src/transformers/models/persimmon/convert_persimmon_weights_to_hf.py --i ``` For the chat model: + ```bash wget https://axtkn4xl5cip.objectstorage.us-phoenix-1.oci.customer-oci.com/n/axtkn4xl5cip/b/adept-public-data/o/8b_chat_model_release.tar tar -xvf 8b_base_model_release.tar @@ -76,13 +76,11 @@ model = PersimmonForCausalLM.from_pretrained("/output/path") tokenizer = PersimmonTokenizer.from_pretrained("/output/path") ``` - - Perismmon uses a `sentencepiece` based tokenizer, with a `Unigram` model. It supports bytefallback, which is only available in `tokenizers==0.14.0` for the fast tokenizer. The `LlamaTokenizer` is used as it is a standard wrapper around sentencepiece. The `chat` template will be updated with the templating functions in a follow up PR! - The authors suggest to use the following prompt format for the chat mode: `f"human: {prompt}\n\nadept:"` - ## PersimmonConfig [[autodoc]] PersimmonConfig diff --git a/docs/source/en/model_doc/phi3.md b/docs/source/en/model_doc/phi3.md index 020b2643193..9a045e6f184 100644 --- a/docs/source/en/model_doc/phi3.md +++ b/docs/source/en/model_doc/phi3.md @@ -72,7 +72,6 @@ Phi-3 has been integrated in the development version (4.40.0.dev) of `transforme [[autodoc]] Phi3Config - ## Phi3Model [[autodoc]] Phi3Model @@ -93,4 +92,3 @@ Phi-3 has been integrated in the development version (4.40.0.dev) of `transforme [[autodoc]] Phi3ForTokenClassification - forward - diff --git a/docs/source/en/model_doc/phimoe.md b/docs/source/en/model_doc/phimoe.md index a564eb6145a..3d414d7c43b 100644 --- a/docs/source/en/model_doc/phimoe.md +++ b/docs/source/en/model_doc/phimoe.md @@ -50,6 +50,7 @@ Phi-3.5-MoE-instruct has been integrated in the development version (4.44.2.dev) The current `transformers` version can be verified with: `pip list | grep transformers`. Examples of required packages: + ``` flash_attn==2.5.8 torch==2.3.1 @@ -101,7 +102,6 @@ print(output[0]['generated_text']) [[autodoc]] PhimoeConfig - ## PhimoeModel [[autodoc]] PhimoeModel @@ -117,4 +117,3 @@ print(output[0]['generated_text']) [[autodoc]] PhimoeForSequenceClassification - forward - diff --git a/docs/source/en/model_doc/pixtral.md b/docs/source/en/model_doc/pixtral.md index 55ba0908429..bb175973bd2 100644 --- a/docs/source/en/model_doc/pixtral.md +++ b/docs/source/en/model_doc/pixtral.md @@ -15,7 +15,6 @@ rendered properly in your Markdown viewer. --> *This model was released on 2024-09-17 and added to Hugging Face Transformers on 2024-09-14.* -
PyTorch diff --git a/docs/source/en/model_doc/pop2piano.md b/docs/source/en/model_doc/pop2piano.md index 5f68b180500..90e0cd3f063 100644 --- a/docs/source/en/model_doc/pop2piano.md +++ b/docs/source/en/model_doc/pop2piano.md @@ -21,14 +21,14 @@ specific language governing permissions and limitations under the License. The Pop2Piano model was proposed in [Pop2Piano : Pop Audio-based Piano Cover Generation](https://huggingface.co/papers/2211.00895) by Jongho Choi and Kyogu Lee. -Piano covers of pop music are widely enjoyed, but generating them from music is not a trivial task. It requires great -expertise with playing piano as well as knowing different characteristics and melodies of a song. With Pop2Piano you -can directly generate a cover from a song's audio waveform. It is the first model to directly generate a piano cover -from pop audio without melody and chord extraction modules. +Piano covers of pop music are widely enjoyed, but generating them from music is not a trivial task. It requires great +expertise with playing piano as well as knowing different characteristics and melodies of a song. With Pop2Piano you +can directly generate a cover from a song's audio waveform. It is the first model to directly generate a piano cover +from pop audio without melody and chord extraction modules. -Pop2Piano is an encoder-decoder Transformer model based on [T5](https://huggingface.co/papers/1910.10683). The input audio -is transformed to its waveform and passed to the encoder, which transforms it to a latent representation. The decoder -uses these latent representations to generate token ids in an autoregressive way. Each token id corresponds to one of four +Pop2Piano is an encoder-decoder Transformer model based on [T5](https://huggingface.co/papers/1910.10683). The input audio +is transformed to its waveform and passed to the encoder, which transforms it to a latent representation. The decoder +uses these latent representations to generate token ids in an autoregressive way. Each token id corresponds to one of four different token types: time, velocity, note and 'special'. The token ids are then decoded to their equivalent MIDI file. The abstract from the paper is the following: @@ -53,9 +53,11 @@ The original code can be found [here](https://github.com/sweetcocoa/pop2piano). ## Usage tips * To use Pop2Piano, you will need to install the 🤗 Transformers library, as well as the following third party modules: + ```bash pip install pretty-midi==0.2.9 essentia==2.1b6.dev1034 librosa scipy ``` + Please note that you may need to restart your runtime after installation. * Pop2Piano is an Encoder-Decoder based model like T5. * Pop2Piano can be used to generate midi-audio files for a given audio sequence. @@ -131,7 +133,6 @@ Please note that you may need to restart your runtime after installation. >>> tokenizer_output[1].write("./Outputs/midi_output2.mid") ``` - - Example of processing multiple audio files in batch (Using `Pop2PianoFeatureExtractor` and `Pop2PianoTokenizer`): ```python @@ -166,7 +167,6 @@ Please note that you may need to restart your runtime after installation. >>> tokenizer_output[1].write("./Outputs/midi_output2.mid") ``` - ## Pop2PianoConfig [[autodoc]] Pop2PianoConfig diff --git a/docs/source/en/model_doc/prompt_depth_anything.md b/docs/source/en/model_doc/prompt_depth_anything.md index 5af13c5d630..0ac26609b4d 100644 --- a/docs/source/en/model_doc/prompt_depth_anything.md +++ b/docs/source/en/model_doc/prompt_depth_anything.md @@ -19,8 +19,7 @@ rendered properly in your Markdown viewer. ## Overview -The Prompt Depth Anything model was introduced in [Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation](https://huggingface.co/papers/2412.14015) by Haotong Lin, Sida Peng, Jingxiao Chen, Songyou Peng, Jiaming Sun, Minghuan Liu, Hujun Bao, Jiashi Feng, Xiaowei Zhou, Bingyi Kang. - +The Prompt Depth Anything model was introduced in [Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation](https://huggingface.co/papers/2412.14015) by Haotong Lin, Sida Peng, Jingxiao Chen, Songyou Peng, Jiaming Sun, Minghuan Liu, Hujun Bao, Jiashi Feng, Xiaowei Zhou, Bingyi Kang. The abstract from the paper is as follows: diff --git a/docs/source/en/model_doc/pvt.md b/docs/source/en/model_doc/pvt.md index e7902affe5f..38858db5552 100644 --- a/docs/source/en/model_doc/pvt.md +++ b/docs/source/en/model_doc/pvt.md @@ -29,23 +29,22 @@ is used to further reduce the resource consumption when learning high-resolution The abstract from the paper is the following: -*Although convolutional neural networks (CNNs) have achieved great success in computer vision, this work investigates a -simpler, convolution-free backbone network useful for many dense prediction tasks. Unlike the recently proposed Vision -Transformer (ViT) that was designed for image classification specifically, we introduce the Pyramid Vision Transformer -(PVT), which overcomes the difficulties of porting Transformer to various dense prediction tasks. PVT has several -merits compared to current state of the arts. Different from ViT that typically yields low resolution outputs and -incurs high computational and memory costs, PVT not only can be trained on dense partitions of an image to achieve high -output resolution, which is important for dense prediction, but also uses a progressive shrinking pyramid to reduce the -computations of large feature maps. PVT inherits the advantages of both CNN and Transformer, making it a unified -backbone for various vision tasks without convolutions, where it can be used as a direct replacement for CNN backbones. +*Although convolutional neural networks (CNNs) have achieved great success in computer vision, this work investigates a +simpler, convolution-free backbone network useful for many dense prediction tasks. Unlike the recently proposed Vision +Transformer (ViT) that was designed for image classification specifically, we introduce the Pyramid Vision Transformer +(PVT), which overcomes the difficulties of porting Transformer to various dense prediction tasks. PVT has several +merits compared to current state of the arts. Different from ViT that typically yields low resolution outputs and +incurs high computational and memory costs, PVT not only can be trained on dense partitions of an image to achieve high +output resolution, which is important for dense prediction, but also uses a progressive shrinking pyramid to reduce the +computations of large feature maps. PVT inherits the advantages of both CNN and Transformer, making it a unified +backbone for various vision tasks without convolutions, where it can be used as a direct replacement for CNN backbones. We validate PVT through extensive experiments, showing that it boosts the performance of many downstream tasks, including -object detection, instance and semantic segmentation. For example, with a comparable number of parameters, PVT+RetinaNet -achieves 40.4 AP on the COCO dataset, surpassing ResNet50+RetinNet (36.3 AP) by 4.1 absolute AP (see Figure 2). We hope +object detection, instance and semantic segmentation. For example, with a comparable number of parameters, PVT+RetinaNet +achieves 40.4 AP on the COCO dataset, surpassing ResNet50+RetinNet (36.3 AP) by 4.1 absolute AP (see Figure 2). We hope that PVT could serve as an alternative and useful backbone for pixel-level predictions and facilitate future research.* This model was contributed by [Xrenya](https://huggingface.co/Xrenya). The original code can be found [here](https://github.com/whai362/PVT). - - PVTv1 on ImageNet-1K | **Model variant** |**Size** |**Acc@1**|**Params (M)**| @@ -55,7 +54,6 @@ This model was contributed by [Xrenya](https://huggingface.co/Xrenya). The origi | PVT-Medium | 224 | 81.2 | 44.2 | | PVT-Large | 224 | 81.7 | 61.4 | - ## PvtConfig [[autodoc]] PvtConfig diff --git a/docs/source/en/model_doc/pvt_v2.md b/docs/source/en/model_doc/pvt_v2.md index 0d0ee3cca75..5be8998f4cc 100644 --- a/docs/source/en/model_doc/pvt_v2.md +++ b/docs/source/en/model_doc/pvt_v2.md @@ -26,7 +26,7 @@ The PVTv2 encoder structure has been successfully deployed to achieve state-of-t PVTv2 belongs to a family of models called [hierarchical transformers](https://natecibik.medium.com/the-rise-of-vision-transformers-f623c980419f) , which make adaptations to transformer layers in order to generate multi-scale feature maps. Unlike the columnal structure of Vision Transformer ([ViT](https://huggingface.co/papers/2010.11929)) which loses fine-grained detail, multi-scale feature maps are known preserve this detail and aid performance in dense prediction tasks. In the case of PVTv2, this is achieved by generating image patch tokens using 2D convolution with overlapping kernels in each encoder layer. -The multi-scale features of hierarchical transformers allow them to be easily swapped in for traditional workhorse computer vision backbone models like ResNet in larger architectures. Both Segformer and Panoptic Segformer demonstrated that configurations using PVTv2 for a backbone consistently outperformed those with similarly sized ResNet backbones. +The multi-scale features of hierarchical transformers allow them to be easily swapped in for traditional workhorse computer vision backbone models like ResNet in larger architectures. Both Segformer and Panoptic Segformer demonstrated that configurations using PVTv2 for a backbone consistently outperformed those with similarly sized ResNet backbones. Another powerful feature of the PVTv2 is the complexity reduction in the self-attention layers called Spatial Reduction Attention (SRA), which uses 2D convolution layers to project hidden states to a smaller resolution before attending to them with the queries, improving the $O(n^2)$ complexity of self-attention to $O(n^2/R)$, with $R$ being the spatial reduction ratio (`sr_ratio`, aka kernel size and stride in the 2D convolution). @@ -48,6 +48,7 @@ This model was contributed by [FoamoftheSea](https://huggingface.co/FoamoftheSea - ImageNet pretrained weights for all model sizes can be found on the [hub](https://huggingface.co/models?other=pvt_v2). The best way to get started with the PVTv2 is to load the pretrained checkpoint with the size of your choosing using `AutoModelForImageClassification`: + ```python import requests import torch @@ -99,7 +100,6 @@ outputs = model(torch.tensor(processed["pixel_values"])) | PVT-V2-B4 | 224 | 83.6 | 62.6 | | PVT-V2-B5 | 224 | 83.8 | 82.0 | - ## PvtV2Config [[autodoc]] PvtV2Config diff --git a/docs/source/en/model_doc/qwen2.md b/docs/source/en/model_doc/qwen2.md index 3f872302cc2..feeb69959b2 100644 --- a/docs/source/en/model_doc/qwen2.md +++ b/docs/source/en/model_doc/qwen2.md @@ -142,7 +142,6 @@ outputs = model.generate(**inputs, max_new_tokens=100) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` - ## Notes - Ensure your Transformers library version is up-to-date. Qwen2 requires Transformers>=4.37.0 for full support. diff --git a/docs/source/en/model_doc/qwen2_5_omni.md b/docs/source/en/model_doc/qwen2_5_omni.md index e124f7cdb42..7a0836592d4 100644 --- a/docs/source/en/model_doc/qwen2_5_omni.md +++ b/docs/source/en/model_doc/qwen2_5_omni.md @@ -31,8 +31,6 @@ The abstract from the technical report is the following: *We present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. To enable the streaming of multimodal information inputs, both audio and visual encoders utilize a block-wise processing approach. This strategy effectively decouples the handling of long sequences of multimodal data, assigning the perceptual responsibilities to the multimodal encoder and entrusting the modeling of extended sequences to a large language model. Such a division of labor enhances the fusion of different modalities via the shared attention mechanism. To synchronize the timestamps of video inputs with audio, we organized the audio and video sequentially in an interleaved manner and propose a novel position embedding approach, named TMRoPE (Time-aligned Multimodal RoPE). To concurrently generate text and speech while avoiding interference between the two modalities, we propose Thinker-Talker architecture. In this framework, Thinker functions as a large language model tasked with text generation, while Talker is a dual-track autoregressive model that directly utilizes the hidden representations from the Thinker to produce audio tokens as output. Both the Thinker and Talker models are designed to be trained and inferred in an end-to-end manner. For decoding audio tokens in a streaming manner, we introduce a sliding-window DiT that restricts the receptive field, aiming to reduce the initial package delay. Qwen2.5-Omni outperforms the similarly sized Qwen2-VL and Qwen2-Audio in both image and audio capabilities. Furthermore, Qwen2.5-Omni achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench. Notably, Qwen2.5-Omni is the first open-source model to achieve a level of performance in end-to-end speech instruction following that is comparable to its capabilities with text inputs, as evidenced by benchmarks such as MMLU and GSM8K. As for speech generation, Qwen2.5-Omni’s streaming Talker outperform most existing streaming and non-streaming alternatives in robustness and naturalness.* - - ## Notes - Use [`Qwen2_5OmniForConditionalGeneration`] to generate audio and text output. To generate only one output type, use [`Qwen2_5OmniThinkerForConditionalGeneration`] for text-only and [`Qwen2_5OmniTalkersForConditionalGeneration`] for audio-only outputs. @@ -40,7 +38,6 @@ The abstract from the technical report is the following: - In case out out-of-memory errors hwen working with video input, decrease `processor.max_pixels`. By default the maximum is set to a very arge value and high resolution visuals will not be resized, unless resolution exceeds `processor.max_pixels`. - The processor has its own [`~ProcessorMixin.apply_chat_template`] method to convert chat messages to model inputs. - ## Usage example `Qwen2.5-Omni` can be found on the [Huggingface Hub](https://huggingface.co/Qwen). @@ -275,6 +272,7 @@ processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B", min_pixels=min #### Prompt for audio output If users need audio output, the system prompt must be set as "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.", otherwise the audio output may not work as expected. + ``` { "role": "system", @@ -285,6 +283,7 @@ If users need audio output, the system prompt must be set as "You are Qwen, a vi #### Use audio output or not The model supports both text and audio outputs, if users do not need audio outputs, they can set `enable_audio_output` in the `from_pretrained` function. This option will save about `~2GB` of GPU memory but the `return_audio` option for `generate` function will only allow to be set at `False`. + ```python model = Qwen2_5OmniForConditionalGeneration.from_pretrained( "Qwen/Qwen2.5-Omni-7B", @@ -341,8 +340,6 @@ model = Qwen2_5OmniForConditionalGeneration.from_pretrained( ) ``` - - ## Qwen2_5OmniConfig [[autodoc]] Qwen2_5OmniConfig diff --git a/docs/source/en/model_doc/qwen2_5_vl.md b/docs/source/en/model_doc/qwen2_5_vl.md index 62527ea4963..7f682bf8020 100644 --- a/docs/source/en/model_doc/qwen2_5_vl.md +++ b/docs/source/en/model_doc/qwen2_5_vl.md @@ -26,7 +26,6 @@ rendered properly in your Markdown viewer. [Qwen2.5-VL](https://huggingface.co/papers/2502.13923) is a multimodal vision-language model, available in 3B, 7B, and 72B parameters, pretrained on 4.1T tokens. The model introduces window attention in the ViT encoder to accelerate training and inference, dynamic FPS sampling on the spatial and temporal dimensions for better video understanding across different sampling rates, and an upgraded MRoPE (multi-resolutional rotary positional encoding) mechanism to better capture and learn temporal dynamics. - You can find all the original Qwen2.5-VL checkpoints under the [Qwen2.5-VL](https://huggingface.co/collections/Qwen/qwen25-vl-6795ffac22b334a837c0f9a5) collection. > [!TIP] @@ -61,6 +60,7 @@ messages = [ pipe(text=messages,max_new_tokens=20, return_full_text=False) ``` + @@ -110,6 +110,7 @@ output_text = processor.batch_decode( ) print(output_text) ``` + @@ -130,9 +131,11 @@ model = Qwen2_5_VLForConditionalGeneration.from_pretrained( ) ``` + ### Notes - Use Qwen2.5-VL for video inputs by setting `"type": "video"` as shown below. + ```python conversation = [ { @@ -159,8 +162,10 @@ model = Qwen2_5_VLForConditionalGeneration.from_pretrained( output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True) print(output_text) ``` + - Use Qwen2.5-VL for a mixed batch of inputs (images, videos, text). Add labels when handling multiple images or videos for better reference as show below. + ```python import torch from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor @@ -221,14 +226,15 @@ model = Qwen2_5_VLForConditionalGeneration.from_pretrained( max_pixels = 2048*2048 processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels) ``` - + Higher resolution can require more compute whereas reducing the resolution can save memory as follows: - + ```python min_pixels = 256*28*28 max_pixels = 1024*28*28 processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels) ``` + ## Qwen2_5_VLConfig [[autodoc]] Qwen2_5_VLConfig diff --git a/docs/source/en/model_doc/qwen2_audio.md b/docs/source/en/model_doc/qwen2_audio.md index 7cdcd52119c..9b9dd43a919 100644 --- a/docs/source/en/model_doc/qwen2_audio.md +++ b/docs/source/en/model_doc/qwen2_audio.md @@ -36,7 +36,6 @@ The abstract from the paper is the following: *We introduce the latest progress of Qwen-Audio, a large-scale audio-language model called Qwen2-Audio, which is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions. In contrast to complex hierarchical tags, we have simplified the pre-training process by utilizing natural language prompts for different data and tasks, and have further expanded the data volume. We have boosted the instruction-following capability of Qwen2-Audio and implemented two distinct audio interaction modes for voice chat and audio analysis. In the voice chat mode, users can freely engage in voice interactions with Qwen2-Audio without text input. In the audio analysis mode, users could provide audio and text instructions for analysis during the interaction. Note that we do not use any system prompts to switch between voice chat and audio analysis modes. Qwen2-Audio is capable of intelligently comprehending the content within audio and following voice commands to respond appropriately. For instance, in an audio segment that simultaneously contains sounds, multi-speaker conversations, and a voice command, Qwen2-Audio can directly understand the command and provide an interpretation and response to the audio. Additionally, DPO has optimized the model's performance in terms of factuality and adherence to desired behavior. According to the evaluation results from AIR-Bench, Qwen2-Audio outperformed previous SOTAs, such as Gemini-1.5-pro, in tests focused on audio-centric instruction-following capabilities. Qwen2-Audio is open-sourced with the aim of fostering the advancement of the multi-modal language community. * - ## Usage tips `Qwen2-Audio-7B` and `Qwen2-Audio-7B-Instruct` can be found on the [Huggingface Hub](https://huggingface.co/Qwen) @@ -79,6 +78,7 @@ In the following, we demonstrate how to use `Qwen2-Audio-7B-Instruct` for the in ### Voice Chat Inference In the voice chat mode, users can freely engage in voice interactions with Qwen2-Audio without text input: + ```python from io import BytesIO from urllib.request import urlopen @@ -119,6 +119,7 @@ response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_ ### Audio Analysis Inference In the audio analysis, users could provide both audio and text instructions for analysis: + ```python from io import BytesIO from urllib.request import urlopen @@ -167,6 +168,7 @@ response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_ ### Batch Inference We also support batch inference: + ```python from io import BytesIO from urllib.request import urlopen diff --git a/docs/source/en/model_doc/qwen2_moe.md b/docs/source/en/model_doc/qwen2_moe.md index b8a3fe65d31..9d55de63e16 100644 --- a/docs/source/en/model_doc/qwen2_moe.md +++ b/docs/source/en/model_doc/qwen2_moe.md @@ -24,7 +24,6 @@ rendered properly in your Markdown viewer. # Qwen2MoE - [Qwen2MoE](https://huggingface.co/papers/2407.10671) is a Mixture-of-Experts (MoE) variant of [Qwen2](./qwen2), available as a base model and an aligned chat model. It uses SwiGLU activation, group query attention and a mixture of sliding window attention and full attention. The tokenizer can also be adapted to multiple languages and codes. The MoE architecture uses upcyled models from the dense language models. For example, Qwen1.5-MoE-A2.7B is upcycled from Qwen-1.8B. It has 14.3B parameters but only 2.7B parameters are activated during runtime. @@ -57,6 +56,7 @@ messages = [ outputs = pipe(messages, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95) print(outputs[0]["generated_text"][-1]['content']) ``` + @@ -100,14 +100,14 @@ generated_ids = [ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] print(response) ``` - + + ```bash transformers chat Qwen/Qwen1.5-MoE-A2.7B-Chat --dtype auto --attn_implementation flash_attention_2 ``` - - + Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends. diff --git a/docs/source/en/model_doc/qwen2_vl.md b/docs/source/en/model_doc/qwen2_vl.md index 8ff09ca5723..59dc25b5e08 100644 --- a/docs/source/en/model_doc/qwen2_vl.md +++ b/docs/source/en/model_doc/qwen2_vl.md @@ -25,7 +25,7 @@ rendered properly in your Markdown viewer. ## Overview -The [Qwen2-VL](https://huggingface.co/papers/2409.12191) ([blog post](https://qwenlm.github.io/blog/qwen2-vl/)) model is a major update to [Qwen-VL](https://huggingface.co/papers/2308.12966) from the Qwen team at Alibaba Research. +The [Qwen2-VL](https://huggingface.co/papers/2409.12191) ([blog post](https://qwenlm.github.io/blog/qwen2-vl/)) model is a major update to [Qwen-VL](https://huggingface.co/papers/2308.12966) from the Qwen team at Alibaba Research. The abstract from the blog is the following: @@ -203,8 +203,8 @@ min_pixels = 256*28*28 max_pixels = 1024*28*28 processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels) ``` -This ensures each image gets encoded using a number between 256-1024 tokens. The 28 comes from the fact that the model uses a patch size of 14 and a temporal patch size of 2 (14 x 2 = 28). +This ensures each image gets encoded using a number between 256-1024 tokens. The 28 comes from the fact that the model uses a patch size of 14 and a temporal patch size of 2 (14 x 2 = 28). #### Multiple Image Inputs @@ -307,7 +307,7 @@ model = Qwen2VLForConditionalGeneration.from_pretrained( [[autodoc]] Qwen2VLTextModel - forward - + ## Qwen2VLModel [[autodoc]] Qwen2VLModel diff --git a/docs/source/en/model_doc/qwen3.md b/docs/source/en/model_doc/qwen3.md index 87e6ba500f9..0141388fb97 100644 --- a/docs/source/en/model_doc/qwen3.md +++ b/docs/source/en/model_doc/qwen3.md @@ -25,7 +25,6 @@ rendered properly in your Markdown viewer. To be released with the official model launch. - ## Usage tips To be released with the official model launch. diff --git a/docs/source/en/model_doc/qwen3_omni_moe.md b/docs/source/en/model_doc/qwen3_omni_moe.md index 04d77534f64..cd5506802d5 100644 --- a/docs/source/en/model_doc/qwen3_omni_moe.md +++ b/docs/source/en/model_doc/qwen3_omni_moe.md @@ -31,8 +31,6 @@ The abstract from the technical report is the following: *We present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. To enable the streaming of multimodal information inputs, both audio and visual encoders utilize a block-wise processing approach. This strategy effectively decouples the handling of long sequences of multimodal data, assigning the perceptual responsibilities to the multimodal encoder and entrusting the modeling of extended sequences to a large language model. Such a division of labor enhances the fusion of different modalities via the shared attention mechanism. To synchronize the timestamps of video inputs with audio, we organized the audio and video sequentially in an interleaved manner and propose a novel position embedding approach, named TMRoPE (Time-aligned Multimodal RoPE). To concurrently generate text and speech while avoiding interference between the two modalities, we propose Thinker-Talker architecture. In this framework, Thinker functions as a large language model tasked with text generation, while Talker is a dual-track autoregressive model that directly utilizes the hidden representations from the Thinker to produce audio tokens as output. Both the Thinker and Talker models are designed to be trained and inferred in an end-to-end manner. For decoding audio tokens in a streaming manner, we introduce a sliding-window DiT that restricts the receptive field, aiming to reduce the initial package delay. Qwen2.5-Omni outperforms the similarly sized Qwen2-VL and Qwen2-Audio in both image and audio capabilities. Furthermore, Qwen2.5-Omni achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench. Notably, Qwen2.5-Omni is the first open-source model to achieve a level of performance in end-to-end speech instruction following that is comparable to its capabilities with text inputs, as evidenced by benchmarks such as MMLU and GSM8K. As for speech generation, Qwen2.5-Omni’s streaming Talker outperform most existing streaming and non-streaming alternatives in robustness and naturalness.* - - ## Notes - Use [`Qwen2_5OmniForConditionalGeneration`] to generate audio and text output. To generate only one output type, use [`Qwen2_5OmniThinkerForConditionalGeneration`] for text-only and [`Qwen2_5OmniTalkersForConditionalGeneration`] for audio-only outputs. @@ -40,7 +38,6 @@ The abstract from the technical report is the following: - In case out out-of-memory errors hwen working with video input, decrease `processor.max_pixels`. By default the maximum is set to a very arge value and high resolution visuals will not be resized, unless resolution exceeds `processor.max_pixels`. - The processor has its own [`~ProcessorMixin.apply_chat_template`] method to convert chat messages to model inputs. - ## Usage example `Qwen2.5-Omni` can be found on the [Huggingface Hub](https://huggingface.co/Qwen). @@ -275,6 +272,7 @@ processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B", min_pixels=min #### Prompt for audio output If users need audio output, the system prompt must be set as "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.", otherwise the audio output may not work as expected. + ``` { "role": "system", @@ -285,6 +283,7 @@ If users need audio output, the system prompt must be set as "You are Qwen, a vi #### Use audio output or not The model supports both text and audio outputs, if users do not need audio outputs, they can set `enable_audio_output` in the `from_pretrained` function. This option will save about `~2GB` of GPU memory but the `return_audio` option for `generate` function will only allow to be set at `False`. + ```python model = Qwen2_5OmniForConditionalGeneration.from_pretrained( "Qwen/Qwen2.5-Omni-7B", @@ -341,8 +340,6 @@ model = Qwen2_5OmniForConditionalGeneration.from_pretrained( ) ``` - - ## Qwen3OmniMoeConfig [[autodoc]] Qwen3OmniMoeConfig @@ -410,5 +407,3 @@ model = Qwen2_5OmniForConditionalGeneration.from_pretrained( ## Qwen3OmniMoeTalkerCodePredictorModelForConditionalGeneration [[autodoc]] Qwen3OmniMoeTalkerCodePredictorModelForConditionalGeneration - - diff --git a/docs/source/en/model_doc/qwen3_vl.md b/docs/source/en/model_doc/qwen3_vl.md index c939d5da3cd..dc9ecafeb44 100644 --- a/docs/source/en/model_doc/qwen3_vl.md +++ b/docs/source/en/model_doc/qwen3_vl.md @@ -77,6 +77,7 @@ output_text = processor.batch_decode( ) print(output_text) ``` + diff --git a/docs/source/en/model_doc/qwen3_vl_moe.md b/docs/source/en/model_doc/qwen3_vl_moe.md index 6e27adf915d..e36336d90a4 100644 --- a/docs/source/en/model_doc/qwen3_vl_moe.md +++ b/docs/source/en/model_doc/qwen3_vl_moe.md @@ -77,6 +77,7 @@ output_text = processor.batch_decode( ) print(output_text) ``` + diff --git a/docs/source/en/model_doc/recurrent_gemma.md b/docs/source/en/model_doc/recurrent_gemma.md index 1cd4e784a5b..2d7c940e00a 100644 --- a/docs/source/en/model_doc/recurrent_gemma.md +++ b/docs/source/en/model_doc/recurrent_gemma.md @@ -31,16 +31,14 @@ The abstract from the paper is the following: Tips: -- The original checkpoints can be converted using the conversion script [`src/transformers/models/recurrent_gemma/convert_recurrent_gemma_weights_to_hf.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/recurrent_gemma/convert_recurrent_gemma_to_hf.py). +- The original checkpoints can be converted using the conversion script [`src/transformers/models/recurrent_gemma/convert_recurrent_gemma_weights_to_hf.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/recurrent_gemma/convert_recurrent_gemma_to_hf.py). This model was contributed by [Arthur Zucker](https://huggingface.co/ArthurZ). The original code can be found [here](https://github.com/google-deepmind/recurrentgemma). - ## RecurrentGemmaConfig [[autodoc]] RecurrentGemmaConfig - ## RecurrentGemmaModel [[autodoc]] RecurrentGemmaModel @@ -50,4 +48,3 @@ This model was contributed by [Arthur Zucker](https://huggingface.co/ArthurZ). T [[autodoc]] RecurrentGemmaForCausalLM - forward - diff --git a/docs/source/en/model_doc/reformer.md b/docs/source/en/model_doc/reformer.md index f94134609d2..c48de93d47d 100644 --- a/docs/source/en/model_doc/reformer.md +++ b/docs/source/en/model_doc/reformer.md @@ -89,7 +89,6 @@ equal to `config.hidden_size` and `config.axial_pos_shape` is set to a tuple \\( product has to be equal to `config.max_embedding_size`, which during training has to be equal to the *sequence length* of the `input_ids`. - ### LSH Self Attention In Locality sensitive hashing (LSH) self attention the key and query projection weights are tied. Therefore, the key @@ -122,7 +121,6 @@ Using LSH self attention, the memory and time complexity of the query-key matmul \\(\mathcal{O}(n_s \times n_s)\\) to \\(\mathcal{O}(n_s \times \log(n_s))\\), which usually represents the memory and time bottleneck in a transformer model, with \\(n_s\\) being the sequence length. - ### Local Self Attention Local self attention is essentially a "normal" self attention layer with key, query and value projections, but is @@ -134,7 +132,6 @@ Using Local self attention, the memory and time complexity of the query-key matm \\(\mathcal{O}(n_s \times n_s)\\) to \\(\mathcal{O}(n_s \times \log(n_s))\\), which usually represents the memory and time bottleneck in a transformer model, with \\(n_s\\) being the sequence length. - ### Training During training, we must ensure that the sequence length is set to a value that can be divided by the least common diff --git a/docs/source/en/model_doc/retribert.md b/docs/source/en/model_doc/retribert.md index 871bdc6e8c8..829fed24215 100644 --- a/docs/source/en/model_doc/retribert.md +++ b/docs/source/en/model_doc/retribert.md @@ -39,7 +39,6 @@ pair of BERT encoders with lower-dimension projection for dense semantic indexin This model was contributed by [yjernite](https://huggingface.co/yjernite). Code to train and use the model can be found [here](https://github.com/huggingface/transformers/tree/main/examples/research-projects/distillation). - ## RetriBertConfig [[autodoc]] RetriBertConfig diff --git a/docs/source/en/model_doc/roberta.md b/docs/source/en/model_doc/roberta.md index 580ff09e72c..896156520c5 100644 --- a/docs/source/en/model_doc/roberta.md +++ b/docs/source/en/model_doc/roberta.md @@ -28,7 +28,6 @@ rendered properly in your Markdown viewer. You can find all the original RoBERTa checkpoints under the [Facebook AI](https://huggingface.co/FacebookAI) organization. - > [!TIP] > Click on the RoBERTa models in the right sidebar for more examples of how to apply RoBERTa to different language tasks. diff --git a/docs/source/en/model_doc/rt_detr.md b/docs/source/en/model_doc/rt_detr.md index 02accfd6d9f..d4c85f63fc3 100644 --- a/docs/source/en/model_doc/rt_detr.md +++ b/docs/source/en/model_doc/rt_detr.md @@ -23,7 +23,6 @@ rendered properly in your Markdown viewer. ## Overview - The RT-DETR model was proposed in [DETRs Beat YOLOs on Real-time Object Detection](https://huggingface.co/papers/2304.08069) by Wenyu Lv, Yian Zhao, Shangliang Xu, Jinman Wei, Guanzhong Wang, Cheng Cui, Yuning Du, Qingqing Dang, Yi Liu. RT-DETR is an object detection model that stands for "Real-Time DEtection Transformer." This model is designed to perform object detection tasks with a focus on achieving real-time performance while maintaining high accuracy. Leveraging the transformer architecture, which has gained significant popularity in various fields of deep learning, RT-DETR processes images to identify and locate multiple objects within them. @@ -39,7 +38,6 @@ alt="drawing" width="600"/> The model version was contributed by [rafaelpadilla](https://huggingface.co/rafaelpadilla) and [sangbumchoi](https://github.com/SangbumChoi). The original code can be found [here](https://github.com/lyuwenyu/RT-DETR/). - ## Usage tips Initially, an image is processed using a pre-trained convolutional neural network, specifically a Resnet-D variant as referenced in the original code. This network extracts features from the final three layers of the architecture. Following this, a hybrid encoder is employed to convert the multi-scale features into a sequential array of image features. Then, a decoder, equipped with auxiliary prediction heads is used to refine the object queries. This process facilitates the direct generation of bounding boxes, eliminating the need for any additional post-processing to acquire the logits and coordinates for the bounding boxes. diff --git a/docs/source/en/model_doc/rt_detr_v2.md b/docs/source/en/model_doc/rt_detr_v2.md index f5eb54625c8..3f814ce0d64 100644 --- a/docs/source/en/model_doc/rt_detr_v2.md +++ b/docs/source/en/model_doc/rt_detr_v2.md @@ -34,9 +34,9 @@ The abstract from the paper is the following: This model was contributed by [jadechoghari](https://huggingface.co/jadechoghari). The original code can be found [here](https://github.com/lyuwenyu/RT-DETR). -## Usage tips +## Usage tips -This second version of RT-DETR improves how the decoder finds objects in an image. +This second version of RT-DETR improves how the decoder finds objects in an image. - **better sampling** – adjusts offsets so the model looks at the right areas - **flexible attention** – can use smooth (bilinear) or fixed (discrete) sampling @@ -85,17 +85,15 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - See also: [Object detection task guide](../tasks/object_detection). - Notebooks for [inference](https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/RT_DETR_v2_inference.ipynb) and [fine-tuning](https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/RT_DETR_v2_finetune_on_a_custom_dataset.ipynb) RT-DETRv2 on a custom dataset (🌎). - ## RTDetrV2Config [[autodoc]] RTDetrV2Config - ## RTDetrV2Model [[autodoc]] RTDetrV2Model - forward - + ## RTDetrV2ForObjectDetection [[autodoc]] RTDetrV2ForObjectDetection diff --git a/docs/source/en/model_doc/rwkv.md b/docs/source/en/model_doc/rwkv.md index 4d9d6bbb886..c0bd1273f61 100644 --- a/docs/source/en/model_doc/rwkv.md +++ b/docs/source/en/model_doc/rwkv.md @@ -58,7 +58,7 @@ torch.allclose(torch.cat([output_one, output_two], dim=1), output_whole, atol=1e If you want to make sure the model stops generating when `'\n\n'` is detected, we recommend using the following stopping criteria: -```python +```python from transformers import StoppingCriteria class RwkvStoppingCriteria(StoppingCriteria): diff --git a/docs/source/en/model_doc/sam.md b/docs/source/en/model_doc/sam.md index 49a58254630..65286eb8428 100644 --- a/docs/source/en/model_doc/sam.md +++ b/docs/source/en/model_doc/sam.md @@ -41,7 +41,6 @@ Tips: - Fine-tuning the model is not supported yet - According to the paper, textual input should be also supported. However, at this time of writing this seems not to be supported according to [the official repository](https://github.com/facebookresearch/segment-anything/issues/4#issuecomment-1497626844). - This model was contributed by [ybelkada](https://huggingface.co/ybelkada) and [ArthurZ](https://huggingface.co/ArthurZ). The original code can be found [here](https://github.com/facebookresearch/segment-anything). @@ -98,6 +97,7 @@ masks = processor.image_processor.post_process_masks( ) scores = outputs.iou_scores ``` + ## Resources A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with SAM. diff --git a/docs/source/en/model_doc/sam_hq.md b/docs/source/en/model_doc/sam_hq.md index 2bd14229c37..9dea1de7a77 100644 --- a/docs/source/en/model_doc/sam_hq.md +++ b/docs/source/en/model_doc/sam_hq.md @@ -25,7 +25,6 @@ The model is an enhancement to the original SAM model that produces significantl ![example image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/sam-output.png) - SAM-HQ introduces several key improvements over the original SAM model: 1. High-Quality Output Token: A learnable token injected into SAM's mask decoder for higher quality mask prediction @@ -105,7 +104,6 @@ masks = processor.image_processor.post_process_masks( scores = outputs.iou_scores ``` - ## Resources A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with SAM-HQ: @@ -137,7 +135,6 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] SamHQVisionModel - ## SamHQModel [[autodoc]] SamHQModel diff --git a/docs/source/en/model_doc/seamless_m4t.md b/docs/source/en/model_doc/seamless_m4t.md index c6f3a56f9ba..e7fc00d047c 100644 --- a/docs/source/en/model_doc/seamless_m4t.md +++ b/docs/source/en/model_doc/seamless_m4t.md @@ -67,7 +67,6 @@ Here is how to use the processor to process text and audio: >>> text_inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt") ``` - ### Speech [`SeamlessM4TModel`] can *seamlessly* generate text or speech with few or no changes. Let's target Russian voice translation: @@ -84,7 +83,7 @@ With basically the same code, I've translated English text and Arabic speech to Similarly, you can generate translated text from audio files or from text with the same model. You only have to pass `generate_speech=False` to [`SeamlessM4TModel.generate`]. This time, let's translate to French. -```python +```python >>> # from audio >>> output_tokens = model.generate(**audio_inputs, tgt_lang="fra", generate_speech=False) >>> translated_text_from_audio = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True) @@ -96,11 +95,10 @@ This time, let's translate to French. ### Tips - #### 1. Use dedicated models [`SeamlessM4TModel`] is transformers top level model to generate speech and text, but you can also use dedicated models that perform the task without additional components, thus reducing the memory footprint. -For example, you can replace the audio-to-audio generation snippet with the model dedicated to the S2ST task, the rest is exactly the same code: +For example, you can replace the audio-to-audio generation snippet with the model dedicated to the S2ST task, the rest is exactly the same code: ```python >>> from transformers import SeamlessM4TForSpeechToSpeech @@ -130,7 +128,6 @@ Use `return_intermediate_token_ids=True` with [`SeamlessM4TModel`] to return bot ## Model architecture - SeamlessM4T features a versatile architecture that smoothly handles the sequential generation of text and speech. This setup comprises two sequence-to-sequence (seq2seq) models. The first model translates the input modality into translated text, while the second model generates speech tokens, known as "unit tokens," from the translated text. Each modality has its own dedicated encoder with a unique architecture. Additionally, for speech output, a vocoder inspired by the [HiFi-GAN](https://huggingface.co/papers/2010.05646) architecture is placed on top of the second seq2seq model. @@ -142,7 +139,6 @@ Here's how the generation process works: - If speech generation is required, the second seq2seq model, following a standard encoder-decoder structure, generates unit tokens. - These unit tokens are then passed through the final vocoder to produce the actual speech. - This model was contributed by [ylacombe](https://huggingface.co/ylacombe). The original code can be found [here](https://github.com/facebookresearch/seamless_communication). ## SeamlessM4TModel @@ -150,19 +146,16 @@ This model was contributed by [ylacombe](https://huggingface.co/ylacombe). The o [[autodoc]] SeamlessM4TModel - generate - ## SeamlessM4TForTextToSpeech [[autodoc]] SeamlessM4TForTextToSpeech - generate - ## SeamlessM4TForSpeechToSpeech [[autodoc]] SeamlessM4TForSpeechToSpeech - generate - ## SeamlessM4TForTextToText [[autodoc]] transformers.SeamlessM4TForTextToText @@ -179,7 +172,6 @@ This model was contributed by [ylacombe](https://huggingface.co/ylacombe). The o [[autodoc]] SeamlessM4TConfig - ## SeamlessM4TTokenizer [[autodoc]] SeamlessM4TTokenizer @@ -189,7 +181,6 @@ This model was contributed by [ylacombe](https://huggingface.co/ylacombe). The o - create_token_type_ids_from_sequences - save_vocabulary - ## SeamlessM4TTokenizerFast [[autodoc]] SeamlessM4TTokenizerFast @@ -209,7 +200,6 @@ This model was contributed by [ylacombe](https://huggingface.co/ylacombe). The o [[autodoc]] SeamlessM4TCodeHifiGan - ## SeamlessM4THifiGan [[autodoc]] SeamlessM4THifiGan @@ -221,5 +211,3 @@ This model was contributed by [ylacombe](https://huggingface.co/ylacombe). The o ## SeamlessM4TTextToUnitForConditionalGeneration [[autodoc]] SeamlessM4TTextToUnitForConditionalGeneration - - diff --git a/docs/source/en/model_doc/seamless_m4t_v2.md b/docs/source/en/model_doc/seamless_m4t_v2.md index 8a4ab82d2e9..716718072a4 100644 --- a/docs/source/en/model_doc/seamless_m4t_v2.md +++ b/docs/source/en/model_doc/seamless_m4t_v2.md @@ -67,7 +67,6 @@ Here is how to use the processor to process text and audio: >>> text_inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt") ``` - ### Speech [`SeamlessM4Tv2Model`] can *seamlessly* generate text or speech with few or no changes. Let's target Russian voice translation: @@ -84,7 +83,7 @@ With basically the same code, I've translated English text and Arabic speech to Similarly, you can generate translated text from audio files or from text with the same model. You only have to pass `generate_speech=False` to [`SeamlessM4Tv2Model.generate`]. This time, let's translate to French. -```python +```python >>> # from audio >>> output_tokens = model.generate(**audio_inputs, tgt_lang="fra", generate_speech=False) >>> translated_text_from_audio = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True) @@ -96,11 +95,10 @@ This time, let's translate to French. ### Tips - #### 1. Use dedicated models [`SeamlessM4Tv2Model`] is transformers top level model to generate speech and text, but you can also use dedicated models that perform the task without additional components, thus reducing the memory footprint. -For example, you can replace the audio-to-audio generation snippet with the model dedicated to the S2ST task, the rest is exactly the same code: +For example, you can replace the audio-to-audio generation snippet with the model dedicated to the S2ST task, the rest is exactly the same code: ```python >>> from transformers import SeamlessM4Tv2ForSpeechToSpeech @@ -161,7 +159,6 @@ Here's how the generation process works: - If speech generation is required, the second seq2seq model, generates unit tokens in an non auto-regressive way. - These unit tokens are then passed through the final vocoder to produce the actual speech. - This model was contributed by [ylacombe](https://huggingface.co/ylacombe). The original code can be found [here](https://github.com/facebookresearch/seamless_communication). ## SeamlessM4Tv2Model @@ -169,19 +166,16 @@ This model was contributed by [ylacombe](https://huggingface.co/ylacombe). The o [[autodoc]] SeamlessM4Tv2Model - generate - ## SeamlessM4Tv2ForTextToSpeech [[autodoc]] SeamlessM4Tv2ForTextToSpeech - generate - ## SeamlessM4Tv2ForSpeechToSpeech [[autodoc]] SeamlessM4Tv2ForSpeechToSpeech - generate - ## SeamlessM4Tv2ForTextToText [[autodoc]] transformers.SeamlessM4Tv2ForTextToText diff --git a/docs/source/en/model_doc/segformer.md b/docs/source/en/model_doc/segformer.md index 756c98d45f0..a6b407e5879 100644 --- a/docs/source/en/model_doc/segformer.md +++ b/docs/source/en/model_doc/segformer.md @@ -71,8 +71,6 @@ logits = outputs.logits # shape [batch, num_labels, height, width] - - ## Notes - SegFormer works with **any input size**, padding inputs to be divisible by `config.patch_sizes`. diff --git a/docs/source/en/model_doc/seggpt.md b/docs/source/en/model_doc/seggpt.md index 9e8c08cf2d2..a5568d5c80e 100644 --- a/docs/source/en/model_doc/seggpt.md +++ b/docs/source/en/model_doc/seggpt.md @@ -74,7 +74,6 @@ mask = image_processor.post_process_semantic_segmentation(outputs, target_sizes, This model was contributed by [EduardoPacheco](https://huggingface.co/EduardoPacheco). The original code can be found [here]([(https://github.com/baaivision/Painter/tree/main)). - ## SegGptConfig [[autodoc]] SegGptConfig diff --git a/docs/source/en/model_doc/shieldgemma2.md b/docs/source/en/model_doc/shieldgemma2.md index 99ffde6288f..871cdd31db7 100644 --- a/docs/source/en/model_doc/shieldgemma2.md +++ b/docs/source/en/model_doc/shieldgemma2.md @@ -86,7 +86,6 @@ output = model(**inputs) print(output.probabilities) ``` - ## ShieldGemma2Processor [[autodoc]] ShieldGemma2Processor diff --git a/docs/source/en/model_doc/siglip.md b/docs/source/en/model_doc/siglip.md index c0eb9a8ac6b..bf9c0a46034 100644 --- a/docs/source/en/model_doc/siglip.md +++ b/docs/source/en/model_doc/siglip.md @@ -31,7 +31,6 @@ Unlike CLIP, SigLIP employs a pairwise sigmoid loss on image-text pairs during t You can find all the original SigLIP checkpoints under the [SigLIP](https://huggingface.co/collections/google/siglip-659d5e62f0ae1a57ae0e83ba) collection. - > [!TIP] > Click on the SigLIP models in the right sidebar for more examples of how to apply SigLIP to different image and text tasks. @@ -107,12 +106,14 @@ logits_per_image = outputs.logits_per_image probs = torch.sigmoid(logits_per_image) print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'") ``` + ## Notes - Training is supported for DDP and FSDP on single-node multi-GPU setups. However, it does not use [torch.distributed](https://pytorch.org/tutorials/beginner/dist_overview.html) utilities which may limit the scalability of batch size. - When using the standalone [`SiglipTokenizer`] or [`SiglipProcessor`], make sure to pass `padding="max_length"` because that is how the model was trained. - To get the same results as the [`Pipeline`], a prompt template of `"This is a photo of {label}."` should be passed to the processor. - Toggle the `attn_implementation` parameter to either `"sdpa"` or `"flash_attention_2"` to use a more memory-efficient attention. + ```py # pip install -U flash-attn --no-build-isolation @@ -126,7 +127,6 @@ print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'") ) ``` - ## SiglipConfig [[autodoc]] SiglipConfig @@ -179,7 +179,6 @@ print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'") [[autodoc]] SiglipVisionModel - forward - ## SiglipForImageClassification [[autodoc]] SiglipForImageClassification diff --git a/docs/source/en/model_doc/siglip2.md b/docs/source/en/model_doc/siglip2.md index f2684c6defc..6a058f8907a 100644 --- a/docs/source/en/model_doc/siglip2.md +++ b/docs/source/en/model_doc/siglip2.md @@ -32,7 +32,6 @@ rendered properly in your Markdown viewer. - NaFlex supports different resolutions and maintains the native image aspect ratio - FixRes supports fixed resolutions and is backwards compatible with [SigLIP](./siglip) - You can find all the original SigLIP2 checkpoints under the [SigLIP2](https://huggingface.co/collections/google/siglip2-67b5dcef38c175486e240107) collection. > [!TIP] @@ -157,6 +156,7 @@ print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'") NaFlex resizes the input image so the height and width are multiples of the patch size after resizing. It keeps the aspect ratio distortion as low as possible and produces a sequence length of at most the desired target sequence length (`max_num_patches`). After resizing, the image is split into a sequence of patches and a mask with padding information is added. - Toggle the `attn_implementation` parameter to either `"sdpa"` or `"flash_attention_2"` to use a more memory-efficient attention. + ```py # pip install -U flash-attn --no-build-isolation @@ -169,6 +169,7 @@ print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'") device_map=device, ) ``` + ## Siglip2Config [[autodoc]] Siglip2Config diff --git a/docs/source/en/model_doc/smollm3.md b/docs/source/en/model_doc/smollm3.md index da98a15e33b..db2ddd33601 100644 --- a/docs/source/en/model_doc/smollm3.md +++ b/docs/source/en/model_doc/smollm3.md @@ -139,7 +139,6 @@ outputs = model.generate(**inputs, max_new_tokens=100) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` - ## Notes - Ensure your Transformers library version is up-to-date. SmolLM3 requires Transformers>=4.53.0 for full support. diff --git a/docs/source/en/model_doc/smolvlm.md b/docs/source/en/model_doc/smolvlm.md index c9a886ac876..5f74fa60ba0 100644 --- a/docs/source/en/model_doc/smolvlm.md +++ b/docs/source/en/model_doc/smolvlm.md @@ -39,6 +39,7 @@ If `do_resize` is set to `True`, the model resizes images so that the longest ed The default resizing behavior can be customized by passing a dictionary to the `size` parameter. For example, `{"longest_edge": 4 * 512}` is the default, but you can change it to a different value if needed. Here’s how to control resizing and set a custom size: + ```python image_processor = SmolVLMImageProcessor(do_resize=True, size={"longest_edge": 2 * 512}, max_image_size=512) ``` @@ -47,8 +48,6 @@ Additionally, the `max_image_size` parameter, which controls the size of each sq This model was contributed by [orrzohar](https://huggingface.co/orrzohar). - - ## Usage example ### Single Media inference diff --git a/docs/source/en/model_doc/stablelm.md b/docs/source/en/model_doc/stablelm.md index 29f32a0004e..e47598a8f85 100644 --- a/docs/source/en/model_doc/stablelm.md +++ b/docs/source/en/model_doc/stablelm.md @@ -92,7 +92,6 @@ Now, to run the model with Flash Attention 2, refer to the snippet below: ['The weather is always wonderful in Costa Rica, which makes it a prime destination for retirees. That’s where the Pensionado program comes in, offering'] ``` - ## StableLmConfig [[autodoc]] StableLmConfig diff --git a/docs/source/en/model_doc/starcoder2.md b/docs/source/en/model_doc/starcoder2.md index 2d27aed399c..b67e5dedd2c 100644 --- a/docs/source/en/model_doc/starcoder2.md +++ b/docs/source/en/model_doc/starcoder2.md @@ -34,7 +34,7 @@ The abstract of the paper is the following: ## License The models are licensed under the [BigCode OpenRAIL-M v1 license agreement](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement). - + ## Usage tips The StarCoder2 models can be found in the [HuggingFace hub](https://huggingface.co/collections/bigcode/starcoder2-65de6da6e87db3383572be1a). You can find some examples for inference and fine-tuning in StarCoder2's [GitHub repo](https://github.com/bigcode-project/starcoder2). diff --git a/docs/source/en/model_doc/superglue.md b/docs/source/en/model_doc/superglue.md index 81bb91861de..d25ca822e4c 100644 --- a/docs/source/en/model_doc/superglue.md +++ b/docs/source/en/model_doc/superglue.md @@ -153,4 +153,3 @@ processed_outputs = processor.post_process_keypoint_matching(outputs, image_size [[autodoc]] SuperGlueForKeypointMatching - forward - diff --git a/docs/source/en/model_doc/superpoint.md b/docs/source/en/model_doc/superpoint.md index b86f7fd4aa7..26ffb2c8b4b 100644 --- a/docs/source/en/model_doc/superpoint.md +++ b/docs/source/en/model_doc/superpoint.md @@ -33,8 +33,6 @@ You can find all the original SuperPoint checkpoints under the [Magic Leap Commu > > Click on the SuperPoint models in the right sidebar for more examples of how to apply SuperPoint to different computer vision tasks. - - The example below demonstrates how to detect interest points in an image with the [`AutoModel`] class. @@ -101,6 +99,7 @@ processed_outputs = processor.post_process_keypoint_detection(outputs, [image_si ``` - You can then print the keypoints on the image of your choice to visualize the result: + ```py import matplotlib.pyplot as plt plt.axis("off") diff --git a/docs/source/en/model_doc/swin.md b/docs/source/en/model_doc/swin.md index f6a994ef69b..81142f6c411 100644 --- a/docs/source/en/model_doc/swin.md +++ b/docs/source/en/model_doc/swin.md @@ -47,6 +47,7 @@ pipeline = pipeline( ) pipeline("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg") ``` + @@ -79,6 +80,7 @@ class_labels = model.config.id2label predicted_class_label = class_labels[predicted_class_id] print(f"The predicted class label is: {predicted_class_label}") ``` + diff --git a/docs/source/en/model_doc/swinv2.md b/docs/source/en/model_doc/swinv2.md index 507b79fc7cf..0dc008767ac 100644 --- a/docs/source/en/model_doc/swinv2.md +++ b/docs/source/en/model_doc/swinv2.md @@ -81,7 +81,7 @@ print(f"The predicted class label is: {predicted_class_label}") ## Notes -- Swin Transformer V2 can pad the inputs for any input height and width divisible by `32`. +- Swin Transformer V2 can pad the inputs for any input height and width divisible by `32`. - Swin Transformer V2 can be used as a [backbone](../backbones). When `output_hidden_states = True`, it outputs both `hidden_states` and `reshaped_hidden_states`. The `reshaped_hidden_states` have a shape of `(batch, num_channels, height, width)` rather than `(batch_size, sequence_length, num_channels)`. ## Swinv2Config diff --git a/docs/source/en/model_doc/switch_transformers.md b/docs/source/en/model_doc/switch_transformers.md index efa6bd499db..5eb27a9e7d8 100644 --- a/docs/source/en/model_doc/switch_transformers.md +++ b/docs/source/en/model_doc/switch_transformers.md @@ -27,7 +27,6 @@ rendered properly in your Markdown viewer. You can find all the original Switch Transformers checkpoints under the [Switch Transformer](https://huggingface.co/collections/google/switch-transformers-release-6548c35c6507968374b56d1f) collection. - > [!TIP] > This model was contributed by [ybelkada](https://huggingface.co/ybelkada) and [ArthurZ](https://huggingface.co/ArthurZ). > @@ -99,7 +98,6 @@ outputs = model.generate(input_ids) print(tokenizer.decode(outputs[0])) ``` - ## SwitchTransformersConfig [[autodoc]] SwitchTransformersConfig diff --git a/docs/source/en/model_doc/t5gemma.md b/docs/source/en/model_doc/t5gemma.md index aa8d3b7880e..00dde7ab93a 100644 --- a/docs/source/en/model_doc/t5gemma.md +++ b/docs/source/en/model_doc/t5gemma.md @@ -39,7 +39,6 @@ The example below demonstrates how to chat with the model with [`Pipeline`] or t - ```python import torch from transformers import pipeline @@ -89,6 +88,7 @@ print(tokenizer.decode(outputs[0])) ``` echo -e "Write me a poem about Machine Learning. Answer:" | transformers run --task text2text-generation --model google/t5gemma-2b-2b-prefixlm --device 0 ``` + diff --git a/docs/source/en/model_doc/t5v1.1.md b/docs/source/en/model_doc/t5v1.1.md index 4ad072addcc..62787d5f9d6 100644 --- a/docs/source/en/model_doc/t5v1.1.md +++ b/docs/source/en/model_doc/t5v1.1.md @@ -68,7 +68,6 @@ Google has released the following variants: - [google/t5-v1_1-xxl](https://huggingface.co/google/t5-v1_1-xxl). - Refer to [T5's documentation page](t5) for all API reference, tips, code examples and notebooks. diff --git a/docs/source/en/model_doc/table-transformer.md b/docs/source/en/model_doc/table-transformer.md index b35df2aec31..c982d305907 100644 --- a/docs/source/en/model_doc/table-transformer.md +++ b/docs/source/en/model_doc/table-transformer.md @@ -43,8 +43,8 @@ alt="drawing" width="600"/> Table detection and table structure recognition clarified. Taken from the original paper. -The authors released 2 models, one for [table detection](https://huggingface.co/microsoft/table-transformer-detection) in -documents, one for [table structure recognition](https://huggingface.co/microsoft/table-transformer-structure-recognition) +The authors released 2 models, one for [table detection](https://huggingface.co/microsoft/table-transformer-detection) in +documents, one for [table structure recognition](https://huggingface.co/microsoft/table-transformer-structure-recognition) (the task of recognizing the individual rows, columns etc. in a table). This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be diff --git a/docs/source/en/model_doc/tapas.md b/docs/source/en/model_doc/tapas.md index 4dfac5edce3..c5144121df6 100644 --- a/docs/source/en/model_doc/tapas.md +++ b/docs/source/en/model_doc/tapas.md @@ -76,7 +76,6 @@ To summarize: | Weak supervision for aggregation | WTQ | Questions might involve aggregation, and the model must learn this given only the answer as supervision | | Strong supervision for aggregation | WikiSQL-supervised | Questions might involve aggregation, and the model must learn this given the gold aggregation operator | - Initializing a model with a pre-trained base and randomly initialized classification heads from the hub can be done as shown below. ```py @@ -105,7 +104,6 @@ Of course, you don't necessarily have to follow one of these three ways in which >>> model = TapasForQuestionAnswering.from_pretrained("google/tapas-base", config=config) ``` - What you can also do is start from an already fine-tuned checkpoint. A note here is that the already fine-tuned checkpoint on WTQ has some issues due to the L2-loss which is somewhat brittle. See [here](https://github.com/google-research/tapas/issues/91#issuecomment-735719340) for more info. For a list of all pre-trained and fine-tuned TAPAS checkpoints available on HuggingFace's hub, see [here](https://huggingface.co/models?search=tapas). @@ -128,7 +126,6 @@ The tables themselves should be present in a folder, each table being a separate **STEP 3: Convert your data into tensors using TapasTokenizer** - Third, given that you've prepared your data in this TSV/CSV format (and corresponding CSV files containing the tabular data), you can then use [`TapasTokenizer`] to convert table-question pairs into `input_ids`, `attention_mask`, `token_type_ids` and so on. Again, based on which of the three cases you picked above, [`TapasForQuestionAnswering`] requires different inputs to be fine-tuned: @@ -214,13 +211,11 @@ Of course, this only shows how to encode a single training example. It is advise >>> train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=32) ``` - Note that here, we encode each table-question pair independently. This is fine as long as your dataset is **not conversational**. In case your dataset involves conversational questions (such as in SQA), then you should first group together the `queries`, `answer_coordinates` and `answer_text` per table (in the order of their `position` index) and batch encode each table with its questions. This will make sure that the `prev_labels` token types (see docs of [`TapasTokenizer`]) are set correctly. See [this notebook](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb) for more info. **STEP 4: Train (fine-tune) the model - You can then fine-tune [`TapasForQuestionAnswering`] as follows (shown here for the weak supervision for aggregation case): ```py @@ -272,10 +267,8 @@ You can then fine-tune [`TapasForQuestionAnswering`] as follows (shown here for ... optimizer.step() ``` - ## Usage: inference - Here we explain how you can use [`TapasForQuestionAnswering`] for inference (i.e. making predictions on new data). For inference, only `input_ids`, `attention_mask` and `token_type_ids` (which you can obtain using [`TapasTokenizer`]) have to be provided to the model to obtain the logits. Next, you can use the handy [`~models.tapas.tokenization_tapas.convert_logits_to_predictions`] method to convert these into predicted coordinates and optional aggregation indices. However, note that inference is **different** depending on whether or not the setup is conversational. In a non-conversational set-up, inference can be done in parallel on all table-question pairs of a batch. Here's an example of that: @@ -333,7 +326,6 @@ What is the total number of movies? Predicted answer: SUM > 87, 53, 69 ``` - In case of a conversational set-up, then each table-question pair must be provided **sequentially** to the model, such that the `prev_labels` token types can be overwritten by the predicted `labels` of the previous table-question pair. Again, more info can be found in [this notebook](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb). ## Resources diff --git a/docs/source/en/model_doc/textnet.md b/docs/source/en/model_doc/textnet.md index 9c29a8b16be..c986b17dbff 100644 --- a/docs/source/en/model_doc/textnet.md +++ b/docs/source/en/model_doc/textnet.md @@ -34,7 +34,7 @@ This model was contributed by [Raghavan](https://huggingface.co/Raghavan), [jade ## Usage tips -TextNet is mainly used as a backbone network for the architecture search of text detection. Each stage of the backbone network is comprised of a stride-2 convolution and searchable blocks. +TextNet is mainly used as a backbone network for the architecture search of text detection. Each stage of the backbone network is comprised of a stride-2 convolution and searchable blocks. Specifically, we present a layer-level candidate set, defined as {conv3×3, conv1×3, conv3×1, identity}. As the 1×3 and 3×1 convolutions have asymmetric kernels and oriented structure priors, they may help to capture the features of extreme aspect-ratio and rotated text lines. TextNet is the backbone for Fast, but can also be used as an efficient text/image classification, we add a `TextNetForImageClassification` as is it would allow people to train an image classifier on top of the pre-trained textnet weights @@ -62,4 +62,3 @@ TextNet is the backbone for Fast, but can also be used as an efficient text/imag [[autodoc]] TextNetForImageClassification - forward - diff --git a/docs/source/en/model_doc/time_series_transformer.md b/docs/source/en/model_doc/time_series_transformer.md index c38671f00fb..921b7e01d4b 100644 --- a/docs/source/en/model_doc/time_series_transformer.md +++ b/docs/source/en/model_doc/time_series_transformer.md @@ -61,7 +61,6 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - Check out the Time Series Transformer blog-post in HuggingFace blog: [Probabilistic Time Series Forecasting with 🤗 Transformers](https://huggingface.co/blog/time-series-transformers) - ## TimeSeriesTransformerConfig [[autodoc]] TimeSeriesTransformerConfig diff --git a/docs/source/en/model_doc/timesfm.md b/docs/source/en/model_doc/timesfm.md index 83dee48e71b..e8938202ee9 100644 --- a/docs/source/en/model_doc/timesfm.md +++ b/docs/source/en/model_doc/timesfm.md @@ -25,16 +25,13 @@ rendered properly in your Markdown viewer. TimesFM (Time Series Foundation Model) is a pretrained time-series foundation model proposed in [A decoder-only foundation model for time-series forecasting](https://huggingface.co/papers/2310.10688) by Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. It is a decoder only model that uses non-overlapping patches of time-series data as input and outputs some output patch length prediction in an autoregressive fashion. - The abstract from the paper is the following: *Motivated by recent advances in large language models for Natural Language Processing (NLP), we design a time-series foundation model for forecasting whose out-of-the-box zero-shot performance on a variety of public datasets comes close to the accuracy of state-of-the-art supervised forecasting models for each individual dataset. Our model is based on pretraining a patched-decoder style attention model on a large time-series corpus, and can work well across different forecasting history lengths, prediction lengths and temporal granularities.* - This model was contributed by [kashif](https://huggingface.co/kashif). The original code can be found [here](https://github.com/google-research/timesfm). - To use the model: ```python diff --git a/docs/source/en/model_doc/transfo-xl.md b/docs/source/en/model_doc/transfo-xl.md index 5d9b92f7946..0bd1b0f57e1 100644 --- a/docs/source/en/model_doc/transfo-xl.md +++ b/docs/source/en/model_doc/transfo-xl.md @@ -90,7 +90,6 @@ This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The o - Basically, the hidden states of the previous segment are concatenated to the current input to compute the attention scores. This allows the model to pay attention to information that was in the previous segment as well as the current one. By stacking multiple attention layers, the receptive field can be increased to multiple previous segments. - This changes the positional embeddings to positional relative embeddings (as the regular positional embeddings would give the same results in the current input and the current hidden state at a given position) and needs to make some adjustments in the way attention scores are computed. - TransformerXL does **not** work with *torch.nn.DataParallel* due to a bug in PyTorch, see [issue #36035](https://github.com/pytorch/pytorch/issues/36035) diff --git a/docs/source/en/model_doc/trocr.md b/docs/source/en/model_doc/trocr.md index 6346977dafa..da5c71edde3 100644 --- a/docs/source/en/model_doc/trocr.md +++ b/docs/source/en/model_doc/trocr.md @@ -14,8 +14,6 @@ rendered properly in your Markdown viewer. specific language governing permissions and limitations under the License. --> *This model was released on 2021-09-21 and added to Hugging Face Transformers on 2021-10-13.* - -
PyTorch @@ -32,13 +30,11 @@ You can find all the original TrOCR checkpoints under the [Microsoft](https://hu alt="drawing" width="600"/> TrOCR architecture. Taken from the original paper. - > [!TIP] > This model was contributed by [nielsr](https://huggingface.co/nielsr). > > Click on the TrOCR models in the right sidebar for more examples of how to apply TrOCR to different image and text tasks. - The example below demonstrates how to perform optical character recognition (OCR) with the [`AutoModel`] class. @@ -113,7 +109,6 @@ print(generated_text) - A notebook on [inference with TrOCR](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/TrOCR/Inference_with_TrOCR_%2B_Gradio_demo.ipynb) and Gradio demo. - A notebook on [evaluating TrOCR on the IAM test set](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/TrOCR/Evaluating_TrOCR_base_handwritten_on_the_IAM_test_set.ipynb). - ## TrOCRConfig [[autodoc]] TrOCRConfig diff --git a/docs/source/en/model_doc/tvp.md b/docs/source/en/model_doc/tvp.md index 49a538ffa8c..2df4da02555 100644 --- a/docs/source/en/model_doc/tvp.md +++ b/docs/source/en/model_doc/tvp.md @@ -47,6 +47,7 @@ The [`TvpProcessor`] wraps [`BertTokenizer`] and [`TvpImageProcessor`] into a si encode the text and prepare the images respectively. The following example shows how to run temporal video grounding using [`TvpProcessor`] and [`TvpForVideoGrounding`]. + ```python import av import cv2 @@ -165,7 +166,6 @@ Tips: - Checkpoints for pre-trained [tvp-base](https://huggingface.co/Intel/tvp-base) is released. - Please refer to [Table 2](https://huggingface.co/papers/2303.04995) for TVP's performance on Temporal Video Grounding task. - ## TvpConfig [[autodoc]] TvpConfig diff --git a/docs/source/en/model_doc/umt5.md b/docs/source/en/model_doc/umt5.md index 349dcecf03c..784cc9974df 100644 --- a/docs/source/en/model_doc/umt5.md +++ b/docs/source/en/model_doc/umt5.md @@ -39,7 +39,7 @@ Google has released the following variants: This model was contributed by [agemagician](https://huggingface.co/agemagician) and [stefan-it](https://huggingface.co/stefan-it). The original code can be found [here](https://github.com/google-research/t5x). -## Usage tips +## Usage tips - UMT5 was only pre-trained on [mC4](https://huggingface.co/datasets/mc4) excluding any supervised training. Therefore, this model has to be fine-tuned before it is usable on a downstream task, unlike the original T5 model. @@ -67,7 +67,7 @@ The conversion script is also different because the model was saved in t5x's lat ['nyone who drink a alcohol A A. This I'] ``` - + Refer to [T5's documentation page](t5) for more tips, code examples and notebooks. @@ -105,4 +105,3 @@ Refer to [T5's documentation page](t5) for more tips, code examples and notebook [[autodoc]] UMT5ForQuestionAnswering - forward - diff --git a/docs/source/en/model_doc/univnet.md b/docs/source/en/model_doc/univnet.md index e20bc5c405e..7a580692833 100644 --- a/docs/source/en/model_doc/univnet.md +++ b/docs/source/en/model_doc/univnet.md @@ -69,7 +69,6 @@ write("sample_audio.wav", feature_extractor.sampling_rate, audio) This model was contributed by [dg845](https://huggingface.co/dg845). To the best of my knowledge, there is no official code release, but an unofficial implementation can be found at [maum-ai/univnet](https://github.com/maum-ai/univnet) with pretrained checkpoints [here](https://github.com/maum-ai/univnet#pre-trained-model). - ## UnivNetConfig [[autodoc]] UnivNetConfig diff --git a/docs/source/en/model_doc/van.md b/docs/source/en/model_doc/van.md index 0e07e314bee..0a4ded43021 100644 --- a/docs/source/en/model_doc/van.md +++ b/docs/source/en/model_doc/van.md @@ -74,4 +74,3 @@ If you're interested in submitting a resource to be included here, please feel f [[autodoc]] VanForImageClassification - forward - diff --git a/docs/source/en/model_doc/vaultgemma.md b/docs/source/en/model_doc/vaultgemma.md index 94d28cc8afe..9d39a5eb7ee 100644 --- a/docs/source/en/model_doc/vaultgemma.md +++ b/docs/source/en/model_doc/vaultgemma.md @@ -12,7 +12,6 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. - ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be rendered properly in your Markdown viewer. --> @@ -45,7 +44,6 @@ command line. - ```python from transformers import pipeline diff --git a/docs/source/en/model_doc/video_llava.md b/docs/source/en/model_doc/video_llava.md index 6b09367f37c..5b792b33733 100644 --- a/docs/source/en/model_doc/video_llava.md +++ b/docs/source/en/model_doc/video_llava.md @@ -27,7 +27,6 @@ rendered properly in your Markdown viewer. Video-LLaVa is an open-source multimodal LLM trained by fine-tuning LlamA/Vicuna on multimodal instruction-following data generated by Llava1.5 and VideChat. It is an auto-regressive language model, based on the transformer architecture. Video-LLaVa unifies visual representations to the language feature space, and enables an LLM to perform visual reasoning capabilities on both images and videos simultaneously. - The Video-LLaVA model was proposed in [Video-LLaVA: Learning United Visual Representation by Alignment Before Projection](https://huggingface.co/papers/2311.10122) by Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munang Ning, Peng Jin, Li Yuan. The abstract from the paper is the following: @@ -55,18 +54,16 @@ for the LLM* - Note the model has not been explicitly trained to process multiple images/videos in the same prompt, although this is technically possible, you may experience inaccurate results. -- Note that the video inputs should have exactly 8 frames at the input, since the models were trained in that setting. +- Note that the video inputs should have exactly 8 frames at the input, since the models were trained in that setting. This model was contributed by [RaushanTurganbay](https://huggingface.co/RaushanTurganbay). The original code can be found [here](https://github.com/PKU-YuanGroup/Video-LLaVA). - > [!NOTE] > LLaVA models after release v4.46 will raise warnings about adding `processor.patch_size = {{patch_size}}`, `processor.num_additional_image_tokens = {{num_additional_image_tokens}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. It is strongly recommended to add the attributes to the processor if you own the model checkpoint, or open a PR if it is not owned by you. Adding these attributes means that LLaVA will try to infer the number of image tokens required per image and expand the text with as many `` placeholders as there will be tokens. Usually it is around 500 tokens per image, so make sure that the text is not truncated as otherwise there will be failure when merging the embeddings. The attributes can be obtained from model config, as `model.config.vision_config.patch_size` or `model.config.vision_feature_select_strategy`. The `num_additional_image_tokens` should be `1` if the vision backbone adds a CLS token or `0` if nothing extra is added to the vision patches. - ## Usage example ### Single Media Mode @@ -126,7 +123,7 @@ For multiple turns conversation change the prompt format to: ### Mixed Media Mode -The model can also generate from an interleaved image-video inputs. However note, that it was not trained in interleaved image-video setting which might affect the performance. Below is an example usage for mixed media input, add the following lines to the above code snippet: +The model can also generate from an interleaved image-video inputs. However note, that it was not trained in interleaved image-video setting which might affect the performance. Below is an example usage for mixed media input, add the following lines to the above code snippet: ```python from PIL import Image @@ -150,7 +147,7 @@ processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokeniza ### Quantization using Bitsandbytes for memory efficiency -The model can be loaded in lower bits, significantly reducing memory burden while maintaining the performance of the original model. his allows for efficient deployment on resource-constrained cases. +The model can be loaded in lower bits, significantly reducing memory burden while maintaining the performance of the original model. his allows for efficient deployment on resource-constrained cases. First make sure to install bitsandbytes by running `pip install bitsandbytes` and to have access to a GPU/accelerator that is supported by the library. @@ -164,7 +161,6 @@ We value your feedback to help identify bugs before the full release! Check out Load the quantized model by simply adding [`BitsAndBytesConfig`](../main_classes/quantization#transformers.BitsAndBytesConfig) as shown below: - ```python from transformers import VideoLlavaForConditionalGeneration, BitsAndBytesConfig @@ -178,7 +174,6 @@ quantization_config = BitsAndBytesConfig( model = VideoLlavaForConditionalGeneration.from_pretrained("LanguageBind/Video-LLaVA-7B-hf", quantization_config=quantization_config, device_map="auto") ``` - ### Flash-Attention 2 to speed-up generation Additionally, we can greatly speed-up model inference by using [Flash Attention](../perf_train_gpu_one#flash-attention-2), which is a faster implementation of the attention mechanism used inside the model. @@ -203,7 +198,6 @@ model = VideoLlavaForConditionalGeneration.from_pretrained( ).to(0) ``` - ## VideoLlavaConfig [[autodoc]] VideoLlavaConfig @@ -212,7 +206,6 @@ model = VideoLlavaForConditionalGeneration.from_pretrained( [[autodoc]] VideoLlavaImageProcessor - ## VideoLlavaVideoProcessor [[autodoc]] VideoLlavaVideoProcessor diff --git a/docs/source/en/model_doc/videomae.md b/docs/source/en/model_doc/videomae.md index e0ebbaa4288..44fc8b8b5be 100644 --- a/docs/source/en/model_doc/videomae.md +++ b/docs/source/en/model_doc/videomae.md @@ -42,13 +42,13 @@ The original code can be found [here](https://github.com/MCG-NJU/VideoMAE). ## Using Scaled Dot Product Attention (SDPA) -PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function -encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the -[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) +PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function +encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the +[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) or the [GPU Inference](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention) page for more information. -SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set +SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set `attn_implementation="sdpa"` in `from_pretrained()` to explicitly request SDPA to be used. ``` diff --git a/docs/source/en/model_doc/vipllava.md b/docs/source/en/model_doc/vipllava.md index 0d0a209c27a..fc4aec6ae9b 100644 --- a/docs/source/en/model_doc/vipllava.md +++ b/docs/source/en/model_doc/vipllava.md @@ -37,7 +37,6 @@ The original code can be found [here](https://github.com/mu-cai/ViP-LLaVA). This model was contributed by [Younes Belkada](https://huggingface.co/ybelkada) - ## Usage tips: - The architecture is similar than llava architecture except that the multi-modal projector takes a set of concatenated vision hidden states and has an additional layernorm layer on that module. @@ -51,7 +50,6 @@ This model was contributed by [Younes Belkada](https://huggingface.co/ybelkada) Adding these attributes means that LLaVA will try to infer the number of image tokens required per image and expand the text with as many `` placeholders as there will be tokens. Usually it is around 500 tokens per image, so make sure that the text is not truncated as otherwise there will be failure when merging the embeddings. The attributes can be obtained from model config, as `model.config.vision_config.patch_size` or `model.config.vision_feature_select_strategy`. The `num_additional_image_tokens` should be `1` if the vision backbone adds a CLS token or `0` if nothing extra is added to the vision patches. - - For better results, we recommend users to use the processor's `apply_chat_template()` method to format your prompt correctly. For that you need to construct a conversation history, passing in a plain string will not format your prompt. Each message in the conversation history for chat templates is a dictionary with keys "role" and "content". The "content" should be a list of dictionaries, for "text" and "image" modalities, as follows: ```python @@ -88,16 +86,17 @@ print(text_prompt) ``` - If you want to construct a chat prompt yourself, below is a list of prompt formats accepted by VipLLaVa checkpoints: + ```bash A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.###Human: \n###Assistant: ``` For multiple turns conversation: + ```bash A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.###Human: \n###Assistant: ###Human: ###Assistant: ``` - ## VipLlavaConfig [[autodoc]] VipLlavaConfig diff --git a/docs/source/en/model_doc/visual_bert.md b/docs/source/en/model_doc/visual_bert.md index 7a7ac24e4db..a9912144c4f 100644 --- a/docs/source/en/model_doc/visual_bert.md +++ b/docs/source/en/model_doc/visual_bert.md @@ -27,7 +27,6 @@ rendered properly in your Markdown viewer. You can find all the original VisualBERT checkpoints under the [UCLA NLP](https://huggingface.co/uclanlp/models?search=visualbert) organization. - > [!TIP] > This model was contributed by [gchhablani](https://huggingface.co/gchhablani). > Click on the VisualBERT models in the right sidebar for more examples of how to apply VisualBERT to different image and language tasks. diff --git a/docs/source/en/model_doc/vit_hybrid.md b/docs/source/en/model_doc/vit_hybrid.md index 86c2c7229f5..15fa6fad474 100644 --- a/docs/source/en/model_doc/vit_hybrid.md +++ b/docs/source/en/model_doc/vit_hybrid.md @@ -55,13 +55,13 @@ found [here](https://github.com/google-research/vision_transformer). ## Using Scaled Dot Product Attention (SDPA) -PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function -encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the -[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) +PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function +encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the +[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) or the [GPU Inference](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention) page for more information. -SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set +SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set `attn_implementation="sdpa"` in `from_pretrained()` to explicitly request SDPA to be used. ``` diff --git a/docs/source/en/model_doc/vit_mae.md b/docs/source/en/model_doc/vit_mae.md index b8b9867e881..1099019a842 100644 --- a/docs/source/en/model_doc/vit_mae.md +++ b/docs/source/en/model_doc/vit_mae.md @@ -15,7 +15,6 @@ rendered properly in your Markdown viewer. --> *This model was released on 2021-11-11 and added to Hugging Face Transformers on 2022-01-18.* -
PyTorch diff --git a/docs/source/en/model_doc/vit_msn.md b/docs/source/en/model_doc/vit_msn.md index 5b727f34256..6d10dd59a99 100644 --- a/docs/source/en/model_doc/vit_msn.md +++ b/docs/source/en/model_doc/vit_msn.md @@ -40,11 +40,11 @@ while producing representations of a high semantic level that perform competitiv on ImageNet-1K, with only 5,000 annotated images, our base MSN model achieves 72.4% top-1 accuracy, and with 1% of ImageNet-1K labels, we achieve 75.7% top-1 accuracy, setting a new state-of-the-art for self-supervised learning on this benchmark.* -drawing +drawing MSN architecture. Taken from the original paper. -This model was contributed by [sayakpaul](https://huggingface.co/sayakpaul). The original code can be found [here](https://github.com/facebookresearch/msn). +This model was contributed by [sayakpaul](https://huggingface.co/sayakpaul). The original code can be found [here](https://github.com/facebookresearch/msn). ## Usage tips @@ -58,13 +58,13 @@ labels when fine-tuned. ### Using Scaled Dot Product Attention (SDPA) -PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function -encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the -[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) +PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function +encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the +[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) or the [GPU Inference](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention) page for more information. -SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set +SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set `attn_implementation="sdpa"` in `from_pretrained()` to explicitly request SDPA to be used. ``` diff --git a/docs/source/en/model_doc/vits.md b/docs/source/en/model_doc/vits.md index 2c1777b77f1..664edcb92ae 100644 --- a/docs/source/en/model_doc/vits.md +++ b/docs/source/en/model_doc/vits.md @@ -156,4 +156,3 @@ Audio(waveform, rate=model.config.sampling_rate) [[autodoc]] VitsModel - forward - diff --git a/docs/source/en/model_doc/vivit.md b/docs/source/en/model_doc/vivit.md index 041f80f61ae..9ee5a10a19f 100644 --- a/docs/source/en/model_doc/vivit.md +++ b/docs/source/en/model_doc/vivit.md @@ -32,13 +32,13 @@ This model was contributed by [jegormeister](https://huggingface.co/jegormeister ### Using Scaled Dot Product Attention (SDPA) -PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function -encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the -[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) +PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function +encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the +[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) or the [GPU Inference](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention) page for more information. -SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set +SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set `attn_implementation="sdpa"` in `from_pretrained()` to explicitly request SDPA to be used. ``` @@ -56,8 +56,6 @@ On a local benchmark (A100-40GB, PyTorch 2.3.0, OS Ubuntu 22.04) with `float32` |---------------------:|-------------:|----------:|--------------:|----------------------:|---------------------:|-----------------:| | 100 | 1 | True | 7.122 | 2575.28 | 5932.54 | 130.364 | - - ### Inference | num_batches | batch_size | is cuda | is half | Speedup (%) | Mem eager (MB) | Mem BT (MB) | Mem saved (%) | |---------------|--------------|-----------|-----------|---------------|------------------|---------------|-----------------| @@ -65,7 +63,6 @@ On a local benchmark (A100-40GB, PyTorch 2.3.0, OS Ubuntu 22.04) with `float32` | 20 | 2 | True | False | 17.146 | 1234.75 | 447.175 | 176.122 | | 20 | 4 | True | False | 18.093 | 2275.82 | 709.864 | 220.6 | | 20 | 8 | True | False | 19.284 | 4358.19 | 1233.24 | 253.393 | - ## VivitConfig diff --git a/docs/source/en/model_doc/vjepa2.md b/docs/source/en/model_doc/vjepa2.md index 93960f05189..049c7ff98f2 100644 --- a/docs/source/en/model_doc/vjepa2.md +++ b/docs/source/en/model_doc/vjepa2.md @@ -15,7 +15,6 @@ rendered properly in your Markdown viewer. --> *This model was released on 2025-06-11 and added to Hugging Face Transformers on 2025-06-11.* -
PyTorch @@ -34,7 +33,6 @@ rendered properly in your Markdown viewer. You can find all original V-JEPA2 checkpoints under the [V-JEPA 2](https://huggingface.co/collections/facebook/v-jepa-2-6841bad8413014e185b497a6) collection. - This model was contributed by [koustuvs](https://huggingface.co/koustuvs), [yonigozlan](https://huggingface.co/yonigozlan) and [qubvel](https://huggingface.co/qubvel-hf). The original code can be found [here](https://github.com/facebookresearch/vjepa2). ## Usage example diff --git a/docs/source/en/model_doc/voxtral.md b/docs/source/en/model_doc/voxtral.md index 71f0661c827..56fc84d30d0 100644 --- a/docs/source/en/model_doc/voxtral.md +++ b/docs/source/en/model_doc/voxtral.md @@ -43,6 +43,7 @@ Voxtral builds on Ministral-3B by adding audio processing capabilities: The model supports audio-text instructions, including multi-turn and multi-audio interactions, all processed in batches. ➡️ audio + text instruction + ```python import torch from transformers import VoxtralForConditionalGeneration, AutoProcessor, infer_device @@ -78,7 +79,8 @@ print(decoded_outputs[0]) print("=" * 80) ``` -➡️ multi-audio + text instruction +➡️ multi-audio + text instruction + ```python import torch from transformers import VoxtralForConditionalGeneration, AutoProcessor, infer_device @@ -119,6 +121,7 @@ print("=" * 80) ``` ➡️ multi-turn: + ```python import torch from transformers import VoxtralForConditionalGeneration, AutoProcessor, infer_device @@ -173,6 +176,7 @@ print("=" * 80) ``` ➡️ text only: + ```python import torch from transformers import VoxtralForConditionalGeneration, AutoProcessor, infer_device @@ -208,6 +212,7 @@ print("=" * 80) ``` ➡️ audio only: + ```python import torch from transformers import VoxtralForConditionalGeneration, AutoProcessor, infer_device @@ -243,6 +248,7 @@ print("=" * 80) ``` ➡️ batched inference! + ```python import torch from transformers import VoxtralForConditionalGeneration, AutoProcessor, infer_device() diff --git a/docs/source/en/model_doc/wav2vec2-bert.md b/docs/source/en/model_doc/wav2vec2-bert.md index 4edb67498aa..4a2c8de89c3 100644 --- a/docs/source/en/model_doc/wav2vec2-bert.md +++ b/docs/source/en/model_doc/wav2vec2-bert.md @@ -54,7 +54,6 @@ This model was contributed by [ylacombe](https://huggingface.co/ylacombe). The o - [`Wav2Vec2BertForSequenceClassification`] can be used by adapting this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/audio-classification). - See also: [Audio classification task guide](../tasks/audio_classification) - ## Wav2Vec2BertConfig [[autodoc]] Wav2Vec2BertConfig diff --git a/docs/source/en/model_doc/wav2vec2-conformer.md b/docs/source/en/model_doc/wav2vec2-conformer.md index e2a56b450df..663b6163011 100644 --- a/docs/source/en/model_doc/wav2vec2-conformer.md +++ b/docs/source/en/model_doc/wav2vec2-conformer.md @@ -38,7 +38,7 @@ Note: Meta (FAIR) released a new version of [Wav2Vec2-BERT 2.0](https://huggingf - Wav2Vec2-Conformer follows the same architecture as Wav2Vec2, but replaces the *Attention*-block with a *Conformer*-block as introduced in [Conformer: Convolution-augmented Transformer for Speech Recognition](https://huggingface.co/papers/2005.08100). -- For the same number of layers, Wav2Vec2-Conformer requires more parameters than Wav2Vec2, but also yields +- For the same number of layers, Wav2Vec2-Conformer requires more parameters than Wav2Vec2, but also yields an improved word error rate. - Wav2Vec2-Conformer uses the same tokenizer and feature extractor as Wav2Vec2. - Wav2Vec2-Conformer can use either no relative position embeddings, Transformer-XL-like position embeddings, or diff --git a/docs/source/en/model_doc/wav2vec2.md b/docs/source/en/model_doc/wav2vec2.md index 6c4772f90bc..1f5f4a90576 100644 --- a/docs/source/en/model_doc/wav2vec2.md +++ b/docs/source/en/model_doc/wav2vec2.md @@ -80,13 +80,10 @@ model = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-large-960h-lv60-self", Below is an expected speedup diagram comparing the pure inference time between the native implementation in transformers of the `facebook/wav2vec2-large-960h-lv60-self` model and the flash-attention-2 and sdpa (scale-dot-product-attention) versions. . We show the average speedup obtained on the `librispeech_asr` `clean` validation split: -
- - ## Resources A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with Wav2Vec2. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. diff --git a/docs/source/en/model_doc/wav2vec2_phoneme.md b/docs/source/en/model_doc/wav2vec2_phoneme.md index fe989def3bd..c2621f8924c 100644 --- a/docs/source/en/model_doc/wav2vec2_phoneme.md +++ b/docs/source/en/model_doc/wav2vec2_phoneme.md @@ -53,7 +53,6 @@ The original code can be found [here](https://github.com/pytorch/fairseq/tree/ma - By default, the model outputs a sequence of phonemes. In order to transform the phonemes to a sequence of words one should make use of a dictionary and language model. - Wav2Vec2Phoneme's architecture is based on the Wav2Vec2 model, for API reference, check out [`Wav2Vec2`](wav2vec2)'s documentation page diff --git a/docs/source/en/model_doc/whisper.md b/docs/source/en/model_doc/whisper.md index 673085ac3e7..5e19e870bdd 100644 --- a/docs/source/en/model_doc/whisper.md +++ b/docs/source/en/model_doc/whisper.md @@ -15,7 +15,6 @@ rendered properly in your Markdown viewer. --> *This model was released on 2022-12-06 and added to Hugging Face Transformers on 2022-10-05.* -
PyTorch diff --git a/docs/source/en/model_doc/xcodec.md b/docs/source/en/model_doc/xcodec.md index c4a0b92a26f..ca6d6e473fc 100644 --- a/docs/source/en/model_doc/xcodec.md +++ b/docs/source/en/model_doc/xcodec.md @@ -33,7 +33,7 @@ The X-Codec model is a neural audio codec that integrates semantic information f The abstract of the paper states the following: -*Recent advancements in audio generation have been significantly propelled by the capabilities of Large Language Models (LLMs). The existing research on audio LLM has primarily focused on enhancing the architecture and scale of audio language models, as well as leveraging larger datasets, and generally, acoustic codecs, such as EnCodec, are used for audio tokenization. However, these codecs were originally designed for audio compression, which may lead to suboptimal performance in the context of audio LLM. Our research aims to address the shortcomings of current audio LLM codecs, particularly their challenges in maintaining semantic integrity in generated audio. For instance, existing methods like VALL-E, which condition acoustic token generation on text transcriptions, often suffer from content inaccuracies and elevated word error rates (WER) due to semantic misinterpretations of acoustic tokens, resulting in word skipping and errors. To overcome these issues, we propose a straightforward yet effective approach called X-Codec. X-Codec incorporates semantic features from a pre-trained semantic encoder before the Residual Vector Quantization (RVQ) stage and introduces a semantic reconstruction loss after RVQ. By enhancing the semantic ability of the codec, X-Codec significantly reduces WER in speech synthesis tasks and extends these benefits to non-speech applications, including music and sound generation. Our experiments in text-to-speech, music continuation, and text-to-sound tasks demonstrate that integrating semantic information substantially improves the overall performance of language models in audio generation.* +*Recent advancements in audio generation have been significantly propelled by the capabilities of Large Language Models (LLMs). The existing research on audio LLM has primarily focused on enhancing the architecture and scale of audio language models, as well as leveraging larger datasets, and generally, acoustic codecs, such as EnCodec, are used for audio tokenization. However, these codecs were originally designed for audio compression, which may lead to suboptimal performance in the context of audio LLM. Our research aims to address the shortcomings of current audio LLM codecs, particularly their challenges in maintaining semantic integrity in generated audio. For instance, existing methods like VALL-E, which condition acoustic token generation on text transcriptions, often suffer from content inaccuracies and elevated word error rates (WER) due to semantic misinterpretations of acoustic tokens, resulting in word skipping and errors. To overcome these issues, we propose a straightforward yet effective approach called X-Codec. X-Codec incorporates semantic features from a pre-trained semantic encoder before the Residual Vector Quantization (RVQ) stage and introduces a semantic reconstruction loss after RVQ. By enhancing the semantic ability of the codec, X-Codec significantly reduces WER in speech synthesis tasks and extends these benefits to non-speech applications, including music and sound generation. Our experiments in text-to-speech, music continuation, and text-to-sound tasks demonstrate that integrating semantic information substantially improves the overall performance of language models in audio generation.* Model cards: - [xcodec-hubert-librispeech](https://huggingface.co/hf-audio/xcodec-hubert-librispeech) (for speech) @@ -46,12 +46,11 @@ This model was contributed by [Manal El Aidouni](https://huggingface.co/Manel). Demos can be found on this [page](https://x-codec-audio.github.io/). - -## Usage example +## Usage example Here is a quick example of how to encode and decode an audio using this model: -```python +```python from datasets import load_dataset, Audio from transformers import XcodecModel, AutoFeatureExtractor dummy_dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") @@ -75,6 +74,7 @@ audio_values = decoder_outputs.audio_values audio_values = model(inputs["input_values"]).audio_values ``` + To listen to the original and reconstructed audio, run the snippet below and then open the generated `original.wav` and `reconstruction.wav` files in your music player to compare. ```python @@ -88,12 +88,10 @@ sf.write("original.wav", original, sampling_rate) sf.write("reconstruction.wav", reconstruction.T, sampling_rate) ``` - ## XcodecConfig [[autodoc]] XcodecConfig - ## XcodecModel [[autodoc]] XcodecModel diff --git a/docs/source/en/model_doc/xglm.md b/docs/source/en/model_doc/xglm.md index 9a9170d29b7..370055c90ea 100644 --- a/docs/source/en/model_doc/xglm.md +++ b/docs/source/en/model_doc/xglm.md @@ -44,7 +44,6 @@ showing in particular that it enables cross-lingual in-context learning on some on surface form robustness and adaptation to tasks that do not have a natural cloze form. Finally, we evaluate our models in social value tasks such as hate speech detection in five languages and find it has limitations similar to comparable sized GPT-3 models.* - This model was contributed by [Suraj](https://huggingface.co/valhalla). The original code can be found [here](https://github.com/pytorch/fairseq/tree/main/examples/xglm). ## Resources @@ -67,7 +66,6 @@ This model was contributed by [Suraj](https://huggingface.co/valhalla). The orig [[autodoc]] XGLMTokenizerFast - ## XGLMModel [[autodoc]] XGLMModel diff --git a/docs/source/en/model_doc/xlm-prophetnet.md b/docs/source/en/model_doc/xlm-prophetnet.md index 4dad4c0afa7..fbf47d8c422 100644 --- a/docs/source/en/model_doc/xlm-prophetnet.md +++ b/docs/source/en/model_doc/xlm-prophetnet.md @@ -41,7 +41,6 @@ You can do so by running the following command: `pip install -U transformers==4. **DISCLAIMER:** If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title) and assign @patrickvonplaten - ## Overview The XLM-ProphetNet model was proposed in [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training,](https://huggingface.co/papers/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei diff --git a/docs/source/en/model_doc/xlm-roberta-xl.md b/docs/source/en/model_doc/xlm-roberta-xl.md index 8ae33e8b286..5e1f0bbda28 100644 --- a/docs/source/en/model_doc/xlm-roberta-xl.md +++ b/docs/source/en/model_doc/xlm-roberta-xl.md @@ -77,6 +77,7 @@ predicted_token = tokenizer.decode(predicted_token_id) print(f"The predicted token is: {predicted_token}") ``` + @@ -84,6 +85,7 @@ print(f"The predicted token is: {predicted_token}") ```bash echo -e "Plants create through a process known as photosynthesis." | transformers run --task fill-mask --model facebook/xlm-roberta-xl --device 0 ``` + diff --git a/docs/source/en/model_doc/xlm-roberta.md b/docs/source/en/model_doc/xlm-roberta.md index 65468a786a0..0e986763689 100644 --- a/docs/source/en/model_doc/xlm-roberta.md +++ b/docs/source/en/model_doc/xlm-roberta.md @@ -87,6 +87,7 @@ print(f"The predicted token is: {predicted_token}") ```bash echo -e "Plants create through a process known as photosynthesis." | transformers run --task fill-mask --model FacebookAI/xlm-roberta-base --device 0 ``` + diff --git a/docs/source/en/model_doc/xlm.md b/docs/source/en/model_doc/xlm.md index b4d84c791f5..ff8f8c46024 100644 --- a/docs/source/en/model_doc/xlm.md +++ b/docs/source/en/model_doc/xlm.md @@ -79,6 +79,7 @@ print(f"Predicted token: {predicted_token}") ```bash echo -e "Plants create through a process known as photosynthesis." | transformers run --task fill-mask --model FacebookAI/xlm-mlm-en-2048 --device 0 ``` + diff --git a/docs/source/en/model_doc/xlstm.md b/docs/source/en/model_doc/xlstm.md index b239d631fbb..e1ba3195ecc 100644 --- a/docs/source/en/model_doc/xlstm.md +++ b/docs/source/en/model_doc/xlstm.md @@ -15,7 +15,6 @@ rendered properly in your Markdown viewer. --> *This model was released on 2024-05-07 and added to Hugging Face Transformers on 2025-07-25.* - # xLSTM ## Overview @@ -32,7 +31,6 @@ The abstract from the paper is the following: This model was contributed by [NX-AI](https://huggingface.co/NX-AI). The original code can be found [here](https://github.com/NX-AI/xlstm). - ## xLSTMConfig [[autodoc]] xLSTMConfig diff --git a/docs/source/en/model_doc/yolos.md b/docs/source/en/model_doc/yolos.md index 5c31b539e59..666f9674332 100644 --- a/docs/source/en/model_doc/yolos.md +++ b/docs/source/en/model_doc/yolos.md @@ -26,14 +26,12 @@ rendered properly in your Markdown viewer. [YOLOS](https://huggingface.co/papers/2106.00666) uses a [Vision Transformer (ViT)](./vit) for object detection with minimal modifications and region priors. It can achieve performance comparable to specialized object detection models and frameworks with knowledge about 2D spatial structures. - You can find all the original YOLOS checkpoints under the [HUST Vision Lab](https://huggingface.co/hustvl/models?search=yolos) organization. drawing YOLOS architecture. Taken from the original paper. - > [!TIP] > This model wasa contributed by [nielsr](https://huggingface.co/nielsr). > Click on the YOLOS models in the right sidebar for more examples of how to apply YOLOS to different object detection tasks. @@ -98,7 +96,6 @@ for score, label, box in zip(filtered_scores, filtered_labels, pixel_boxes): - ## Notes - Use [`YolosImageProcessor`] for preparing images (and optional targets) for the model. Contrary to [DETR](./detr), YOLOS doesn't require a `pixel_mask`. diff --git a/docs/source/en/model_doc/yoso.md b/docs/source/en/model_doc/yoso.md index f07e5aba082..8e121dd88cd 100644 --- a/docs/source/en/model_doc/yoso.md +++ b/docs/source/en/model_doc/yoso.md @@ -26,20 +26,20 @@ rendered properly in your Markdown viewer. The YOSO model was proposed in [You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling](https://huggingface.co/papers/2111.09714) by Zhanpeng Zeng, Yunyang Xiong, Sathya N. Ravi, Shailesh Acharya, Glenn Fung, Vikas Singh. YOSO approximates standard softmax self-attention via a Bernoulli sampling scheme based on Locality Sensitive Hashing (LSH). In principle, all the Bernoulli random variables can be sampled with -a single hash. +a single hash. The abstract from the paper is the following: -*Transformer-based models are widely used in natural language processing (NLP). Central to the transformer model is -the self-attention mechanism, which captures the interactions of token pairs in the input sequences and depends quadratically -on the sequence length. Training such models on longer sequences is expensive. In this paper, we show that a Bernoulli sampling -attention mechanism based on Locality Sensitive Hashing (LSH), decreases the quadratic complexity of such models to linear. -We bypass the quadratic cost by considering self-attention as a sum of individual tokens associated with Bernoulli random -variables that can, in principle, be sampled at once by a single hash (although in practice, this number may be a small constant). -This leads to an efficient sampling scheme to estimate self-attention which relies on specific modifications of -LSH (to enable deployment on GPU architectures). We evaluate our algorithm on the GLUE benchmark with standard 512 sequence -length where we see favorable performance relative to a standard pretrained Transformer. On the Long Range Arena (LRA) benchmark, -for evaluating performance on long sequences, our method achieves results consistent with softmax self-attention but with sizable +*Transformer-based models are widely used in natural language processing (NLP). Central to the transformer model is +the self-attention mechanism, which captures the interactions of token pairs in the input sequences and depends quadratically +on the sequence length. Training such models on longer sequences is expensive. In this paper, we show that a Bernoulli sampling +attention mechanism based on Locality Sensitive Hashing (LSH), decreases the quadratic complexity of such models to linear. +We bypass the quadratic cost by considering self-attention as a sum of individual tokens associated with Bernoulli random +variables that can, in principle, be sampled at once by a single hash (although in practice, this number may be a small constant). +This leads to an efficient sampling scheme to estimate self-attention which relies on specific modifications of +LSH (to enable deployment on GPU architectures). We evaluate our algorithm on the GLUE benchmark with standard 512 sequence +length where we see favorable performance relative to a standard pretrained Transformer. On the Long Range Arena (LRA) benchmark, +for evaluating performance on long sequences, our method achieves results consistent with softmax self-attention but with sizable speed-ups and memory savings and often outperforms other efficient self-attention methods. Our code is available at this https URL* This model was contributed by [novice03](https://huggingface.co/novice03). The original code can be found [here](https://github.com/mlpen/YOSO). @@ -50,12 +50,12 @@ This model was contributed by [novice03](https://huggingface.co/novice03). The o in parallel on a GPU. - The kernels provide a `fast_hash` function, which approximates the random projections of the queries and keys using the Fast Hadamard Transform. Using these hash codes, the `lsh_cumulation` function approximates self-attention via LSH-based Bernoulli sampling. -- To use the custom kernels, the user should set `config.use_expectation = False`. To ensure that the kernels are compiled successfully, -the user must install the correct version of PyTorch and cudatoolkit. By default, `config.use_expectation = True`, which uses YOSO-E and +- To use the custom kernels, the user should set `config.use_expectation = False`. To ensure that the kernels are compiled successfully, +the user must install the correct version of PyTorch and cudatoolkit. By default, `config.use_expectation = True`, which uses YOSO-E and does not require compiling CUDA kernels. +alt="drawing" width="600"/> YOSO Attention Algorithm. Taken from the original paper. diff --git a/docs/source/en/model_doc/zamba.md b/docs/source/en/model_doc/zamba.md index bb974080770..635bc76fb0c 100644 --- a/docs/source/en/model_doc/zamba.md +++ b/docs/source/en/model_doc/zamba.md @@ -24,7 +24,6 @@ rendered properly in your Markdown viewer. This model was contributed by [pglo](https://huggingface.co/pglo). - ## Model details Zamba-7B-v1 is a hybrid between state-space models (Specifically [Mamba](https://github.com/state-spaces/mamba)) and transformer, and was trained using next-token prediction. Zamba uses a shared transformer layer after every 6 mamba blocks. It uses the [Mistral v0.1 tokenizer](https://huggingface.co/mistralai/Mistral-7B-v0.1). We came to this architecture after a series of ablations at small scales. Zamba-7B-v1 was pre-trained on 1T tokens of text and code data. @@ -33,23 +32,24 @@ Zamba-7B-v1 is a hybrid between state-space models (Specifically [Mamba](https:/ ## Quick start - ### Presequities Zamba requires you use `transformers` version 4.46.0 or higher: + ```bash pip install transformers>=4.45.0 ``` In order to run optimized Mamba implementations, you first need to install `mamba-ssm` and `causal-conv1d`: + ```bash pip install mamba-ssm causal-conv1d>=1.2.0 ``` + You also have to have the model on a CUDA device. You can run the model not using the optimized Mamba kernels, but it is **not** recommended as it will result in significantly lower latencies. In order to do that, you'll need to specify `use_mamba_kernels=False` when loading the model. - ## Inference ```python @@ -66,39 +66,32 @@ outputs = model.generate(**input_ids, max_new_tokens=100) print(tokenizer.decode(outputs[0])) ``` - ## Model card The model cards can be found at: * [Zamba-7B](https://huggingface.co/Zyphra/Zamba-7B-v1) - ## Issues For issues with model output, or community discussion, please use the Hugging Face community [forum](https://huggingface.co/Zyphra/Zamba-7B-v1/discussions) - ## License The model weights are open-sourced via an Apache 2.0 license. - ## ZambaConfig [[autodoc]] ZambaConfig - ## ZambaModel [[autodoc]] ZambaModel - forward - ## ZambaForCausalLM [[autodoc]] ZambaForCausalLM - forward - ## ZambaForSequenceClassification [[autodoc]] transformers.ZambaForSequenceClassification diff --git a/docs/source/en/model_doc/zamba2.md b/docs/source/en/model_doc/zamba2.md index ba4324366a9..7296ef1b250 100644 --- a/docs/source/en/model_doc/zamba2.md +++ b/docs/source/en/model_doc/zamba2.md @@ -26,7 +26,6 @@ rendered properly in your Markdown viewer. This model was contributed by [pglo](https://huggingface.co/pglo). - ## Model details [Zamba2-1.2B](https://www.zyphra.com/post/zamba2-mini), [Zamba2-2.7B](https://www.zyphra.com/post/zamba2-small) and [Zamba2-7B](https://www.zyphra.com/post/zamba2-7b) are hybrid models combining state-space models (Specifically [Mamba2](https://github.com/state-spaces/mamba)) and transformer, and were trained using next-token prediction. Zamba2 uses shared transformer layers after every 6 mamba blocks. It uses the [Mistral v0.1 tokenizer](https://huggingface.co/mistralai/Mistral-7B-v0.1). We came to this architecture after a series of ablations at small scales. Zamba2-1.2B, Zamba2-2.7B and Zamba2-7B were pre-trained on 2T and 3T tokens, respectively. @@ -35,10 +34,10 @@ This model was contributed by [pglo](https://huggingface.co/pglo). ## Quick start - ### Presequities Zamba2 requires you use `transformers` version 4.48.0 or higher: + ```bash pip install transformers>=4.48.0 ``` @@ -59,7 +58,6 @@ outputs = model.generate(**input_ids, max_new_tokens=100) print(tokenizer.decode(outputs[0])) ``` - ## Model card The model cards can be found at: @@ -67,33 +65,27 @@ The model cards can be found at: * [Zamba2-2.7B](https://huggingface.co/Zyphra/Zamba2-2.7B) * [Zamba2-7B](https://huggingface.co/Zyphra/Zamba2-7B) - ## Issues For issues with model output, or community discussion, please use the Hugging Face community [forum](https://huggingface.co/Zyphra/Zamba2-7B/discussions) - ## License The model weights are open-sourced via an Apache 2.0 license. - ## Zamba2Config [[autodoc]] Zamba2Config - ## Zamba2Model [[autodoc]] Zamba2Model - forward - ## Zamba2ForCausalLM [[autodoc]] Zamba2ForCausalLM - forward - ## Zamba2ForSequenceClassification [[autodoc]] transformers.Zamba2ForSequenceClassification diff --git a/docs/source/en/model_doc/zoedepth.md b/docs/source/en/model_doc/zoedepth.md index 367c630a322..5252d2b4d36 100644 --- a/docs/source/en/model_doc/zoedepth.md +++ b/docs/source/en/model_doc/zoedepth.md @@ -15,7 +15,6 @@ rendered properly in your Markdown viewer. --> *This model was released on 2023-02-23 and added to Hugging Face Transformers on 2024-07-08.* -
PyTorch @@ -97,6 +96,7 @@ Image.fromarray(depth.astype("uint8")) ## Notes - In the [original implementation](https://github.com/isl-org/ZoeDepth/blob/edb6daf45458569e24f50250ef1ed08c015f17a7/zoedepth/models/depth_model.py#L131) ZoeDepth performs inference on both the original and flipped images and averages the results. The `post_process_depth_estimation` function handles this by passing the flipped outputs to the optional `outputs_flipped` argument as shown below. + ```py with torch.no_grad(): outputs = model(pixel_values) @@ -107,7 +107,7 @@ Image.fromarray(depth.astype("uint8")) outputs_flipped=outputs_flipped, ) ``` - + ## Resources - Refer to this [notebook](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/ZoeDepth) for an inference example. diff --git a/docs/source/en/model_memory_anatomy.md b/docs/source/en/model_memory_anatomy.md index 2c6162ed1ca..9b2e4b4b622 100644 --- a/docs/source/en/model_memory_anatomy.md +++ b/docs/source/en/model_memory_anatomy.md @@ -16,24 +16,23 @@ limitations under the License. # Model training anatomy -To understand performance optimization techniques that one can apply to improve efficiency of model training -speed and memory utilization, it's helpful to get familiar with how GPU is utilized during training, and how compute +To understand performance optimization techniques that one can apply to improve efficiency of model training +speed and memory utilization, it's helpful to get familiar with how GPU is utilized during training, and how compute intensity varies depending on an operation performed. -Let's start by exploring a motivating example of GPU utilization and the training run of a model. For the demonstration, -we'll need to install a few libraries: +Let's start by exploring a motivating example of GPU utilization and the training run of a model. For the demonstration, +we'll need to install a few libraries: ```bash pip install transformers datasets accelerate nvidia-ml-py ``` -The `nvidia-ml-py` library allows us to monitor the memory usage of the models from within Python. You might be familiar +The `nvidia-ml-py` library allows us to monitor the memory usage of the models from within Python. You might be familiar with the `nvidia-smi` command in the terminal - this library allows to access the same information in Python directly. -Then, we create some dummy data: random token IDs between 100 and 30000 and binary labels for a classifier. +Then, we create some dummy data: random token IDs between 100 and 30000 and binary labels for a classifier. In total, we get 512 sequences each with length 512 and store them in a [`~datasets.Dataset`] with PyTorch format. - ```py >>> import numpy as np >>> from datasets import Dataset @@ -74,9 +73,9 @@ Let's verify that we start with a free GPU memory: GPU memory occupied: 0 MB. ``` -That looks good: the GPU memory is not occupied as we would expect before we load any models. If that's not the case on -your machine make sure to stop all processes that are using GPU memory. However, not all free GPU memory can be used by -the user. When a model is loaded to the GPU the kernels are also loaded, which can take up 1-2GB of memory. To see how +That looks good: the GPU memory is not occupied as we would expect before we load any models. If that's not the case on +your machine make sure to stop all processes that are using GPU memory. However, not all free GPU memory can be used by +the user. When a model is loaded to the GPU the kernels are also loaded, which can take up 1-2GB of memory. To see how much it is we load a tiny tensor into the GPU which triggers the kernels to be loaded as well. ```py @@ -92,10 +91,9 @@ We see that the kernels alone take up 1.3GB of GPU memory. Now let's see how muc ## Load Model -First, we load the `google-bert/bert-large-uncased` model. We load the model weights directly to the GPU so that we can check +First, we load the `google-bert/bert-large-uncased` model. We load the model weights directly to the GPU so that we can check how much space just the weights use. - ```py >>> from transformers import AutoModelForSequenceClassification @@ -105,12 +103,11 @@ how much space just the weights use. GPU memory occupied: 2631 MB. ``` -We can see that the model weights alone take up 1.3 GB of GPU memory. The exact number depends on the specific -GPU you are using. Note that on newer GPUs a model can sometimes take up more space since the weights are loaded in an -optimized fashion that speeds up the usage of the model. Now we can also quickly check if we get the same result +We can see that the model weights alone take up 1.3 GB of GPU memory. The exact number depends on the specific +GPU you are using. Note that on newer GPUs a model can sometimes take up more space since the weights are loaded in an +optimized fashion that speeds up the usage of the model. Now we can also quickly check if we get the same result as with `nvidia-smi` CLI: - ```bash nvidia-smi ``` @@ -138,8 +135,8 @@ Tue Jan 11 08:58:05 2022 +-----------------------------------------------------------------------------+ ``` -We get the same number as before and you can also see that we are using a V100 GPU with 16GB of memory. So now we can -start training the model and see how the GPU memory consumption changes. First, we set up a few standard training +We get the same number as before and you can also see that we are using a V100 GPU with 16GB of memory. So now we can +start training the model and see how the GPU memory consumption changes. First, we set up a few standard training arguments: ```py @@ -154,7 +151,7 @@ default_args = { - If you plan to run multiple experiments, in order to properly clear the memory between experiments, restart the Python + If you plan to run multiple experiments, in order to properly clear the memory between experiments, restart the Python kernel between experiments. @@ -181,9 +178,9 @@ Samples/second: 8.86 GPU memory occupied: 14949 MB. ``` -We see that already a relatively small batch size almost fills up our GPU's entire memory. However, a larger batch size +We see that already a relatively small batch size almost fills up our GPU's entire memory. However, a larger batch size can often result in faster model convergence or better end performance. So ideally we want to tune the batch size to our -model's needs and not to the GPU limitations. What's interesting is that we use much more memory than the size of the model. +model's needs and not to the GPU limitations. What's interesting is that we use much more memory than the size of the model. To understand a bit better why this is the case let's have a look at a model's operations and memory needs. ## Anatomy of Model's Operations @@ -206,10 +203,9 @@ This knowledge can be helpful to know when analyzing performance bottlenecks. This summary is derived from [Data Movement Is All You Need: A Case Study on Optimizing Transformers 2020](https://huggingface.co/papers/2007.00072) - ## Anatomy of Model's Memory -We've seen that training the model uses much more memory than just putting the model on the GPU. This is because there +We've seen that training the model uses much more memory than just putting the model on the GPU. This is because there are many components during training that use GPU memory. The components on GPU memory are the following: 1. model weights @@ -219,8 +215,8 @@ are many components during training that use GPU memory. The components on GPU m 5. temporary buffers 6. functionality-specific memory -A typical model trained in mixed precision with AdamW requires 18 bytes per model parameter plus activation memory. For -inference there are no optimizer states and gradients, so we can subtract those. And thus we end up with 6 bytes per +A typical model trained in mixed precision with AdamW requires 18 bytes per model parameter plus activation memory. For +inference there are no optimizer states and gradients, so we can subtract those. And thus we end up with 6 bytes per model parameter for mixed precision inference, plus activation memory. Let's look at the details. @@ -244,29 +240,29 @@ Let's look at the details. - size depends on many factors, the key ones being sequence length, hidden size and batch size. -There are the input and output that are being passed and returned by the forward and the backward functions and the +There are the input and output that are being passed and returned by the forward and the backward functions and the forward activations saved for gradient computation. **Temporary Memory** -Additionally, there are all kinds of temporary variables which get released once the calculation is done, but in the -moment these could require additional memory and could push to OOM. Therefore, when coding it's crucial to think +Additionally, there are all kinds of temporary variables which get released once the calculation is done, but in the +moment these could require additional memory and could push to OOM. Therefore, when coding it's crucial to think strategically about such temporary variables and sometimes to explicitly free those as soon as they are no longer needed. **Functionality-specific memory** -Then, your software could have special memory needs. For example, when generating text using beam search, the software +Then, your software could have special memory needs. For example, when generating text using beam search, the software needs to maintain multiple copies of inputs and outputs. **`forward` vs `backward` Execution Speed** -For convolutions and linear layers there are 2x flops in the backward compared to the forward, which generally translates -into ~2x slower (sometimes more, because sizes in the backward tend to be more awkward). Activations are usually -bandwidth-limited, and it’s typical for an activation to have to read more data in the backward than in the forward -(e.g. activation forward reads once, writes once, activation backward reads twice, gradOutput and output of the forward, +For convolutions and linear layers there are 2x flops in the backward compared to the forward, which generally translates +into ~2x slower (sometimes more, because sizes in the backward tend to be more awkward). Activations are usually +bandwidth-limited, and it’s typical for an activation to have to read more data in the backward than in the forward +(e.g. activation forward reads once, writes once, activation backward reads twice, gradOutput and output of the forward, and writes once, gradInput). -As you can see, there are potentially a few places where we could save GPU memory or speed up operations. -Now that you understand what affects GPU utilization and computation speed, refer to -the [Methods and tools for efficient training on a single GPU](perf_train_gpu_one) documentation page to learn about -performance optimization techniques. +As you can see, there are potentially a few places where we could save GPU memory or speed up operations. +Now that you understand what affects GPU utilization and computation speed, refer to +the [Methods and tools for efficient training on a single GPU](perf_train_gpu_one) documentation page to learn about +performance optimization techniques. diff --git a/docs/source/en/models.md b/docs/source/en/models.md index fdfcfba6585..ae5572c0c77 100644 --- a/docs/source/en/models.md +++ b/docs/source/en/models.md @@ -45,7 +45,6 @@ There are two general types of models you can load: 1. A barebones model, like [`AutoModel`] or [`LlamaModel`], that outputs hidden states. 2. A model with a specific *head* attached, like [`AutoModelForCausalLM`] or [`LlamaForCausalLM`], for performing specific tasks. - ## Model classes To get a pretrained model, you need to load the weights into the model. This is done by calling [`~PreTrainedModel.from_pretrained`] which accepts weights from the Hugging Face Hub or a local directory. @@ -111,7 +110,6 @@ You need enough memory to hold two copies of the model weights (random and pretr Transformers reduces some of these memory-related challenges with fast initialization, sharded checkpoints, Accelerate's [Big Model Inference](https://hf.co/docs/accelerate/usage_guides/big_modeling) feature, and supporting lower bit data types. - ### Sharded checkpoints The [`~PreTrainedModel.save_pretrained`] method automatically shards checkpoints larger than 10GB. diff --git a/docs/source/en/perf_train_gaudi.md b/docs/source/en/perf_train_gaudi.md index 2ba792d484a..1ab8957f9d7 100644 --- a/docs/source/en/perf_train_gaudi.md +++ b/docs/source/en/perf_train_gaudi.md @@ -20,14 +20,17 @@ The Intel Gaudi AI accelerator family includes [Intel Gaudi 1](https://habana.ai [`TrainingArguments`], [`Trainer`] and [`Pipeline`] detect and set the backend device to `hpu` if an Intel Gaudi device is available. No additional changes are required to enable training and inference on your device. Some modeling code in Transformers is not optimized for HPU lazy mode. If you encounter any errors, set the environment variable below to use eager mode: + ``` PT_HPU_LAZY_MODE=0 ``` In some cases, you'll also need to enable int64 support to avoid casting issues with long integers: + ``` PT_ENABLE_INT64_SUPPORT=1 ``` + Refer to the [Gaudi docs](https://docs.habana.ai/en/latest/index.html) for more details. > [!TIP] diff --git a/docs/source/en/pipeline_webserver.md b/docs/source/en/pipeline_webserver.md index 0112d116c47..37d245483b9 100644 --- a/docs/source/en/pipeline_webserver.md +++ b/docs/source/en/pipeline_webserver.md @@ -82,6 +82,7 @@ Query the server with a POST request. ```bash curl -X POST -d "Paris is the [MASK] of France." http://localhost:8000/ ``` + This should return the output below. ```bash diff --git a/docs/source/en/pr_checks.md b/docs/source/en/pr_checks.md index a5634c29ee4..7056adf2149 100644 --- a/docs/source/en/pr_checks.md +++ b/docs/source/en/pr_checks.md @@ -52,7 +52,6 @@ or for an editable install: pip install -e .[quality] ``` - ## Tests All the jobs that begin with `ci/circleci: run_tests_` run parts of the Transformers testing suite. Each of those jobs focuses on a part of the library in a certain environment: for instance `ci/circleci: run_tests_pipelines` runs the pipeline tests in an environment where all pipeline-related requirements are installed. diff --git a/docs/source/en/quantization/auto_round.md b/docs/source/en/quantization/auto_round.md index 15abf9faa84..7526597ee86 100644 --- a/docs/source/en/quantization/auto_round.md +++ b/docs/source/en/quantization/auto_round.md @@ -11,18 +11,17 @@ rendered properly in your Markdown viewer. # AutoRound -[AutoRound](https://github.com/intel/auto-round) is an advanced quantization algorithm that delivers strong accuracy, even at 2-bit precision. -It leverages sign gradient descent to fine-tune both rounding values and min-max clipping thresholds in just 200 steps. Designed for broad compatibility, it seamlessly supports a wide range of LLMs and is actively expanding to cover more VLMs as well. +[AutoRound](https://github.com/intel/auto-round) is an advanced quantization algorithm that delivers strong accuracy, even at 2-bit precision. +It leverages sign gradient descent to fine-tune both rounding values and min-max clipping thresholds in just 200 steps. Designed for broad compatibility, it seamlessly supports a wide range of LLMs and is actively expanding to cover more VLMs as well. It also supports quantization and inference across multiple hardware platforms, including CPU, XPU, and CUDA. -AutoRound also offers a variety of useful features, including mixed-bit tuning and inference, lm-head quantization, support for exporting to formats like GPTQ/AWQ/GGUF, and flexible tuning recipes. +AutoRound also offers a variety of useful features, including mixed-bit tuning and inference, lm-head quantization, support for exporting to formats like GPTQ/AWQ/GGUF, and flexible tuning recipes. For a comprehensive overview and the latest updates, check out the AutoRound [README](https://github.com/intel/auto-round). -AutoRound was originally developed as part of the [Intel Neural Compressor](https://github.com/intel/neural-compressor), serving as a general-purpose model compression library for deep learning. -It has since evolved into a standalone library focused specifically on low-precision optimization for large language models (LLMs). +AutoRound was originally developed as part of the [Intel Neural Compressor](https://github.com/intel/neural-compressor), serving as a general-purpose model compression library for deep learning. +It has since evolved into a standalone library focused specifically on low-precision optimization for large language models (LLMs). AutoRound remains fully integrated with the Intel Neural Compressor, and you can explore the repository for more details. - ## Installation ```bash @@ -51,6 +50,7 @@ Currently, only offline mode is supported to generate quantized models. ### Command Line Usage + ```bash auto-round \ --model facebook/opt-125m \ @@ -59,7 +59,7 @@ auto-round \ --output_dir ./tmp_autoround ``` -AutoRound also offer another two recipes, `auto-round-best` and `auto-round-light`, designed for optimal accuracy and improved speed, respectively. +AutoRound also offer another two recipes, `auto-round-best` and `auto-round-light`, designed for optimal accuracy and improved speed, respectively. For 2 bits, we recommend using `auto-round-best` or `auto-round`. @@ -99,6 +99,7 @@ autoround.quantize_and_save(output_dir, format='auto_round') ### AutoRoundBest recipe This setting provides the best accuracy in most scenarios but is 4–5× slower than the standard AutoRound recipe. It is especially recommended for 2-bit quantization and is a good choice if sufficient resources are available. + ```python from transformers import AutoModelForCausalLM, AutoTokenizer from auto_round import AutoRound @@ -121,6 +122,7 @@ autoround = AutoRound( output_dir = "./tmp_autoround" autoround.quantize_and_save(output_dir, format='auto_round') ``` + @@ -230,7 +232,7 @@ print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50, do_sample=Fal AutoRound automatically selects the backend for each layer based on compatibility. In general, the priority order is Marlin > ExLLaMAV2 > Triton, but the final choice depends on factors such as group size, bit width, packing format, hardware device, and other implementation details. For more details, please refer to [backends](https://github.com/intel/auto-round?tab=readme-ov-file#specify-backend), -The backend may not always be the most suitable for certain devices. +The backend may not always be the most suitable for certain devices. You can specify your preferred backend such as "ipex" for CPU, "ipex/triton" for XPU, "marlin/exllamav2/triton" for CUDA, according to your needs or hardware compatibility. Please note that additional corresponding libraries may be required. ```python @@ -247,7 +249,6 @@ print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50, do_sample=Fal - ### Convert GPTQ/AWQ to AutoRound @@ -277,7 +278,6 @@ the [transformers](https://github.com/huggingface/transformers/issues) repositor If you encounter any issues with auto-round, please open an issue on the [AutoRound](https://github.com/intel/auto-round/issues) repository. - ## Acknowledgement Special thanks to open-source low precision libraries such as AutoGPTQ, AutoAWQ, GPTQModel, Triton, Marlin, and ExLLaMAV2 for providing low-precision CUDA kernels, which are leveraged in AutoRound. diff --git a/docs/source/en/quantization/awq.md b/docs/source/en/quantization/awq.md index b6437e2588a..b2cf4b9ecdf 100644 --- a/docs/source/en/quantization/awq.md +++ b/docs/source/en/quantization/awq.md @@ -25,6 +25,7 @@ Run the command below to install autoawq ```bash pip install autoawq ``` + > [!WARNING] > AutoAWQ downgrades Transformers to version 4.47.1. If you want to do inference with AutoAWQ, you may need to reinstall your Transformers' version after installing AutoAWQ. diff --git a/docs/source/en/quantization/bitsandbytes.md b/docs/source/en/quantization/bitsandbytes.md index 60c3c2dfebf..9cdbbe5af39 100644 --- a/docs/source/en/quantization/bitsandbytes.md +++ b/docs/source/en/quantization/bitsandbytes.md @@ -32,12 +32,12 @@ bitsandbytes offers two main quantization features: > **Note:** For a user-friendly quantization experience, you can use the `bitsandbytes` [community space](https://huggingface.co/spaces/bnb-community/bnb-my-repo). - Run the command below to install bitsandbytes. ```bash pip install --upgrade transformers accelerate bitsandbytes ``` + To compile from source, follow the instructions in the [bitsandbytes installation guide](https://huggingface.co/docs/bitsandbytes/main/en/installation). ## Hardware Compatibility @@ -116,6 +116,7 @@ model = AutoModelForCausalLM.from_pretrained( model.push_to_hub("bloom-560m-8bit") ``` +
@@ -166,6 +167,7 @@ model = AutoModelForCausalLM.from_pretrained( model.push_to_hub("bloom-560m-4bit") ``` +
diff --git a/docs/source/en/quantization/compressed_tensors.md b/docs/source/en/quantization/compressed_tensors.md index a3b01a1b448..3c047d0af98 100644 --- a/docs/source/en/quantization/compressed_tensors.md +++ b/docs/source/en/quantization/compressed_tensors.md @@ -99,29 +99,29 @@ For a more detailed look at the model weights, use the [safetensors viewer](http | Tensors | Shape | Precision | | ------- | ----- | --------- | -model.layers.0.input_layernorm.weight | [4 096] | BF16 -model.layers.0.mlp.down_proj.input_scale | [1] | BF16 -model.layers.0.mlp.down_proj.weight | [4 096, 14 336] | F8_E4M3 -model.layers.0.mlp.down_proj.weight_scale | [1] | BF16 -model.layers.0.mlp.gate_proj.input_scale | [1] | BF16 -model.layers.0.mlp.gate_proj.weight | [14 336, 4 096] | F8_E4M3 -model.layers.0.mlp.gate_proj.weight_scale | [1] | BF16 -model.layers.0.mlp.up_proj.input_scale| [1] |BF16 -model.layers.0.mlp.up_proj.weight | [14 336, 4 096] | F8_E4M3 -model.layers.0.mlp.up_proj.weight_scale | [1] | BF16 -model.layers.0.post_attention_layernorm.weight | [4 096] |BF16 +model.layers.0.input_layernorm.weight | [4 096] | BF16 +model.layers.0.mlp.down_proj.input_scale | [1] | BF16 +model.layers.0.mlp.down_proj.weight | [4 096, 14 336] | F8_E4M3 +model.layers.0.mlp.down_proj.weight_scale | [1] | BF16 +model.layers.0.mlp.gate_proj.input_scale | [1] | BF16 +model.layers.0.mlp.gate_proj.weight | [14 336, 4 096] | F8_E4M3 +model.layers.0.mlp.gate_proj.weight_scale | [1] | BF16 +model.layers.0.mlp.up_proj.input_scale| [1] |BF16 +model.layers.0.mlp.up_proj.weight | [14 336, 4 096] | F8_E4M3 +model.layers.0.mlp.up_proj.weight_scale | [1] | BF16 +model.layers.0.post_attention_layernorm.weight | [4 096] |BF16 model.layers.0.self_attn.k_proj.input_scale | [1] | BF16 model.layers.0.self_attn.k_proj.weight | [1 024, 4 096]| F8_E4M3 -model.layers.0.self_attn.k_proj.weight_scale |[1] | BF16 +model.layers.0.self_attn.k_proj.weight_scale |[1] | BF16 model.layers.0.self_attn.o_proj.input_scale | [1] | BF16 -model.layers.0.self_attn.o_proj.weight | [4 096, 4 096] | F8_E4M3 -model.layers.0.self_attn.o_proj.weight_scale | [1] | BF16 -model.layers.0.self_attn.q_proj.input_scale | [1] | BF16 -model.layers.0.self_attn.q_proj.weight | [4 096, 4 096] | F8_E4M3 -model.layers.0.self_attn.q_proj.weight_scale | [1] | BF16 -model.layers.0.self_attn.v_proj.input_scale | [1] | BF16 -model.layers.0.self_attn.v_proj.weight | [1 024, 4 096] | F8_E4M3 -model.layers.0.self_attn.v_proj.weight_scale | [1] | BF16 +model.layers.0.self_attn.o_proj.weight | [4 096, 4 096] | F8_E4M3 +model.layers.0.self_attn.o_proj.weight_scale | [1] | BF16 +model.layers.0.self_attn.q_proj.input_scale | [1] | BF16 +model.layers.0.self_attn.q_proj.weight | [4 096, 4 096] | F8_E4M3 +model.layers.0.self_attn.q_proj.weight_scale | [1] | BF16 +model.layers.0.self_attn.v_proj.input_scale | [1] | BF16 +model.layers.0.self_attn.v_proj.weight | [1 024, 4 096] | F8_E4M3 +model.layers.0.self_attn.v_proj.weight_scale | [1] | BF16 When loading a compressed-tensors model with the [`~quantizers.HFQuantizer`] integration, all the [nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) modules specified in the quantization config are replaced by [CompressedLinear](https://github.com/neuralmagic/compressed-tensors/blob/975cb223b19fcac2b98a4271d17668462d4d6e1d/src/compressed_tensors/linear/compressed_linear.py#L30) modules that manage the compressed weights and forward pass for inference. The `lm_head` module is still kept as an unquantized nn.Linear module. diff --git a/docs/source/en/quantization/concept_guide.md b/docs/source/en/quantization/concept_guide.md index ff300b9d48a..e9d3b451484 100644 --- a/docs/source/en/quantization/concept_guide.md +++ b/docs/source/en/quantization/concept_guide.md @@ -18,7 +18,6 @@ rendered properly in your Markdown viewer. Quantization reduces the memory footprint and computational cost of large machine learning models like those found in the Transformers library. It achieves this by representing the model's weights and or activations with lower-precision data types (like 8-bit integers or int8) instead of the standard 32-bit floating-point (float32). - Reducing a model's precision offers several significant benefits: - Smaller model size: Lower-precision data types require less storage space. An int8 model, for example, is roughly 4 times smaller than its float32 counterpart. @@ -46,8 +45,7 @@ The most common method is *affine quantization*. For a given float32 tensor (lik There are two main ways to perform this mapping, *symmetric* and *asymmetric*. The choice between symmetric and asymmetric quantization determines how the float32 range is mapped to the int8 range. - Symmetric: This method assumes the original float32 range is symmetric around zero ( \\([ -a, a ]\\) ). This range is mapped symmetrically to the int8 range, for example, \\([-127, 127]\\). A key characteristic is that the float32 value \\(0.0\\) maps directly to the int8 value \\(0\\). This only requires one parameter, the **scale ( \\(S\\) )**, to define the mapping. It can simplify computations, but it might be less accurate if the original data distribution isn't naturally centered around zero. -- Asymmetric (Affine): This method does not assume the data is centered around zero. It maps the exact range \\([val_{min}, val_{max}]\\) from float32 to the full int8 range, like \\([-128, 127]\\). This requires two parameters, a **scale ( \\(S\\) )** and a **zero-point ( \\(Z\\) )**. - +- Asymmetric (Affine): This method does not assume the data is centered around zero. It maps the exact range \\([val_{min}, val_{max}]\\) from float32 to the full int8 range, like \\([-128, 127]\\). This requires two parameters, a **scale ( \\(S\\) )** and a **zero-point ( \\(Z\\) )**. scale ( \\(S\\) ): A positive float32 number representing the ratio between the float32 and the int8 range. @@ -134,8 +132,7 @@ There are two main types of quantization techniques. ## Quantization in Transformers -Transformers integrates several quantization backends such as bitsandbytes, torchao, compressed-tensors, and more (refer to the quantization [overview](./overview) for more backends). - +Transformers integrates several quantization backends such as bitsandbytes, torchao, compressed-tensors, and more (refer to the quantization [overview](./overview) for more backends). All backends are unified under the [`HfQuantizer`] API and associated [`QuantizationConfig`] classes. You can integrate your own custom quantization backends by implementing a custom [`HfQuantizer`] and [`QuantizationConfig`], as shown in the [Contribution](./contribute) guide. @@ -165,7 +162,6 @@ model = AutoModelForCausalLM.from_pretrained( ) ``` - ## Resources To explore quantization and related performance optimization concepts more deeply, check out the following resources. diff --git a/docs/source/en/quantization/mxfp4.md b/docs/source/en/quantization/mxfp4.md index a2b9f7634c8..dd313c5555e 100644 --- a/docs/source/en/quantization/mxfp4.md +++ b/docs/source/en/quantization/mxfp4.md @@ -16,7 +16,7 @@ rendered properly in your Markdown viewer. # MXFP4 -Note: MXFP4 quantisation currently only works for OpenAI GPT-OSS 120b and 20b. +Note: MXFP4 quantisation currently only works for OpenAI GPT-OSS 120b and 20b. MXFP4 is a 4-bit floating point format that dramatically reduces the memory requirements of large models. Large models (GPT-OSS-120B) can fit on a single 80GB GPU and smaller models (GPT-OSS-20B) only require 16GB of memory. It uses blockwise scaling to preserve it's range and accuracy, which typically becomes degraded at lower precisions. @@ -25,7 +25,6 @@ To use MXPF4, make sure your hardware meets the following requirements. - Install Accelerate, kernels, and Triton ≥ 3.4. Only manually install Triton ≥ 3.4 if you're using PyTorch 2.7 because it is already supported in PyTorch 2.8. - NVIDIA GPU Compute Capability ≥ 7.5 which includes Tesla GPUs and newer. Use [get_device_capability](https://docs.pytorch.org/docs/stable/generated/torch.cuda.get_device_capability.html) to check Compute Capability. - ```python from torch import cuda cuda.get_device_capability() @@ -54,7 +53,6 @@ print(cfg.quantization_config) # } ``` - ## MXFP4 kernels Transformers automatically pulls the MXFP4-aware Triton kernels from the community repository when you load a model that needs them. The kernels are stored in your local cache and used during the forward pass. @@ -67,7 +65,6 @@ You can use [hf cache scan](https://huggingface.co/docs/huggingface_hub/en/guide hf cache scan ``` - ```shell REPO ID REPO TYPE SIZE ON DISK -------------------------------- --------- ------------ diff --git a/docs/source/en/quantization/overview.md b/docs/source/en/quantization/overview.md index ceab195b2b5..d607ae44660 100644 --- a/docs/source/en/quantization/overview.md +++ b/docs/source/en/quantization/overview.md @@ -34,7 +34,7 @@ Use the Space below to help you pick a quantization method depending on your har | [GGUF / GGML (llama.cpp)](../gguf) | 🟢 | 🟢 | 🟢 | 🔴 | 🟢 | 🟢 | 🔴 | 1/8 | 🔴 | [See Notes](../gguf) | [See Notes](../gguf) | https://github.com/ggerganov/llama.cpp | | [GPTQModel](./gptq) | 🔴 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 | 🔴 | 2/3/4/8 | 🟢 | 🟢 | 🟢 | https://github.com/ModelCloud/GPTQModel | | [AutoGPTQ](./gptq) | 🔴 | 🔴 | 🟢 | 🟢 | 🔴 | 🔴 | 🔴 | 2/3/4/8 | 🟢 | 🟢 | 🟢 | https://github.com/AutoGPTQ/AutoGPTQ | -| [HIGGS](./higgs) | 🟢 | 🔴 | 🟢 | 🔴 | 🔴 | 🔴 | 🟢 | 2/4 | 🔴 | 🟢 | 🟢 | https://github.com/HanGuo97/flute | +| [HIGGS](./higgs) | 🟢 | 🔴 | 🟢 | 🔴 | 🔴 | 🔴 | 🟢 | 2/4 | 🔴 | 🟢 | 🟢 | https://github.com/HanGuo97/flute | | [HQQ](./hqq) | 🟢 | 🟢 | 🟢 | 🔴 | 🔴 | 🟢 | 🟢 | 1/8 | 🟢 | 🔴 | 🟢 | https://github.com/mobiusml/hqq/ | | [optimum-quanto](./quanto) | 🟢 | 🟢 | 🟢 | 🔴 | 🟢 | 🟢 | 🟢 | 2/4/8 | 🔴 | 🔴 | 🟢 | https://github.com/huggingface/optimum-quanto | | [FBGEMM_FP8](./fbgemm_fp8) | 🟢 | 🔴 | 🟢 | 🔴 | 🔴 | 🔴 | 🔴 | 8 | 🔴 | 🟢 | 🟢 | https://github.com/pytorch/FBGEMM | @@ -53,7 +53,7 @@ If you are new to quantization, we recommend checking out these beginner-friendl ## User-Friendly Quantization Tools -If you are looking for a user-friendly quantization experience, you can use the following community spaces and notebooks: +If you are looking for a user-friendly quantization experience, you can use the following community spaces and notebooks: * [Bitsandbytes Space](https://huggingface.co/spaces/bnb-community/bnb-my-repo) * [GGUF Space](https://huggingface.co/spaces/ggml-org/gguf-my-repo) diff --git a/docs/source/en/quantization/selecting.md b/docs/source/en/quantization/selecting.md index 7653e946dd8..69b989bca88 100644 --- a/docs/source/en/quantization/selecting.md +++ b/docs/source/en/quantization/selecting.md @@ -118,7 +118,7 @@ Consider the quantization method below during fine-tuning to save memory. Other methods offer PEFT compatibility, though bitsandbytes is the most established and straightforward path for QLoRA. -See the [bitsandbytes documentation](./bitsandbytes#qlora) and [PEFT Docs](https://huggingface.co/docs/peft/developer_guides/quantization#aqlm-quantization) for more details. +See the [bitsandbytes documentation](./bitsandbytes#qlora) and [PEFT Docs](https://huggingface.co/docs/peft/developer_guides/quantization#aqlm-quantization) for more details. ## Research diff --git a/docs/source/en/quantization/torchao.md b/docs/source/en/quantization/torchao.md index 6427866d022..8778f9f3e5e 100644 --- a/docs/source/en/quantization/torchao.md +++ b/docs/source/en/quantization/torchao.md @@ -30,7 +30,6 @@ See the table below for additional torchao features. > [!TIP] > Refer to the torchao [README.md](https://github.com/pytorch/ao#torchao-pytorch-architecture-optimization) for more details about the library. - torchao supports the [quantization techniques](https://github.com/pytorch/ao/blob/main/torchao/quantization/README.md) below. - A16W8 Float8 Dynamic Quantization @@ -43,7 +42,6 @@ torchao supports the [quantization techniques](https://github.com/pytorch/ao/blo torchao also supports module level configuration by specifying a dictionary from fully qualified name of module and its corresponding quantization config. This allows skip quantizing certain layers and using different quantization config for different modules. - Check the table below to see if your hardware is compatible. | Component | Compatibility | @@ -52,8 +50,6 @@ Check the table below to see if your hardware is compatible. | XPU Versions | ✅ pytorch2.8 | | CPU | ✅ change `device_map="cpu"` (see examples below) | - - Install torchao from PyPi or the PyTorch index with the following commands. @@ -64,13 +60,15 @@ Install torchao from PyPi or the PyTorch index with the following commands. # Stable release from Pypi which will default to CUDA 12.6 pip install --upgrade torchao transformers ``` + Stable Release from the PyTorch index - + ```bash pip install torchao --index-url https://download.pytorch.org/whl/cu126 # options are cpu/cu118/cu126/cu128 ``` + @@ -118,6 +116,7 @@ input_ids = tokenizer(input_text, return_tensors="pt").to(model.device) output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static") print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` + @@ -146,6 +145,7 @@ input_ids = tokenizer(input_text, return_tensors="pt").to(model.device) output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static") print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` + @@ -177,13 +177,14 @@ input_ids = tokenizer(input_text, return_tensors="pt").to(model.device) output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static") print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` + ### A100 GPU - + ```py import torch from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer @@ -210,6 +211,7 @@ input_ids = tokenizer(input_text, return_tensors="pt").to(model.device) output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static") print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` + @@ -245,6 +247,7 @@ input_ids = tokenizer(input_text, return_tensors="pt").to(model.device) output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static") print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` + @@ -276,13 +279,14 @@ input_ids = tokenizer(input_text, return_tensors="pt").to(model.device) output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static") print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` + ### Intel XPU - + ```py import torch from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer @@ -309,6 +313,7 @@ input_ids = tokenizer(input_text, return_tensors="pt").to(model.device) output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static") print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` + @@ -340,14 +345,14 @@ input_ids = tokenizer(input_text, return_tensors="pt").to(model.device) output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static") print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` + - ### CPU - + ```py import torch from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer @@ -373,6 +378,7 @@ input_ids = tokenizer(input_text, return_tensors="pt") output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static") print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` + @@ -404,12 +410,14 @@ input_ids = tokenizer(input_text, return_tensors="pt") output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static") print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` + ### Per Module Quantization #### 1. Skip quantization for certain layers With `ModuleFqnToConfig` we can specify a default configuration for all layers while skipping quantization for certain layers. + ```py import torch from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig @@ -438,6 +446,7 @@ print(output_text) ``` #### 2. Quantizing different layers with different quantization configs + ```py import torch from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig @@ -485,7 +494,6 @@ Note: autoquant is for GPU only right now. Create a [`TorchAoConfig`] and set to `"autoquant"`. Set the `cache_implementation` to `"static"` to automatically [torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) the forward method. Finally, call `finalize_autoquant` on the quantized model to finalize the quantization and log the input shapes. - ```py import torch from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer @@ -509,7 +517,6 @@ quantized_model.finalize_autoquant() print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` - ## Serialization torchao implements [torch.Tensor subclasses](https://pytorch.org/docs/stable/notes/extending.html#subclassing-torch-tensor) for maximum flexibility in supporting new quantized torch.Tensor formats. [Safetensors](https://huggingface.co/docs/safetensors/en/index) serialization and deserialization does not work with torchao. @@ -518,15 +525,16 @@ To avoid arbitrary user code execution, torchao sets `weights_only=True` in [tor - + ```py # don't serialize model with Safetensors output_dir = "llama3-8b-int4wo-128" quantized_model.save_pretrained("llama3-8b-int4wo-128", safe_serialization=False) ``` + - + ```py # don't serialize model with Safetensors USER_ID = "your_huggingface_user_id" @@ -534,13 +542,14 @@ REPO_ID = "llama3-8b-int4wo-128" quantized_model.push_to_hub(f"{USER_ID}/llama3-8b-int4wo-128", safe_serialization=False) tokenizer.push_to_hub(f"{USER_ID}/llama3-8b-int4wo-128") ``` + - ## Loading quantized models Loading a quantized model depends on the quantization scheme. For quantization schemes, like int8 and float8, you can quantize the model on any device and also load it on any device. The example below demonstrates quantizing a model on the CPU and then loading it on CUDA or XPU. + ```py import torch from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer @@ -574,6 +583,7 @@ output = reloaded_model.generate(**input_ids, max_new_tokens=10) print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` + For int4, the model can only be loaded on the same device it was quantized on because the layout is specific to the device. The example below demonstrates quantizing and loading a model on the CPU. ```py @@ -641,8 +651,6 @@ print(tokenizer.decode(output[0], skip_special_tokens=True)) > > All configuration objects accept parameters for customization (e.g., `group_size`, `scheme`, `layout`). - - ## Resources For a better sense of expected performance, view the [benchmarks](https://github.com/pytorch/ao/tree/main/torchao/quantization#benchmarks) for various models with CUDA and XPU backends. You can also run the code below to benchmark a model yourself. diff --git a/docs/source/en/run_scripts.md b/docs/source/en/run_scripts.md index ef32bf26ee0..594eb84b02a 100644 --- a/docs/source/en/run_scripts.md +++ b/docs/source/en/run_scripts.md @@ -52,6 +52,7 @@ Start with a smaller dataset by including the `max_train_samples`, `max_eval_sam > [!WARNING] > Not all example scripts support the `max_predict_samples` parameter. Run the command below to check whether a script supports it or not. +> > ```bash > examples/pytorch/summarization/run_summarization.py -h > ``` diff --git a/docs/source/en/serialization.md b/docs/source/en/serialization.md index 831f163bed1..cf9160f5b33 100644 --- a/docs/source/en/serialization.md +++ b/docs/source/en/serialization.md @@ -38,6 +38,7 @@ pip install optimum[exporters] > [!TIP] > Refer to the [Export a model to ONNX with optimum.exporters.onnx](https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/export_a_model#exporting-a-model-to-onnx-using-the-cli) guide for all available arguments or with the command below. +> > ```bash > optimum-cli export onnx --help > ``` diff --git a/docs/source/en/serving.md b/docs/source/en/serving.md index f421a284950..6237b09bb49 100644 --- a/docs/source/en/serving.md +++ b/docs/source/en/serving.md @@ -356,7 +356,6 @@ ResponseCompletedEvent(response=Response(id='resp_req_0', created_at=1754060400. - ## MCP integration The `transformers serve` server is also an MCP client, so it can interact with MCP tools in agentic use cases. This, of course, requires the use of an LLM that is designed to use tools. @@ -382,7 +381,6 @@ transformers serve \ --attn_implementation sdpa_paged ``` - ### Performance tips - Use an efficient attention backend when available: @@ -401,5 +399,3 @@ transformers serve \ - `--load_in_4bit`/`--load_in_8bit` can reduce memory footprint for LoRA setups - `--force-model ` avoids per-request model hints and helps produce stable, repeatable runs - - diff --git a/docs/source/en/tasks/audio_classification.md b/docs/source/en/tasks/audio_classification.md index 52e2f965ee2..250b980be19 100644 --- a/docs/source/en/tasks/audio_classification.md +++ b/docs/source/en/tasks/audio_classification.md @@ -210,7 +210,6 @@ At this point, only three steps remain: 2. Pass the training arguments to [`Trainer`] along with the model, dataset, tokenizer, data collator, and `compute_metrics` function. 3. Call [`~Trainer.train`] to fine-tune your model. - ```py >>> training_args = TrainingArguments( ... output_dir="my_awesome_mind_model", diff --git a/docs/source/en/tasks/document_question_answering.md b/docs/source/en/tasks/document_question_answering.md index d83e025c409..902a948307f 100644 --- a/docs/source/en/tasks/document_question_answering.md +++ b/docs/source/en/tasks/document_question_answering.md @@ -439,6 +439,7 @@ Now that you have finetuned a LayoutLMv2 model, and uploaded it to the 🤗 Hub, way to try out your finetuned model for inference is to use it in a [`Pipeline`]. Let's take an example: + ```py >>> example = dataset["test"][2] >>> question = example["query"]["en"] diff --git a/docs/source/en/tasks/idefics.md b/docs/source/en/tasks/idefics.md index 3f8915f3cc9..5fef5953d5b 100644 --- a/docs/source/en/tasks/idefics.md +++ b/docs/source/en/tasks/idefics.md @@ -18,26 +18,26 @@ rendered properly in your Markdown viewer. [[open-in-colab]] -While individual tasks can be tackled by fine-tuning specialized models, an alternative approach -that has recently emerged and gained popularity is to use large models for a diverse set of tasks without fine-tuning. -For instance, large language models can handle such NLP tasks as summarization, translation, classification, and more. -This approach is no longer limited to a single modality, such as text, and in this guide, we will illustrate how you can -solve image-text tasks with a large multimodal model called IDEFICS. +While individual tasks can be tackled by fine-tuning specialized models, an alternative approach +that has recently emerged and gained popularity is to use large models for a diverse set of tasks without fine-tuning. +For instance, large language models can handle such NLP tasks as summarization, translation, classification, and more. +This approach is no longer limited to a single modality, such as text, and in this guide, we will illustrate how you can +solve image-text tasks with a large multimodal model called IDEFICS. -[IDEFICS](../model_doc/idefics) is an open-access vision and language model based on [Flamingo](https://huggingface.co/papers/2204.14198), -a state-of-the-art visual language model initially developed by DeepMind. The model accepts arbitrary sequences of image -and text inputs and generates coherent text as output. It can answer questions about images, describe visual content, -create stories grounded in multiple images, and so on. IDEFICS comes in two variants - [80 billion parameters](https://huggingface.co/HuggingFaceM4/idefics-80b) -and [9 billion parameters](https://huggingface.co/HuggingFaceM4/idefics-9b), both of which are available on the 🤗 Hub. For each variant, you can also find fine-tuned instructed +[IDEFICS](../model_doc/idefics) is an open-access vision and language model based on [Flamingo](https://huggingface.co/papers/2204.14198), +a state-of-the-art visual language model initially developed by DeepMind. The model accepts arbitrary sequences of image +and text inputs and generates coherent text as output. It can answer questions about images, describe visual content, +create stories grounded in multiple images, and so on. IDEFICS comes in two variants - [80 billion parameters](https://huggingface.co/HuggingFaceM4/idefics-80b) +and [9 billion parameters](https://huggingface.co/HuggingFaceM4/idefics-9b), both of which are available on the 🤗 Hub. For each variant, you can also find fine-tuned instructed versions of the model adapted for conversational use cases. -This model is exceptionally versatile and can be used for a wide range of image and multimodal tasks. However, -being a large model means it requires significant computational resources and infrastructure. It is up to you to decide whether -this approach suits your use case better than fine-tuning specialized models for each individual task. +This model is exceptionally versatile and can be used for a wide range of image and multimodal tasks. However, +being a large model means it requires significant computational resources and infrastructure. It is up to you to decide whether +this approach suits your use case better than fine-tuning specialized models for each individual task. -In this guide, you'll learn how to: +In this guide, you'll learn how to: - [Load IDEFICS](#loading-the-model) and [load the quantized version of the model](#quantized-model) -- Use IDEFICS for: +- Use IDEFICS for: - [Image captioning](#image-captioning) - [Prompted image captioning](#prompted-image-captioning) - [Few-shot prompting](#few-shot-prompting) @@ -47,7 +47,7 @@ In this guide, you'll learn how to: - [Run inference in batch mode](#running-inference-in-batch-mode) - [Run IDEFICS instruct for conversational use](#idefics-instruct-for-conversational-use) -Before you begin, make sure you have all the necessary libraries installed. +Before you begin, make sure you have all the necessary libraries installed. ```bash pip install -q bitsandbytes sentencepiece accelerate transformers @@ -59,14 +59,14 @@ To run the following examples with a non-quantized version of the model checkpoi ## Loading the model -Let's start by loading the model's 9 billion parameters checkpoint: +Let's start by loading the model's 9 billion parameters checkpoint: ```py >>> checkpoint = "HuggingFaceM4/idefics-9b" ``` -Just like for other Transformers models, you need to load a processor and the model itself from the checkpoint. -The IDEFICS processor wraps a [`LlamaTokenizer`] and IDEFICS image processor into a single processor to take care of +Just like for other Transformers models, you need to load a processor and the model itself from the checkpoint. +The IDEFICS processor wraps a [`LlamaTokenizer`] and IDEFICS image processor into a single processor to take care of preparing text and image inputs for the model. ```py @@ -79,13 +79,13 @@ preparing text and image inputs for the model. >>> model = IdeficsForVisionText2Text.from_pretrained(checkpoint, dtype=torch.bfloat16, device_map="auto") ``` -Setting `device_map` to `"auto"` will automatically determine how to load and store the model weights in the most optimized +Setting `device_map` to `"auto"` will automatically determine how to load and store the model weights in the most optimized manner given existing devices. ### Quantized model -If high-memory device availability is an issue, you can load the quantized version of the model. To load the model and the -processor in 4bit precision, pass a `BitsAndBytesConfig` to the `from_pretrained` method and the model will be compressed +If high-memory device availability is an issue, you can load the quantized version of the model. To load the model and the +processor in 4bit precision, pass a `BitsAndBytesConfig` to the `from_pretrained` method and the model will be compressed on the fly while loading. ```py @@ -109,8 +109,8 @@ on the fly while loading. Now that you have the model loaded in one of the suggested ways, let's move on to exploring tasks that you can use IDEFICS for. ## Image captioning -Image captioning is the task of predicting a caption for a given image. A common application is to aid visually impaired -people navigate through different situations, for instance, explore image content online. +Image captioning is the task of predicting a caption for a given image. A common application is to aid visually impaired +people navigate through different situations, for instance, explore image content online. To illustrate the task, get an image to be captioned, e.g.: @@ -118,10 +118,10 @@ To illustrate the task, get an image to be captioned, e.g.: Image of a puppy in a flower bed
-Photo by [Hendo Wang](https://unsplash.com/@hendoo). +Photo by [Hendo Wang](https://unsplash.com/@hendoo). -IDEFICS accepts text and image prompts. However, to caption an image, you do not have to provide a text prompt to the -model, only the preprocessed input image. Without a text prompt, the model will start generating text from the +IDEFICS accepts text and image prompts. However, to caption an image, you do not have to provide a text prompt to the +model, only the preprocessed input image. Without a text prompt, the model will start generating text from the BOS (beginning-of-sequence) token thus creating a caption. As image input to the model, you can use either an image object (`PIL.Image`) or a url from which the image can be retrieved. @@ -142,15 +142,15 @@ A puppy in a flower bed -It is a good idea to include the `bad_words_ids` in the call to `generate` to avoid errors arising when increasing -the `max_new_tokens`: the model will want to generate a new `` or `` token when there +It is a good idea to include the `bad_words_ids` in the call to `generate` to avoid errors arising when increasing +the `max_new_tokens`: the model will want to generate a new `` or `` token when there is no image being generated by the model. You can set it on-the-fly as in this guide, or store in the `GenerationConfig` as described in the [Text generation strategies](../generation_strategies) guide. ## Prompted image captioning -You can extend image captioning by providing a text prompt, which the model will continue given the image. Let's take +You can extend image captioning by providing a text prompt, which the model will continue given the image. Let's take another image to illustrate:
@@ -158,7 +158,7 @@ another image to illustrate:
Photo by [Denys Nevozhai](https://unsplash.com/@dnevozhai). - + Textual and image prompts can be passed to the model's processor as a single list to create appropriate inputs. ```py @@ -178,12 +178,12 @@ This is an image of the Eiffel Tower in Paris, France. ## Few-shot prompting -While IDEFICS demonstrates great zero-shot results, your task may require a certain format of the caption, or come with +While IDEFICS demonstrates great zero-shot results, your task may require a certain format of the caption, or come with other restrictions or requirements that increase task's complexity. Few-shot prompting can be used to enable in-context learning. -By providing examples in the prompt, you can steer the model to generate results that mimic the format of given examples. +By providing examples in the prompt, you can steer the model to generate results that mimic the format of given examples. -Let's use the previous image of the Eiffel Tower as an example for the model and build a prompt that demonstrates to the model -that in addition to learning what the object in an image is, we would also like to get some interesting information about it. +Let's use the previous image of the Eiffel Tower as an example for the model and build a prompt that demonstrates to the model +that in addition to learning what the object in an image is, we would also like to get some interesting information about it. Then, let's see, if we can get the same response format for an image of the Statue of Liberty:
@@ -213,24 +213,24 @@ User: Describe this image. Assistant: An image of the Statue of Liberty. Fun fact: the Statue of Liberty is 151 feet tall. ``` -Notice that just from a single example (i.e., 1-shot) the model has learned how to perform the task. For more complex tasks, +Notice that just from a single example (i.e., 1-shot) the model has learned how to perform the task. For more complex tasks, feel free to experiment with a larger number of examples (e.g., 3-shot, 5-shot, etc.). ## Visual question answering -Visual Question Answering (VQA) is the task of answering open-ended questions based on an image. Similar to image -captioning it can be used in accessibility applications, but also in education (reasoning about visual materials), customer +Visual Question Answering (VQA) is the task of answering open-ended questions based on an image. Similar to image +captioning it can be used in accessibility applications, but also in education (reasoning about visual materials), customer service (questions about products based on images), and image retrieval. -Let's get a new image for this task: +Let's get a new image for this task:
Image of a couple having a picnic
-Photo by [Jarritos Mexican Soda](https://unsplash.com/@jarritos). +Photo by [Jarritos Mexican Soda](https://unsplash.com/@jarritos). -You can steer the model from image captioning to visual question answering by prompting it with appropriate instructions: +You can steer the model from image captioning to visual question answering by prompting it with appropriate instructions: ```py >>> prompt = [ @@ -251,11 +251,11 @@ Instruction: Provide an answer to the question. Use the image to answer. ## Image classification -IDEFICS is capable of classifying images into different categories without being explicitly trained on data containing -labeled examples from those specific categories. Given a list of categories and using its image and text understanding -capabilities, the model can infer which category the image likely belongs to. +IDEFICS is capable of classifying images into different categories without being explicitly trained on data containing +labeled examples from those specific categories. Given a list of categories and using its image and text understanding +capabilities, the model can infer which category the image likely belongs to. -Say, we have this image of a vegetable stand: +Say, we have this image of a vegetable stand:
Image of a vegetable stand @@ -286,10 +286,10 @@ In the example above we instruct the model to classify the image into a single c ## Image-guided text generation -For more creative applications, you can use image-guided text generation to generate text based on an image. This can be -useful to create descriptions of products, ads, descriptions of a scene, etc. +For more creative applications, you can use image-guided text generation to generate text based on an image. This can be +useful to create descriptions of products, ads, descriptions of a scene, etc. -Let's prompt IDEFICS to write a story based on a simple image of a red door: +Let's prompt IDEFICS to write a story based on a simple image of a red door:
Image of a red door with a pumpkin on the steps @@ -333,14 +333,14 @@ Looks like IDEFICS noticed the pumpkin on the doorstep and went with a spooky Ha -For longer outputs like this, you will greatly benefit from tweaking the text generation strategy. This can help -you significantly improve the quality of the generated output. Check out [Text generation strategies](../generation_strategies) -to learn more. +For longer outputs like this, you will greatly benefit from tweaking the text generation strategy. This can help +you significantly improve the quality of the generated output. Check out [Text generation strategies](../generation_strategies) +to learn more. ## Running inference in batch mode -All of the earlier sections illustrated IDEFICS for a single example. In a very similar fashion, you can run inference +All of the earlier sections illustrated IDEFICS for a single example. In a very similar fashion, you can run inference for a batch of examples by passing a list of prompts: ```py @@ -375,13 +375,13 @@ This is an image of a vegetable stand. ## IDEFICS instruct for conversational use -For conversational use cases, you can find fine-tuned instructed versions of the model on the 🤗 Hub: +For conversational use cases, you can find fine-tuned instructed versions of the model on the 🤗 Hub: `HuggingFaceM4/idefics-80b-instruct` and `HuggingFaceM4/idefics-9b-instruct`. -These checkpoints are the result of fine-tuning the respective base models on a mixture of supervised and instruction +These checkpoints are the result of fine-tuning the respective base models on a mixture of supervised and instruction fine-tuning datasets, which boosts the downstream performance while making the models more usable in conversational settings. -The use and prompting for the conversational use is very similar to using the base models: +The use and prompting for the conversational use is very similar to using the base models: ```py >>> import torch diff --git a/docs/source/en/tasks/image_captioning.md b/docs/source/en/tasks/image_captioning.md index f9716f29a20..89c35a50b55 100644 --- a/docs/source/en/tasks/image_captioning.md +++ b/docs/source/en/tasks/image_captioning.md @@ -14,7 +14,6 @@ rendered properly in your Markdown viewer. --> - # Image captioning [[open-in-colab]] @@ -26,7 +25,7 @@ helps to improve content accessibility for people by describing images to them. This guide will show you how to: * Fine-tune an image captioning model. -* Use the fine-tuned model for inference. +* Use the fine-tuned model for inference. Before you begin, make sure you have all the necessary libraries installed: @@ -37,7 +36,6 @@ pip install jiwer -q We encourage you to log in to your Hugging Face account so you can upload and share your model with the community. When prompted, enter your token to log in: - ```python from huggingface_hub import notebook_login @@ -47,8 +45,7 @@ notebook_login() ## Load the Pokémon BLIP captions dataset Use the 🤗 Dataset library to load a dataset that consists of {image-caption} pairs. To create your own image captioning dataset -in PyTorch, you can follow [this notebook](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/GIT/Fine_tune_GIT_on_an_image_captioning_dataset.ipynb). - +in PyTorch, you can follow [this notebook](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/GIT/Fine_tune_GIT_on_an_image_captioning_dataset.ipynb). ```python from datasets import load_dataset @@ -56,6 +53,7 @@ from datasets import load_dataset ds = load_dataset("lambdalabs/pokemon-blip-captions") ds ``` + ```bash DatasetDict({ train: Dataset({ @@ -69,21 +67,19 @@ The dataset has two features, `image` and `text`. -Many image captioning datasets contain multiple captions per image. In those cases, a common strategy is to randomly sample a caption amongst the available ones during training. +Many image captioning datasets contain multiple captions per image. In those cases, a common strategy is to randomly sample a caption amongst the available ones during training. Split the dataset’s train split into a train and test set with the [`~datasets.Dataset.train_test_split`] method: - ```python ds = ds["train"].train_test_split(test_size=0.1) train_ds = ds["train"] test_ds = ds["test"] ``` -Let's visualize a couple of samples from the training set. - +Let's visualize a couple of samples from the training set. ```python from textwrap import wrap @@ -106,7 +102,7 @@ sample_images_to_visualize = [np.array(train_ds[i]["image"]) for i in range(5)] sample_captions = [train_ds[i]["text"] for i in range(5)] plot_images(sample_images_to_visualize, sample_captions) ``` - +
Sample training images
@@ -115,7 +111,7 @@ plot_images(sample_images_to_visualize, sample_captions) Since the dataset has two modalities (image and text), the pre-processing pipeline will preprocess images and the captions. -To do so, load the processor class associated with the model you are about to fine-tune. +To do so, load the processor class associated with the model you are about to fine-tune. ```python from transformers import AutoProcessor @@ -124,7 +120,7 @@ checkpoint = "microsoft/git-base" processor = AutoProcessor.from_pretrained(checkpoint) ``` -The processor will internally pre-process the image (which includes resizing, and pixel scaling) and tokenize the caption. +The processor will internally pre-process the image (which includes resizing, and pixel scaling) and tokenize the caption. ```python def transforms(example_batch): @@ -139,13 +135,12 @@ train_ds.set_transform(transforms) test_ds.set_transform(transforms) ``` -With the dataset ready, you can now set up the model for fine-tuning. +With the dataset ready, you can now set up the model for fine-tuning. ## Load a base model Load the ["microsoft/git-base"](https://huggingface.co/microsoft/git-base) into a [`AutoModelForCausalLM`](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForCausalLM) object. - ```python from transformers import AutoModelForCausalLM @@ -154,10 +149,9 @@ model = AutoModelForCausalLM.from_pretrained(checkpoint) ## Evaluate -Image captioning models are typically evaluated with the [Rouge Score](https://huggingface.co/spaces/evaluate-metric/rouge) or [Word Error Rate](https://huggingface.co/spaces/evaluate-metric/wer). For this guide, you will use the Word Error Rate (WER). - -We use the 🤗 Evaluate library to do so. For potential limitations and other gotchas of the WER, refer to [this guide](https://huggingface.co/spaces/evaluate-metric/wer). +Image captioning models are typically evaluated with the [Rouge Score](https://huggingface.co/spaces/evaluate-metric/rouge) or [Word Error Rate](https://huggingface.co/spaces/evaluate-metric/wer). For this guide, you will use the Word Error Rate (WER). +We use the 🤗 Evaluate library to do so. For potential limitations and other gotchas of the WER, refer to [this guide](https://huggingface.co/spaces/evaluate-metric/wer). ```python from evaluate import load @@ -177,11 +171,10 @@ def compute_metrics(eval_pred): ## Train! -Now, you are ready to start fine-tuning the model. You will use the 🤗 [`Trainer`] for this. +Now, you are ready to start fine-tuning the model. You will use the 🤗 [`Trainer`] for this. First, define the training arguments using [`TrainingArguments`]. - ```python from transformers import TrainingArguments, Trainer @@ -208,7 +201,7 @@ training_args = TrainingArguments( ) ``` -Then pass them along with the datasets and the model to 🤗 Trainer. +Then pass them along with the datasets and the model to 🤗 Trainer. ```python trainer = Trainer( @@ -222,7 +215,7 @@ trainer = Trainer( To start training, simply call [`~Trainer.train`] on the [`Trainer`] object. -```python +```python trainer.train() ``` @@ -230,7 +223,6 @@ You should see the training loss drop smoothly as training progresses. Once training is completed, share your model to the Hub with the [`~Trainer.push_to_hub`] method so everyone can use your model: - ```python trainer.push_to_hub() ``` @@ -239,7 +231,6 @@ trainer.push_to_hub() Take a sample image from `test_ds` to test the model. - ```python from PIL import Image import requests @@ -252,7 +243,7 @@ image
Test image
- + Prepare image for the model. ```python @@ -263,13 +254,14 @@ inputs = processor(images=image, return_tensors="pt").to(device) pixel_values = inputs.pixel_values ``` -Call [`generate`] and decode the predictions. +Call [`generate`] and decode the predictions. ```python generated_ids = model.generate(pixel_values=pixel_values, max_length=50) generated_caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] print(generated_caption) ``` + ```bash a drawing of a pink and blue pokemon ``` diff --git a/docs/source/en/tasks/image_classification.md b/docs/source/en/tasks/image_classification.md index 39b013f129c..4754a91bd48 100644 --- a/docs/source/en/tasks/image_classification.md +++ b/docs/source/en/tasks/image_classification.md @@ -175,7 +175,6 @@ Your `compute_metrics` function is ready to go now, and you'll return to it when ## Train - If you aren't familiar with finetuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#train-with-pytorch-trainer)! @@ -238,7 +237,6 @@ Once training is completed, share your model to the Hub with the [`~transformers >>> trainer.push_to_hub() ``` - For a more in-depth example of how to finetune a model for image classification, take a look at the corresponding [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb). diff --git a/docs/source/en/tasks/image_feature_extraction.md b/docs/source/en/tasks/image_feature_extraction.md index 455a2b425d4..e08ba89e4dd 100644 --- a/docs/source/en/tasks/image_feature_extraction.md +++ b/docs/source/en/tasks/image_feature_extraction.md @@ -27,7 +27,7 @@ In this guide, you will: ## Image Similarity using `image-feature-extraction` Pipeline -We have two images of cats sitting on top of fish nets, one of them is generated. +We have two images of cats sitting on top of fish nets, one of them is generated. ```python from PIL import Image @@ -66,7 +66,7 @@ print(outputs) # [[[-0.03909236937761307, 0.43381670117378235, -0.06913255900144577, ``` -To get the similarity score, we need to pass them to a similarity function. +To get the similarity score, we need to pass them to a similarity function. ```python from torch.nn.functional import cosine_similarity @@ -131,4 +131,3 @@ print(similarity_score) # tensor([0.6061], device='cuda:0', grad_fn=) ``` - diff --git a/docs/source/en/tasks/image_text_to_text.md b/docs/source/en/tasks/image_text_to_text.md index b34f4edf90f..5412882b59f 100644 --- a/docs/source/en/tasks/image_text_to_text.md +++ b/docs/source/en/tasks/image_text_to_text.md @@ -63,7 +63,6 @@ The image inputs look like the following. A bee on a pink flower
- ```python from PIL import Image import requests @@ -76,7 +75,6 @@ images = [Image.open(requests.get(img_urls[0], stream=True).raw), Below is an example of the chat template. We can feed conversation turns and the last message as an input by appending it at the end of the template. - ```python messages = [ { @@ -207,7 +205,6 @@ We can use [text streaming](./generation_strategies#streaming) for a better gene Assume we have an application that keeps chat history and takes in the new user input. We will preprocess the inputs as usual and initialize [`TextIteratorStreamer`] to handle the generation in a separate thread. This allows you to stream the generated text tokens in real-time. Any generation arguments can be passed to [`TextIteratorStreamer`]. - ```python import time from transformers import TextIteratorStreamer diff --git a/docs/source/en/tasks/image_to_image.md b/docs/source/en/tasks/image_to_image.md index da6a57ac9aa..6c4cdf585f0 100644 --- a/docs/source/en/tasks/image_to_image.md +++ b/docs/source/en/tasks/image_to_image.md @@ -18,7 +18,7 @@ rendered properly in your Markdown viewer. [[open-in-colab]] -Image-to-Image task is the task where an application receives an image and outputs another image. This has various subtasks, including image enhancement (super resolution, low light enhancement, deraining and so on), image inpainting, and more. +Image-to-Image task is the task where an application receives an image and outputs another image. This has various subtasks, including image enhancement (super resolution, low light enhancement, deraining and so on), image inpainting, and more. This guide will show you how to: - Use an image-to-image pipeline for super resolution task, @@ -32,7 +32,7 @@ Let's begin by installing the necessary libraries. pip install transformers ``` -We can now initialize the pipeline with a [Swin2SR model](https://huggingface.co/caidas/swin2SR-lightweight-x2-64). We can then infer with the pipeline by calling it with an image. As of now, only [Swin2SR models](https://huggingface.co/models?sort=trending&search=swin2sr) are supported in this pipeline. +We can now initialize the pipeline with a [Swin2SR model](https://huggingface.co/caidas/swin2SR-lightweight-x2-64). We can then infer with the pipeline by calling it with an image. As of now, only [Swin2SR models](https://huggingface.co/models?sort=trending&search=swin2sr) are supported in this pipeline. ```python from transformers import pipeline, infer_device @@ -53,19 +53,22 @@ image = Image.open(requests.get(url, stream=True).raw) print(image.size) ``` + ```bash # (532, 432) ``` +
Photo of a cat
-We can now do inference with the pipeline. We will get an upscaled version of the cat image. +We can now do inference with the pipeline. We will get an upscaled version of the cat image. ```python upscaled = pipe(image) print(upscaled.size) ``` + ```bash # (1072, 880) ``` @@ -79,7 +82,7 @@ model = Swin2SRForImageSuperResolution.from_pretrained("caidas/swin2SR-lightweig processor = Swin2SRImageProcessor("caidas/swin2SR-lightweight-x2-64") ``` -`pipeline` abstracts away the preprocessing and postprocessing steps that we have to do ourselves, so let's preprocess the image. We will pass the image to the processor and then move the pixel values to GPU. +`pipeline` abstracts away the preprocessing and postprocessing steps that we have to do ourselves, so let's preprocess the image. We will pass the image to the processor and then move the pixel values to GPU. ```python pixel_values = processor(image, return_tensors="pt").pixel_values @@ -96,7 +99,8 @@ import torch with torch.no_grad(): outputs = model(pixel_values) ``` -Output is an object of type `ImageSuperResolutionOutput` that looks like below 👇 + +Output is an object of type `ImageSuperResolutionOutput` that looks like below 👇 ``` (loss=None, reconstruction=tensor([[[[0.8270, 0.8269, 0.8275, ..., 0.7463, 0.7446, 0.7453], @@ -108,6 +112,7 @@ Output is an object of type `ImageSuperResolutionOutput` that looks like below [0.5927, 0.5914, 0.5922, ..., 0.0664, 0.0694, 0.0718]]]], device='cuda:0'), hidden_states=None, attentions=None) ``` + We need to get the `reconstruction` and post-process it for visualization. Let's see how it looks like. ```python @@ -128,6 +133,7 @@ output = np.moveaxis(output, source=0, destination=-1) output = (output * 255.0).round().astype(np.uint8) Image.fromarray(output) ``` +
Upscaled photo of a cat
diff --git a/docs/source/en/tasks/keypoint_detection.md b/docs/source/en/tasks/keypoint_detection.md index 3a5871d01a2..c850c67ae15 100644 --- a/docs/source/en/tasks/keypoint_detection.md +++ b/docs/source/en/tasks/keypoint_detection.md @@ -18,7 +18,7 @@ rendered properly in your Markdown viewer. [[open-in-colab]] -Keypoint detection identifies and locates specific points of interest within an image. These keypoints, also known as landmarks, represent meaningful features of objects, such as facial features or object parts. These models take an image input and return the following outputs: +Keypoint detection identifies and locates specific points of interest within an image. These keypoints, also known as landmarks, represent meaningful features of objects, such as facial features or object parts. These models take an image input and return the following outputs: - **Keypoints and Scores**: Points of interest and their confidence scores. - **Descriptors**: A representation of the image region surrounding each keypoint, capturing its texture, gradient, orientation and other properties. @@ -36,15 +36,14 @@ model = SuperPointForKeypointDetection.from_pretrained("magic-leap-community/sup Let's test the model on the images below.
- Bee - Cats
- ```python import torch from PIL import Image @@ -93,7 +92,7 @@ image_sizes = [(image.size[1], image.size[0]) for image in images] outputs = processor.post_process_keypoint_detection(outputs, image_sizes) ``` -The outputs are now a list of dictionaries where each dictionary is a processed output of keypoints, scores and descriptors. +The outputs are now a list of dictionaries where each dictionary is a processed output of keypoints, scores and descriptors. ```python [{'keypoints': tensor([[ 226, 57], @@ -144,11 +143,10 @@ for i in range(len(images)): Below you can see the outputs.
- Bee - Cats
- diff --git a/docs/source/en/tasks/keypoint_matching.md b/docs/source/en/tasks/keypoint_matching.md index f7065f31521..aff16a937d7 100644 --- a/docs/source/en/tasks/keypoint_matching.md +++ b/docs/source/en/tasks/keypoint_matching.md @@ -34,15 +34,15 @@ model = AutoModelForKeypointMatching.from_pretrained("zju-community/matchanythin Load two images that have the same object of interest. The second photo is taken a second apart, it's colors are edited, and it is further cropped and rotated.
- Bee - Bee edited
-```python +```python from transformers.image_utils import load_image image1 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg") image2 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee_edited.jpg") @@ -82,16 +82,16 @@ Here's the outputs. [1521, 2560]], dtype=torch.int32), 'matching_scores': tensor([0.2189, 0.2073, 0.2414, ... ])}] -``` +``` We have trimmed the output but there's 401 matches! ```python len(outputs[0]["keypoints0"]) # 401 -``` +``` -We can visualize them using the processor's [`~EfficientLoFTRImageProcessor.visualize_keypoint_matching`] method. +We can visualize them using the processor's [`~EfficientLoFTRImageProcessor.visualize_keypoint_matching`] method. ```python plot_images = processor.visualize_keypoint_matching(images, outputs) @@ -100,7 +100,7 @@ plot_images ![Matched Image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/matched_bees.png) -Optionally, you can use the [`Pipeline`] API and set the task to `keypoint-matching`. +Optionally, you can use the [`Pipeline`] API and set the task to `keypoint-matching`. ```python from transformers import pipeline diff --git a/docs/source/en/tasks/knowledge_distillation_for_image_classification.md b/docs/source/en/tasks/knowledge_distillation_for_image_classification.md index 7c4a684d3c0..d4b3dd8511d 100644 --- a/docs/source/en/tasks/knowledge_distillation_for_image_classification.md +++ b/docs/source/en/tasks/knowledge_distillation_for_image_classification.md @@ -52,7 +52,6 @@ processed_datasets = dataset.map(process, batched=True) Essentially, we want the student model (a randomly initialized MobileNet) to mimic the teacher model (fine-tuned vision transformer). To achieve this, we first get the logits output from the teacher and the student. Then, we divide each of them by the parameter `temperature` which controls the importance of each soft target. A parameter called `lambda` weighs the importance of the distillation loss. In this example, we will use `temperature=5` and `lambda=0.5`. We will use the Kullback-Leibler Divergence loss to compute the divergence between the student and teacher. Given two data P and Q, KL Divergence explains how much extra information we need to represent P using Q. If two are identical, their KL divergence is zero, as there's no other information needed to explain P from Q. Thus, in the context of knowledge distillation, KL divergence is useful. - ```python from transformers import TrainingArguments, Trainer, infer_device import torch diff --git a/docs/source/en/tasks/mask_generation.md b/docs/source/en/tasks/mask_generation.md index 5f66e68c245..06ba26ea123 100644 --- a/docs/source/en/tasks/mask_generation.md +++ b/docs/source/en/tasks/mask_generation.md @@ -16,22 +16,22 @@ rendered properly in your Markdown viewer. # Mask Generation -Mask generation is the task of generating semantically meaningful masks for an image. -This task is very similar to [image segmentation](semantic_segmentation), but many differences exist. Image segmentation models are trained on labeled datasets and are limited to the classes they have seen during training; they return a set of masks and corresponding classes, given an image. +Mask generation is the task of generating semantically meaningful masks for an image. +This task is very similar to [image segmentation](semantic_segmentation), but many differences exist. Image segmentation models are trained on labeled datasets and are limited to the classes they have seen during training; they return a set of masks and corresponding classes, given an image. -Mask generation models are trained on large amounts of data and operate in two modes. -- Prompting mode: In this mode, the model takes in an image and a prompt, where a prompt can be a 2D point location (XY coordinates) in the image within an object or a bounding box surrounding an object. In prompting mode, the model only returns the mask over the object -that the prompt is pointing out. -- Segment Everything mode: In segment everything, given an image, the model generates every mask in the image. To do so, a grid of points is generated and overlaid on the image for inference. +Mask generation models are trained on large amounts of data and operate in two modes. +- Prompting mode: In this mode, the model takes in an image and a prompt, where a prompt can be a 2D point location (XY coordinates) in the image within an object or a bounding box surrounding an object. In prompting mode, the model only returns the mask over the object +that the prompt is pointing out. +- Segment Everything mode: In segment everything, given an image, the model generates every mask in the image. To do so, a grid of points is generated and overlaid on the image for inference. -Mask generation task is supported by [Segment Anything Model (SAM)](model_doc/sam). It's a powerful model that consists of a Vision Transformer-based image encoder, a prompt encoder, and a two-way transformer mask decoder. Images and prompts are encoded, and the decoder takes these embeddings and generates valid masks. +Mask generation task is supported by [Segment Anything Model (SAM)](model_doc/sam). It's a powerful model that consists of a Vision Transformer-based image encoder, a prompt encoder, and a two-way transformer mask decoder. Images and prompts are encoded, and the decoder takes these embeddings and generates valid masks.
SAM Architecture
-SAM serves as a powerful foundation model for segmentation as it has large data coverage. It is trained on -[SA-1B](https://ai.meta.com/datasets/segment-anything/), a dataset with 1 million images and 1.1 billion masks. +SAM serves as a powerful foundation model for segmentation as it has large data coverage. It is trained on +[SA-1B](https://ai.meta.com/datasets/segment-anything/), a dataset with 1 million images and 1.1 billion masks. In this guide, you will learn how to: - Infer in segment everything mode with batching, @@ -114,7 +114,6 @@ Below is the original image in grayscale with colorful maps overlaid. Very impre Visualized
- ## Model Inference ### Point Prompting @@ -132,7 +131,7 @@ processor = SamProcessor.from_pretrained("facebook/sam-vit-base") To do point prompting, pass the input point to the processor, then take the processor output and pass it to the model for inference. To post-process the model output, pass the outputs and -`original_sizes` and `reshaped_input_sizes` we take from the processor's initial output. We need to pass these +`original_sizes` and `reshaped_input_sizes` we take from the processor's initial output. We need to pass these since the processor resizes the image, and the output needs to be extrapolated. ```python @@ -143,6 +142,7 @@ with torch.no_grad(): outputs = model(**inputs) masks = processor.image_processor.post_process_masks(outputs.pred_masks.cpu(), inputs["original_sizes"].cpu(), inputs["reshaped_input_sizes"].cpu()) ``` + We can visualize the three masks in the `masks` output. ```python @@ -177,10 +177,9 @@ plt.show() ### Box Prompting You can also do box prompting in a similar fashion to point prompting. You can simply pass the input box in the format of a list -`[x_min, y_min, x_max, y_max]` format along with the image to the `processor`. Take the processor output and directly pass it +`[x_min, y_min, x_max, y_max]` format along with the image to the `processor`. Take the processor output and directly pass it to the model, then post-process the output again. - ```python # bounding box around the bee box = [2350, 1600, 2850, 2100] @@ -219,7 +218,7 @@ plt.show() Visualized Bbox
-You can see the inference output below. +You can see the inference output below. ```python fig, ax = plt.subplots() @@ -233,4 +232,3 @@ plt.show()
Visualized Inference
- diff --git a/docs/source/en/tasks/monocular_depth_estimation.md b/docs/source/en/tasks/monocular_depth_estimation.md index c90abce1cd5..aef9bd22c4d 100644 --- a/docs/source/en/tasks/monocular_depth_estimation.md +++ b/docs/source/en/tasks/monocular_depth_estimation.md @@ -23,7 +23,7 @@ a single camera viewpoint. Monocular depth estimation has various applications, including 3D reconstruction, augmented reality, autonomous driving, and robotics. It is a challenging task as it requires the model to understand the complex relationships between objects in the scene and the corresponding depth information, which can be affected by factors such as lighting conditions, -occlusion, and texture. +occlusion, and texture. There are two main depth estimation categories: @@ -143,7 +143,7 @@ Let's post-process the results to remove any padding and resize the depth map to

In the original implementation ZoeDepth model performs inference on both the original and flipped images and averages out the results. The post_process_depth_estimation function can handle this for us by passing the flipped outputs to the optional outputs_flipped argument:

-
>>> with torch.no_grad():   
+
>>> with torch.no_grad():
 ...     outputs = model(pixel_values)
 ...     outputs_flipped = model(pixel_values=torch.flip(inputs.pixel_values, dims=[3]))
 >>> post_processed_output = image_processor.post_process_depth_estimation(
diff --git a/docs/source/en/tasks/multiple_choice.md b/docs/source/en/tasks/multiple_choice.md
index 3f4c9d4637f..d35f108ecce 100644
--- a/docs/source/en/tasks/multiple_choice.md
+++ b/docs/source/en/tasks/multiple_choice.md
@@ -113,6 +113,7 @@ To apply the preprocessing function over the entire dataset, use 🤗 Datasets [
 ```
 
 To create a batch of examples, it's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length. [`DataCollatorForMultipleChoice`] flattens all the model inputs, applies padding, and then unflattens the results.
+
 ```py
 >>> from transformers import DataCollatorForMultipleChoice
 >>> collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)
@@ -197,7 +198,6 @@ Once training is completed, share your model to the Hub with the [`~transformers
 >>> trainer.push_to_hub()
 ```
 
-
 
 
 For a more in-depth example of how to finetune a model for multiple choice, take a look at the corresponding
diff --git a/docs/source/en/tasks/object_detection.md b/docs/source/en/tasks/object_detection.md
index 394e77104b7..093644b662f 100644
--- a/docs/source/en/tasks/object_detection.md
+++ b/docs/source/en/tasks/object_detection.md
@@ -171,11 +171,11 @@ To get an even better understanding of the data, visualize an example in the dat
 
 >>> image
 ```
+
 
CPPE-5 Image Example
- To visualize the bounding boxes with associated labels, you can get the labels from the dataset's metadata, specifically the `category` field. You'll also want to create dictionaries that map a label id to a label class (`id2label`) and the other way around (`label2id`). @@ -576,6 +576,7 @@ Finally, bring everything together, and call [`~transformers.Trainer.train`]: >>> trainer.train() ``` +
@@ -1487,6 +1488,7 @@ Now that you have finetuned a model, evaluated it, and uploaded it to the Huggin ``` Load model and image processor from the Hugging Face Hub (skip to use already trained in this session): + ```py >>> from transformers import infer_device diff --git a/docs/source/en/tasks/prompting.md b/docs/source/en/tasks/prompting.md index eb8e61d67aa..2d115d4e544 100644 --- a/docs/source/en/tasks/prompting.md +++ b/docs/source/en/tasks/prompting.md @@ -127,7 +127,6 @@ for output in outputs: print(f"Result: {output['generated_text']}") ``` - While the basic few-shot prompting approach embedded examples within a single text string, the chat template format offers the following benefits. - The model may have a potentially improved understanding because it can better recognize the pattern and the expected roles of user input and assistant output. diff --git a/docs/source/en/tasks/semantic_segmentation.md b/docs/source/en/tasks/semantic_segmentation.md index 5d3c8e70aa1..08d68047dc6 100644 --- a/docs/source/en/tasks/semantic_segmentation.md +++ b/docs/source/en/tasks/semantic_segmentation.md @@ -69,6 +69,7 @@ results ``` The segmentation pipeline output includes a mask for every predicted class. + ```bash [{'score': None, 'label': 'road', @@ -107,6 +108,7 @@ Taking a look at the mask for the car class, we can see every car is classified ```python results[-1]["mask"] ``` +
Semantic Segmentation Output
@@ -135,11 +137,13 @@ As you can see below, there are multiple cars classified, and there's no classif 'label': 'person', 'mask': }] ``` + Checking out one of the car masks below. ```python results[2]["mask"] ``` +
Semantic Segmentation Output
@@ -151,6 +155,7 @@ panoptic_segmentation = pipeline("image-segmentation", "facebook/mask2former-swi results = panoptic_segmentation(image) results ``` + As you can see below, we have more classes. We will later illustrate to see that every pixel is classified into one of the classes. ```bash @@ -206,7 +211,6 @@ To see all architectures and checkpoints compatible with this task, we recommend - ### Load SceneParse150 dataset Start by loading a smaller subset of the SceneParse150 dataset from the 🤗 Datasets library. This'll give you a chance to experiment and make sure everything works before spending more time training on the full dataset. @@ -473,7 +477,6 @@ Reload the dataset and load an image for inference. Image of bedroom
- We will now see how to infer without a pipeline. Process the image with an image processor and place the `pixel_values` on a GPU: ```py @@ -503,7 +506,6 @@ Next, rescale the logits to the original image size: >>> pred_seg = upsampled_logits.argmax(dim=1)[0] ``` - To visualize the results, load the [dataset color palette](https://github.com/tensorflow/models/blob/3f1ca33afe3c1631b733ea7e40c294273b9e406d/research/deeplab/utils/get_dataset_colormap.py#L51) as `ade_palette()` that maps each class to their RGB values. ```py diff --git a/docs/source/en/tasks/summarization.md b/docs/source/en/tasks/summarization.md index c57097421fb..b2f2beebc80 100644 --- a/docs/source/en/tasks/summarization.md +++ b/docs/source/en/tasks/summarization.md @@ -213,7 +213,6 @@ Once training is completed, share your model to the Hub with the [`~transformers >>> trainer.push_to_hub() ``` - For a more in-depth example of how to finetune a model for summarization, take a look at the corresponding diff --git a/docs/source/en/tasks/token_classification.md b/docs/source/en/tasks/token_classification.md index 49b0fcf216b..5096298affd 100644 --- a/docs/source/en/tasks/token_classification.md +++ b/docs/source/en/tasks/token_classification.md @@ -242,7 +242,6 @@ Before you start training your model, create a map of the expected ids to their ... } ``` - If you aren't familiar with finetuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#train-with-pytorch-trainer)! @@ -298,7 +297,6 @@ Once training is completed, share your model to the Hub with the [`~transformers >>> trainer.push_to_hub() ``` - For a more in-depth example of how to finetune a model for token classification, take a look at the corresponding diff --git a/docs/source/en/tasks/video_classification.md b/docs/source/en/tasks/video_classification.md index b387a8320df..bae638bd84e 100644 --- a/docs/source/en/tasks/video_classification.md +++ b/docs/source/en/tasks/video_classification.md @@ -363,7 +363,6 @@ Leverage [`Trainer`](https://huggingface.co/docs/transformers/main_classes/train Most of the training arguments are self-explanatory, but one that is quite important here is `remove_unused_columns=False`. This one will drop any features not used by the model's call function. By default it's `True` because usually it's ideal to drop unused feature columns, making it easier to unpack inputs into the model's call function. But, in this case, you need the unused features ('video' in particular) in order to create `pixel_values` (which is a mandatory key our model expects in its inputs). - ```py >>> from transformers import TrainingArguments, Trainer @@ -477,7 +476,6 @@ The simplest way to try out your fine-tuned model for inference is to use it in You can also manually replicate the results of the `pipeline` if you'd like. - ```py >>> def run_inference(model, video): ... # (num_frames, num_channels, height, width) diff --git a/docs/source/en/tasks/video_text_to_text.md b/docs/source/en/tasks/video_text_to_text.md index 0e0191af588..b0f698f039e 100644 --- a/docs/source/en/tasks/video_text_to_text.md +++ b/docs/source/en/tasks/video_text_to_text.md @@ -18,9 +18,9 @@ rendered properly in your Markdown viewer. [[open-in-colab]] -Video-text-to-text models, also known as video language models or vision language models with video input, are language models that take a video input. These models can tackle various tasks, from video question answering to video captioning. +Video-text-to-text models, also known as video language models or vision language models with video input, are language models that take a video input. These models can tackle various tasks, from video question answering to video captioning. -These models have nearly the same architecture as [image-text-to-text](../image_text_to_text) models except for some changes to accept video data, since video data is essentially image frames with temporal dependencies. Some image-text-to-text models take in multiple images, but this alone is inadequate for a model to accept videos. Moreover, video-text-to-text models are often trained with all vision modalities. Each example might have videos, multiple videos, images and multiple images. Some of these models can also take interleaved inputs. For example, you can refer to a specific video inside a string of text by adding a video token in text like "What is happening in this video? `
Pass the image and the candidate object labels to look for to the pipeline. -Here we pass the image directly; other suitable options include a local path to an image or an image url. We also pass text descriptions for all items we want to query the image for. +Here we pass the image directly; other suitable options include a local path to an image or an image url. We also pass text descriptions for all items we want to query the image for. ```py >>> predictions = detector( diff --git a/docs/source/en/testing.md b/docs/source/en/testing.md index 497c6b01931..78c32a58097 100644 --- a/docs/source/en/testing.md +++ b/docs/source/en/testing.md @@ -16,7 +16,6 @@ rendered properly in your Markdown viewer. # Testing - Let's take a look at how 🤗 Transformers models are tested and how you can write new tests and improve the existing ones. There are 2 test suites in the repository: @@ -51,12 +50,8 @@ RUN_SLOW=1 pytest examples/ The results can be observed [here](https://github.com/huggingface/transformers/actions). - - ## Running tests - - ### Choosing which tests to run This document goes into many details of how tests can be run. If after reading everything, you need even more details @@ -89,8 +84,6 @@ which tells pytest to: - do not capture output - run in verbose mode - - ### Getting the list of all tests All tests of the test suite: @@ -187,7 +180,6 @@ Sometimes you need to run `accelerate` tests on your models. For that you can ju RUN_SLOW=1 pytest -m accelerate_tests tests/models/opt/test_modeling_opt.py ``` - ### Run documentation tests In order to test whether the documentation examples are correct, you should check that the `doctests` are passing. @@ -217,9 +209,11 @@ Example: ``` Just run the following line to automatically test every docstring example in the desired file: + ```bash pytest --doctest-modules ``` + If the file has a markdown extension, you should add the `--doctest-glob="*.md"` argument. ### Run only modified tests @@ -271,7 +265,6 @@ directory. [pytest-watch](https://github.com/joeyespo/pytest-watch) is an alternative implementation of this functionality. - ### Skip a test module If you want to run all test modules, except a few you can exclude them by giving an explicit list of tests to run. For @@ -307,7 +300,6 @@ It's good to repeat the tests several times, in sequence, randomly, or in sets, inter-dependency and state-related bugs (tear down). And the straightforward multiple repetition is just good to detect some problems that get uncovered by randomness of DL. - #### Repeat tests - [pytest-flakefinder](https://github.com/dropbox/pytest-flakefinder): @@ -403,8 +395,6 @@ pytest -p no:sugar or uninstall it. - - #### Report each sub-test name and its progress For a single or a group of tests via `pytest` (after `pip install pytest-pspec`): @@ -457,7 +447,6 @@ decorators are used to set the requirements of tests CPU/GPU/XPU/TPU-wise: Let's depict the GPU requirements in the following table: - | n gpus | decorator | |--------|--------------------------------| | `>= 0` | `@require_torch` | @@ -466,7 +455,6 @@ Let's depict the GPU requirements in the following table: | `< 2` | `@require_torch_non_multi_gpu` | | `< 3` | `@require_torch_up_to_2_gpus` | - For example, here is a test that must be run only when there are 2 or more GPUs available and pytorch is installed: ```python no-style @@ -520,6 +508,7 @@ Certain devices will require an additional import after importing `torch` for th ```bash TRANSFORMERS_TEST_BACKEND="torch_npu" pytest tests/utils/test_logging.py ``` + Alternative backends may also require the replacement of device-specific functions. For example `torch.cuda.manual_seed` may need to be replaced with a device-specific seed setter like `torch.npu.manual_seed` or `torch.xpu.manual_seed` to correctly set a random seed on the device. To specify a new backend with backend-specific device functions when running the test suite, create a Python device specification file `spec.py` in the format: ```python @@ -536,6 +525,7 @@ MANUAL_SEED_FN = torch.npu.manual_seed EMPTY_CACHE_FN = torch.npu.empty_cache DEVICE_COUNT_FN = torch.npu.device_count ``` + This format also allows for specification of any additional imports required. To use this file to replace equivalent methods in the test suite, set the environment variable `TRANSFORMERS_TEST_DEVICE_SPEC` to the path of the spec file, e.g. `TRANSFORMERS_TEST_DEVICE_SPEC=spec.py`. Currently, only `MANUAL_SEED_FN`, `EMPTY_CACHE_FN` and `DEVICE_COUNT_FN` are supported for device-specific dispatch. @@ -610,7 +600,6 @@ You can read [here](https://docs.pytest.org/en/stable/unittest.html) which featu thing to remember is that most `pytest` fixtures don't work. Neither parametrization, but we use the module `parameterized` that works in a similar way. - ### Parametrization Often, there is a need to run the same test multiple times, but with different arguments. It could be done from within @@ -719,8 +708,6 @@ pytest test_this2.py::test_floor[negative--1.5--2.0] test_this2.py::test_floor[i as in the previous example. - - ### Files and directories In tests often we need to know where things are relative to the current test file, and it's not trivial since the test @@ -843,7 +830,6 @@ otherwise. If you need to temporary override `sys.path` to import from another test for example, you can use the `ExtendSysPath` context manager. Example: - ```python import os from transformers.testing_utils import ExtendSysPath @@ -893,7 +879,6 @@ or the `xfail` way: def test_feature_x(): ``` - Here's how to skip a test based on internal checks within the test: ```python @@ -1018,7 +1003,6 @@ That report is also useful to find slow outliers that aren't marked as such, or If you notice that the test suite starts getting slow on CI, the top listing of this report will show the slowest tests. - ### Testing the stdout/stderr output In order to test functions that write to `stdout` and/or `stderr`, the test can access those streams using the @@ -1141,7 +1125,6 @@ print(cs.err, cs.out) Also, to aid debugging test issues, by default these context managers automatically replay the captured streams on exit from the context. - ### Capturing logger stream If you need to validate the output of a logger, you can use `CaptureLogger`: @@ -1193,7 +1176,6 @@ called if anything. This helper method creates a copy of the `os.environ` object, so the original remains intact. - ### Getting reproducible results In some situations you may want to remove randomness for your tests. To get identical reproducible results set, you @@ -1241,9 +1223,6 @@ To trigger a self-push workflow CI job, you must: 4. Then you can see the job appear [here](https://github.com/huggingface/transformers/actions/workflows/self-push.yml). It may not run right away if there is a backlog. - - - ## Testing Experimental CI Features Testing CI features can be potentially problematic as it can interfere with the normal CI functioning. Therefore if a diff --git a/docs/source/en/tiny_agents.md b/docs/source/en/tiny_agents.md index dc53d05a4bf..7266f0236a6 100644 --- a/docs/source/en/tiny_agents.md +++ b/docs/source/en/tiny_agents.md @@ -42,4 +42,3 @@ Image URL: https://evalstate-flux1-schnell.hf.space/gradio_api/file=/tmp/gradio/ I have generated an image of a cat on the moon using the Flux 1 Schnell Image Generator. The image is 1024x1024 pixels and was created with 4 inference steps. Let me know if you would like to make any changes or need further assistance! ``` - diff --git a/docs/source/en/trainer.md b/docs/source/en/trainer.md index 48325da6893..32f14bc41da 100644 --- a/docs/source/en/trainer.md +++ b/docs/source/en/trainer.md @@ -346,7 +346,6 @@ use_cpu: false - Run [accelerate_launch](https://hf.co/docs/accelerate/package_reference/cli#accelerate-launch) to start training with the configurations set in `config_file.yaml`. This file is saved to the Accelerate cache folder and automatically loaded when you run `accelerate_launch`. The example below launches the [run_glue.py](../../../examples/pytorch/text-classification/run_glue) script with the FSDP configuration shown earlier. Parameters from the `config_file.yaml` file can also be directly set in the command line. diff --git a/docs/source/en/training.md b/docs/source/en/training.md index ed992e8152d..ccee25704fa 100644 --- a/docs/source/en/training.md +++ b/docs/source/en/training.md @@ -52,6 +52,7 @@ dataset = dataset.map(tokenize, batched=True) > [!TIP] > Fine-tune on a smaller subset of the full dataset to reduce the time it takes. The results won't be as good compared to fine-tuning on the full dataset, but it is useful to make sure everything works as expected first before committing to training on the full dataset. +> > ```py > small_train = dataset["train"].shuffle(seed=42).select(range(1000)) > small_eval = dataset["test"].shuffle(seed=42).select(range(1000)) diff --git a/docs/source/en/transformers_as_backend.md b/docs/source/en/transformers_as_backend.md index 422cc4a121e..d1070acea6f 100644 --- a/docs/source/en/transformers_as_backend.md +++ b/docs/source/en/transformers_as_backend.md @@ -32,6 +32,7 @@ vLLM automatically selects the best backend, and if a model isn’t natively sup from vllm import LLM llm = LLM(model="meta-llama/Llama-3.2-1B", model_impl="transformers") ``` + Add `--model-impl transformers` to `vllm serve` to launch a server with a Transformers' model. ```bash @@ -42,7 +43,6 @@ vllm serve meta-llama/Llama-3.2-1B \ Refer to the [vLLM docs](https://docs.vllm.ai/en/latest/models/supported_models.html#transformers) for more usage examples and tips on using a Transformers as the backend. - ## SGLang [SGLang](https://github.com/InternLM/sglang) is a high-performance, OpenAI-compatible server and runtime designed for chat-based LLMs. It offers fast inference, role-based conversation handling, and support for custom pipelines, making it great for building real-world LLM apps. @@ -57,12 +57,6 @@ print(llm.generate(["The capital of France is"], {"max_new_tokens": 20})[0]) ``` Add `impl transformers` to `sglang.launch_server` to launch a server with a Transformers' model. - - - - - - ```bash python3 -m sglang.launch_server \ @@ -133,7 +127,7 @@ class MyModel(PreTrainedModel): 3. This step is optional, but if you want to support tensor parallel and/or pipeline parallel features, add the following keys to the config. * `base_model_tp_plan` enables [tensor parallelism](./perf_infer_gpu_multi) by mapping fully qualified layer name patterns to tensor parallel styles. Only the `"colwise"` and `"rowwise"` partitioning strategies are currently supported. * `base_model_pp_plan` enables pipeline parallelism by mapping direct child layer names to tuples of lists of strings. The list in the first element of the tuple contains the names of the input arguments. The list in the last element of the tuple contains the names of the variables the layer outputs to in the modeling code. - + Expand the code below for an example.
@@ -158,6 +152,7 @@ class MyConfig(PretrainedConfig): "norm": (["hidden_states"], ["hidden_states"]), } ``` +
### Multimodal models @@ -200,8 +195,8 @@ class MyMultimodalModelForConditionalGeneration(MyMultimodalPreTrainedModel, Gen self.model = MyMultimodalModel(config) self.lm_head = nn.Linear(hidden_dim, vocab_size) ``` -
+ 2. A multimodal model config must be nested with the following fields. * text_config: decoder language model config @@ -246,6 +241,7 @@ class MyMultimodalProcessor(ProcessorMixin): vision_data.update({"num_image_tokens": num_image_tokens, "num_image_patches": num_image_patches}) return MultiModalData(**vision_data) ``` + ## Resources diff --git a/docs/source/en/troubleshooting.md b/docs/source/en/troubleshooting.md index 7998881d364..cfc51966893 100644 --- a/docs/source/en/troubleshooting.md +++ b/docs/source/en/troubleshooting.md @@ -34,7 +34,6 @@ Sometimes errors occur, but we are here to help! This guide covers some of the m For more details about troubleshooting and getting help, take a look at [Chapter 8](https://huggingface.co/course/chapter8/1?fw=pt) of the Hugging Face course. - ## Firewalled environments Some GPU instances on cloud and intranet setups are firewalled to external connections, resulting in a connection error. When your script attempts to download model weights or datasets, the download will hang and then timeout with the following message: diff --git a/notebooks/README.md b/notebooks/README.md index 4d31797104f..aed43587880 100644 --- a/notebooks/README.md +++ b/notebooks/README.md @@ -22,7 +22,6 @@ Also, we would like to list here interesting content created by the community. If you wrote some notebook(s) leveraging 🤗 Transformers and would like to be listed here, please open a Pull Request so it can be included under the Community notebooks. - ## Hugging Face's notebooks 🤗 ### Documentation notebooks @@ -38,7 +37,6 @@ You can open any page of the documentation as a notebook in Colab (there is a bu | [Summary of the tokenizers](https://github.com/huggingface/notebooks/blob/main/transformers_doc/en/tokenizer_summary.ipynb) | The differences between the tokenizers algorithm |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/transformers_doc/en/tokenizer_summary.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/transformers_doc/en/tokenizer_summary.ipynb)| | [Multilingual models](https://github.com/huggingface/notebooks/blob/main/transformers_doc/en/multilingual.ipynb) | How to use the multilingual models of the library |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/transformers_doc/en/multilingual.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/transformers_doc/en/multilingual.ipynb)| - ### PyTorch Examples #### Natural Language Processing[[pytorch-nlp]] @@ -88,7 +86,6 @@ You can open any page of the documentation as a notebook in Colab (there is a bu | [How to fine-tune a Nucleotide Transformer model](https://github.com/huggingface/notebooks/blob/main/examples/nucleotide_transformer_dna_sequence_modelling.ipynb) | See how to tokenize DNA and fine-tune a large pre-trained DNA "language" model | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/nucleotide_transformer_dna_sequence_modelling.ipynb) | [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/nucleotide_transformer_dna_sequence_modelling.ipynb) | | [Fine-tune a Nucleotide Transformer model with LoRA](https://github.com/huggingface/notebooks/blob/main/examples/nucleotide_transformer_dna_sequence_modelling_with_peft.ipynb) | Train even larger DNA models in a memory-efficient way | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/nucleotide_transformer_dna_sequence_modelling_with_peft.ipynb) | [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/nucleotide_transformer_dna_sequence_modelling_with_peft.ipynb) | - #### Other modalities[[pytorch-other]] | Notebook | Description | | | @@ -101,7 +98,6 @@ You can open any page of the documentation as a notebook in Colab (there is a bu |:----------|:-------------|:-------------|------:| | [How to export model to ONNX](https://github.com/huggingface/notebooks/blob/main/examples/onnx-export.ipynb)| Highlight how to export and run inference workloads through ONNX | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/onnx-export.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/onnx-export.ipynb)| - ### Optimum notebooks 🤗 [Optimum](https://github.com/huggingface/optimum) is an extension of 🤗 Transformers, providing a set of performance optimization tools enabling maximum efficiency to train and run models on targeted hardwares.