* docs: ko: main_classes/peft.md
* feat: nmt draft
* docs: add missing TOC to documentation for `PeftAdapterMixin` section
Added a table of contents (TOC) to the documentation, specifically for the `transformers.integrations.PeftAdapterMixin` section, following the structure and content outlined in [this link](https://huggingface.co/docs/transformers/main/en/main_classes/peft#transformers.integrations.PeftAdapterMixin).
* fix: Improve naturalness of purpose expression in Korean
Changed '관리하기 위한' to '관리할 수 있도록' for more natural Korean expression when describing the purpose of providing functions.
* fix: Simplify plural form and make expression more concise
Changed '~할 수 없기 때문에' to '~할 수 없어' for more concise expression while maintaining clarity.
* fix: Replace technical term '주입' with more natural '적용'
Changed '주입할 수 없어' to '적용할 수 없어' for better readability.
Considered alternatives:
'삽입': Too literal translation of 'inject'
'입력': Could be misunderstood as data input
'통합': Implies merging two systems
'추가': Simple but less precise
'적용' was chosen as it's the most natural and widely used term in Korean technical documentation for this context.
* fix: update toctree path for PEFT to lowercase
Changed the toctree path from 'PEFT' (uppercase) to 'peft' (lowercase) to match the correct directory naming convention and prevent broken links.
* docs: update as per reviewer feedback after rebase
* Add Fast Segformer Processor
* Modified the params according to segformer model
* modified test_image_processing_Segformer_fast args
- removed redundant params like do_center_crop,center_crop which aren't present in the original segformer class
* added segmentation_maps processing logic form the slow segformer processing module with references from beitimageprocessing fast
* fixed code_quality
* added recommended fixes and tests to make sure everything processess smoothly
* Fixed SegmentationMapsLogic
- modified the preprocessing of segmentation maps to use tensors
- added batch support
* fixed some mismatched files
* modified the tolerance for tests
* use modular
* fix ci
---------
Co-authored-by: yonigozlan <yoni.gozlan@huggingface.co>
* feat: superpoint fast image processor
* fix: reran fast cli command to generate fast config
* feat: updated test cases
* fix: removed old model add
* fix: format fix
* Update src/transformers/models/superpoint/image_processing_superpoint_fast.py
Co-authored-by: Yoni Gozlan <74535834+yonigozlan@users.noreply.github.com>
* fix: ported to torch and made requested changes
* fix: removed changes to init
* fix: init fix
* fix: init format fix
* fixed testcases and ported to torch
* fix: format fixes
* failed
test case fix
* fix superpoint fast
* fix docstring
---------
Co-authored-by: Yoni Gozlan <74535834+yonigozlan@users.noreply.github.com>
Co-authored-by: yonigozlan <yoni.gozlan@huggingface.co>
* Add missing cache_position argument.
* Pass cache_position to language model.
* Overwrite prepare_inputs_for_generation.
* Set model to half precision for Flash Attention test.
* Cast model to bfloat16.
* add tests for helpers
* duplicate test for each model
* why llava next video has no helper
* oops must have been in the commit
* fix test after rebase
* add copy from
* support `typing.Literal` as type of tool parameters
* validate the `args` of `typing.Literal` roughly
* add test to get json schema for `typing.Literal` type hint
* fix: add `"type"` attribute to the parsed result of `typing.Literal`
* test: add argument `booleanish` to test multi-type literal
* style: auto fixup
* EP + updates
Co-authored-by: Nouamane Tazi <NouamaneTazi@users.noreply.github.com>
Co-authored-by: drbh <drbh@users.noreply.github.com>
* remove unrelated change
* not working yet but let's see where it goes!
* update the api a bit
* udpate
* where I am at for now
* fix ep
* refactor the API
* yups
* fix
* fixup
* clean modeling
* just support llama4 for now!
* properly avoid
* fix
* nits
* Update src/transformers/models/llama4/modeling_llama4.py
* Update src/transformers/integrations/tensor_parallel.py
* style
* ,,,,
* update
---------
Co-authored-by: Nouamane Tazi <NouamaneTazi@users.noreply.github.com>
Co-authored-by: drbh <drbh@users.noreply.github.com>
* upload initial code
* update deepseek-vl adaptor
* update hierarchy of vision model classes
* udpate aligner model
* add text model
* Added Image Processor
* Added Image Processor
* Added Image Processor
* apply masks
* remove projection; add aligner
* remove interpolate_pos_encoding
* remove unused params in config
* cleaning
* Add the __init__ file
* added processing deepseek_vl class
* modified the deepseek-vl processor
* modified the deepseek-vl processor
* update __init__
* Update the image processor class name
* Added Deepseek to src/transformers/__init__.py file
* Added Deepseek to image_processing_auto.py
* update the __init__ file
* update deepseek_vl image processor
* Update Deepseek Processor
* upload fast image processor
* Revert "upload fast image processor"
This reverts commit 68c8fd50bafbb9770ac70c9de02448e2519219b4.
* update image processor
* flatten heirarchy
* remove DeepseekVLModel
* major update (complete modeling)
* auto modeling and other files
* formatting
* fix quality
* replace torchvision in modeling
* set default do_normalize to False
* add fast image processor template using tool
* update image processors
* add fast image processor to other files
* update liscense
* Added deepseek image testcases
* update image test
* update processor
* write CHAT_TEMPLATE
* update model for processor
* fix processor
* minor fixes and formatting
* fix image processing and tests
* fix interpolation in sam
* fix output_attentions in DeepseekVLModel
* upload test_modeling
* fix tests because of vocab size
* set use_high_res_vision=False in tests
* fix all modeling tests
* fix styling
* remove explicit background_color from image processors
* added test_processor
* added test_processor
* fix processor tests
* update docs
* update docs
* update docs
* update conversion script
* Fixed typos
* minor fixes from review
- remove model_id comments in examples
- remove from pre-trained auto mapping
- move to image-text-to-text from vision-to-seq in auto mapping
- add image_token_index to __init__ for config
- remove outdated temporary config in conversion script
- update example to use chat_template in docstring example
- update liscense 2021->2025
* fix type in config docstring
Co-authored-by: Raushan Turganbay <raushan.turganbay@alumni.nu.edu.kz>
* update get_image_features
* fix config
* improve DeepseekVLImageProcessor.preprocess
* return image_hidden_states
* use AutoTokenizer and AutoImageProcessor in Processor
* fix model outputs
* make num_image_tokens configurable
* fix docstring of processor
* move system prompt to chat template
* fix repo consistency
* fix return_dict
* replace SamVisionEncoder with SamVisionModel
* update to remove deepcopy
* 🛠️ Major Architectural Changes (Adds DeepseekVLHybrid)
* fix quality checks
* add missing hybrid in auto modeling
* run make style
* update sam_hq
* update high_res_size in test
* update docs following #36979
* update code with auto_docstring
* update conversion scripts
* fix style
* fix failing test because of tuple
* set weights_only=True in conversion script
* use safetensors.torch.load_file instead of torch.load in conversion script
* make output_dir optional in conversion script
* fix code snippets in docs (now the examples work fine)
* integration tests for DeepseekVL
* update expected texts
* make style
* integration tests for DeepseekVLHybrid
* fix class name
* update expected texts for hybrid
* run "make style"
* update since changes in main
* run make-style
* nits since changes in main
* undo changes in sam
* fix tests
* fix tests; update with main
* update with main: output_attention/output_hidden_states
* fix copied part in deepseek_vl
* run fix-copies
* fix output_hidden_states
* sam: fix _init_weigths
* use modular for DeepseekVL
* make image processor more modular
* modular: use JanusPreTrainedModel
* janus: provide kwargs in loss
* update processors in conversion script
* Revert "sam: fix _init_weigths"
This reverts commit db625d0c68956c0dad45edd7a469b6a074905c27.
* run fix-copies
---------
Co-authored-by: Shakib-IO <shakib.khan17@northsouth.edu>
Co-authored-by: Raushan Turganbay <raushan.turganbay@alumni.nu.edu.kz>
* init
* Force qwen2VL image proc to fast
* refactor qwen2 vl fast
* fix copies
* Update after PR review and update tests to use return_tensors="pt"
* fix processor tests
* add BC for min pixels/max pixels
* fix most tests
* skip a few more tests
* address comments
* fix chameleon tests
* forgot to uncomment
* qwen has its own tests with images, rename it as well
* add owlv2 fast image processor
* add Owlv2ImageProcessorFast to Owlv2Processor image_processor_class
* add Owlv2ImageProcessorFast to Owlv2Processor image_processor_class
* change references to owlVit to owlv2 in docstrings for post process methods
* change type hints from List, Dict, Tuple to list, dict, tuple
* remove unused typing imports
* add disable grouping argument to group images by shape
* run make quality and repo-consistency
* use modular
* fix auto_docstring
---------
Co-authored-by: Lewis Marshall <lewism@elderda.co.uk>
Co-authored-by: yonigozlan <yoni.gozlan@huggingface.co>
* docs: Standardize OPT model card with enhanced details
* Remove incorrect link from OPT model card
* Address review feedback on OPT model card
* Update opt.md
---------
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
- Fix Cyrillic 'Р' to Latin 'P' in Portuguese language link (README.md)
- Fix 'meanginful' to 'meaningful' in training documentation
- Fix duplicate 'Cohere' reference in modular transformers documentation
- Fix duplicate 'the the' in trainer and chat command comments
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-authored-by: Claude <claude@anthropic.com>
Co-authored-by: Claude <noreply@anthropic.com>
* First attempt
* fix
* fix
* Enhance TrackioCallback to log GPU memory usage and allocation
* Enhance Trackio integration in callbacks and training arguments documentation
* re order
* remove unused lines
* fix torch optional
* use partial to wrap around `transformers` utils!
* try to refactor?
* revert one wrong change
* just a nit
* push
* reverter watever was wrong!
* some nits
* fixes when there is no attention mask
* bring the licence back
* some fixes
* nit
* style
* remove prints
* correct dtype
* fa flags for testing
* update
* use paged attention if requested!
* updates
* a clone was needed, not sure why
* automatically create cu seq lens when input is flash, this at least makes sure layers don't re-compute
* simplify and improve?
* flash attention is kinda broken on recent cuda version so allow the opportunity to use something else
* fix!
* protect kernels import
* update
* properly parse generation config being passed
* revert and update
* add two tests
* some fixes
* fix test FA2
* takes comment into account
* fixup
* revert changes
* revert the clone, it is only needed because the metal kernel is not doing it?
* [docs] update attention implementation and cache docs (#39547)
* update docs
* Apply suggestions from code review
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* applu suggestions
---------
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* fix mps on our side for now
* Update src/transformers/integrations/flash_paged.py
* no qa
---------
Co-authored-by: Vasqu <antonprogamer@gmail.com>
Co-authored-by: Raushan Turganbay <raushan@huggingface.co>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* feat: add support for gradient checkpointing in TimmWrapperModel and TimmWrapperForImageClassification
* ruff fix
* refactor + add test for not supported model
* ruff
* Update src/transformers/models/timm_wrapper/modeling_timm_wrapper.py
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
* Update src/transformers/models/timm_wrapper/modeling_timm_wrapper.py
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
* Update src/transformers/models/timm_wrapper/modeling_timm_wrapper.py
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
* Update src/transformers/models/timm_wrapper/modeling_timm_wrapper.py
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
---------
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
* initial commit
* Apply suggestions from code review
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
* fix: various typos, typehints, refactors from suggestions
* fix: fine_matching method
* Added EfficientLoFTRModel and AutoModelForKeypointMatching class
* fix: got rid of compilation breaking instructions
* docs: added todo for plot
* fix: used correct hub repo
* docs: added comments
* fix: run modular
* doc: added PyTorch badge
* fix: model repo typo in config
* fix: make modular
* fix: removed mask values from outputs
* feat: added plot_keypoint_matching to EfficientLoFTRImageProcessor
* feat: added SuperGlueForKeypointMatching to AutoModelForKeypointMatching list
* fix: reformat
* refactor: renamed aggregation_sizes config parameter into q, kv aggregation kernel size and stride
* doc: added q, kv aggregation kernel size and stride doc to config
* refactor: converted efficientloftr implementation from modular to copied from mechanism
* tests: overwrote batching_equivalence for "keypoints" specific tests
* fix: changed EfficientLoFTRConfig import in test_modeling_rope_utils
* fix: make fix-copies
* fix: make style
* fix: update rope function to make meta tests pass
* fix: rename plot_keypoint_matching to visualize_output for clarity
* refactor: optimize image pair processing by removing redundant target size calculations
* feat: add EfficientLoFTRImageProcessor to image processor mapping
* refactor: removed logger and updated attention forward
* refactor: added auto_docstring and can_return_tuple decorators
* refactor: update type imports
* refactor: update type hints from List/Dict to list/dict for consistency
* refactor: update MODEL_MAPPING_NAMES and __all__ to include LightGlue and AutoModelForKeypointMatching
* fix: change type hint for size parameter in EfficientLoFTRImageProcessor to Optional[dict]
* fix typing
* fix some typing issues
* nit
* a few more typehint fixes
* Remove output_attentions and output_hidden_states from modeling code
* else -> elif to support efficientloftr
* nit
* tests: added EfficientLoFTR image processor tests
* refactor: reorder functions
* chore: update copyright year in EfficientLoFTR test file
* Use default rope
* Add docs
* Update visualization method
* fix doc order
* remove 2d rope test
* Update src/transformers/models/efficientloftr/modeling_efficientloftr.py
* fix docs
* Update src/transformers/models/efficientloftr/image_processing_efficientloftr.py
* update gradient
* refactor: removed unused codepath
* Add motivation to keep postprocessing in modeling code
* refactor: removed unnecessary variable declarations
* docs: use load_image from image_utils
* refactor: moved stage in and out channels computation to configuration
* refactor: set an intermediate_size parameter to be more explicit
* refactor: removed all mentions of attention masks as they are not used
* refactor: moved position_embeddings to be computed once in the model instead of every layer
* refactor: removed unnecessary hidden expansion parameter from config
* refactor: removed completely hidden expansions
* refactor: removed position embeddings slice function
* tests: fixed broken tests because of previous commit
* fix is_grayscale typehint
* not refactoring
* not renaming
* move h/w to embeddings class
* Precompute embeddings in init
* fix: replaced cuda device in convert script to accelerate device
* fix: replaced stevenbucaille repo to zju-community
* Remove accelerator.device from conversion script
* refactor: moved parameter computation in configuration instead of figuring it out when instantiating a Module
* fix: removed unused attributes in configuration
* fix: missing self
* fix: refactoring and tests
* fix: make style
---------
Co-authored-by: steven <steven.bucaille@buawei.com>
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
* improve handlike of other image-like inputs in fast image processors
* fix issues with _prepare_images_structure
* update sam image processor fast
* use dict update
* init
* copied from remote
* add proper structure and llama like structure
* fixup
* revert to state that works
* get closer to llama
* slow and steady
* some removal
* masks work
* it is indeed the rope implementation, how dafuq does it mesh with the cache now hmm
* nice
* getting closer
* closer to transformers style
* let's simplify this, batching works now
* simplified
* working version with modular
* it is indeed the rotation per weights, make it complete llama style
* cleanup conversion, next to look at -> tokenizer
* remove llama artefacts
* fix modeling tests (common ones)
* style
* integration test + first look into tokenization (will need more work, focussing on modeling other models first)
* style
* working moe version, based on remote
* lets keep it simple and go step by step - transformers annotations for modular and transformers style rope (complex view)
* more cleanup
* refactor namings and remove addition forXXX classes
* our moe won't cut it it seems, correction bias seems to be missing in remote code version
* tokenization change (remote)
* our moe version works when adding normalization :D
* cleanup moe
* nits
* cleanup modeling -> let's get to modular next
* style
* modular v1
* minor things + attempt at conversion (which doesn't work)
* no conversion follow glm, fixup modular and other nits
* modular cleanup
* fixes
* tests, tests, tests + some moe dtype forcing
* simplify modular, fix fatal fa2 bug, remaining tests
* fix import issue?
* some initial docs, fix bnb faulty behavior --> needs to fix some tests because of gate needing to be float
* fix sdpa test, load on init dtype only
* fixup post merge
* style
* fix doc links
* tokenization cleanup beginnings
* simplify tokenizer by a lot as its basically llama
* tokenizer is full llama with different defaults + extra special tokens
* sync og special tokens of ernie
* fix decoding with numbers (also in remote done what a timing), begin of tok tests
* align with remote and preserve special tokens, adjust tests to ernie legacy behavior, warning for questionable behavior (also in llama)
* nits
* docs
* my daily post merge it is
* check
* tokenization update with explanations and conversion script
* review on modular (til), revert some tokenizer things i did prior, remove mtp comment (low prio)
* post merge fixes
* fixup tokenization, llama fast is the way to go
* more fixups
* check
* import fixes
* correction bias following the paddle code
* fix
* fix TP plan, fix correction bias sharding during forward
* style
* whoops
* fix tied weights
* docs and last nit
* license
* flasky tests
* move repo id, update when merged on the hub
* simplify common get/set
* remove some noise
* change some 5 years old modeling utils
* update examples
* fix copies
* revert some changes
* fixes, gah
* format
* move to Mixin
* remove smolvlm specific require grad
* skip
* force defaults
* remodularise some stuff
* remodularise more stuff
* add safety for audio models
* style
* have a correct fallback, you daft donkey
* remove this argh
* change heuristic for audio models
* fixup
* revert
* this works
* revert again
* 🧠
* aaah ESM has two modelings aaah
* add informative but short comment
* add `input_embed_layer` mixin attribute
* style
* walrus has low precedence
* modular fix
* this was breaking parser
Enable average_tokens_across_devices by default in TrainingArguments
Fixes#39392
This change improves loss calculation correctness for multi-GPU training by enabling proper token averaging across devices by default.
Co-authored-by: Krishnan Vignesh <krishnanvignesh@Krishnans-MacBook-Air.local>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* fix qwen2 vl packing in FA2
* why? delete!
* qwen2-5-vl seems to work now
* update
* fix tests
* start by adapting FA2 tests
* add similar tests for sdpa/eager
* address comments
* why is this even in conditional model and not base model?
* fix type order
* change all Union[str, dict] to Union[dict, str]
* add hf_parser test && fix test order
* add deepspeed dependency
* replace deepspeed with accelerator
* Scaffolding
* Explicit content
* Naïve Responses API streaming implementation
* Cleanup
* Scaffolding
* Explicit content
* Naïve Responses API streaming implementation
* Cleanup
* use openai
* validate request, including detecting unused fields
* dict indexing
* dict var access
* tmp commit (tests failing)
* add slow
* use oai output type in completions
* (little rebase errors)
* working spec?
* guard type hint
* type hints. fix state (CB can now load different models)
* type hints; fn names; error type
* add docstrings
* responses + kv cache
* metadata support; fix kv cache; error event
* add output_index and content_index
* docstrings
* add test_build_response_event
* docs/comments
* gate test requirements; terminate cb manager on model switch
* nasty type hints
* more type hints
* disable validation by default; enable force models
* todo
* experiment: base model from typed dict
* audio working
* fix bad rebase
* load audio with librosa
* implement timed models
* almost working
* make fixup
* fix tests
* transcription request type
* tokenizer -> processor
* add example in docs
---------
Co-authored-by: Lysandre <hi@lysand.re>
* Add the `device` option for `generate()`
* Add device for default tensors to avoid tensor mismatch
* [test] Enable test_static_cache_exportability for torch_device
* infer device from the prompt_token_ids
* Add device for generated tensor
* [Test] Make `test_export_static_cache` tests to run on devices rather than only CPU
* fix format
* infer device from the model
* wip: adding first version of the IJEPA model card.
* refactor based on the @stevhliu feedbacks
* refactor:
- revert the accidental removal of the autodoc api description and the image reerece architecture
- general context updation.
* - changes of model for example quantization.
- merging the quantization content.
Fix indentation bug in Idefics3 image processor
- Fix KeyError when do_image_splitting=False
- Move split_images_grouped assignment inside loop
- Ensures all image shapes are stored, not just the last one
- This fixes the bug in both Idefics3 and generated SmolVLM processors
cc @yonigozlan
Co-authored-by: Krishnan Vignesh <krishnanvignesh@Krishnans-MacBook-Air.local>
* Fix typo in generation configuration for Janus model weight conversion
* Fix typo
* Update Janus model generation configuration
* Update Janus model to use generation_kwargs
* dump
* push other models
* fix simple greedy generation
* xmod
* add fmst and clean up some mentions of old cache format
* gpt-bigcode now follows standards
* delete tuple cache reference in generation
* fix some models
* fix some models
* fix mambas and support cache in tapas
* fix some more tests
* fix copies
* delete `_reorder_cache`
* another fix copies
* fix typos and delete unnecessary test
* fix rag generate, needs special cache reordering
* fix tapas and superglue
* reformer create special cache
* recurrent gemma `reorder_cache` was a no-op, delete
* fix-copies
* fix blio and musicgen pipeline tests
* fix reformer
* fix reformer, again...
* delete `_supports_cache_class`
* delete `supports_quantized_cache`
* fix failing tests
* fix copies
* some minor clean up
* style
* style
* fix copies
* fix tests
* fix copies
* create causal mask now needs positions?
* fixc copies
* style
* Update tests/test_modeling_common.py
Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
* clean-up of non-generative model after merging main
* check `is_decoder` for cache
* delete transpose for scores
* remove tuple cache from docs everywhere
* fix tests
* fix copies
* fix copies once more
* properly deprecate `encoder_attention_mask` in Bert-like models
* import `deprecate_kwarg` where needed
* fix copies again
* fix copies
* delete `nex_decoder_cache`
* fix copies asks to update for PLM
* fix copies
* rebasing had a few new models, fix them and merge asap!
* fix copies once more
* fix slow tests
* fix tests and updare PLM checkpoint
* add read token and revert accidentally removed line
* oh com -on, style
* just skip it, read token has no access to PLM yet
---------
Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
* Added StableAdamW as an optimizer option for Trainer. Also wrote tests to verify its behaviour.
* Fixed issue with
* Added docs for StableAdamW. Also fixed a typo in schedule free optimizers
---------
Co-authored-by: Gautham Krithiwas <gauthamkrithiwas2003@gmail.com>
* add test scanner
* add doc + license
* refactor for only 1 tree traversal
* add back test of only one method
* document single method scan
* format
* fixup generate tests
* minor fix
* fixup
* fixup doc
* add cosine_with_min_lr_schedule_with_warmup_lr_rate scheduler in trainer
* Update src/transformers/optimization.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Update optimization.py
fix the error of the unclosed "("
* Update optimization.py
remove whitespace in line 402 in order to pass the quality test
* Update src/transformers/optimization.py
* Update src/transformers/optimization.py
* Apply style fixes
---------
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
fix: 🐛 Fixed a bug in calculating Cross Entropy loss in JetMoeForCausalLM
In the original code, we shift the logits and pass shift_logits into the self.loss_function, but in self.loss_function, the shift_logits will be shifted again, so we are actually doing "next next token prediction", which is incorrect. I have removed the logits shifting before calling self.loss_function.
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* fix vlm with retrieval
* we can't use AutoModel because new ColQwen was released after refactor
* no need for colqwen
* tied weight keys are necessary, if using IMageTextToText
* need to apply renaming in tied weights, only for ColPali
* overwrite tied keys in ColPali
* fix copies, modular can't handle if-statements
* working locally; need to style and test
* added docs and initial tests; need to debug and flesh out
* fixed tests
* working long context; batches
* working fa2 and eager
* update tests
* add missing confnigs
* remove default autoset
* fix spacing
* fix most tests
* fixed tests
* fix to init
* refactor to match new transformers updates
* remove static cache option
* fa2 fix
* fix docs
* in progress
* working on tests
* fixed issue with attn outputs
* remove debug
* fix local config attr
* update doc string
* fix docstring
* add docs to toc
* correct typo in toc
* add new updates from main w.r.t. ModernBERT RoPE
* fix local param
---------
Co-authored-by: oweller2 <oweller2@dsailogin.mgmt.ai.cluster>
Co-authored-by: oweller2 <oweller2@l07.mgmt.ai.cluster>
Co-authored-by: oweller2 <oweller2@n02.mgmt.ai.cluster>
Co-authored-by: oweller2 <oweller2@l08.mgmt.ai.cluster>
Co-authored-by: oweller2 <oweller2@l01.mgmt.ai.cluster>
Co-authored-by: oweller2 <oweller2@l02.mgmt.ai.cluster>
* Update modeling_qwen2_5_vl.py
### 🐛 Bug Description
When using Unsloth’s Qwen2.5-VL vision models (both 3B and 7B) with the latest HuggingFace Transformers (commit: 520b9dcb42cef21662c304583368ff6645116a45), the model crashes due to a type mismatch in the attention mask handling.
---
### 🔥 Error Traceback
* Fix dtype compatibility in attention mask processing
Replace hardcoded torch.finfo() usage with dtype-aware function selection to handle both integer and floating-point attention mask tensors.
Technical Details:
Problem: Line 1292 assumes floating-point dtype for attention_mask_tensor
Solution: Add dtype check to use torch.iinfo() for integer types and torch.finfo() for float types
Files Modified: transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py
* Update modeling_qwen2_5_vl.py
* Update modeling_qwen2_5_vl.py
* Fix: Cast to float before applying torch.finfo
* # Fix: Use appropriate function based on dtype
* Update modular_qwen2_5_vl.py
* Fix: Cast to float before applying torch.finfo
* Fix: Use appropriate function based on dtype
* Fix: Use appropriate function based on dtype
* Updatet modeling_glm4v.py
* Only apply conversion for floating point tensors (inverted masks)
* corrected the format issue
reformatted modeling_glm4v.py
All done! ✨🍰✨
1 file reformatted
* Fix: Cast to float before applying torch.finfo
Corrected the format issue
* Fix torch.finfo() for integer attention mask
#39333
* Run make fix-copies and make style for CI compliance
- Updated dependency versions table
- Fixed code formatting and style issues
- Sorted auto mappings
- Updated documentation TOC
* Fix torch.finfo() TypeError for
Fix torch.finfo() TypeError for integer attention_mask_tensor #39333
* Fix torch.finfo() TypeError for integer
* Updated CamemBERT model card to new standardized format
* Applied review suggestions for CamemBERT: restored API refs, added examples, badges, and attribution
* Updated CamemBERT usage examples, quantization, badges, and format
* Updated CamemBERT badges
* Fixed CLI Section
* fix ast deprecations for python 3.14: replace node.n by node.value and use `ast.Constant`
More verbose exceptions in `fix_docstring` on docstring formatting issues.
* plm template
* A working plm with fixed image features
* hacked processor
* First version that reproduced PLM output using PE from timm.
* Simplify and fix tie_word_embeddings
* Use PIL resize. Simplify converstion.
* First version that works with video input.
* simplifed image preprocessing (not batched)
* Minor fixes after rebasing on main.
* Video processor based on new API.
* Revert to use _preprocess for image processor.
* refactor with modular
* fix tie_word_embedding
* Testing with timm PE
* check in missed converstion from modular to model.py
* First working version of PLM with Eva PE. PLM-1B and 3B outputs are exactly the same as before. PLM-8B output has some differences.
* address review comments
* Fixed batching if video and image examples mixed.
* Simplify PE configuration.
* Enable AutoModel for PerceptionEncoder.
* Update PE config style.
* update all headers
* Minor fixes.
* Move lm_head to PerceptionLMForConditionalGeneration.
Fix vit_G model specification.
* Fix for testing_modeling_perception_lm.py
* Image processing refactoring to use more common parts.
* Fix processor test.
* update tests to use model from hub
* More test fixes.
* integration test GT update after rebasing; probably due to video preprocessing
* update test media path to hub
* Stop tracking local scripts
* address some review comments
* refactor image processing.
* small fixes
* update documentation and minor fixes
* remove scripts
* Minor fix for CI
* Fix image processing
* CI and doc fix
* CI formatting fix
* ruff fix
* ruff formatting
* ran utils/sort_auto_mappings.py
* update docstring
* more docstring udpates
* add vision_input_type default fallback for image processing
* more verbose variable naming
* test update
* Remove PE and PEConfig use AutoModel(TimmWrapper) instead
* Minor cleanup.
* Minor Fix: remove any ref to PE. Ruff format and check.
* fix docstring
* Fix modular/model consistency.Improvex docstringfor .
* Fix PerceptionLMForConditionalGenerationModelTest
* ruff fix
* fix for check_repo
* minor formatting
* dummy size arg to fix for processor test.
* Update docstring for PerceptionLMConfig
* Minor fixes from review feedback.
* Revert some minor changes per reviewer feedback.
* update base_model_prefix
* address reviewer feedback
* fix comment in modeling file
* address reviewer feedback
* ruff format
* Pre-merge test update.
* reapply modular and fix checkpoint name
* processor test path
* use modular a bit more
* remove dead code
* add token decorator
---------
Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co>
Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>
* Updated Switch Transformers model card with standardized format (Issue #36979)
* Apply reviewer suggestions to the new standardised Switch Transformer's model card
* Update switch_transformers.md
---------
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* changes for video
* update modular
* change get_video_features
* update video token replacement
* update modular
* add test and fix typo
* lint
* fix order
* lint
* fix
* remove dependency
* lint
* lint
* remove todo
* resize video for test
* lint..
* fix test
* new a processor for video_test
* fix test
Also add notes asking users to set `TORCHDYNAMO_CAPTURE_SCALAR_OUTPUTS=1`
or call `torch._dynamo.config.capture_scalar_outputs = True`, as currently
this will cause a graph break.
Signed-off-by: Hollow Man <hollowman@opensuse.org>
* ensure the query is updated during training
avoid unused parameters that DDP does not like
* avoid a crash when `kwargs` contain `padding=True`
trainers often pass this argument automatically
* minor
* Remove mel_spec lazy init, and rename to mel_filters.
this ensures save_pretrained will not crash when saving the processor during training
d5d007a1a0/src/transformers/feature_extraction_utils.py (L595)
* minor - most feature extractors has a `sampling_rate` property
* speedup relative position embeddings
* fix several issues in model saving/loading:
- avoid modifying `self._hf_peft_config_loaded` when saving
- adapter_config automatically points to the original base model - a finetuned version should point to the model save dir.
- fixing model weights names, that are changed by adding an adapter.
* minor
* minor
* minor
* fixing a crash without peft active
* add todo to replace einsum
* granite speech speedups:
1. register attention_dist to avoid cpu-to-gpu transfer every layer.
2. pad_sequence is much faster than per-sample-padding + concat.
3. avoid returning audio back to cpu when using a compute device.
* support audio.shape=(1,L)
* add initial structure
* doc fixes, add model base logic
* update init files
* some fixes to config and modular
* some improvements for attention
* format
* remove unused attn
* some fixes for moe layer and for decoder
* adapt _compute_yarn_parameters for deepseek
* format
* small fix
* fix for decoder forward
* add tests, small refactoring
* fix dummies
* fix init
* fix doc
* fix config docs
* add sequce doc, fix init for gate
* fix issues in tests
* fix config doc
* remove unused args
* some fixes and refactoring after review
* fix doc for config
* small fixes for config args
* revert config refactoring
* small refactoring
* minor fixes after rebase
* small fix after merge
* fix modular
* remove rotaryembd from public init
* small test fix
* some rotary pos calculation improvement
* fix format
* some improvements and fixes
* fix config
* some refactoring
* adjust some unit tests
* skip test
* small fixes and tests adjustment
* reapply modular
* fix all tests except Integration
* fix integration testzs
* cleanup BC stuff
* rope
* fix integrations tests based on a10
* style
---------
Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co>
Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>
* Add Doge Model
* Fix code quality
* Rollback an error commit
* Fix config for open-source weights
* Revert "Fix config for open-source weights"
This reverts commit 229cdcac10a6a4274d1dd13b729bc14c98eb0c76.
* Add modular_doge
* Update Doge inherits from Llama
* Fix import bug
* [docs] Add usage of doge model
* Fix Doge import pretrainedconfig from modeling_utils to configuration_utils
* [docs] remove trust remote code from doge
* Fix dynamo bug in doge model
* Update docstrings
* Import apply_rotary_pos_emb and repeat_kv from Llama
* Fix all nits
* Fix code quality
* Fix some bugs
* Fix code quality
* Remove inherited `_update_causal_mask` from Llama
This leads to incorrect weight initialization.
* Fix the wrong tensor orderings in DogeCDMoE
* Fix attention mask bug
We have to provide attention_mask for dynamic mask computation
* Modify most implementations to inherit from Llama
But there are two problems:
1. `flex_attention_forward` is not updated properly
2. `Example` error in the forward method of DogeForCausalLM
* Modify CDMoE for batch efficient implementation
* Uniform MoE configuration names, just like QwenMoE
* Fix code quality
* Fix code quality
* Fix code quality
* Add tp plan of CDMoE Module
* Hybird DMA with sliding window
* Update valid tokens greater than window size
* Fix code quality
* Add `convert_doge_weights_to_hf`
* Fix STATE_DICT_MAPPING in convert_doge_weights_to_hf.py
* Fix nits in modular_doge
* Fix code quality
* Fix all nits
* Fix all nits
* Make sure the attention function is updated inside the class
* Fix code quality issues in the Doge model and add a test for it
* Fix `test_generate`
* Fix code quality
* Fix nits fllowing suggestions
* Fix code quality
* Fix code quality issues
* Fix nits
* Fix code quality nits
* Fix the missing parameters in the configuration.
* Fix the missing parameters in the configuration.
* Fix nits
* Add initialization of attention
* Fix last nits
* Simplify dynamic mask generation logic
* Rename router_logits to gate_logits for matching latest changes of MixtralModel
* Rename typings for matching latest changes of MixtralModel
* Fixes typo in comment
* Update src/transformers/models/doge/modular_doge.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* Fix code quality issues to match other modular
* Fix code quality issues to match other modular
* Fix the static compilation errors
* Update model weights link
* Fix code quality issues to match other modular
* reapply modular and support for new outputs
* style
* simplify a lot
* fix import location
* reapply modular
* fix
* fix integration test
---------
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co>
Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>
* Fix errors when use verl to train GLM4.1v model
* Support glm4v load from AutoModelForVision2Seq
* Set glm4v model _checkpoint_conversion_mapping attr from None to {}
* Update modeling_auto.py
* fix(decoding): stop beam search per-instance when heuristic satisfied
Previously, when early_stopping is set to `False`, the early-stopping heuristic only halted generation when **all** batch instances reached the criterion. This caused instances that are impossible (suggested by the heuristic) to improve keep generating, leading to inconsistent and overlong outputs across the batch.
Now we apply the heuristic **per-instance**: once a certain instance of batch has its all beams impossibe to improve, we mark that instance finished while letting others continue. This restores expected behavior and ensures consistency in batched generation.
* Add test case GenerationIntegrationTests.test_beam_search_early_stop_heuristic
* Update naming improvement_possibility -> is_early_stop_heuristic_unsatisfied
* Add comments for early stop heuristic
* Update src/transformers/generation/utils.py
---------
Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
- Complete Apache License text in Italian documentation
- Remove duplicate variable assignment in Perceiver converter
- Fix typo in MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES constant
* chameleon xpu bnb groundtruth update on bnb triton backend since we are
deprecating ipex backend
Signed-off-by: YAO Matrix <matrix.yao@intel.com>
* enable hqq uts on XPU, all passed
Signed-off-by: YAO Matrix <matrix.yao@intel.com>
* fix style
Signed-off-by: YAO Matrix <matrix.yao@intel.com>
* fix comment
Signed-off-by: YAO Matrix <matrix.yao@intel.com>
---------
Signed-off-by: YAO Matrix <matrix.yao@intel.com>
* update the glm4 model readme
* update test
* update GLM-4.1V model
* update as format
* update
* fix some tests
* fix the rest
* fix on a10, not t4
* nit: dummy import
---------
Co-authored-by: raushan <raushan@huggingface.co>
* [video processors] Support float fps for precise frame sampling
Enable fractional fps values (e.g., 1.5, 29.97) in video processors
for more precise frame sampling control.
- Change fps type from int to float across all video processors
- Maintain backward compatibility with integer values
Extends: #38105
* [video processors] Refine fps typing to Union[int, float]
Change fps type from Optional[float] to Optional[Union[int, float]]
for more explicit type information about supporting both integer
and floating-point frame rates.
- Update type hints and docstrings across 8 files
- Maintain backward compatibility
- Clarify support for both int and float values
Extends: #38105
* Revert "[video processors] Support float fps for precise frame sampling"
This reverts commit 7360d6e661b413ca0239e5ef61f9b1abbeab8e65.
* just update 2 files
* update other models as well just making fix-copies
* also add the changes needed to modeling utils
* put this on the pretrained model instead
* nits and fixes
* update generic, fix to use config value
* update other modelings
* use transformers kwargs instead
* update
* update
* update other models
* update
* updates
* update
* update
* update
* fix
* finally
* very small nits
* this fixes more tests
* fix other models as well!
* update modularqwen2
* update models based on qwen2
* update
* update
* remove the **flash stuff in favor of noraml kwargs
* update
* propagate gemma?
* remove output attentions
* propagate
* support cross attention edge case
* same
* test this
* fixes
* more fix
* update
* update
* fix conflicts
* update
* fix emu3
* fix emu3
* move the fix a bit
* quel enfer
* some fixes, loss_kwargs should never had been
* finish fixing gemma3n
* fix small lm3
* fix another one
* fix csm now
* fux csm and mistral
* fix mistral now
* small fixes
* fix janusss
* only for some models
* fixup
* phix phi3
* more fixes?
* dose this fix it?
* update
* holy shit it was just graph breaks
* protect torch
* updates
* fix samhq?
* fix moonshine
* more moonshine fixes, 3 failures left!
* nits
* generic needs to support more
* more fixes to moonshine!
* fix cross attention outputs!
* fix csm!
* nits
* fix stupid kosmos2
* current updates
* fixes
* use output recorder?
* nicer!
* a little bit of magic
* update
* fix protect
* fix
* small fixes
* protect import
* fix a bunch of more models
* fix fixups
* fix some of the last ones
* nit
* partly fix phi
* update
* fix import path
* make something that is fullgraph compatible just to be sure
* typing was wrong on llama so the rest was wrong as well
* fucking ugly but at least it is still exportable
* syle
* supposed to fix moonshine, it still breaks
* fix some default
* fix the last bits of sam
* update samhq
* more fixes to am hq
* nit
* fix all output+hidden states and output_attentions!
* fix?
* fix diffllama
* updates to fix initialization on the sam pips
* ups there was a bug
* fix the last sam hq test
* fix gotocr
* fix gotocr2!
* fixes
* skip stupid tests
* there was one left :)
* fixup
* fix fix copies issues with this test file
* fix copies for sam_hq
* rm some comments
* skip 2 more failing tests
* fix
* fix everything
* Apply suggestions from code review
Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
Co-authored-by: Pablo Montalvo <39954772+molbap@users.noreply.github.com>
* add more doc!
* fix public init
* fix modular qwen3
---------
Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
Co-authored-by: Pablo Montalvo <39954772+molbap@users.noreply.github.com>
* more torch.hpu patches
* increase top_k because it results in flaky behavior when Tempreture, TopP and TopK are used together, which ends up killing beams early.
* remove temporal fix
* fix scatter operation when input and src are the same
* trigger
* fix and reduce
* skip finding batch size as it makes the hpu go loco
* fix fsdp (yay all are passing)
* fix checking equal nan values
* style
* remove models list
* order
* rename to cuda_extensions
* Update src/transformers/trainer.py
* Expectations for llava_next_video
* Updated image src in aria
* Fix test_small_model_integration_test
* Fix small model integration llama
* Fix a bunch of tests
* Style
* Shortened generation in test from 900 to 90
* Fix index out of bounds exception on wrong kv reuse
* Prevent loading same model twice
---------
Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
Co-authored-by: Lysandre Debut <hi@lysand.re>
* Fixed some devices errors
* Fixed other device issues and more expectations
* Reverted support flags
* style
* More granular support
* Fixed some rebase stuff
* add a not None check before .to
* fix FA2
* update is causal flag and remove mask for FA2
* update for FA2 with varlen path
* how the tests were passing with different devices?
* add comment and ref to the PR
* move mask preparation to base pretrained model
* seq len is the first dim, not second
* fix copies to fix GLM4V
* deprecate for 1 version
* style
* fix some tests
* fix esm
* skip for now, GC requires positional args but we have keyword args
* remove transpose for scores in modified models only
* skip fx trace tests
* remove the skips
* fix the epsilon to a small value (does not make sense otherwise)
* safeguard
* overload test_eager_matches_sdpa
* Update test_modeling_common.py
* skip appropriate tests
* correct no_split_layer
* fix all devices issue
* fix backward
* fix
Updating Gemma 3n docs and docstrings to clarify the relationship
between the newly trained audio encoder used in Gemma 3n and the USM
model from the original paper.
TST Fix PEFT integration test bitsandbytes config
The PEFT integration tests still used load_in_{4,8}_bit, which is
deprecated, moving to properly setting BitsAndBytesConfig. For 4bit,
also ensure that nf4 is being used to prevent
> RuntimeError: quant_type must be nf4 on CPU, got fp4
* Add Fast Image Processor for Chameleon
* add warning to resize and move blend_rgba to convert_to_rgb
* Remove unrelated files
* Update image_processing_chameleon_fast to use auto_docstring
* fix equivalence test
---------
Co-authored-by: Yoni Gozlan <74535834+yonigozlan@users.noreply.github.com>
Co-authored-by: yonigozlan <yoni.gozlan@huggingface.co>
* add fast image processor nougat
* test fixes
* docstring white space
* last fixes
* docstring_type
* tolerance unit test
* fix tolerance
* fix rtol
* remove traling white space
* remove white space
* note for tolerance unit test
* fix tests
* remove print
---------
Co-authored-by: yonigozlan <yoni.gozlan@huggingface.co>
Co-authored-by: Yoni Gozlan <74535834+yonigozlan@users.noreply.github.com>
Some PEFT integration tests involving text generation pipelines were
failing since #38129 because the base model is too small to generate
longer sequences. Setting max_new_tokens fixes this.
* timestamp token is end of token time !!!
* ensure correct alignment between tokens and timestamp tokens
* ignore input tokens for DTW computation
* use num_frames to avoid token timestamp hallucinations
* token timestamps test updates !
* num_frames: deprecate and use attention_mask instead
* avoid breaking change
* fix the pipeline usage for chunk approach
* make style
* better logging
* better logging
* make style
* update tests with correct values
* Update PEGASUS-X model card
* Add cache_implementation argument in quantization code example
* Update CLI example
* Apply suggestions from code review
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Remove TensorFlow and Flax badges
---------
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* docs: first draft to more standard SuperPoint documentation
* Apply suggestions from code review
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* docs: reverted changes on Auto classes
* docs: addressed the rest of the comments
* docs: remove outdated reference to keypoint detection task guide in SuperPoint documentation
* Update superpoint.md
---------
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* remove compile on mask creation, ensure kv blocks do not explode on indices
* trigger ci
* switch dynamic compilation to false
* patch new masking functions as well
* add len check
* i was wrong
* last comment
* Gemma 3n
* initial commit of Gemma 3n scaffold
* Fixing param pass through on Gemm3p5RMSNorm
* Adds Einsum layer to Gemma 3n
* Updating EinsumLayer API
* Undoing erroneous force push
* Reverting RMSNorm to with_scale by default
* Adds LAuReL to Gemma 3n
* Adds AltUp to Gemma 3n
* Adding Gemma3p5 overall and text config with vision and audio config placeholders (#3)
* Adding gemma3p5 text configs
* Adding audio config placeholders
* Adding a placeholder for vision configs
* Updating MobileNetVisionConfig, inheriting TimmWrapperConfig
* Updating text configs
* Update src/transformers/models/gemma3p5/modular_gemma3p5.py
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Removing altup configs to accept the suggested configs
* Update src/transformers/models/gemma3p5/modular_gemma3p5.py
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Updating altup config
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Addressing review comments and updating text configs
* Adding a config for activation sparsity
* Updating configs to pass through options to super class init and adjust some name prefixes
* Updating laurel and altup with corrected config values
* Normalizing sub_config initializers
---------
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Updating MLP with activation sparsity (#2)
* Updating DecoderBlock for Gemma 3n (#3)
* Initial Gemm3nTextModel (#4)
NOTE: This implementation WILL CHANGE in the coming weeks, however, changes will be strictly additive and this will remain a suitable baseline for downstream implementations to reference.
* Adding KV Cache Sharing
* Adds Einsum layer to Gemma 3n
* Updating EinsumLayer API
* Refactored kv cache sharing in attention
* Adding KVStore for cache sharing
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Update src/transformers/cache_utils.py
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Undoing erroneous force push
* Reverting RMSNorm to with_scale by default
* Adds LAuReL to Gemma 3n
* Updating KV Cache Sharing implementation
* Updating the q and k norm definitions in the attention module
* Fixing name error for q,k,v RMS norm to use the right 3n module
* Updating MLP with activation sparsity
* Updating DecoderBlock for Gemma 3.5
* Updating kv cache sharing implementation with the use of a cache buffer and refactoring some lines of code
* Isolating KV Cache logic to relevant components
* Fixing logic error in Gemma3nAttention.forward
* Refactoring caching contributions and fixing kv_store initialization
* Simplifying Configs
* Remove errant self from super init call
* Bug fix in the Attention module - changing self.head_dim to config.head_dim
* Bug fixes in the LaurelBlock and RMS Norm super init call
* removing redundant code from a merge
* Adding per_layer_inputs to TextModel
* Adding preprocess embeddings with altup
* Adds per-layer-to-single output and a host of TODOs
* Integrating altup predict with the model workflow and other minor bug fixes
* Using nn.Embedding temporarily for text model
* It goes forward
* Minor refactor of attention sparsity and RoPE initialization
* Fixing duplicate rope_scaling param bug when loading from pretrained
---------
Co-authored-by: Sindhu Raghuram <sindhuraghuram@google.com>
Co-authored-by: SindhuRaghuram97 <114270661+SindhuRaghuram97@users.noreply.github.com>
* Normalizing on altup_num_inputs config option
* regenerating modeling file after syncing to HEAD
* Use torch.std(..., unbiased=False) for activation sparsity (#8)
* Refactoring to a single QVK Norm (#13)
* AltUp: support scale_corrected_output (#14)
* Converts einsums to nn.Linear (#7)
* Converts einsums to nn.Linear
* Removing unused variables
* Aligning SharedKVCache with HybridCache (#11)
* Alinging SharedKVStore with HybridCache
* Remove KVStore. Refactor apply_rotary_pos_emb for sharing
* Addressing review comments
* Supporting split modality embeddings in Gemma3n (#10)
* Adding the Embedder class
* Update modular
Co-authored-by: Ryan Mullins <ryan@ryanmullins.org>
* Update modular
Co-authored-by: Ryan Mullins <ryan@ryanmullins.org>
* Update modular
Co-authored-by: Ryan Mullins <ryan@ryanmullins.org>
* Update modular
Co-authored-by: Ryan Mullins <ryan@ryanmullins.org>
* Update modular
Co-authored-by: Ryan Mullins <ryan@ryanmullins.org>
* Update modular
Co-authored-by: Ryan Mullins <ryan@ryanmullins.org>
* Addressing review comments, adding audio embedding layers, integrating embedder with the remaining architecture, adding a forward method for conditional generation
* Apply suggestions from code review
Co-authored-by: Ryan Mullins <ryan@ryanmullins.org>
* Update modular
Co-authored-by: Ryan Mullins <ryan@ryanmullins.org>
* Addressing review comments, prop drilling audio and vision configs to the text config
* Removing TODO's that have been addressed
* Simplify Embedder init and add audio embeddings
* Embeddings refactor. Adds Gemma3nAudioEmbedder and Gemma3nVisionEmbedder
* Refactoring vision and audio embeddings into ConditionalGeneration model
---------
Co-authored-by: Ryan Mullins <ryan@ryanmullins.org>
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Updating attention mask for Gemma 3.5 (#15)
* xxx_token_index to xxx_token_id
* remvoing deprecated last_cache_position
* Removing references to SigLIP
* Always init per-layer inputs
* Using torch.finfo().min for epsilon_tensor
* Gemma3nDecoderLayer inherits from Gemma3DecoderLayer. Remove gating lambdas
* fix modular GEMMA3N_INPUTS_DOCSTRING
* Gemma3nAttention inherits from Gemma3Attention
* Modular inheritance fixes
* CausalLM conversion script for 4B model (#16)
* Add Gemma3n Audio Encoder (#6)
* initial commit of Gemma 3.5 scaffold
* Fixing param pass through on Gemm3nRMSNorm
* Adds Einsum layer to Gemma 3.5
* Updating EinsumLayer API
* Undoing erroneous force push
* Reverting RMSNorm to with_scale by default
* Adds LAuReL to Gemma 3n
* Adds AltUp to Gemma 3n
* Adding Gemma3n overall and text config with vision and audio config placeholders (#3)
* Adding gemma3n text configs
* Adding audio config placeholders
* Adding a placeholder for vision configs
* Updating MobileNetVisionConfig, inheriting TimmWrapperConfig
* Updating text configs
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Removing altup configs to accept the suggested configs
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Updating altup config
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Addressing review comments and updating text configs
* Adding a config for activation sparsity
* Updating configs to pass through options to super class init and adjust some name prefixes
* Updating laurel and altup with corrected config values
* Normalizing sub_config initializers
---------
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Updating MLP with activation sparsity (#2)
* Updating DecoderBlock for Gemma 3.5 (#3)
* Initial Gemm3nTextModel (#4)
NOTE: This implementation WILL CHANGE in the coming weeks, however, changes will be strictly additive and this will remain a suitable baseline for downstream implementations to reference.
* Adding KV Cache Sharing
* Adds Einsum layer to Gemma 3.5
* Updating EinsumLayer API
* Refactored kv cache sharing in attention
* Adding KVStore for cache sharing
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Update src/transformers/cache_utils.py
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Undoing erroneous force push
* Reverting RMSNorm to with_scale by default
* Adds LAuReL to Gemma 3n
* Updating KV Cache Sharing implementation
* Updating the q and k norm definitions in the attention module
* Fixing name error for q,k,v RMS norm to use the right Gemma 3n module
* Updating MLP with activation sparsity
* Updating DecoderBlock for Gemma 3.5
* Updating kv cache sharing implementation with the use of a cache buffer and refactoring some lines of code
* Isolating KV Cache logic to relevant components
* Fixing logic error in Gemma3nAttention.forward
* Refactoring caching contributions and fixing kv_store initialization
* Simplifying Configs
* Remove errant self from super init call
* Bug fix in the Attention module - changing self.head_dim to config.head_dim
* Bug fixes in the LaurelBlock and RMS Norm super init call
* removing redundant code from a merge
* Adding per_layer_inputs to TextModel
* Adding preprocess embeddings with altup
* Adds per-layer-to-single output and a host of TODOs
* Integrating altup predict with the model workflow and other minor bug fixes
* Using nn.Embedding temporarily for text model
* It goes forward
* Minor refactor of attention sparsity and RoPE initialization
* Fixing duplicate rope_scaling param bug when loading from pretrained
---------
Co-authored-by: Sindhu Raghuram <sindhuraghuram@google.com>
Co-authored-by: SindhuRaghuram97 <114270661+SindhuRaghuram97@users.noreply.github.com>
* Normalizing on altup_num_inputs config option
* Adding audio encoder config
* Adds high-level components for Audio Encoder
* Implement uniform reducer for Audio Encoder
* Adding placeholders for Conformer components in Audio Encoder
* Adding placeholders for SubSampleConvProjection components in Audio Encoder
* Adding SequenceLayer component placeholders
* Implementing Gemma3nAudioEncoder with nn.Sequential
* Implementing Gemma3nAudioSubSampleConvProjection with nn.Sequential
* Implementing Conformer model with SequenceLayers
* Use OrderedDict in nn.Sequential initializers
* Implements sl.Residual in Torch with nn.Sequential and OrderedDict
* Adopting a base SequenceLayer class with default forward() method
* Implementing sl.GatedLinearUnit in Torch
* Implementing sl.Swish in Torch
* Implementing sl.ReLU in Torch
* Implementing sl.Scale in Torch
* Removing sl.Dropout after tree-shaking
* Implementing sl.RMSNorm in Torch with fake shape
* Implementing sl.GroupNorm in Torch
* Implementing sl.Conv2d in Torch
* Implementing sl.Dense in Torch
* Removing sl.Delay layers, which act as pass-throughs
* Connecting shapes to configs in initializers
* Removing sl.Emit
* Implementing sl.ExpandDims in Torch
* Adding sl.GradientClipping to Torch
* Implementing sl.DenseShaped in Torch
* Implementing sl.LDPA in Torch
* Removing unused sl.CombinedQKVProj class
* Fixing erroneous type hint
* Implemnenting sl.DepthwiseConv1D in Torch
* Implementing sl.MaskInvalid in Torch
* Fixes for initialization
* Fixes for saving weights
* Removing einsums per feedback from HF staff
* Removing Sequence Layers idioms from audio encoder
* Fixes for reviewer comments
* CausalLM conversion script for 4B model
* inv_timescales to non-persistent buffer
* Addressing audio encoder Attention feedback
* Addressing Gemma3nAudioSSCPConvBlock feedback
* Addressing Gemma3nAudioConformerAttention feedback
* Addressing padding feedback
* Weights conversion loads audio state dict
* Always use vision_config so saving works
* Token id updates for configs
* Stubs for interleaving audio embs
* Addressing reviewer feedback
---------
Co-authored-by: SindhuRaghuram97 <114270661+SindhuRaghuram97@users.noreply.github.com>
Co-authored-by: Sindhu Raghuram <sindhuraghuram@google.com>
* Fixing cache access error
* Removing duplicate code from a bad merge
* Gemma 3n Text + Vision Part 1 (#17)
* testing utilities for numerics comparisons
* Corrected einsum to nn.Linear weights conversion
* Inherit scaled word embs from Gemma3 not Bart
* Fixing transposes for collapsed linears
* More transpose fixes
* numpy api fix
* RMSNorm: Explicit kwargs, scale_shift=0.0 when with_scale=True
* Force AltUp to float32
* Updating debugging script for AudioEncoder debugging
* Support divide_weight_by_sqrt_fan_in from JAX for per-layer inputs
* Correcting attention einsum conversions
* RMSNorm in type of x
* Fixing douplicate laurel norm/gating
* KV sharing using the right previous indices
* Refactor kv shared index computation. Correct frac_shared_layers
* Use num_shared_layers instead of inferring from a fraction
* fixing a bug for logging
* Fix shared data_ptrs in altup inits
* rope: adjust proj -> norm -> rope to preserve computation (#20)
* rope: adjust proj -> norm -> rope to preserve computation
* Removing some breaking language model fluff in ConditionalGeneration
* Consolidate query_states transforms
---------
Co-authored-by: Douglas Reid <21148125+douglas-reid@users.noreply.github.com>
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Vectorize the loops in AltUp (#19)
* Vectorize the loops in AltUp
* fix typo
* Expanding to support batched inputs
* remove extra debug script
* Fix AltUp.forward
---------
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Add 'scale_shift=0.0, with_scale=True' to the final norm in TextModel
* Convert norm to 1/sqrt (#21)
* Convert norm to 1/sqrt
* Scale shift change per Phil's rec
* Adding default activation sparsity
* Fixing 2B config in weights conversion script
* Fixing RMSNorm parameters - adding scale_shift and with_scale
* Correcting query pre-attention scaling
* Adding query_rescale_scalar to text config
* Adding layer_idx to MLP
* Permafix for input_layernorm
* Use 1/sqrt instead of rsqrt in DecoderLayer
* Fix o_proj conversion
* Conversion script update for vision encoder
* Removing logging for debugging timm model
* Fixing bugs in Gemma3nForConditionalGeneration for text generation
* Generating the modeling_gemma3n.py file
* Removing the addition of an erroneous line in the modeling file
* Adding gemma3n text model to modeling_auto
* Bugfix: Updating the interleaving of inputs_embeds and vision_embeds
* Updating the modeling file with the latest bugfix changes
* Updating models/auto for Gemma 3n
* using AutoTokenizer in forward test
* Adding processing_gemma3n.py
* Gemma 3n configured for AutoModel. Conversion script updated.
* Removing errant merge artifacts
---------
Co-authored-by: Mayank Chaturvedi <imayank@google.com>
Co-authored-by: Douglas Reid <douglas-reid@users.noreply.github.com>
Co-authored-by: Douglas Reid <21148125+douglas-reid@users.noreply.github.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
Co-authored-by: Sindhu Raghuram <sindhuraghuram@google.com>
* Removing errant debugging statements from Gemma 3
* Gemma3n audio model (#18)
* testing utilities for numerics comparisons
* Implement CumulativeGroupNorm and add to SubSampleConvProjection and SSCPConvBlock
* Add audio version of forward script based on RyanMullins' implementation
* Updating to match encoder tests. WIP: config question needs resolving
* Updates to audio classes to enable end-to-end running
* Removing vestigial classes, cleaning up print statements
* Adding SiLU / Swish to audio conformer feed forward block
* Shifted Gemma3p5Audio naming prefix to Gemma3NanoAudio
* Adding outputs to audio test
* Fixes to padding in SSCP and 1D convolution, align RMS Norm with wider model
* Update forward test to load from local weights
* Update conversion to process / output audio layers
* Update __all__ to export audio encoder
* AutoModel registration for Gemma 3n Audio
* Use AutoModel for ConditionalGeneration.audio_tower
* Fixing input_proj_linear transpose
* Fixing Gemma3NanoAudioConformerAttention.post conversion
* Fixing Gemma3NanoAudioSSCPConvBlock.conv weights conversion
* Correcting indentation issue on Gemma3p5RMSNorm
---------
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Text + Vision Part 2 (#23)
* Updates for ConditionalGeneration.get_image_features
* Adding a WIP draft of image_processing_gemma3p5.py
* Update src/transformers/models/gemma3p5/modular_gemma3p5.py
Co-authored-by: SindhuRaghuram97 <114270661+SindhuRaghuram97@users.noreply.github.com>
* Modular conversion after github suggested change
* Text + image gives good results
* Fixing image size preset
* Updating configs for the 2B variant in the conversion script
* Using final generation config in conversion script
---------
Co-authored-by: Sindhu Raghuram <sindhuraghuram@google.com>
Co-authored-by: SindhuRaghuram97 <114270661+SindhuRaghuram97@users.noreply.github.com>
* Audio Integration (#12)
* initial commit of Gemma 3n scaffold
* Fixing param pass through on Gemm3nRMSNorm
* Adds Einsum layer to Gemma 3n
* Updating EinsumLayer API
* Undoing erroneous force push
* Reverting RMSNorm to with_scale by default
* Adds LAuReL to Gemma 3n
* Adds AltUp to Gemma 3n
* Adding Gemma 3n overall and text config with vision and audio config placeholders (#3)
* Adding Gemma 3n text configs
* Adding audio config placeholders
* Adding a placeholder for vision configs
* Updating MobileNetVisionConfig, inheriting TimmWrapperConfig
* Updating text configs
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Removing altup configs to accept the suggested configs
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Updating altup config
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Addressing review comments and updating text configs
* Adding a config for activation sparsity
* Updating configs to pass through options to super class init and adjust some name prefixes
* Updating laurel and altup with corrected config values
* Normalizing sub_config initializers
---------
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Updating MLP with activation sparsity (#2)
* Updating DecoderBlock for Gemma 3n (#3)
* Initial Gemma3nTextModel (#4)
NOTE: This implementation WILL CHANGE in the coming weeks, however, changes will be strictly additive and this will remain a suitable baseline for downstream implementations to reference.
* Adding KV Cache Sharing
* Adds Einsum layer to Gemma 3n
* Updating EinsumLayer API
* Refactored kv cache sharing in attention
* Adding KVStore for cache sharing
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Update src/transformers/cache_utils.py
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Undoing erroneous force push
* Reverting RMSNorm to with_scale by default
* Adds LAuReL to Gemma 3n
* Updating KV Cache Sharing implementation
* Updating the q and k norm definitions in the attention module
* Fixing name error for q,k,v RMS norm to use the right 3n module
* Updating MLP with activation sparsity
* Updating DecoderBlock for Gemma 3n
* Updating kv cache sharing implementation with the use of a cache buffer and refactoring some lines of code
* Isolating KV Cache logic to relevant components
* Fixing logic error in Gemma3nAttention.forward
* Refactoring caching contributions and fixing kv_store initialization
* Simplifying Configs
* Remove errant self from super init call
* Bug fix in the Attention module - changing self.head_dim to config.head_dim
* Bug fixes in the LaurelBlock and RMS Norm super init call
* removing redundant code from a merge
* Adding per_layer_inputs to TextModel
* Adding preprocess embeddings with altup
* Adds per-layer-to-single output and a host of TODOs
* Integrating altup predict with the model workflow and other minor bug fixes
* Using nn.Embedding temporarily for text model
* It goes forward
* Minor refactor of attention sparsity and RoPE initialization
* Fixing duplicate rope_scaling param bug when loading from pretrained
---------
Co-authored-by: Sindhu Raghuram <sindhuraghuram@google.com>
Co-authored-by: SindhuRaghuram97 <114270661+SindhuRaghuram97@users.noreply.github.com>
* Normalizing on altup_num_inputs config option
* Adding audio encoder config
* Adds high-level components for Audio Encoder
* Implement uniform reducer for Audio Encoder
* Adding placeholders for Conformer components in Audio Encoder
* Adding placeholders for SubSampleConvProjection components in Audio Encoder
* Adding SequenceLayer component placeholders
* Implementing Gemma3nAudioEncoder with nn.Sequential
* Implementing Gemma3nAudioSubSampleConvProjection with nn.Sequential
* Implementing Conformer model with SequenceLayers
* Use OrderedDict in nn.Sequential initializers
* Implements sl.Residual in Torch with nn.Sequential and OrderedDict
* Adopting a base SequenceLayer class with default forward() method
* Implementing sl.GatedLinearUnit in Torch
* Implementing sl.Swish in Torch
* Implementing sl.ReLU in Torch
* Implementing sl.Scale in Torch
* Removing sl.Dropout after tree-shaking
* Implementing sl.RMSNorm in Torch with fake shape
* Implementing sl.GroupNorm in Torch
* Implementing sl.Conv2d in Torch
* Implementing sl.Dense in Torch
* Removing sl.Delay layers, which act as pass-throughs
* Connecting shapes to configs in initializers
* Removing sl.Emit
* Implementing sl.ExpandDims in Torch
* Adding sl.GradientClipping to Torch
* Implementing sl.DenseShaped in Torch
* Implementing sl.LDPA in Torch
* Removing unused sl.CombinedQKVProj class
* Fixing erroneous type hint
* Implemnenting sl.DepthwiseConv1D in Torch
* Implementing sl.MaskInvalid in Torch
* Fixes for initialization
* Fixes for saving weights
* Removing einsums per feedback from HF staff
* Removing Sequence Layers idioms from audio encoder
* Fixes for reviewer comments
* Converting sl.Frontend to FeatureExtractor
* Updates for ConditionalGeneration.get_image_features
* Adding a WIP draft of image_processing_gemma3n.py
* Update modular
Co-authored-by: SindhuRaghuram97 <114270661+SindhuRaghuram97@users.noreply.github.com>
* Modular conversion after github suggested change
* Text + image gives good results
* Fixing image size preset
* Draft of audio data in chat template
* Removing image processing. Using SigLIP instead.
* Audio input going end-to-end
* Fixing dtype issues in audio encoder
* x-lib formatting consistency
* Adding example data
* Save preprocessor_config.json from conversion script
* Instrumentaiton for debugging
* Additional instrumentation for preprocessing debugging
* Updates to preprocessor, padding; produces correct end-to-end results on sample
* Tackling configuraiton TODOs
* Start of feature extractor refatcor
* Adds Numpy version of USM extractor, removes Torch version and dependencies
* Fixing AltUp.correct coef permute
* Supporting batches of single audio segment inputs
* Docstrings updates for config
* In-lining audio feature extraction
* Adjustments to conversion script and smoke test script
---------
Co-authored-by: SindhuRaghuram97 <114270661+SindhuRaghuram97@users.noreply.github.com>
Co-authored-by: Sindhu Raghuram <sindhuraghuram@google.com>
Co-authored-by: pculliton <phillipculliton@gmail.com>
* Gemma 3n renaming
* Removing test data and utilities
* Renaming test files
* Gemma 3n refactor
* Fix tokenizer config in conversion script
* Address reviewer feedback
* FeatureExtractor returns float32 by default
* Adding basic tests for audio, and input name for audio encoder
* Audio integration test, updates to model_id for other integration tests
* Use scales for q and k norms (#26)
* Update audio integration test to use HF dataset
* Reviewer feedback
* Expand embedding table to full vocab size in weights conversion
* Mix-n-match MatFormers for Gemma 3n (#25)
* Remove in-place operations (#30)
* chore: removing inplace ops
* remove [tensor] * n pattern
* chore: reviewer feedback in AudioEncoder and AltUp
* More grad clipping
* Dynamo compatibility
* fix: cache slicing error
* chore: simplify shared kv cache slicing
* chore: vision encoder rename in timm
* fix: image processor do_normalize=False
* fixup: style
* chore: model_doc
* fix: docs for code quality
* chore: repo consistency
* fix: RMSNorm in float as in prior Gemmas
* fix: per_layer_inputs = None
* chore: Gemma3nForCausalLM from Gemma3nForConditionalGeneration checkpoint
* chore: repo consistency
* Add initial unit tests for Gemma3nAudioFeatureExtractor (#27)
* Add initial unit tests for Gemma3nAudioFeatureExtractor
* Add basic unit tests for Gemma3nProcessor (#28)
Co-authored-by: Douglas Reid <21148125+douglas-reid@users.noreply.github.com>
* parameterize tests
---------
Co-authored-by: Douglas Reid <21148125+douglas-reid@users.noreply.github.com>
* chore: code style
* fix: test cases
* style and consistency
* fix config in the test to be coherent with layer cache sharing
* fix hidden states in tests and code
* inits and mappings
* fix modality prefixes
* test order and prefixes
* fix test exception
* fix class order and reduce model size for faster tests
* restore _checkpoint_conversion_mapping to load Caual from Conditional
* fix config mapping!
* fix: reviewer feedback
---------
Co-authored-by: SindhuRaghuram97 <114270661+SindhuRaghuram97@users.noreply.github.com>
Co-authored-by: Sindhu Raghuram <sindhuraghuram@google.com>
Co-authored-by: raushan <raushan@huggingface.co>
Co-authored-by: Mayank Chaturvedi <imayank@google.com>
Co-authored-by: Douglas Reid <douglas-reid@users.noreply.github.com>
Co-authored-by: Douglas Reid <21148125+douglas-reid@users.noreply.github.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
Co-authored-by: pculliton <phillipculliton@gmail.com>
Co-authored-by: Aritra Roy Gosthipaty <aritra.born2fly@gmail.com>
Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>
* fix import test
* add model args
* auto_docstring
* replace test path
* consistency
* skip tests for now
* fix docstring for doc builder
* skip unused attr
---------
Co-authored-by: SindhuRaghuram97 <114270661+SindhuRaghuram97@users.noreply.github.com>
Co-authored-by: Sindhu Raghuram <sindhuraghuram@google.com>
Co-authored-by: raushan <raushan@huggingface.co>
Co-authored-by: Mayank Chaturvedi <imayank@google.com>
Co-authored-by: Douglas Reid <douglas-reid@users.noreply.github.com>
Co-authored-by: Douglas Reid <21148125+douglas-reid@users.noreply.github.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
Co-authored-by: pculliton <phillipculliton@gmail.com>
Co-authored-by: Aritra Roy Gosthipaty <aritra.born2fly@gmail.com>
Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>
Co-authored-by: Arthur <arthur.zucker@gmail.com>
* rm tf/flax tests
* more flax deletions
* revert fixture change
* reverted test that should not be deleted; rm tf/flax test
* revert
* fix a few add-model-like tests
* fix add-model-like checkpoint source
* a few more
* test_get_model_files_only_pt fix
* fix test_retrieve_info_for_model_with_xxx
* fix test_retrieve_model_classes
* relative paths are the devil
* add todo
* handle long form generation
* add warning
* correct incorrect in place token change
* update test to catch edge case
* make style
* update warning
* add doc
* Image processor compile fix (#38540)
* Added a compile-friendly versiom of resize to BaseImgProcessorFast
* Changed qwen2 processor to use its parent class .resize
* Style
* underlined issue only happens on AMD w/ comment and bool check
* Fixed some utils functions
* Fixed the same issue for bridgetower
* Fixed the same issue for llava_next
* Repo consistency for llava onevision
* Update src/transformers/image_processing_utils_fast.py
Co-authored-by: Mohit Sharma <mohit21sharma.ms@gmail.com>
---------
Co-authored-by: Mohit Sharma <mohit21sharma.ms@gmail.com>
* Added an Expectation to an internvl test
* Made qwen2_vl use the resize method of its parent clas
* Changed to torch.where
---------
Co-authored-by: Mohit Sharma <mohit21sharma.ms@gmail.com>
* add dia model
* add tokenizer files
* cleanup some stuff
* brut copy paste code
* rough cleanup of the modeling code
* nuke some stuff
* more nuking
* more cleanups
* updates
* add mulitLayerEmbedding vectorization
* nits
* more modeling simplifications
* updates
* update rope
* update rope
* just fixup
* update configuration files
* more cleanup!
* default config values
* update
* forgotten comma
* another comma!
* update, more cleanups
* just more nits
* more config cleanups
* time for the encoder
* fix
* sa=mall nit
* nits
* n
* refacto a bit
* cleanup
* update cv scipt
* fix last issues
* fix last nits
* styling
* small fixes
* just run 1 generation
* fixes
* nits
* fix conversion
* fix
* more fixes
* full generate
* ouf!
* fixes!
* updates
* fix
* fix cvrt
* fixup
* nits
* delete wrong test
* update
* update
* test tokenization
* let's start changing things bit by bit - fix encoder step
* removing custom generation, moving to GenerationMixin
* add encoder decoder attention masks for generation
* mask changes, correctness checked against ad29837 in dia repo
* refactor a bit already --> next cache
* too important not to push :)
* minimal cleanup + more todos
* make main overwrite modeling utils
* add cfg filter & eos filter
* add eos countdown & delay pattern
* update eos countdown
* add max step eos countdown
* fix tests
* fix some things
* fix generation with testing
* move cfg & eos stuff to logits processor
* make RepetitionPenaltyLogitsProcessor flexible
- can accept 3D scores like (batch_size, channel, vocab)
* fix input_ids concatenation dimension in GenerationMixin for flexibility
* Add DiaHangoverLogitsProcessor and DiaExponentialDecayLengthPenalty classes; refactor logits processing in DiaForConditionalGeneration to utilize new configurations and improve flexibility.
* Add stopping criteria
* refactor
* move delay pattern from processor to modeling like musicgen.
- add docs
- change eos countdown to eos delay pattern
* fix processor & fix tests
* refactor types
* refactor imports
* format code
* fix docstring to pass ci
* add docstring to DiaConfig & add DiaModel to test
* fix docstring
* add docstring
* fix some bugs
* check
* porting / merging results from other branch - IMPORTANT: it very likely breaks generation, the goal is to have a proper forward path first
* experimental testing of left padding for first channel
* whoops
* Fix merge to make generation work
* fix cfg filter
* add position ids
* add todos, break things
* revert changes to generation --> we will force 2d but go 3d on custom stuff
* refactor a lot, change prepare decoder ids to work with left padding (needs testing), add todos
* some first fixes to get to 10. in generation
* some more generation fixes / adjustment
* style + rope fixes
* move cfg out, simplify a few things, more todos
* nit
* start working on custom logit processors
* nit
* quick fixes
* cfg top k
* more refactor of logits processing, needs a decision if gen config gets the new attributes or if we move it to config or similar
* lets keep changes to core code minimal, only eos scaling is questionable atm
* simpler eos delay logits processor
* that was for debugging :D
* proof of concept rope
* small fix on device mismatch
* cfg fixes + delay logits max len
* transformers rope
* modular dia
* more cleanup
* keep modeling consistently 3D, generate handles 2D internally
* decoder starts with bos if nothing
* post processing prototype
* style
* lol
* force sample / greedy + fixes on padding
* style
* fixup tokenization
* nits
* revert
* start working on dia tests
* fix a lot of tests
* more test fixes
* nit
* more test fixes + some features to simplify code more
* more cleanup
* forgot that one
* autodocs
* small consistency fixes
* fix regression
* small fixes
* dia feature extraction
* docs
* wip processor
* fix processor order
* processing goes brrr
* transpose before
* small fix
* fix major bug but needs now a closer look into the custom processors esp cfg
* small thing on logits
* nits
* simplify indices and shifts
* add simpler version of padding tests back (temporarily)
* add logit processor tests
* starting tests on processor
* fix mask application during generation
* some fixes on the weights conversion
* style + fixup logits order
* simplify conversion
* nit
* remove padding tests
* nits on modeling
* hmm
* fix tests
* trigger
* probably gonna be reverted, just a quick design around audio tokenizer
* fixup typing
* post merge + more typing
* initial design for audio tokenizer
* more design changes
* nit
* more processor tests and style related things
* add to init
* protect import
* not sure why tbh
* add another protect
* more fixes
* wow
* it aint stopping :D
* another missed type issue
* ...
* change design around audio tokenizer to prioritize init and go for auto - in regards to the review
* change to new causal mask function + docstrings
* change ternary
* docs
* remove todo, i dont think its essential tbh
* remove pipeline as current pipelines do not fit in the current scheme, same as csm
* closer to wrapping up the processor
* text to audio, just for demo purposes (will likely be reverted)
* check if it's this
* save audio function
* ensure no grad
* fixes on prefixed audio, hop length is used via preprocess dac, device fixes
* integration tests (tested locally on a100) + some processor utils / fixes
* style
* nits
* another round of smaller things
* docs + some fixes (generate one might be big)
* msytery solved
* small fix on conversion
* add abstract audio tokenizer, change init check to abstract class
* nits
* update docs + fix some processing :D
* change inheritance scheme for audio tokenizer
* delete dead / unnecessary code in copied generate loop
* last nits on new pipeline behavior (+ todo on tests) + style
* trigger
---------
Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: Vasqu <antonprogamer@gmail.com>
* ensure the query is updated during training
avoid unused parameters that DDP does not like
* avoid a crash when `kwargs` contain `padding=True`
trainers often pass this argument automatically
* minor
* Remove mel_spec lazy init, and rename to mel_filters.
this ensures save_pretrained will not crash when saving the processor during training
d5d007a1a0/src/transformers/feature_extraction_utils.py (L595)
* minor - most feature extractors has a `sampling_rate` property
* speedup relative position embeddings
* fix several issues in model saving/loading:
- avoid modifying `self._hf_peft_config_loaded` when saving
- adapter_config automatically points to the original base model - a finetuned version should point to the model save dir.
- fixing model weights names, that are changed by adding an adapter.
* minor
* minor
* minor
* fixing a crash without peft active
* add todo to replace einsum
* remove trust_remote_code
* again
* Revert "Skip some tests for now (#38931)"
This reverts commit 31d30b72245aacfdf70249165964b53790d9c4d8.
* again
* style
* again
* again
* style
* fix integration test
* fix tests
* style
* fix
* fix
* fix the last ones
* style
* last one
* fix last
* fix
---------
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
* fix: astronomical loss with ModernBERT when using gradient checkpointing
* update the modling fix
---------
Co-authored-by: Arthur <arthur.zucker@gmail.com>
* Support `flash_attn_3`
Implements fwd and tests for Flash Attention 3 https://github.com/Dao-AILab/flash-attention/commits/main/hopper
- Includes checks for dropout>0 and ALiBi in `modeling_utils.PreTrainedModel._check_and_enable_flash_attn_3` (Dropout will likely be supported soon, so this will need to be updated and `modeling_flash_attention_utils._flash_attention_forward` at the `if _IS_FLASH_ATTN_3_AVAILABLE: ...`
An example Llama implementation is included in `modeling_llama.py` but other models would still need to be updated
Based on https://github.com/huggingface/transformers/pull/36190 which has model implementations and examples which could be merged
* Add tests for Flash Attention 2 and 3 parity
* ci fix
* FA2 compatibiity
- `_prepare_flash_attention_from_position_ids` ->`prepare_fa2_from_position_ids`
- Remove bettertransformer check in Flash Attention 3
- Merge tests
- Add licensing
* ci fix
* Test naming consistency
* ci fix
* Deprecation warning for `prepare_fa2_from_position_ids`
* ci fix
* Initial submit
* Fix bugs:
1. add __init__ file
2. tied word embedding
3. support flash/flex attention
4. model saving and loading
* Code refactor:
* Rename encdecgemma to t5gemma.
* Split attention into self- and cross-attention
* Split stack into encoder and decoder
* Add test cases
* Add auto configuration
* Update configurations.
* Fix bugs related to copy and attribute checks
* Fix type union
* Fix merge errors
* run ruff format
* Run make style and update tests.
* Add t5gemma model doc.
* ruff and style formatting.
* Add missed module config.
* Add dummy checkpoint link to pass tests (need updated when real checkpoints are uplioaded.).
* Update model doc.
* Minor updates following Arthur's comments:
* replace docstrings with auto_docstrings
* remove checkpoint layers
* remove deprecate_kwargs
* fix rebase errors
* Fix docstring issues.
* fix t5gemma doc issue.
* run ruff format
* Updates:
* split encoder-only model out
* make t5gemmamodel encoder-decoder only
* update token and sequence classification
* update tests
* don't move the whole video to GPU
* add torchcodec
* add tests
* make style
* instrucblip as well
* consistency
* Update src/transformers/utils/import_utils.py
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
* Update src/transformers/utils/import_utils.py
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
* Update src/transformers/video_utils.py
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
---------
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
* Fix graph break in torch.compile when using FA2 with attention_mask=None and batch size > 1
* fix code format
* add test; replace position_ids with query_states becasue position_ids.shape[0] is always 1
* add assert loss is not nan
* Add zero dim tensor check when using flash_attention
Signed-off-by: ranzhejiang <zhejiang.ran@intel.com>
* Add zero dim tensor check when using flash_attention
Signed-off-by: ranzhejiang <zhejiang.ran@intel.com>
---------
Signed-off-by: ranzhejiang <zhejiang.ran@intel.com>
* Add Hugging Face authentication procedure for IDEs (PyCharm, VS Code, etc.)
* Update quicktour.md
---------
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* ensure the query is updated during training
avoid unused parameters that DDP does not like
* avoid a crash when `kwargs` contain `padding=True`
trainers often pass this argument automatically
* minor
* Remove mel_spec lazy init, and rename to mel_filters.
this ensures save_pretrained will not crash when saving the processor during training
d5d007a1a0/src/transformers/feature_extraction_utils.py (L595)
* minor - most feature extractors has a `sampling_rate` property
* Add Arcee model support to transformers
- Add ArceeConfig and model mappings for all task types (CausalLM, SequenceClassification, QuestionAnswering, TokenClassification)
- Add auto-loading support through AutoModel, AutoConfig, and AutoTokenizer
- Use LlamaTokenizer for tokenization
- Add FX graph support for Arcee models
- Create lazy loading module structure for Arcee
* feat: update YARN scaling and RoPE validation for Arcee model
* feat: add auto_docstring checkpoint config to Arcee model classes
* docs: add pre-trained model weights reference to Arcee configuration files
* refactor: move RoPE utilities to dedicated modeling_rope_utils module
* Add comprehensive test suite for Arcee model
- Add test_modeling_arcee.py following standard transformers test patterns
- Include tests for all model variants (CausalLM, SequenceClassification, QuestionAnswering, TokenClassification)
- Add specific test for ReLU² activation in ArceeMLP
- Add RoPE scaling tests including YARN support
- Follow CausalLMModelTest pattern used by similar models
* Add documentation for Arcee model
- Add comprehensive model documentation with usage examples
- Include all model variants in autodoc
- Add to table of contents in proper alphabetical order
- Fixes documentation coverage for Arcee model classes
* Make style/fixup
* fix copyright year
* Sync modular conversion
* revert in legacy supported models in src/transformers/utils/fx
* cleaned redundant code in modular_arcee.py
* cleaned testing
* removed pretraining tp
* fix styles
* integration testing
---------
Co-authored-by: Pranav <veldurthipranav@gmail.com>
Co-authored-by: Pranav <56645758+pranav4501@users.noreply.github.com>
* some fixes
* some fixes
* now the pipeline can take list of tokens as input and is_split_into_words argument
* now the pipeline can take list of tokens as input and is_split_into_words argument
* now the pipeline can take list of tokens as input and is_split_into_words argument and we can handle batches of tokenized input
* now the pipeline can take list of tokens as input and is_split_into_words argument and we can handle batches of tokenized input
* solving test problems
* some fixes
* some fixes
* modify tests
* aligning start and end correctly
* adding tests
* some formatting
* some formatting
* some fixes
* some fixes
* some fixes
* resolve conflicts
* removing unimportant lines
* removing unimportant lines
* generalize to other languages
* generalize to other languages
* generalize to other languages
* generalize to other languages
* fix: add __bool__ operator to tokenizer to avoid bloated asserts
When a user does 'assert tokenizer' to ensure that the tokenizer is not None, they inadvertently set off a rather expensive process in the '__len__()' operator. This fix adds a trivial '__bool__()' that returns True, so that a None tokenizer asserts and an actual tokenizer returns True when asserted, without calling length op.
* typo
* add working idefics2 fast and improvements for fast nested images processing
* add fast image processors idefics 3 and smolvlm
* cleanup tests
* fic doc idefics2
* PR review and fix issues after merge
* Force providing disable_grouping to group_images_by_shape
* simplify group_images_by_shape
* fix modular
* Fix nits after review
* Fix(time_series): Correct scaler tensor shape in base model
The create_network_inputs function in TimeSeriesTransformerModel
handled the scaler's loc and scale tensors inconsistently.
When input_size=1, the tensors were not squeezed, leading to
downstream dimension errors for models like Informer.
This commit refactors the logic to unconditionally apply .squeeze(1),
which correctly handles all input_size cases and fixes the bug at its source.
Fixes#38745
* Fix(time_series): Correct scaler tensor shape in base model
The create_network_inputs function in TimeSeriesTransformerModel
handled the scaler's loc and scale tensors inconsistently.
When input_size=1, the tensors were not squeezed, leading to
downstream dimension errors for models like Informer.
This commit refactors the logic to unconditionally apply .squeeze(1),
which correctly handles all input_size cases and fixes the bug at its source.
Fixes#38745
---------
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
* remove it everywhere
* Update trainer_pt_utils.py
* Update trainer_pt_utils.py
* style
* sort list in test
* CIs
* use recursion same way as before (for intermediate layer names)
* feat: add flexible Liger Kernel configuration to TrainingArguments
Add support for granular Liger Kernel configuration through a new
`liger_kernel_config` parameter in TrainingArguments. This allows users
to selectively enable/disable specific kernels (rope, swiglu, cross_entropy,
etc.) instead of the current approach that rely on default configuration.
Features:
- Add `liger_kernel_config` dict parameter to TrainingArguments
- Support selective kernel application for all supported models
- Maintain full backward compatibility with existing `use_liger_kernel` flag
Example usage:
```python
TrainingArguments(
use_liger_kernel=True,
liger_kernel_config={
"rope": True,
"swiglu": True,
"cross_entropy": False,
"fused_linear_cross_entropy": True
}
)
Closes#38905
* Address comments and update Liger section in Trainer docs
* we need to check against mapping to be safe
* need to check only when inferring from image type, otherwise messes custom code
---------
Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com>
* log: Add logging when user uses split_batches and per_device_train_batch_size
* refactor: remove whitespace from blank line
* Update src/transformers/training_args.py
Change logging level to info
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
---------
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* Fix HQQ model param device transfer issue
* modify a comment
* clear the code and add test for hqq device/dtype
* fix test hqq code quality of imports
---------
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* Correctly fix init
Co-authored-by: BUI Van Tuan <buivantuan07@gmail.com>
* add back the block, breaking BC but this is correct author's code
* override the test for params needing it
---------
Co-authored-by: BUI Van Tuan <buivantuan07@gmail.com>
* No more Tuple, List, Dict
* make fixup
* More style fixes
* Docstring fixes with regex replacement
* Trigger tests
* Redo fixes after rebase
* Fix copies
* [test all]
* update
* [test all]
* update
* [test all]
* make style after rebase
* Patch the hf_argparser test
* Patch the hf_argparser test
* style fixes
* style fixes
* style fixes
* Fix docstrings in Cohere test
* [test all]
---------
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
* Moved the sources to the right
* small Changes
* Some Changes to moonshine
* Added the install to pipline
* updated the monshine model card
* Update docs/source/en/model_doc/moonshine.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/model_doc/moonshine.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/model_doc/moonshine.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/model_doc/moonshine.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Updated Documentation According to changes
* Fixed the model with the commits
* Changes to the roc_bert
* Final Update to the branch
* Adds Quantizaiton to the model
* Finsihed Fixing the Roc_bert docs
* Fixed Moshi
* Fixed Problems
* Fixed Problems
* Fixed Problems
* Fixed Problems
* Fixed Problems
* Fixed Problems
* Added the install to pipline
* updated the monshine model card
* Update docs/source/en/model_doc/moonshine.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/model_doc/moonshine.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/model_doc/moonshine.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Updated Documentation According to changes
* Fixed the model with the commits
* Fixed the problems
* Final Fix
* Final Fix
* Final Fix
* Update roc_bert.md
---------
Co-authored-by: Your Name <sohamprabhu@Mac.fios-router.home>
Co-authored-by: Your Name <sohamprabhu@Sohams-MacBook-Air.local>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* init
* chore: various changes to LightGlue
* chore: various changes to LightGlue
* chore: various changes to LightGlue
* chore: various changes to LightGlue
* Fixed dynamo bug and image padding tests
* refactor: applied refactoring changes from SuperGlue's concat, batch and stack functions to LightGlue file
* tests: removed sdpa support and changed expected values
* chore: added some docs and refactoring
* chore: fixed copy to superpoint.image_processing_superpoint.convert_to_grayscale
* feat: adding batch implementation
* feat: added validation for preprocess and post process method to LightGlueImageProcessor
* chore: changed convert_lightglue_to_hf script to comply with new standard
* chore: changed lightglue test values to match new lightglue config pushed to hub
* chore: simplified convert_lightglue_to_hf conversion map
* feat: adding batching implementation
* chore: make style
* feat: added threshold to post_process_keypoint_matching method
* fix: added missing instructions that turns keypoints back to absolute coordinate before matching forward
* fix: added typehint and docs
* chore: make style
* [run-slow] lightglue
* fix: add matches different from -1 to compute valid matches in post_process_keypoint_matching
* tests: added CUDA proof tests similar to SuperGlue
* chore: various changes to modeling_lightglue.py
- Added "Copies from" statements for copied functions from modeling_superglue.py
- Added missing docstrings
- Removed unused functions or classes
- Removed unnecessary statements
- Added missing typehints
- Added comments to the main forward method
* chore: various changes to convert_lightglue_to_hf.py
- Added model saving
- Added model reloading
* chore: fixed imports in lightglue files
* [run-slow] lightglue
* chore: make style
* [run-slow] lightglue
* Apply suggestions from code review
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
* [run-slow] lightglue
* chore: Applied some suggestions from review
- Added missing typehints
- Refactor "cuda" to device variable
- Variable renaming
- LightGlue output order changed
- Make style
* fix: added missing grayscale argument in image processor in case use of SuperPoint keypoint detector
* fix: changed lightglue HF repo to lightglue_superpoint with grayscale default to True
* refactor: make keypoints `(batch_size, num_keypoints, keypoint_dim)` through forward and unsqueeze only before attention layer
* refactor: refactor do_layer_keypoint_pruning
* tests: added tests with no early stop and keypoint pruning
* refactor: various refactoring to modeling_lightglue.py
- Removed unused functions
- Renamed variables for consistency
- Added comments for clarity
- Set methods to private in LightGlueForKeypointMatching
- Replaced tensor initialization to list then concatenation
- Used more pythonic list comprehension for repetitive instructions
* refactor: added comments and renamed filter_matches to get_matches_from_scores
* tests: added copied from statement with superglue tests
* docs: added comment to prepare_keypoint_matching_output function in tests
* [run-slow] lightglue
* refactor: reordered _concat_early_stopped_outputs in LightGlue class
* [run-slow] lightglue
* docs: added lightglue.md model doc
* docs: added Optional typehint to LightGlueKeypointMatchingOutput
* chore: removed pad_images function
* chore: set do_grayscale default value to True in LightGlueImageProcessor
* Apply suggestions from code review
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
* Apply suggestions from code review
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
* docs: added missing LightGlueConfig typehint in nn.Module __init__ methods
* docs: removed unnecessary code in docs
* docs: import SuperPointConfig only from a TYPE_CHECKING context
* chore: use PretrainedConfig arguments `num_hidden_layers` and `num_attention_heads` instead of `num_layers` and `num_heads`
* chore: added organization as arg in convert_lightglue_to_hf.py script
* refactor: set device variable
* chore: added "gelu" in LightGlueConfig as hidden_act parameter
* docs: added comments to reshape.flip.reshape instruction to perform cross attention
* refactor: used batched inference for keypoint detector forward pass
* fix: added fix for SDPA tests
* docs: fixed docstring for LightGlueImageProcessor
* [run-slow] lightglue
* refactor: removed unused line
* refactor: added missing arguments in LightGlueConfig init method
* docs: added missing LightGlueConfig typehint in init methods
* refactor: added checkpoint url as default variable to verify models output only if it is the default url
* fix: moved print message inside if statement
* fix: added log assignment r removal in convert script
* fix: got rid of confidence_thresholds as registered buffers
* refactor: applied suggestions from SuperGlue PR
* docs: changed copyright to 2025
* refactor: modular LightGlue
* fix: removed unnecessary import
* feat: added plot_keypoint_matching method to LightGlueImageProcessor with matplotlib soft dependency
* fix: added missing import error for matplotlib
* Updated convert script to push on ETH org
* fix: added missing licence
* fix: make fix-copies
* refactor: use cohere apply_rotary_pos_emb function
* fix: update model references to use ETH-CVG/lightglue_superpoint
* refactor: add and use intermediate_size attribute in config to inherit CLIPMLP for LightGlueMLP
* refactor: explicit variables instead of slicing
* refactor: use can_return_tuple decorator in LightGlue model
* fix: make fix-copies
* docs: Update model references in `lightglue.md` to use the correct pretrained model from ETH-CVG
* Refactor LightGlue configuration and processing classes
- Updated type hints for `keypoint_detector_config` in `LightGlueConfig` to use `SuperPointConfig` directly.
- Changed `size` parameter in `LightGlueImageProcessor` to be optional.
- Modified `position_embeddings` in `LightGlueAttention` and `LightGlueAttentionBlock` to be optional tuples.
- Cleaned up import statements across multiple files for better readability and consistency.
* refactor: Update LightGlue configuration to enforce eager attention implementation
- Added `attn_implementation="eager"` to `keypoint_detector_config` in `LightGlueConfig` and `LightGlueAttention` classes.
- Removed unnecessary logging related to attention implementation fallback.
- Cleaned up import statements for better readability.
* refactor: renamed message into attention_output
* fix: ensure device compatibility in LightGlueMatchAssignmentLayer descriptor normalization
- Updated the normalization of `m_descriptors` to use the correct device for the tensor, ensuring compatibility across different hardware setups.
* refactor: removed Conv layers from init_weights since LightGlue doesn't have any
* refactor: replace add_start_docstrings with auto_docstring in LightGlue models
- Updated LightGlue model classes to utilize the new auto_docstring utility for automatic documentation generation.
- Removed legacy docstring handling to streamline the code and improve maintainability.
* refactor: simplify LightGlue image processing tests by inheriting from SuperGlue
- Refactored `LightGlueImageProcessingTester` and `LightGlueImageProcessingTest` to inherit from their SuperGlue counterparts, reducing code duplication.
- Removed redundant methods and properties, streamlining the test setup and improving maintainability.
* test: forced eager attention implementation to LightGlue model tests
- Updated `LightGlueModelTester` to include `attn_implementation="eager"` in the model configuration.
- This change aligns the test setup with the recent updates in LightGlue configuration for eager attention.
* refactor: update LightGlue model references
* fix: import error
* test: enhance LightGlue image processing tests with setup method
- Added a setup method in `LightGlueImageProcessingTest` to initialize `LightGlueImageProcessingTester`.
- Included a docstring for `LightGlueImageProcessingTester` to clarify its purpose.
* refactor: added LightGlue image processing implementation to modular file
* refactor: moved attention blocks into the transformer layer
* fix: added missing import
* fix: added missing import in __all__ variable
* doc: added comment about enforcing eager attention because of SuperPoint
* refactor: added SuperPoint eager attention comment and moved functions to the closest they are used
---------
Co-authored-by: Steven Bucaille <steven.bucaille@buawei.com>
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
Earlier PR put executorch specific sdpa and mask function in the export function. This prevent any customization that can be done to sdpa, prior to export. By moving this to __init__, we still keep the original behavior but allow users like optimum-executorch to override sdpa by setting model.config._attn_implementation.
* fixing the problem align_to_words=True leading to duplicate solutions
* adding tests
* some fixes
* some fixes
* changing the handle_duplicate_answers=False by default
* some fixese
* some fixes
* make the duplicate handling the default behaviour and merge duplicates
* make the duplicate handling the default behaviour
* adding model and conversion scripts
* add imports to test vjepa conversion
* fix imports and make conversion work
* fix computation for short side
* replace attention with library attention function
* cleanup more attention classes
* remove config overrides
* add test cases, fix some of the failing ones
* fix the model outputs
* fix outputs of the model per review
* fix too big model test case
* fix styling __init__.py
* fix initialization test
* remove all asserts per review
* update sorting unsorting logic as per feedback
* remove is_video per review
* remove another is_video segment
* remove unwanted stuff
* small fixes
* add docstrings for the model
* revert adding vjepa2 config here
* update styling
* add config docstrings (wip)
* fix dpr issue
* removed test failing issues
* update styles
* merge predictor configs into main config
* remove processing code, add video processor
* remove permute which is not necessary now
* fix styles
* updated vjepa2 to be in video_processing_auto
* update comment for preprocessing
* test integration test and fix the outputs
* update test values, change test to look at repeated frames for a given image
* add a simple video processing test
* refactoring pixel_values_videos and upload ckpts to original
* fix torch_fx test cases
* remove unused config
* add all config docstrings
* add more integration tests
* add basic doc
* revert unwanted styling changes
* working make fixup
* Fix model_type in config
* Add ForVideoClassification model
* update attention implementation to fit new hf standards
* fix the preprocessing logic, ensure it matches the original model
* remove use_rope logic, cleanup
* fix docstrings
* Further cleanup, update doc
* Fix model prefix
* fix get_vision_features
* VJEPA2Embeddings style refactor
* nit, style comment
* change modules default values
* Only `str` activation in config
* GradientCheckpointingLayer
* fixup
* fix conversion script
* Remove return_dict
* remove None return typehint
* Refactor VJEPA2Layer, remove use_SiLU
* Fix fx tests
* dpr -> drop_path_rates
* move *ModelOutput on top
* format docs bit
* update docs
* update docs
* update doc example
* remove prune_heads from model
* remove unused config params
* refactor embed signature
* Add vjepa to docs
* Fix config docstring
* attention head
* update defaults
* Update docs/source/en/model_doc/vjepa2.md
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* Update docs/source/en/model_doc/vjepa2.md
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* Fix import
* Min refactoring
* Update HUB_SOURCE and HUB_REPO in conversion script
* Add missing headers
* VJEPA -> V-JEPA in docs
* Add image to doc
* fix style
* fix init weights
* change checkpoint name in modeling tests
* Initial cls head setup
* remove rop attention from head (not needed)
* remove swigluffn - not needed
* Add siglip layer
* Replace with siglip layer
* Rename Siglip - VJEPA2
* remove unused modules
* remove siglip mlp
* nit
* remove MLP
* Refactor head cross attention
* refactor VJEPA2HeadCrossAttentionLayer
* nit renaming
* fixup
* remove commented code
* Add cls head params to config
* depth from config
* move pooler + classifier to the model
* Update for cls model signature
* move layers, rename a bit
* fix docs
* update weights init
* remove typehint for init
* add to auto-mapping
* enable tests
* Add conversion script
* fixup
* add to docs
* fix docs
* nit
* refactor for mapping
* clean
* Add integration test
* Fixing multi gpu test
* update not-split-modules
* update video cls test tolerance
* Increase test_inference_image tolerance
* Update no-split modules for multi gpu
* Apply suggestions from code review
* fixing multi-gpu
* fix docstring
* Add cls snippet to docs
* Update checkpoint
* Refactor DBRX tests to use CausalLMModelTest base classes
- Changed DbrxModelTester to inherit from CausalLMModelTester
- Changed DbrxModelTest to inherit from CausalLMModelTest
- Removed duplicate methods that are already in base classes
- Added required class attributes for model classes
- Updated pipeline_model_mapping to include feature-extraction
- Kept DBRX-specific configuration and test methods
- Disabled RoPE tests as DBRX's rotary embedding doesn't accept config parameter
This refactoring reduces code duplication and follows the pattern established
in other causal LM model tests like Gemma.
* Apply style fixes
* Trigger tests
* Refactor DBRX test
* Make sure the DBRX-specific settings are handled
* Use the attribute_map
* Fix attribute map
---------
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* Unbreak optimum-executorch
* use static cache if has layer_types but no sliding_window
* revert view on kv_arange
---------
Co-authored-by: Guang Yang <guangyang@fb.com>
* remove it from all py files
* remove it from the doc
* remove it from examples
* style
* remove traces of _fast_init
* Update test_peft_integration.py
* CIs
* apply updates smolVLM (still needs workaround for chat template)
* add other models
* dump qwen omni for now, come back later
* port qwen omni from their impl
* wait, all qwens sample videos in same way!
* clean up
* make smolvlm backwards compatible and fix padding
* dix some tests
* fox smolvlm tests
* more clean up and test fixing
* delete unused arg
* fix
* address comments
* style
* fix test
* chore(pixtral): emit block attention mask when using flash attention
Since flash_attention_2 relies solely on position_ids, emitting the block attention mask avoids unnecessary memory usage and prevents OOM on large inputs.
* remove unnecessary attention_mask assignment
* Update Pegasus model card
* Fix transformers-cli command
* Update code examples to use bfloat16
* Reverted code examples to use float16
* Fix typo, update checkpoints link
* Update str formatting in code examples
* Apply suggestions from code review
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Fix typo
* Remove inaccurate badges
* Revert badge removal
* Apply suggestions from code review
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Include cache_implementation argument in quantization example
---------
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* adding model and conversion scripts
* add imports to test vjepa conversion
* fix imports and make conversion work
* fix computation for short side
* replace attention with library attention function
* cleanup more attention classes
* remove config overrides
* add test cases, fix some of the failing ones
* fix the model outputs
* fix outputs of the model per review
* fix too big model test case
* fix styling __init__.py
* fix initialization test
* remove all asserts per review
* update sorting unsorting logic as per feedback
* remove is_video per review
* remove another is_video segment
* remove unwanted stuff
* small fixes
* add docstrings for the model
* revert adding vjepa2 config here
* update styling
* add config docstrings (wip)
* fix dpr issue
* removed test failing issues
* update styles
* merge predictor configs into main config
* remove processing code, add video processor
* remove permute which is not necessary now
* fix styles
* updated vjepa2 to be in video_processing_auto
* update comment for preprocessing
* test integration test and fix the outputs
* update test values, change test to look at repeated frames for a given image
* add a simple video processing test
* refactoring pixel_values_videos and upload ckpts to original
* fix torch_fx test cases
* remove unused config
* add all config docstrings
* add more integration tests
* add basic doc
* revert unwanted styling changes
* working make fixup
* Fix model_type in config
* update attention implementation to fit new hf standards
* fix the preprocessing logic, ensure it matches the original model
* remove use_rope logic, cleanup
* fix docstrings
* Further cleanup, update doc
* Fix model prefix
* fix get_vision_features
* VJEPA2Embeddings style refactor
* nit, style comment
* change modules default values
* Only `str` activation in config
* GradientCheckpointingLayer
* fixup
* fix conversion script
* Remove return_dict
* remove None return typehint
* Refactor VJEPA2Layer, remove use_SiLU
* Fix fx tests
* dpr -> drop_path_rates
* move *ModelOutput on top
* format docs bit
* update docs
* update docs
* update doc example
* remove prune_heads from model
* remove unused config params
* refactor embed signature
* Add vjepa to docs
* Fix config docstring
* update defaults
* Update docs/source/en/model_doc/vjepa2.md
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* Update docs/source/en/model_doc/vjepa2.md
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* Fix import
* Min refactoring
* Update HUB_SOURCE and HUB_REPO in conversion script
* Add missing headers
* VJEPA -> V-JEPA in docs
* Add image to doc
* fix style
* fix init weights
* change checkpoint name in modeling tests
---------
Co-authored-by: Koustuv Sinha <koustuv.sinha@mail.mcgill.ca>
Co-authored-by: yonigozlan <yoni.gozlan@huggingface.co>
Co-authored-by: Yoni Gozlan <74535834+yonigozlan@users.noreply.github.com>
Co-authored-by: Koustuv Sinha <koustuvsinha@gmail.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* fix: Add method to retrieve image features in PaliGemmaForConditionalGeneration
* feat: Add get_image_features method to multiple models for image feature extraction
* fix: reformat the files with ruff.
* feat: Add methods for packing and retrieving image and video features across multiple models
modified:
- modeling_chameleon.py
- modeling_llava_next.py
- modular_llava_next_video.py
- modeling_qwen2_vl.py
and generate the:
- modeling_llava_next_video.py
- modeling_llava_onevision.py
- modeling_qwen2_5_vl.py
* feat: Implement get_image_features method in Aria, Mistral3, and VipLlava models with updated parameters
* fix: reformatted the code with fix-style
* Created model card for xlm-roberta-xl
* Update XLM-RoBERTa-XL model card with improved descriptions and usage examples
* Minor option labeling fix
* Added MaskedLM version of XLM RoBERTa XL to model card
* Added quantization example for XLM RoBERTa XL model card
* minor fixes to xlm roberta xl model card
* Minor fixes to mask format in xlm roberta xl model card
* Update XLM-RoBERTa model documentation with enhanced usage examples and improved layout
* Added CLI command example and quantization example for XLM RoBERTa model card.
* Minor change to transformers CLI and quantization example for XLM roberta model card
* Created model card for XLM model
* Revised model card structure and content of XLM model
* Update XLM model documentation with improved examples and code snippets for predicting <mask> tokens using Pipeline and AutoModel.
* Fix typo in LLaVa documentation
In exactly one section, LlavaImageProcessor was spelt wrongly as LLavaImageProcessor, which throws off copy-pasting the section.
* Fix LlavaImageProcessor url to make it valid (and copypaste-able)
Earlier, the URL contained the entire HF prefix. This commit removes that to ensure that the code block can be copied and run as is.
* mlm_probability in DataCollatorForLanguageModeling should be validated only when mlm is True (#38522)
* Change mlm_probability to Optional in DataCollatorForLanguageModeling (#38537)
---------
Co-authored-by: eak <eak@ivalua.com>
* added fast image processor for ZoeDepth and expanded tests accordingly
* added fast image processor for ZoeDepth and expanded tests accordingly, hopefully fixed repo consistency issue too now
* final edits for zoedept fast image processor
* final minor edit for zoedepth fast imate procesor
Fix "RuntimeError: Expected all tensors to be on the same device,
but found at least two devices, cuda:0 and cpu" error running the
following roformer tests on GPUs (CUDA or XPU):
```
tests/models/roformer/test_modeling_roformer.py::RoFormerSinusoidalPositionalEmbeddingTest::test_basic
tests/models/roformer/test_modeling_roformer.py::RoFormerSelfAttentionRotaryPositionEmbeddingTest::test_apply_rotary_position_embeddings
```
Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
* Fix: resolve import order and duplicate import (ruff I001, F811)
* Format: clean up Dinov2 test file with ruff formatter
* Add _no_split_modules = ['Dinov2Layer'] to enable device_map='auto'
* Revert dinov2_with_registers _no_split_modules to original state
* Remove redundant device_map test as suggested
* Remove unused import after deleting test
* removed import torch and the redundant test function
* Update tests/models/dinov2/test_modeling_dinov2.py
---------
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* Fix multiple devices error on Janus
* Fix AttributeError on Janus BOI token
* Initialize lm first in Janus to get correct device map
* Added expectations for Janus test_model_generate_images
* Fixed JanusVisionEncoderLayer being split across devices
* Code formatting
* Adding modeling file
* Reverted changes out of scope for this PR
* feat: add colqwen2 (wip)
* tests: fix test_attention_outputs
* tests: reduce hidden size to accelerate tests
* tests: fix `test_attention_outputs` 🥳
* fix: fix wrong parent class for `ColQwen2ForRetrievalOutput`
* fix: minor typing and style changes
* chore: run `make style`
* feat: remove redundant `max_num_visual_tokens` attribute in `ColQwen2Processor`
* tests: tweak comments
* style: apply ruff formatter
* feat: move default values for `visual_prompt_prefix` and `query_prefix`
* docs: update ColQwen2 model card
* docs: tweak model cards
* docs: add required example config checkpoint
* tests: update expected scores in integration test
* docs: tweak quickstart snippets
* fix: address PR comments
* tests: fix colqwen2 tests + tweak comment in colpali test
* tests: unskip useful tests
* fix: fix bug when `visual_prompt_prefix` or `query_prefix` is an empty string
* fix: fix ColPali outputs when `return_dict == False`
* fix: fix issue with PaliGemma output not being a dict
* docs: set default dtype to bfloat16 in quickstart snippets
* fix: fix error when `return_dict=False` in ColPali and ColQwen2
* tests: fix special tokens not being replaced in input_ids
* style: fix lint
* fix: `ColQwen2Processor`'s `padding_side` is now set from `processor_config.json`
* fix: remove unused `padding_side` in ColQwen2 model
* docs: update ColQwen2's model doc
* fix: fix harcoded vlm backbone class in ColQwen2Config
* fix: remove `padding_side` from ColQwen2Processor as should fed from kwargs
* docs: fix typo in model docstring
* docs: add illuin mention in model docs
* fix: let `padding_size` be handled by `tokenizer_config.json`
* docs: add colpali reference url in colqwen2's model doc
* docs: add Hf mention in model docs
* docs: add late interaction mention in model docs
* docs: tweak colqwen2 model doc
* docs: update reference checkpoint for ColPali to v1.3
* docs: simplify quickstart snippets
* docs: remove redundant `.eval()`
* refactor: use `can_return_tuple` decorator for ColPali and ColQwen2
* docs: fix copyright date
* docs: add missing copyright in tests
* fix: raise error when `initializer_range` is not in config
* docs: remove redundant `.eval()` in colpali doc
* fix: fix `get_text_config` now that Qwen2VL has a proper `text_config` attribute
See https://github.com/huggingface/transformers/pull/37268 for details about changes in Qwen2VL's config.
* fix: add missing `initializer_range` attribute in `ColQwen2Config`
* fix: use `get_text_config` in `resize_token_embeddings`
* update colwen2 with auto_docstring
* docs: fix wrong copyright year
* chore: remove `raise` as `initializer_range` has a default value in `ColQwen2Config`
* refactor: merge `inner_forward` into `forward`
* Refactor colqwen2 after refactoring of qwen2VL, use modular for modeling code
* protect torch import in modular to protect in processing
* protect torch import in modular to protect in processing
* tests: fix hf model path in ColQwen2 integration test
* docs: clarify `attn_implementation` and add comments
* docs: add fallback snippet for using offline PIL dummy images
* docs: temporarily revert attn_implementation to `None` while sdpa is not fixed
* docs: tweaks in colpali/colqwen2 quick start snippets
* fix: add missing flags to enable SDPA/Flex Attention in ColQwen2 model
* fix: add missing changes in modular file
* fix modeling tests
---------
Co-authored-by: yonigozlan <yoni.gozlan@huggingface.co>
* Update Loss Functions to Accept Tensor num_items_in_batch
* Fix device mismatch by moving num_items_in_batch to loss device in fixed_cross_entropy
* fix the ruff check
* delete the unused if stat
* fix the type problem
transformers.enable_full_determinism enables deterministic
flash attention using `FLASH_ATTENTION_DETERMINISTIC`
800510c67b/src/transformers/trainer_utils.py (L79)
However, current checks use a global variable `deterministic_g`,
which will do the environment variable check as soon as importing,
this will cause issues as users can call
`transformers.enable_full_determinism` after
`transformers.modeling_flash_attention_utils` is imported. This
behavior is introduced in
https://github.com/huggingface/transformers/pull/33932/files#r1806668579
to fix the graph break.
As a result, this PR implement fixes by delaying the environment variable
check to the first time when `_flash_attention_forward` is executed, so
that we can fix this issue and we won't introduce a graph break.
Signed-off-by: Hollow Man <hollowman@opensuse.org>
* A shallow copy in groundingdino
Fixes#37333
* Supprimer une ligne vide dans la classe GroundingDinoForObjectDetection
* Translate comments in the GroundingDinoForObjectDetection class from French to English
* make it go brrrr
* date time
* update
* fix
* up
* uppp
* up
* no number i
* udpate
* fix
* [paligemma] fix processor with suffix (#38365)
fix pg processor
* [video utils] group and reorder by number of frames (#38374)
fix
* Fix convert to original state dict for VLMs (#38385)
* fix convert to original state dict
* fix
* lint
* Update modeling_utils.py
* update
* warn
* no verbose
* fginal
* ouft
* style
---------
Co-authored-by: Raushan Turganbay <raushan@huggingface.co>
Co-authored-by: hoshi-hiyouga <hiyouga@buaa.edu.cn>
* Use dict comprehension to create dict
* Fix type annotation
Union[Any] doesn't really make any sense
* Remove methods that are already implemented in the `UserDict` parent
class
* updates
* fixup
* fix tests
* fix test
* fix
* let it be here for now, till monday
* two more fixes
* persimmon
* fixup
* fix
* fixup
* make sure fuyu runs now that LM has new attn API
* fixup + tests
* qwen vl uses new mask interface as well
* qwen image features format
* update
* remove image_sizes
* address comments
* i am dumb...
* feat: add cache retention for requests
* fix: propagate `manual_eviction` param & refactor `finish_request`
`finish_request` now only takes `request_id: str` as an input rather
than the full `RequestState`, which was not needed and simplifies
calling from `ContinuousBatchingManager::evict_request_from_cache`
* refactor: pop req from `active_requests`
* Apply style fixes
---------
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Support tensor-valued _extra_state values
TransformerEngine uses the pytorch get/set_extra_state API to store FP8
layer config information as bytes Tensor in the _extra_state entry in
the state dict. With recent changes to from_pretrained, this
functionality has broken and loading a model that uses this API doesn't
appear to work. This PR fixes the save/load pretrained functions for
extra state entries that use a pytorch tensor, and adds a (currently
x-failing) test for a dictionary extra state.
Signed-off-by: Peter St. John <pstjohn@nvidia.com>
* start refactoring whisper
* revert for now
* first step
* carry over attn fixes
* check if this works
* whisper has an off by one somewhere - cutting mask in any interface
* make it based on interface
* remove some tests that were skipped but now work
* some fixes for whisper tests
* interface changes
* change the order of fix
* some attention adjustments for eager + TP
* fix scaling
* mask changes
* why does whisper contain those extra seq lens?
* fix from config for fa2 as input_ids is invalid
* fix another test
* another fix
* disable flex attn due to compile issues
* copies and refactor for qwen audio since it somewhat relies on whisper
* fix scaling and smaller things
* retrigger
* new new interface version + more fixups
* adjust qwen
* add comment
* forgot this one
* change copies as whisper cuts on the mask
* add guard
* add flex attention
* switch to new mask function + add skips for torchscript
* remove old api with cache position
* last changes?
* trigger ci
* standardize
* fix tests
* batch update some processors, not final yet
* oke, now I tested that everything indeed runs. Still needs prettification
* emu3
* fixup
* gemma3 but it doesn't generate anything
* fuyu
* update
* why?
* Update src/transformers/models/aya_vision/processing_aya_vision.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* address comments
* bc
* why do we need to guard import this every time?
* i hate guarded imports
* i am blind
---------
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* Firstly: Better detection of when we're a custom class
* Trigger tests
* Let's break everything
* make fixup
* fix mistaken line doubling
* Let's try to get rid of it from config classes at least
* Let's try to get rid of it from config classes at least
* Fixup image processor
* no more circular import
* Let's go back to setting `_auto_class` again
* Let's go back to setting `_auto_class` again
* stash commit
* Revert the irrelevant changes until we figure out AutoConfig
* Change tests since we're breaking expectations
* make fixup
* do the same for all custom classes
* Cleanup for feature extractor tests
* Cleanup tokenization tests too
* typo
* Fix tokenizer tests
* make fixup
* fix image processor test
* make fixup
* Remove warning from register_for_auto_class
* Stop adding model info to auto map entirely
* Remove todo
* Remove the other todo
* Let's start slapping _auto_class on models why not
* Let's start slapping _auto_class on models why not
* Make sure the tests know what's up
* Make sure the tests know what's up
* Completely remove add_model_info_to_*
* Start adding _auto_class to models
* Start adding _auto_class to models
* Add a flaky decorator
* Add a flaky decorator and import
* stash commit
* More message cleanup
* make fixup
* fix indent
* Fix trust_remote_code prompts
* make fixup
* correct indentation
* Reincorporate changes into dynamic_module_utils
* Update call to trust_remote_code
* make fixup
* Fix video processors too
* Fix video processors too
* Remove is_flaky additions
* make fixup
* let's try a non-regex solution
* make fixup
* Slight adjustment
* Let's just use the original code with a check
* slight tweak to conditional
* slight tweak to conditional
* Update roformer model card
* fix example purpose description
* fix model description according to the comments
* revert changes for autodoc
* remove unneeded tags
* fix review issues
* fix hfoption
---------
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* docs(swinv2): Update SwinV2 model card to new standard format
* docs(swinv2): Apply review suggestions
Incorporates feedback from @stevhliu to:
- Enhance the introductory paragraph with more details about scaling and SimMIM.
- Generalize the tip from "image classification tasks" to "vision tasks".
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---------
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* stash commit
* Experiment 1: Try just Gemma
* Experiment 1: Just try Gemma
* make fixup
* Trigger tests
* stash commit
* Try adding Gemma3 as well
* make fixup
* Correct attrib names
* Correct pipeline model mapping
* Add in all_model_classes for Gemma1 again
* Move the pipeline model mapping around again
* make fixup
* Revert Gemma3 changes since it's a VLM
* Let's try Falcon
* Correct attributes
* Correct attributes
* Let's try just overriding get_config() for now
* Do Nemotron too
* And Llama!
* Do llama/persimmon
* Correctly skip tests
* Fix Persimmon
* Include Phimoe
* Fix Gemma2
* Set model_tester_class correctly
* Add GLM
* More models!
* models models models
* make fixup
* Add Qwen3 + Qwen3MoE
* Correct import
* make fixup
* Add the QuestionAnswering classes
* Add the QuestionAnswering classes
* Move pipeline mapping to the right place
* Jetmoe too
* Stop RoPE testing models with no RoPE
* Fix up JetMOE a bit
* Fix up JetMOE a bit
* Can we just force pad_token_id all the time?
* make fixup
* fix starcoder2
* Move pipeline mapping
* Fix RoPE skipping
* Fix RecurrentGemma tests
* Fix Falcon tests
* Add MoE attributes
* Fix values for RoPE testing
* Make sure we set bos_token_id and eos_token_id in an appropriate range
* make fixup
* Fix GLM4
* Add mamba attributes
* Revert bits of JetMOE
* Re-add the JetMOE skips
* Update tests/causal_lm_tester.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* Add licence
---------
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* Get parallel loader working. Include tests.
* Update the tests for parallel loading
* Rename env variables.
* Add docs for parallel model weight loading.
* Touch up parallel model loading docs.
* Touch up parallel model loading docs again.
* Edit comment in test_modeling_utils_parallel_loading.py
* Make sure HF_PARALLEL_LOADING_WORKERS is spelled correctly in modeling_utils.py
* Correct times for parallelized loading, previous times were for a "hot" filesystem
* Update parallel model loading so the spawn method is encapsulated. DRY up the code by leveraging get_submodule.
* Update docs on model loading parallelism so that details on setting the multiprocessing start method are removed, now that the package handles this step internally.
* Fix style on model loading parallelism changes.
* Merge latest version of master's modeling_utils.
* Removed unused variable.
* Fix argument packing for the parallel loader.
* Fix state dict being undefined in the parallel model loader.
* Rename variables used in parallel model loading for clarity. Use get_module_from_name().
* Switch to the use of threads for parallel model loading.
* Update docs for parallel loading.
* Remove the use of json.loads when evaluating HF_ENABLE_PARALLEL_LOADING. Prefer simple casting.
* Move parallelized shard loading into its own function.
* Remove use of is_true(). Favor checking env var true values for HF_ENABLE_PARALLEL_LOADING.
* Update copyright to 2025 in readme for paralell model loading.
* Remove garbage collection line in load_shard_file, implicit garbage collection already occurs.
* Run formatter on modeling_utils.py
* Apply style fixes
* Delete tests/utils/test_modeling_utils_parallel_loading.py
---------
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co>
* refactor to rm property can_save_slow_tokenizer, it can be done within the if of save_vocab
* move property to fast
* revert if
* check if vocab_file is attr
* fix check for sp
* fix if condition
* fix if condition
* fix if condition
* stash for now
* initial commit
* small updated
* up
* up
* works!
* nits and fixes
* don't loop too much
* finish working example
* update
* fix the small freeblocks issue
* feat: stream inputs to continuous batch
* fix: update attn from `eager` to `sdpa`
* refactor: fmt
* refactor: cleanup unnecessary code
* feat: add `update` fn to `PagedAttentionCache`
* feat: broken optimal block size computation
* fix: debugging invalid cache logic
* fix: attention mask
* refactor: use custom prompts for example
* feat: add streaming output
* fix: prefill split
refactor: add doc strings and unsound/redundant logic
fix: compute optimal blocks logic
* fix: send decoded tokens when `prefilling_split` -> `decoding`
* refactor: move logic to appropriate parent class
* fix: remove truncation as we split prefilling anyways
refactor: early return when we have enough selected requests
* feat: add paged attention forward
* push Ggraoh>
* add paged sdpa
* update
* btter mps defaults
* feat: add progress bar for `generate_batch`
* feat: add opentelemetry metrics (ttft + batch fill %age)
* feat: add tracing
* Add cuda graphs (#38059)
* draft cudagraphs addition
* nits
* styling
* update
* fix
* kinda draft of what it should look like
* fixes
* lol
* not sure why inf everywhere
* can generate but output is shit
* some fixes
* we should have a single device synch
* broken outputs but it does run
* refactor
* updates
* updates with some fixes
* fix mask causality
* another commit that casts after
* add error
* simplify example
* update
* updates
* revert llama changes
* fix merge conflicts
* fix: tracing and metrics
* my updates
* update script default values
* fix block allocation issue
* fix prefill split attnetion mask
* no bugs
* add paged eager
* fix
* update
* style
* feat: add pytorch traces
* fix
* fix
* refactor: remove pytorch profiler data
* style
* nits
* cleanup
* draft test file
* fix
* fix
* fix paged and graphs
* small renamings
* cleanups and push
* refactor: move tracing and metrics logic to utils
* refactor: trace more blocks of code
* nits
* nits
* update
* to profile or not to profile
* refactor: create new output object
* causal by default
* cleanup but generations are still off for IDK what reason
* simplifications but not running still
* this does work.
* small quality of life updates
* nits
* updaet
* fix the scheduler
* fix warning
* ol
* fully fixed
* nits
* different generation parameters
* nice
* just style
* feat: add cache memory usage
* feat: add kv cache free memory
* feat: add active/waiting count & req latency
* do the sampling
* fix: synchronize CUDA only if available and improve error handling in ContinuousBatchingManager
* fix on mps
* feat: add dashboard & histogram buckets
* perf: improve waiting reqs data structures
* attempt to compile, but we should only do it on mps AFAIK
* feat: decouple scheduling logic
* just a draft
* c;eanup and fixup
* optional
* style
* update
* update
* remove the draft documentation
* fix import as well
* update
* fix the test
* style doomed
---------
Co-authored-by: Luc Georges <luc.sydney.georges@gmail.com>
* starting attn refactor for encoder decoder models via bart (eager + sdpa)
* flash attention works, remove unnecessary code
* flex attention support for bart!, gotta check if the renaming is not too aggressive
* some comments
* skip flex grad test for standalone as done with the other test
* revert flex attn rename (for now), sdpa simplify, and todos
* more todos
* refactor mask creation for reuse
* modular attempt at biogpt
* first batch of other models
* fix attn dropout
* fix autoformer copies
* hubert
* another batch of models
* copies/style + last round of bart models --> whisper next?
* remove unnecessary _reshape function and remove copy to whisper
* add skip for decoder-only models out of enc-dec (same as in bart)
* bring back licences
* remove comment, added to pr read instead
* mostly docs
* disable sew flex attn as it's unclear attn mask for now
* oops
* test fixes for enc-dec
* torch fx fixes + try at flex attn
* skip on mbart
* some more fixes
* musicgen skip / delete old attn class logic + sdpa compose compile skip
* disable flex attn for musicgen, not worth the effort
* more fixes and style
* flex attention test for dropout and encoder decoder that dont have main input names
* informer fixes
* the weirdest thing I've encountered yet...
* style
* remove empty tensor attempt, found core root in previous commits
* disable time series due to tests being very text centric on inputs
* add speech to text to be ignoring the other attns, also due to tests
* update docs
* remaining issues resolved ?
* update docs for current state --> nllb moe and pegasus x sdpa is questionable :D
* some models have not set the is_causal flag...
* change dtype in softmax tol old behaviour + some modular fixes
* I hate it but it is what it is
* fixes from main for bart
* forgot this one
* some model fixes
* style
* current status
* marian works now
* fixing some copies
* some copy fixes + time series x informer
* last models possibly and fixes on style/copies
* some post merge fixes
* more fixes
* make attention interface callable and move warnings there
* style lol
* add comment to "unsupported"
* remove callable interface and change interface warnings + some copies
* fix
* ternary is ugly af, make it simpler
* how did that happen
* fix flex attn test
* failing the test
* no more fallback! fixing copies next
* style + attn fixed
* fixing copies and mask creation
* wrong copy
* fixup tests and disable flex attn for now
* fixup last tests?
* docs(swin): Update Swin model card to standard format
* docs(swin): Refine link to Microsoft organization for Swin models
Apply suggestion from @stevhliu in PR #37628.
This change updates the link pointing to the official Microsoft Swin Transformer checkpoints on the Hugging Face Hub.
The link now directs users specifically to the Microsoft organization page, filtered for Swin models, providing a clearer and more canonical reference compared to the previous general search link.
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* docs(swin): Clarify padding description and link to backbone docs
Apply suggestion from @stevhliu in PR #37628.
This change introduces two improvements to the Swin model card:
1. Refines the wording describing how Swin handles input padding for better clarity.
2. Adds an internal documentation link to the general "backbones" page when discussing Swin's capability as a backbone model.
These updates enhance readability and improve navigation within the Transformers documentation.
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* docs(swin): Change Swin paper link to huggingface.co/papers as suggested
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---------
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* _get_padding_size module
* do not patchify images when processing multi image
* modify llava onevision image processor fast
* tensor to list of tensors
* backward compat
* reuse pad_to_square in llave & some clarification
* add to doc
* fix: consider no image cases (text only or video)
* add integration test
* style & repo_consistency
* accept custom device_mesh
* fix device_map
* assert that num_heads % tp_size == 0
* todo.
* ReplicateParallel
* handle tied weights
* handle dtensor in save_pretrained with safe_serialization
* tp test works
* doesnt work
* fix shard_and_distribute_module's rank should be local_rank
* tp=4 is correct
* dp+tp is broken
* todo allreduce with dtensors on another dim is annoying
* workaround to sync dp grads when using dtensors
* loading a checkpoint works
* wandb and compare losses with different tp/dp
* cleaning
* cleaning
* .
* .
* logs
* CP2 DP2 no mask works after commenting attn_mask and is_causal from scaled_dot_product_attention
* DP=2 TP=2 now works even with tied embeddings
* model.parameters() and model.module.parameters() are empty..
* reformat sanity_check_tensor_sync
* set atol=1e-4 for CP to pass
* try populate _parameters from named_modules
* refactors
TP2 DP2 works
CP2 DP2 works
* is_causal=True and pack sequences, no attn mask, and preshuffle dataset
* fix packing
* CP=4 doesn't work
* fix labels and position_ids for CP
* DP CP works with transformers 🥳🥳🥳
* refactor
* add example cp
* fixup
* revert sdpa changes
* example cleared
* add CP, DP to the mesh init
* nit
* clean
* use `ALL_PARALLEL_STYLES`
* style
* FSDP works
* log on 1 rank
* .
* fix?
* FSDP1 also has .parameters() bug
* reported gradnorm when using FSDP1 is wrong, but loss is correct so it's okay
* .
* style and fixup
* move stuff around
* fix tests
* style
* let's make it a check
* add missing licences
* warning should be an info
* tp plan should not be NONE
* test all
* god damn it
* test all
---------
Co-authored-by: nouamanetazi <nouamane98@gmail.com>
* add seq_idx and fa kwargs
* update tests
* docs and grad ckpt support
* fmt
* better names
* test_raise_missing_padding_free_kwarg_errs
* + seq_idx in doc strings
* padding free training docs
* add link to pr plots
* raise err on attn_mask with padding free
* rm raising missing padding free err test
* BambaFlashAttentionKwargs
* run modular util for modular_granitemoehybrid.py
* accept custom device_mesh
* fix device_map
* assert that num_heads % tp_size == 0
* todo.
* ReplicateParallel
* handle tied weights
* handle dtensor in save_pretrained with safe_serialization
* tp test works
* doesnt work
* fix shard_and_distribute_module's rank should be local_rank
* tp=4 is correct
* dp+tp is broken
* todo allreduce with dtensors on another dim is annoying
* workaround to sync dp grads when using dtensors
* loading a checkpoint works
* wandb and compare losses with different tp/dp
* cleaning
* cleaning
* .
* .
* logs
* CP2 DP2 no mask works after commenting attn_mask and is_causal from scaled_dot_product_attention
* DP=2 TP=2 now works even with tied embeddings
* model.parameters() and model.module.parameters() are empty..
* reformat sanity_check_tensor_sync
* set atol=1e-4 for CP to pass
* try populate _parameters from named_modules
* refactors
TP2 DP2 works
CP2 DP2 works
* is_causal=True and pack sequences, no attn mask, and preshuffle dataset
* fix packing
* CP=4 doesn't work
* fix labels and position_ids for CP
* DP CP works with transformers 🥳🥳🥳
* refactor
* add example cp
* fixup
* revert sdpa changes
* example cleared
* add CP, DP to the mesh init
* nit
* clean
* use `ALL_PARALLEL_STYLES`
* style
* FSDP works
* log on 1 rank
* .
* fix?
* FSDP1 also has .parameters() bug
* reported gradnorm when using FSDP1 is wrong, but loss is correct so it's okay
* .
* style and fixup
* move stuff around
* fix tests
* style
* let's make it a check
* warning should be an info
---------
Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com>
When preparing the causal attention mask at this point the mask comes
in as a float tensor with min value as a masked value.
It is not correct to convert it to bool and treat it as a bool mask as
this inverts the mask.
`torch.nn.functional.scaled_dot_product_attention` expects that a masked value is `False`.
I suspect that the `sdpa` implementation variant may not have been
thoroughly tested and that is why this error was not caught earlier.
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Add Llama4TextModel to AutoModel mapping
using Llama4TextConfig on AutoModel.from_config raises a ValueError when it is expected to instantiate a Llama4TextModel
bnb quant tests: remove obsolete trust_remote_code test
The MPT model is now natively integrated in Transformers and no longer requires trust_remote_code=True. This removes the failing test_get_keys_to_not_convert_trust_remote_code and related usage, which depended on remote code and caused CI issues due to missing dependencies (e.g., triton_pre_mlir).
* Update modular_qwen2_5_omni.py
fix the error when loading quantized model by AuotAWQ.
* Update modeling_qwen2_5_omni.py
sync code to modular_qwen2_5_omni.py
* pipeline generation defaults
* add max_new_tokens=20 in test pipelines
* pop all kwargs that are used to parameterize generation config
* add class attr that tell us whether a pipeline calls generate
* tmp commit
* pt text gen pipeline tests passing
* remove failing tf tests
* fix text gen pipeline mixin test corner case
* update text_to_audio pipeline tests
* trigger tests
* a few more tests
* skips
* some more audio tests
* not slow
* broken
* lower severity of generation mode errors
* fix all asr pipeline tests
* nit
* skip
* image to text pipeline tests
* text2test pipeline
* last pipelines
* fix flaky
* PR comments
* handle generate attrs more carefully in models that cant generate
* same as above
* tmp commit (imports broken)
* working version; update tests
* remove line break
* shorter msg
* dola checks need num_beams=1; other minor PR comments
* update early trainer failing on bad gen config
* make fixup
* test msg
* Fix ModuleNotFoundError torchao.prototype.low_bit_optim since torchao v 0.11.0
* Fix space on blank line
* update torchao's AdamW4bit and AdamW8bit import for v0.11.0
* Apply style fixes
---------
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* add args support to fast image processors
* add comment for clarity
* fix-copies
* Handle child class args passed as both args or kwargs in call and preprocess functions
* revert support args passed as kwargs in overwritten preprocess
* fix image processor errors
* Add flash-attention-2 backend for ESM-2
Signed-off-by: Peter St. John <pstjohn@nvidia.com>
* update extended_attention_mask for fa2
Signed-off-by: Peter St. John <pstjohn@nvidia.com>
* add test_flash_attn_2_equivalence test
Signed-off-by: Peter St. John <pstjohn@nvidia.com>
---------
Signed-off-by: Peter St. John <pstjohn@nvidia.com>
* enable optional RMS in BitLinear
* Fix naming
* Import RMS from Llama using config.*
* make fix-copies
* ran CI loop
* remove default BitNetQuantConfig values
* Fix BitNetQuantConfig to be Optional
* Fix config docstrings to match Optoinal
* Edit docstrings to match standards
---------
Co-authored-by: steinmetzc <codysteinmetz7@gmail.com>
Co-authored-by: codys12 <steinmetzc@dh-mgmt4.hpc.msoe.edu>
Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>
* Include output embedding as well with `include_embedding` flag
Summary:
att
Test Plan:
python tests/quantization/torchao_integration/test_torchao.py -k test_include_embedding
Reviewers:
Subscribers:
Tasks:
Tags:
* format
* rename include_embedding to include_input_output_embeddings
---------
Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>
* disable deepspeed when setting up fake trainer
* Apply style fixes
---------
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* mvp
* remove trust_remote_code
* generate_from_hub
* handle requirements; docs
* english
* doc PR suggestions
* Apply suggestions from code review
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* changed remote code path to generate/generate.py
* model repo has custom generate -> override base generate
* check for proper inheritance
* some doc updates (missing: tag-related docs)
* update docs to model repo
* nit
* nit
* nits
* Update src/transformers/dynamic_module_utils.py
* Apply suggestions from code review
* Update docs/source/en/generation_strategies.md
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* trust remote code is required
* use new import utils for requirements version parsing
* use org examples
* add tests
* Apply suggestions from code review
Co-authored-by: Manuel de Prada Corral <6536835+manueldeprada@users.noreply.github.com>
* ascii file structure; tag instructions on readme.md
---------
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: Manuel de Prada Corral <6536835+manueldeprada@users.noreply.github.com>
* init vilt image processor fast
* Refactor image processor tests to use loop for all processors
* Add ViltImageProcessorFast with PyTorch-based optimized image processing
* Change made automatically by make fixup command
* Change made automatically by make fix-copies command
* Fix type hints in ViltImageProcessorFast for Python compatibility
* Define constants for image resizing based on COCO dataset aspect ratio
* Add missing property initializations to ViltImageProcessorFast
* Extract resize logic into dedicated method in ViltImageProcessorFast
* Extract padding logic into dedicated method
* Implement shape-based image grouping for optimized processing in Vilt
* Update test suite to verify ViltImageProcessorFast attributes
* Move variable declarations to _preprocess method parameters
* Remove unused parameters
* Rename _resize method to resize to override existing function
* Remove whitespace
* Remove unnecessary type check and conversion for stacked_images
* Remove redundant loop and apply padding directly to stacked images
* Refactor pad function to return images and mask as tuple instead of dict
* Add tests comparing padding masks in slow and fast implementations
* Update ViltImageProcessor tests to ensure compatibility between slow and fast implementations
* Replace add_start_docstrings with auto_docstring in ViltImageProcessorFast
* Move docstrings of custom args to ViltFastImageProcessorKwargs
* Use reorder_images function for both masks and images
---------
Co-authored-by: Yoni Gozlan <74535834+yonigozlan@users.noreply.github.com>
* fix llava processor to calculate unpad size correctly
* repo consistency
* Revert "repo consistency" & "setUp in llava family"
This reverts commit 26a50af8db5b15bb6b700db3d53342fe69579d8e.
* add edge case test for padding & unpadding
* compute unpadding size from original size
* make test config explicit
* Revert "compute unpadding size from original size"
This reverts commit 752cd27ad9710ab056c17a9986760c4651975540.
* Revert "add edge case test for padding & unpadding"
This reverts commit ccbd094d69c3f8f6a259159164284f60ba835bce.
* revert unpad logic
* remove irrelevant tests
* model test
* remove processor from model test
---------
Co-authored-by: jaycha <jaycha@ncsoft.com>
* chore(qwen2): display warning log only when sliding window attention is enabled
* Align modeling_qwen2.py and modular_qwen2.py
---------
Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>
* accept arbitrary kwargs
* move user commands to a separate fn
* work with generation config files
* rm cmmt
* docs
* base generate flag doc section
* nits
* nits
* nits
* no <br>
* better basic args description
* initial design
* update all video processors
* add tests
* need to add qwen2-vl (not tested yet)
* add qwen2-vl in auto map
* fix copies
* isort
* resolve confilicts kinda
* nit:
* qwen2-vl is happy now
* qwen2-5 happy
* other models are happy
* fix copies
* fix tests
* add docs
* CI green now?
* add more tests
* even more changes + tests
* doc builder fail
* nit
* Update src/transformers/models/auto/processing_auto.py
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
* small update
* imports correctly
* dump, otherwise this is getting unmanagebale T-T
* dump
* update
* another update
* update
* tests
* move
* modular
* docs
* test
* another update
* init
* remove flakiness in tests
* fixup
* clean up and remove commented lines
* docs
* skip this one!
* last fix after rebasing
* run fixup
* delete slow files
* remove unnecessary tests + clean up a bit
* small fixes
* fix tests
* more updates
* docs
* fix tests
* update
* style
* fix qwen2-5-vl
* fixup
* fixup
* unflatten batch when preparing
* dump, come back soon
* add docs and fix some tests
* how to guard this with new dummies?
* chat templates in qwen
* address some comments
* remove `Fast` suffix
* fixup
* oops should be imported from transforms
* typo in requires dummies
* new model added with video support
* fixup once more
* last fixup I hope
* revert image processor name + comments
* oh, this is why fetch test is failing
* fix tests
* fix more tests
* fixup
* add new models: internvl, smolvlm
* update docs
* imprt once
* fix failing tests
* do we need to guard it here again, why?
* new model was added, update it
* remove testcase from tester
* fix tests
* make style
* not related CI fail, lets' just fix here
* mark flaky for now, filas 15 out of 100
* style
* maybe we can do this way?
* don't download images in setup class
---------
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
Do not erase a cache_position initialization passed explicitly to generate(), if there is one.
But: Let initialization replace cache_position if it's set to None. I assume that if the value is explicitly passed but None, we should initialize anyway.
* update models
* why rename
* return attn weights when sdpa
* fixes
* fix attn implementation composite
* fix moshi
* add message
* add typings
* use explicitly all flags for each attn type
* fix some tests
* import what is needed
* kosmos on main has ew attention already, yay
* new models in main, run fixup
* won't fix kosmos yet
* fix-copies
* clean up after rebasing
* fix tests
* style
* dont cast attns to fp32
* did we update ruff? oke, let's just do what it asks
* fix pixtral after rebase
* Add ALL_ATTENTION_FUNCTIONS compatibility for Pixtral model
* Fix invalid operand type
* Allow image_sizes to be optional in forward pass to fit tests
Disallow using sdpa and output_attentions
* Disallow using sdpa with output_attentions
* Delete useless comments, use eager attention from smolvlm, use pattern from mistral
* add _supports_attention_backend
* use kwargs instead of position_ids
---------
Co-authored-by: aurelien.lac <aurelien.lac@lighton.ai>
* Add fast image processor support for Swin2SR
* Add Swin2SR tests of fast image processing
* Update docs and remove unnecessary test func
* Fix docstring formatting
* Skip fast vs slow processing test
---------
Co-authored-by: Yoni Gozlan <74535834+yonigozlan@users.noreply.github.com>
* i guessreverted all CdGen classes
* style
* llava onevision
* fix copies
* fix some tests
* some more tests
* dump
* skip these
* nevermind, i am dumb
* revert fix not needed
* fixup
* fixup
* another fixup
* more fixup to make ci finally happy
* fixup after rebasing
* fix qwen tests
* add internVL + typos here and there
* image token index -> id
* style
* fix init weights
* revert blip-2 not supported
* address comments
* fix copies
* revert blip2 test file as well
* as discussed internally, revert back CdGen models
* fix some tests
* fix more tests for compile
* CI red
* fix copies
* enumerate explicitly allowed models
* address comments
* fix tests
* fixup
* style again
* add tests for new model class
* another fixup ( x _ x )
* [fixup] unused attributes can be removed post-deprecation
* Enable granite speech 3.3 tests
* skip sdpa test for granite speech
* Explicitly move model to device
* Use granite speech 2b in tests
---------
Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com>
* args keep_torch_compile=False in _save and _wwrap_method
* Fix FSDP execution on evaluation for torch_compile mode
* add test trainer FSDP + Torch Compile
* fix quality code
* make style
* Revert " make style"
This reverts commit 77e797f8829c50992cc21496be3d9a3e480e1c97.
* make style
* [fix] one pixel should be added when length is odd
* [fix] add vision_aspect_ratio args & typo
* [fix] style
* [fix] do not fix fast file directly
* [fix] convert using modular
* remove duplicate codes
* match unpad logic with pad logic
* test odd-sized images for llava & aria
* test unpad odd-sized padding for llava family
* fix style
* add kwarg to onvision modular
* move vision_aspect_ratio from image_processor to processor
(llava_onevision)
* add num_tokens_to_discard to the forward of Dinov2ForImageClassification
* redefine forward in modular file, remove change to modeling_dinov2 file
* run make fixup
---------
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
Implements last migrations for generation from `config.vocab_size` to `config.get_text_config().vocab.size`
In doing so, we enable multimodal models to fully leverage all existing generation features.
* Let notification service succeed even when artifacts and reported jobs on github have mismatch
* Use default trace msg if no trace msg available
* Add pop_default helper fn
* style
Summary:
Currently when we try to quantize input_embedding for some models, the output embedding
(lm_head) will also be quantized the same way, since they are tied, and this may not be what
we want. To break the tie, we added the option to allow people to
1. load unquantized weight
2. tie weights
3. quantize
so that the tie will be broken
Test Plan:
```
from transformers import (
AutoModelForCausalLM,
AutoProcessor,
AutoTokenizer,
TorchAoConfig,
)
from torchao.quantization.quant_api import (
IntxWeightOnlyConfig,
Int8DynamicActivationIntxWeightConfig,
AOPerModuleConfig
)
from torchao.quantization.granularity import PerGroup, PerAxis
import torch
model_id = "microsoft/Phi-4-mini-instruct"
embedding_config = IntxWeightOnlyConfig(
weight_dtype=torch.int8,
granularity=PerAxis(0),
)
linear_config = Int8DynamicActivationIntxWeightConfig(
weight_dtype=torch.int4,
weight_granularity=PerGroup(32),
weight_scale_dtype=torch.bfloat16,
)
quant_config = AOPerModuleConfig({"_default": linear_config, "model.embed_tokens": embedding_config})
quantization_config = TorchAoConfig(quant_type=quant_config, include_embedding=True, untie_embedding_weights=True)
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float32, device_map="auto", quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained(model_id)
print(quantized_model)
print("embed_tokens.weight:", quantized_model.model.embed_tokens.weight)
print("lm head weight:", quantized_model.lm_head.weight)
from transformers.modeling_utils import find_tied_parameters
print(find_tied_parameters(quantized_model))
```
Reviewers:
Subscribers:
Tasks:
Tags:
Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>
* rm already deprecated padding max length
* truncate_strategy AS AN ARG is already deprecated for a few years
* fix
* rm test_padding_to_max_length
* rm pad_to_max_length=True in other tests
* rm from common
* missed fnet
* Support `AOPerModuleConfig` and include_embedding
Summary:
This PR adds support per module configuration for torchao
Also added per module quantization examples:
1. Quantizing different layers with different quantization configs
2. Skip quantization for certain layers
Test Plan:
python tests/quantization/torchao_integration/test_torchao.py -k test_include_embedding
python tests/quantization/torchao_integration/test_torchao.py -k test_per_module_config_skip
Reviewers:
Subscribers:
Tasks:
Tags:
* format
* format
* inlcude embedding remove input embedding from module not to convert
* more docs
* Update docs/source/en/quantization/torchao.md
Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>
* Update src/transformers/quantizers/quantizer_torchao.py
Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>
* Update src/transformers/quantizers/quantizer_torchao.py
Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>
---------
Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>
Support FlaxPreTrainedModel to load model checkpoint from subfolder in local directory as safetensors format
Signed-off-by: Yan Zhao <zhao.y4@northeastern.edu>
* Unhardcode use_chunked_attention, fix no_rope_layers
* Go back to exhaustive list of bools
* Conversion and modeling updates
* Fix rope
* Unhardcode rope
* Fix context length
* style
* Minor updates to conversion
* Use StaticCache
* Minor simplification
* DynamicCache 🤦
* Style
* Style
* No more red flaky tests in the CI!
* Remove the CircleCI logic as well
* Revert most changes including is_flaky behaviour
* make fixup
* Move to a more sensible place
* Mark a flaky test that failed on this PR!
* correct import
* update
* update
* update
* update
---------
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
* Fix check of unecessary packages (issue #37626)
* Reformat using ruff
* And a condition to avoind the risk of matching a random object in `import_utils`
* Reformat
* copy the last changes from broken PR
* small format
* some fixes and refactoring after review
* format
* add config attr for loss
* some fixes and refactoring
* fix copies
* fix style
* add test for d-fine resnet
* fix decoder layer prop
* fix dummies
* format init
* remove extra print
* refactor modeling, move resnet into separate folder
* fix resnet config
* change resnet on hgnet_v2, add clamp into decoder
* fix init
* fix config doc
* fix init
* fix dummies
* fix config docs
* fix hgnet_v2 config typo
* format modular
* add image classification for hgnet, some refactoring
* format tests
* fix dummies
* fix init
* fix style
* fix init for hgnet v2
* fix index.md, add init rnage for hgnet
* fix conversion
* add missing attr to encoder
* add loss for d-fine, add additional output for rt-detr decoder
* tests and docs fixes
* fix rt_detr v2 conversion
* some fixes for loos and decoder output
* some fixes for loss
* small fix for converted modeling
* add n model config, some todo comments for modular
* convert script adjustments and fixes, small refact
* remove extra output for rt_detr
* make some outputs optionsl, fix conversion
* some posr merge fixes
* small fix
* last field fix
* fix not split for hgnet_v2
* disable parallelism test for hgnet_v2 image classification
* skip multi gpu for d-fine
* adjust after merge init
* remove extra comment
* fix repo name references
* small fixes for tests
* Fix checkpoint path
* Fix consistency
* Fixing docs
---------
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
* added fast image processor for VitMatte including updated and new tests, fixed a bug in the slow image processor that processed images incorrectly for input format ChannelDimension.FIRST in which case the trimaps were not added in the correct dimension, this bug was also reflected in the tests through incorretly shaped trimaps being passed
* final edits for fast vitmatte image processor and tests
* final edits for fast vitmatte image processor and tests
---------
Co-authored-by: Yoni Gozlan <74535834+yonigozlan@users.noreply.github.com>
* added the configuartion for sam_hq
* added the modeelling for sam_hq
* added the sam hq mask decoder with hq features
* added the code for the samhq
* added the code for the samhq
* added the code for the samhq
* Delete src/transformers/models/sam_hq/modelling_sam_hq.py
* added the code for the samhq
* added the code for the samhq
* added the chnages for the modeelling
* added the code for sam hq for image processing
* added code for the sam hq model
* added the required changes
* added the changes
* added the key mappings for the sam hq
* adding the working code of samhq
* added the required files
* adding the pt object
* added the push to hub account
* added the args for the sam maks decoder
* added the args for the sam hq vision config
* aded the some more documentation
* removed the unecessary spaces
* all required chnages
* removed the image processor
* added the required file
* added the changes for the checkcopies
* added the code for modular file
* added the changes for the __init file
* added the code for the interm embeds
* added the code for sam hq
* added the changes for modular file
* added the test file
* added the changes required
* added the changes required
* added the code for the
* added the cl errors
* added the changes
* added the required changes
* added the some code
* added the code for the removing image processor
* added the test dimensins
* added the code for the removing extra used variables
* added the code for modeluar file hf_mlp for a better name
* removed abbrevaation in core functionality
* removed abbrevaation in core functionality
* .contiguous() method is often used to ensure that the tensor is stored in a contiguous block of memory
* added the code which is after make fixup
* added some test for the intermediate embeddings test
* added the code for the torch support in sam hq
* added the code for the updated modular file
* added the changes for documentations as mentioned
* removed the heading
* add the changes for the code
* first mentioned issue resolved
* added the changes code to processor
* added the easy loading to init file
* added the changes to code
* added the code to changes
* added the code to work
* added the code for sam hq
* added the code for sam hq
* added the code for the point pad value
* added the small test for the image embeddings and intermediate embedding
* added the code
* added the code
* added the code for the tests
* added the code
* added ythe code for the processor file
* added the code
* added the code
* added the code
* added the code
* added the code
* added the code for tests and some checks
* added some code
* added the code
* added the code
* added some code
* added some code
* added the changes for required
* added the code
* added the code
* added the code
* added the code
* added the code
* added the code
* added the code
* added the code
* added the code
* added the code
* added some changes
* added some changes
* removed spaces and quality checks
* added some code
* added some code
* added some code
* added code quality checks
* added the checks for quality checks
* addded some code which fixes test_inference_mask_generation_no_point
* added code for the test_inference_mask_generation_one_point_one_bb
* added code for the test_inference_mask_generation_one_point_one_bb_zero
* added code for the test_inference_mask_generation_one_box
* added some code in modelling for testing
* added some code which sort maks with high score
* added some code
* added some code
* added some code for the move KEYS_TO_MODIFY_MAPPING
* added some code for the unsqueeze removal
* added some code for the unsqueeze removal
* added some code
* added some code
* add some code
* added some code
* added some code
* added some testign values changed
* added changes to code in sam hq for readbility purpose
* added pre commit checks
* added the fix samvisionmodel for compatibilty
* added the changes made on sam by cyyever
* fixed the tests for samhq
* added some the code
* added some code related to init file issue during merge conflicts
* remobved the merge conflicts
* added changes mentioned by aruther and mobap
* added changes mentioned by aruther and mobap
* solving quality checks
* added the changes for input clearly
* added the changes
* added changes in mask generation file rgearding model inputs and sam hq quargs in processor file
* added changes in processor file
* added the Setup -> setupclass conversion
* added the code mentioned for processor
* added changes for the code
* added some code
* added some code
* added some code
---------
Co-authored-by: Pablo Montalvo <39954772+molbap@users.noreply.github.com>
Two PEFT tests are actually failing:
tests/peft_integration/test_peft_integration.py::PeftIntegrationTester::test_delete_adapter
tests/peft_integration/test_peft_integration.py::PeftIntegrationTester::test_peft_pipeline_no_warning
This must have been going on for some time but was apparently never
noticed. The cause is that the tests themselves are faulty, the PEFT
integration is correct in these cases.
test_delete_adapter
The first faulty test was introduced by #34650. AFAICT, it should never
have passed in the first place, the PEFT integration logic was not
changed in the meantime. At this point, the logs for the PR CI are gone,
so I'm not sure if the test passed back then or not.
test_peft_pipeline_no_warning
This test was introduced in #36783 and should also never have passed, as
the self.assertNoLogs context manager only returns None, thus the assert
should never have worked (mea culpa for suggesting this code snippet).
Here too, the CI logs are deleted by now, so I can't check if the test
already failed back then.
- run:if [[ "$CIRCLE_PULL_REQUEST" == "" && "$CIRCLE_BRANCH" != "main" && "$CIRCLE_BRANCH" != *-release ]]; then echo "Not a PR, not the main branch and not a release branch, skip test!"; circleci-agent step halt; fi
gh pr comment $PR_NUMBER --repo $REPO --body "Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the \`Ready for review\` button (at the bottom of the PR page). This will assign reviewers and trigger CI."
name:Self-hosted runner scale set (AMD mi300 scheduled CI caller)
# Note: For every job in this workflow, the name of the runner scale set is finalized in the runner yaml i.e. huggingface/hf-workflows/.github/workflows/transformers_amd_ci_scheduled_arc_scale_set.yaml
# For example, 1gpu scale set: amd-mi300-ci-1gpu
# 2gpu scale set: amd-mi300-ci-2gpu
on:
workflow_run:
workflows:["Self-hosted runner (AMD scheduled CI caller)"]
name:Self-hosted runner scale set (AMD mi325 scheduled CI caller)
# Note: For every job in this workflow, the name of the runner scale set is finalized in the runner yaml i.e. huggingface/hf-workflows/.github/workflows/transformers_amd_ci_scheduled_arc_scale_set.yaml
# For example, 1gpu scale set: amd-mi325-ci-1gpu
# 2gpu scale set: amd-mi325-ci-2gpu
on:
workflow_run:
workflows:["Self-hosted runner (AMD scheduled CI caller)"]
This AGENTS.md file provides guidance for code agents working with this codebase.
## Core Project Structure
-`/src/transformers`: This contains the core source code for the library
-`/models`: Code for individual models. Models inherit from base classes in the root `/src/transformers` directory.
-`/tests`: This contains the core test classes for the library. These are usually inherited rather than directly run.
-`/models`: Tests for individual models. Model tests inherit from common tests in the root `/tests` directory.
-`/docs`: This contains the documentation for the library, including guides, tutorials, and API references.
## Coding Conventions for Hugging Face Transformers
- PRs should be as brief as possible. Bugfix PRs in particular can often be only one or two lines long, and do not need large comments, docstrings or new functions in this case. Aim to minimize the size of the diff.
- When writing tests, they should be added to an existing file. The only exception is for PRs to add a new model, when a new test directory should be created for that model.
- Code style is enforced in the CI. You can install the style tools with `pip install -e .[quality]`. You can then run `make fixup` to apply style and consistency fixes to your code.
## Copying and inheritance
Many models in the codebase have similar code, but it is not shared by inheritance because we want each model file to be self-contained.
We use two mechanisms to keep this code in sync:
- "Copied from" syntax. Functions or entire classes can have a comment at the top like this: `# Copied from transformers.models.llama.modeling_llama.rotate_half` or `# Copied from transformers.models.t5.modeling_t5.T5LayerNorm with T5->MT5`
These comments are actively checked by the style tools, and copies will automatically be updated when the base code is updated. If you need to update a copied function, you should
either update the base function and use `make fixup` to propagate the change to all copies, or simply remove the `# Copied from` comment if that is inappropriate.
- "Modular" files. These files briefly define models by composing them using inheritance from other models. They are not meant to be used directly. Instead, the style tools
automatically generate a complete modeling file, like `modeling_bert.py`, from the modular file like `modular_bert.py`. If a model has a modular file, the modeling file
should never be edited directly! Instead, changes should be made in the modular file, and then you should run `make fixup` to update the modeling file automatically.
When adding new models, you should prefer `modular` style.
## Testing
After making changes, you should usually run `make fixup` to ensure any copies and modular files are updated, and then test all affected models. This includes both
the model you made the changes in and any other models that were updated by `make fixup`. Tests can be run with `pytest tests/models/[name]/test_modeling_[name].py`
If your changes affect code in other classes like tokenizers or processors, you should run those tests instead, like `test_processing_[name].py` or `test_tokenization_[name].py`.
In order to run tests, you may need to install dependencies. You can do this with `pip install -e .[testing]`. You will probably also need to `pip install torch accelerate` if your environment does not already have them.
Transformers is a library of pretrained text, computer vision, audio, video, and multimodal models for inference and training. Use Transformers to fine-tune models on your data, build inference applications, and for generative AI use cases across multiple modalities.
There are over 500K+ Transformers [model checkpoints](https://huggingface.co/models?library=transformers&sort=trending) on the [Hugging Face Hub](https://huggingface.com/models) you can use.
Transformers acts as the model-definition framework for state-of-the-art machine learning models in text, computer
vision, audio, video, and multimodal model, for both inference and training.
It centralizes the model definition so that this definition is agreed upon across the ecosystem. `transformers` is the
pivot across frameworks: if a model definition is supported, it will be compatible with the majority of training
and adjacent modeling libraries (llama.cpp, mlx, ...) which leverage the model definition from `transformers`.
We pledge to help support new state-of-the-art models and democratize their usage by having their model definition be
simple, customizable, and efficient.
There are over 1M+ Transformers [model checkpoints](https://huggingface.co/models?library=transformers&sort=trending) on the [Hugging Face Hub](https://huggingface.com/models) you can use.
Explore the [Hub](https://huggingface.com/) today to find a model and use Transformers to help you get started right away.
@ -78,7 +88,6 @@ Create and activate a virtual environment with [venv](https://docs.python.org/3/
# venv
python-mvenv.my-env
source.my-env/bin/activate
# uv
uvvenv.my-env
source.my-env/bin/activate
@ -88,10 +97,10 @@ Install Transformers in your virtual environment.
```py
# pip
pipinstalltransformers
pipinstall"transformers[torch]"
# uv
uvpipinstalltransformers
uvpipinstall"transformers[torch]"
```
Install Transformers from source if you want the latest changes in the library or are interested in contributing. However, the *latest* version may not be stable. Feel free to open an [issue](https://github.com/huggingface/transformers/issues) if you encounter an error.
@ -99,7 +108,12 @@ Install Transformers from source if you want the latest changes in the library o
DALL·E Flow is an interactive workflow for generating high-definition images from a text prompt. Itt leverages DALL·E-Mega, GLID-3 XL, and Stable Diffusion to generate image candidates, and then calls CLIP-as-service to rank the candidates w.r.t. the prompt.
DALL·E Flow is an interactive workflow for generating high-definition images from a text prompt. It leverages DALL·E-Mega, GLID-3 XL, and Stable Diffusion to generate image candidates, and then calls CLIP-as-service to rank the candidates w.r.t. the prompt.
The preferred candidate is fed to GLID-3 XL for diffusion, which often enriches the texture and background. Finally, the candidate is upscaled to 1024x1024 via SwinIR.
[underthesea](https://github.com/undertheseanlp/underthesea) is a Vietnamese NLP toolkit. Underthesea is a suite of open source Python modules data sets and tutorials supporting research and development in Vietnamese Natural Language Processing. We provides extremely easy API to quickly apply pretrained NLP models to your Vietnamese text, such as word segmentation, part-of-speech tagging (PoS), named entity recognition (NER), text classification and dependency parsing.
[underthesea](https://github.com/undertheseanlp/underthesea) is a Vietnamese NLP toolkit. Underthesea is a suite of open source Python modules data sets and tutorials supporting research and development in Vietnamese Natural Language Processing. We provide extremely easy API to quickly apply pretrained NLP models to your Vietnamese text, such as word segmentation, part-of-speech tagging (PoS), named entity recognition (NER), text classification and dependency parsing.
RUN python3 -m pip install --no-cache-dir -e ./transformers[dev-torch]
# `kernels` may give different outputs (within 1e-5 range) even with the same model (weights) and the same inputs
RUN python3 -m pip uninstall -y kernels
# Uninstall flash-attn installed by autoawq, it causes issues here : https://github.com/huggingface/transformers/actions/runs/15915442841/job/44892146131
RUN python3 -m pip uninstall -y flash-attn
# When installing in editable mode, `transformers` is not recognized as a package.
# this line must be added in order for python to be aware of transformers.
يُشهد في الآونة الأخيرة نمو مجال دراسي يُعنى باستكشاف آلية عمل نماذج المحولات الضخمة مثل BERT (والذي يُطلق عليها البعض اسم "BERTology"). ومن الأمثلة البارزة على هذا المجال ما يلي:
- BERT Rediscovers the Classical NLP Pipeline بواسطة Ian Tenney و Dipanjan Das و Ellie Pavlick:
https://arxiv.org/abs/1905.05950
- Are Sixteen Heads Really Better than One? بواسطة Paul Michel و Omer Levy و Graham Neubig: https://arxiv.org/abs/1905.10650
https://huggingface.co/papers/1905.05950
- Are Sixteen Heads Really Better than One? بواسطة Paul Michel و Omer Levy و Graham Neubig: https://huggingface.co/papers/1905.10650
- What Does BERT Look At? An Analysis of BERT's Attention بواسطة Kevin Clark و Urvashi Khandelwal و Omer Levy و Christopher D.
Manning: https://arxiv.org/abs/1906.04341
- CAT-probing: A Metric-based Approach to Interpret How Pre-trained Models for Programming Language Attend Code Structure: https://arxiv.org/abs/2210.04633
Manning: https://huggingface.co/papers/1906.04341
- CAT-probing: A Metric-based Approach to Interpret How Pre-trained Models for Programming Language Attend Code Structure: https://huggingface.co/papers/2210.04633
لإثراء هذا المجال الناشئ، قمنا بتضمين بعض الميزات الإضافية في نماذج BERT/GPT/GPT-2 للسماح للناس بالوصول إلى التمثيلات الداخلية، والتي تم تكييفها بشكل أساسي من العمل الرائد لـ Paul Michel (https://arxiv.org/abs/1905.10650):
لإثراء هذا المجال الناشئ، قمنا بتضمين بعض الميزات الإضافية في نماذج BERT/GPT/GPT-2 للسماح للناس بالوصول إلى التمثيلات الداخلية، والتي تم تكييفها بشكل أساسي من العمل الرائد لـ Paul Michel (https://huggingface.co/papers/1905.10650):
- الوصول إلى جميع الحالات المخفية في BERT/GPT/GPT-2،
- الوصول إلى جميع أوزان الانتباه لكل رأس في BERT/GPT/GPT-2،
- استرجاع قيم ومشتقات مخرجات الرأس لحساب درجة أهمية الرأس وحذفه كما هو موضح في https://arxiv.org/abs/1905.10650.
- استرجاع قيم ومشتقات مخرجات الرأس لحساب درجة أهمية الرأس وحذفه كما هو موضح في https://huggingface.co/papers/1905.10650.
ولمساعدتك على فهم واستخدام هذه الميزات بسهولة، أضفنا مثالًا برمجيًا محددًا: [bertology.py](https://github.com/huggingface/transformers-research-projects/tree/main/bertology/run_bertology.py) أثناء استخراج المعلومات وتقليص من نموذج تم تدريبه مسبقًا على GLUE.
في كل وحدة الانتباه الباقية في المحولات، تلي طبقة الاهتمام الانتباه عادة طبقتان للتغذية الأمامية.
حجم تضمين الطبقة الأمامية الوسيطة أكبر عادة من حجم المخفي للنموذج (على سبيل المثال، لـ
`google-bert/bert-base-uncased`).
بالنسبة لإدخال بحجم `[batch_size, sequence_length]`، يمكن أن تمثل الذاكرة المطلوبة لتخزين التضمينات الأمامية الوسيطة `[batch_size، sequence_length, config.intermediate_size]` جزءًا كبيرًا من استخدام الذاكرة. لاحظ مؤلفو (https://arxiv.org/abs/2001.04451)[Reformer: The Efficient Transformer] أنه نظرًا لأن الحساب مستقل عن بعد `sequence_length`، فإنه من المكافئ رياضيًا حساب تضمينات الإخراج الأمامية `[batch_size، config.hidden_size]_0, ..., [batch_size، `config_size]_n
بالنسبة لإدخال بحجم `[batch_size, sequence_length]`، يمكن أن تمثل الذاكرة المطلوبة لتخزين التضمينات الأمامية الوسيطة `[batch_size، sequence_length, config.intermediate_size]` جزءًا كبيرًا من استخدام الذاكرة. لاحظ مؤلفو (https://huggingface.co/papers/2001.04451)[Reformer: The Efficient Transformer] أنه نظرًا لأن الحساب مستقل عن بعد `sequence_length`، فإنه من المكافئ رياضيًا حساب تضمينات الإخراج الأمامية `[batch_size، config.hidden_size]_0, ..., [batch_size، `config_size]_n
فردياً والتوصيل بها لاحقًا إلى `[batch_size, sequence_length, config.hidden_size]` مع `n = sequence_length`، والذي يتداول زيادة وقت الحساب مقابل تقليل استخدام الذاكرة، ولكنه ينتج عنه نتيجة مكافئة رياضيا.
بالنسبة للنماذج التي تستخدم الدالة `[apply_chunking_to_forward]`، يحدد `chunk_size` عدد التضمينات يتم حساب الإخراج بالتوازي وبالتالي يحدد المقايضة بين حجم الذاكرة والتعقيد الوقت. إذا تم تعيين `chunk_size` إلى `0`، فلن يتم إجراء تجزئة التغذية الأمامية.
@ -173,7 +173,7 @@
<Youtubeid="VFp38yj8h3A"/>
يعمل كل محلل لغوي بشكل مختلف ولكن الآلية الأساسية تبقى كما هي. إليك مثال باستخدام محلل BERT اللغوي، والذي يعد محلل لغوي [WordPiece](https://arxiv.org/pdf/1609.08144.pdf):
يعمل كل محلل لغوي بشكل مختلف ولكن الآلية الأساسية تبقى كما هي. إليك مثال باستخدام محلل BERT اللغوي، والذي يعد محلل لغوي [WordPiece](https://huggingface.co/papers/1609.08144):
تحقق نماذج اللغة الكبيرة (LLMs) مثل GPT3/4، [Falcon](https://huggingface.co/tiiuae/falcon-40b)، و [Llama](https://huggingface.co/meta-llama/Llama-2-70b-hf) تقدمًا سريعًا في قدرتها على معالجة المهام التي تركز على الإنسان، مما يجعلها أدوات أساسية في الصناعات القائمة على المعرفة الحديثة.
لا يزال نشر هذه النماذج في المهام الواقعية يمثل تحديًا، ومع ذلك:
- لكي تظهر نماذج اللغة الكبيرة قدرات فهم وتوليد النصوص قريبة من قدرات الإنسان، فإنها تتطلب حاليًا إلى تكوينها من مليارات المعلمات (انظر [كابلان وآخرون](https://arxiv.org/abs/2001.08361)، [وي وآخرون](https://arxiv.org/abs/2206.07682)). وهذا بدوره يزيد من متطلبات الذاكرة للاستدلال.
- لكي تظهر نماذج اللغة الكبيرة قدرات فهم وتوليد النصوص قريبة من قدرات الإنسان، فإنها تتطلب حاليًا إلى تكوينها من مليارات المعلمات (انظر [كابلان وآخرون](https://huggingface.co/papers/2001.08361)، [وي وآخرون](https://huggingface.co/papers/2206.07682)). وهذا بدوره يزيد من متطلبات الذاكرة للاستدلال.
- في العديد من المهام الواقعية، تحتاج نماذج اللغة الكبيرة إلى معلومات سياقية شاملة. يتطلب ذلك قدرة النموذج على إدارة تسلسلات إدخال طويلة للغاية أثناء الاستدلال.
يكمن جوهر صعوبة هذه التحديات في تعزيز القدرات الحسابية والذاكرة لنماذج اللغة الكبيرة، خاصة عند التعامل مع تسلسلات الإدخال الضخمة.
@ -17,7 +17,7 @@
2.**اFlash Attention:** إن Flash Attention وهي نسخة مُعدَّلة من خوارزمية الانتباه التي لا توفر فقط نهجًا أكثر كفاءة في استخدام الذاكرة، ولكنها تحقق أيضًا كفاءة متزايدة بسبب الاستخدام الأمثل لذاكرة GPU.
3.**الابتكارات المعمارية:** حيث تم اقتراح هياكل متخصصة تسمح باستدلال أكثر فعالية نظرًا لأن نماذج اللغة الكبيرة يتم نشرها دائمًا بنفس الطريقة أثناء عملية الاستدلال، أي توليد النص التنبؤي التلقائي مع سياق الإدخال الطويل، فقد تم اقتراح بنيات نموذج متخصصة تسمح بالاستدلال الأكثر كفاءة. أهم تقدم في بنيات النماذج هنا هو [عذر](https://arxiv.org/abs/2108.12409)، [الترميز الدوار](https://arxiv.org/abs/2104.09864)، [الاهتمام متعدد الاستعلامات (MQA)](https://arxiv.org/abs/1911.02150) و [مجموعة الانتباه بالاستعلام (GQA)]((https://arxiv.org/abs/2305.13245)).
3.**الابتكارات المعمارية:** حيث تم اقتراح هياكل متخصصة تسمح باستدلال أكثر فعالية نظرًا لأن نماذج اللغة الكبيرة يتم نشرها دائمًا بنفس الطريقة أثناء عملية الاستدلال، أي توليد النص التنبؤي التلقائي مع سياق الإدخال الطويل، فقد تم اقتراح بنيات نموذج متخصصة تسمح بالاستدلال الأكثر كفاءة. أهم تقدم في بنيات النماذج هنا هو [عذر](https://huggingface.co/papers/2108.12409)، [الترميز الدوار](https://huggingface.co/papers/2104.09864)، [الاهتمام متعدد الاستعلامات (MQA)](https://huggingface.co/papers/1911.02150) و [مجموعة الانتباه بالاستعلام (GQA)]((https://huggingface.co/papers/2305.13245)).
على مدار هذا الدليل، سنقدم تحليلًا للتوليد التنبؤي التلقائي من منظور المُوتِّرات. نتعمق في مزايا وعيوب استخدام دقة أقل، ونقدم استكشافًا شاملاً لخوارزميات الانتباه الأحدث، ونناقش بنيات نماذج نماذج اللغة الكبيرة المحسنة. سندعم الشرح بأمثلة عملية تُبرِز كل تحسين على حدة.
@ -152,8 +152,8 @@ from accelerate.utils import release_memory
release_memory(model)
```
والآن ماذا لو لم يكن لدى وحدة معالجة الرسومات (GPU) لديك 32 جيجا بايت من ذاكرة الفيديو العشوائية (VRAM)؟ لقد وجد أن أوزان النماذج يمكن تحويلها إلى 8 بتات أو 4 بتات دون خسارة كبيرة في الأداء (انظر [Dettmers et al.](https://arxiv.org/abs/2208.07339)).
يمكن تحويل النموذج إلى 3 بتات أو 2 بتات مع فقدان مقبول في الأداء كما هو موضح في ورقة [GPTQ](https://arxiv.org/abs/2210.17323) 🤯.
والآن ماذا لو لم يكن لدى وحدة معالجة الرسومات (GPU) لديك 32 جيجا بايت من ذاكرة الفيديو العشوائية (VRAM)؟ لقد وجد أن أوزان النماذج يمكن تحويلها إلى 8 بتات أو 4 بتات دون خسارة كبيرة في الأداء (انظر [Dettmers et al.](https://huggingface.co/papers/2208.07339)).
يمكن تحويل النموذج إلى 3 بتات أو 2 بتات مع فقدان مقبول في الأداء كما هو موضح في ورقة [GPTQ](https://huggingface.co/papers/2210.17323) 🤯.
دون الدخول في الكثير من التفاصيل، تهدف مخططات التكميم إلى تخفيض دقة الأوزان مع محاولة الحفاظ على دقة نتائج النموذج كما هي (*أي* أقرب ما يمكن إلى bfloat16).
لاحظ أن التكميم يعمل بشكل خاص جيدًا لتوليد النص حيث كل ما نهتم به هو اختيار *مجموعة الرموز الأكثر احتمالًا التالية* ولا نهتم حقًا بالقيم الدقيقة لتوزيع الرمز التالي *logit*.
@ -231,7 +231,7 @@ flush()
دعنا نرى ما هو استهلاك ذاكرة GPU الذروة الذي يوفره تكميم 4 بت. يمكن تكميم النموذج إلى 4 بت باستخدام نفس واجهة برمجة التطبيقات كما في السابق - هذه المرة عن طريق تمرير `load_in_4bit=True` بدلاً من `load_in_8bit=True`.
```python
model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", load_in_4bit=True, low_cpu_mem_usage=True, pad_token_id=0)
model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", load_in_4bit=True, pad_token_id=0)
مع تحسن LLMs في فهم النص وتوليد النص، يتم تطبيقها على مهام متزايدة التعقيد. في حين أن النماذج كانت تتعامل سابقًا مع ترجمة أو تلخيص بضع جمل، فإنها الآن تدير صفحات كاملة، مما يتطلب القدرة على معالجة أطوال إدخال واسعة.
كيف يمكننا التخلص من متطلبات الذاكرة الباهظة للتطويلات المدخلة الكبيرة؟ نحن بحاجة إلى طريقة جديدة لحساب آلية الاهتمام الذاتي التي تتخلص من مصفوفة \\( QK^T \\). [طريقه داو وآخرون.](Https://arxiv.org/abs/2205.14135) طوروا بالضبط مثل هذا الخوارزمية الجديدة وأطلقوا عليها اسم **Flash Attention**.
كيف يمكننا التخلص من متطلبات الذاكرة الباهظة للتطويلات المدخلة الكبيرة؟ نحن بحاجة إلى طريقة جديدة لحساب آلية الاهتمام الذاتي التي تتخلص من مصفوفة \\( QK^T \\). [طريقه داو وآخرون.](https://huggingface.co/papers/2205.14135) طوروا بالضبط مثل هذا الخوارزمية الجديدة وأطلقوا عليها اسم **Flash Attention**.
باختصار، يكسر الاهتمام الفلاشي حساب \\( \mathbf{V} \times \operatorname{Softmax}(\mathbf{QK}^T\\)) ويحسب بدلاً من ذلك قطعًا أصغر من الإخراج عن طريق التكرار عبر العديد من خطوات حساب Softmax:
> من خلال تتبع إحصائيات التطبيع softmax واستخدام بعض الرياضيات الذكية، يعطي Flash Attention **مخرجات متطابقة رقميًا** مقارنة بطبقة الاهتمام الذاتي الافتراضية بتكلفة ذاكرة لا تزيد خطيًا مع \\( N \\).
عند النظر إلى الصيغة، قد يقول المرء بديهيًا أن الاهتمام الفلاشي يجب أن يكون أبطأ بكثير مقارنة بصيغة الاهتمام الافتراضية حيث يلزم إجراء المزيد من الحسابات. في الواقع، يتطلب Flash Attention المزيد من عمليات الفاصلة العائمة مقارنة بالاهتمام العادي حيث يجب إعادة حساب إحصائيات التطبيع softmax باستمرار (راجع [الورقة](https://arxiv.org/abs/2205.14135) لمزيد من التفاصيل إذا كنت مهتمًا)
عند النظر إلى الصيغة، قد يقول المرء بديهيًا أن الاهتمام الفلاشي يجب أن يكون أبطأ بكثير مقارنة بصيغة الاهتمام الافتراضية حيث يلزم إجراء المزيد من الحسابات. في الواقع، يتطلب Flash Attention المزيد من عمليات الفاصلة العائمة مقارنة بالاهتمام العادي حيث يجب إعادة حساب إحصائيات التطبيع softmax باستمرار (راجع [الورقة](https://huggingface.co/papers/2205.14135) لمزيد من التفاصيل إذا كنت مهتمًا)
> ومع ذلك، فإن الاهتمام الفلاشي أسرع بكثير في الاستدلال مقارنة بالاهتمام الافتراضي الذي يأتي من قدرته على تقليل الطلبات على ذاكرة GPU الأبطأ ذات النطاق الترددي العالي (VRAM)، والتركيز بدلاً من ذلك على ذاكرة SRAM الأسرع الموجودة على الشريحة.
@ -535,20 +535,20 @@ flush()
لكي يفهم LLM ترتيب الجملة، يلزم وجود *إشارة* إضافية ويتم تطبيقها عادةً في شكل *الترميزات الموضعية* (أو ما يُطلق عليه أيضًا *الترميزات الموضعية*).
لم يتم ترجمة النص الخاص والروابط وأكواد HTML وCSS بناءً على طلبك.
قدم مؤلفو الورقة البحثية [*Attention Is All You Need*](https://arxiv.org/abs/1706.03762) تضمينات موضعية جيبية مثلثية \\( \mathbf{P} = \mathbf{p}_1, \ldots, \mathbf{p}_N \\) حيث يتم حساب كل متجه \\( \mathbf{p}_i \\) كدالة جيبية لموضعه \\( i \\) .
قدم مؤلفو الورقة البحثية [*Attention Is All You Need*](https://huggingface.co/papers/1706.03762) تضمينات موضعية جيبية مثلثية \\( \mathbf{P} = \mathbf{p}_1, \ldots, \mathbf{p}_N \\) حيث يتم حساب كل متجه \\( \mathbf{p}_i \\) كدالة جيبية لموضعه \\( i \\) .
بعد ذلك يتم ببساطة إضافة التضمينات الموضعية إلى متجهات تسلسل الإدخال \\( \mathbf{\hat{X}} = \mathbf{\hat{x}}_1, \ldots, \mathbf{\hat{x}}_N \\) = \\( \mathbf{x}_1 + \mathbf{p}_1, \ldots, \mathbf{x}_N + \mathbf{p}_N \\) وبالتالي توجيه النموذج لتعلم ترتيب الجملة بشكل أفضل.
بدلاً من استخدام التضمينات الموضعية الثابتة، استخدم آخرون (مثل [Devlin et al.](https://arxiv.org/abs/1810.04805)) تضمينات موضعية مكتسبة يتم من خلالها تعلم التضمينات الموضعية \\( \mathbf{P} \\) أثناء التدريب.
بدلاً من استخدام التضمينات الموضعية الثابتة، استخدم آخرون (مثل [Devlin et al.](https://huggingface.co/papers/1810.04805)) تضمينات موضعية مكتسبة يتم من خلالها تعلم التضمينات الموضعية \\( \mathbf{P} \\) أثناء التدريب.
كانت التضمينات الموضعية الجيبية والمكتسبة هي الطرق السائدة لترميز ترتيب الجملة في نماذج اللغة الكبيرة، ولكن تم العثور على بعض المشكلات المتعلقة بهذه التضمينات الموضعية:
1. التضمينات الموضعية الجيبية والمكتسبة هي تضمينات موضعية مطلقة، أي ترميز تضمين فريد لكل معرف موضعي: \\( 0, \ldots, N \\) . كما أظهر [Huang et al.](https://arxiv.org/abs/2009.13658) و [Su et al.](https://arxiv.org/abs/2104.09864)، تؤدي التضمينات الموضعية المطلقة إلى أداء ضعيف لنماذج اللغة الكبيرة للمدخلات النصية الطويلة. بالنسبة للمدخلات النصية الطويلة، يكون من المفيد إذا تعلم النموذج المسافة الموضعية النسبية التي تمتلكها رموز المدخلات إلى بعضها البعض بدلاً من موضعها المطلق.
1. التضمينات الموضعية الجيبية والمكتسبة هي تضمينات موضعية مطلقة، أي ترميز تضمين فريد لكل معرف موضعي: \\( 0, \ldots, N \\) . كما أظهر [Huang et al.](https://huggingface.co/papers/2009.13658) و [Su et al.](https://huggingface.co/papers/2104.09864)، تؤدي التضمينات الموضعية المطلقة إلى أداء ضعيف لنماذج اللغة الكبيرة للمدخلات النصية الطويلة. بالنسبة للمدخلات النصية الطويلة، يكون من المفيد إذا تعلم النموذج المسافة الموضعية النسبية التي تمتلكها رموز المدخلات إلى بعضها البعض بدلاً من موضعها المطلق.
2. عند استخدام التضمينات الموضعية المكتسبة، يجب تدريب نموذج اللغة الكبيرة على طول إدخال ثابت \\( N \\)، مما يجعل من الصعب الاستقراء إلى طول إدخال أطول مما تم تدريبه عليه.
في الآونة الأخيرة، أصبحت التضمينات الموضعية النسبية التي يمكنها معالجة المشكلات المذكورة أعلاه أكثر شعبية، وأبرزها:
يؤكد كل من *RoPE* و *ALiBi* أنه من الأفضل توجيه نموذج اللغة الكبيرة حول ترتيب الجملة مباشرة في خوارزمية الانتباه الذاتي حيث يتم وضع رموز الكلمات في علاقة مع بعضها البعض. على وجه التحديد، يجب توجيه ترتيب الجملة عن طريق تعديل عملية \\( \mathbf{QK}^T \\) .
كبديل، يقترح *ALiBi* مخطط ترميز موضعي نسبي أبسط بكثير. يتم إضافة المسافة النسبية التي تمتلكها رموز المدخلات إلى بعضها البعض كعدد صحيح سلبي مقياس بقيمة محددة مسبقًا `m` إلى كل إدخال استعلام-مفتاح لمصفوفة \\( \mathbf{QK}^T \\) مباشرة قبل حساب softmax.

كما هو موضح في ورقة [ALiBi](https://arxiv.org/abs/2108.12409)، يسمح هذا الترميز الموضعي النسبي البسيط للنموذج بالحفاظ على أداء عالٍ حتى في تسلسلات المدخلات النصية الطويلة جدًا.
كما هو موضح في ورقة [ALiBi](https://huggingface.co/papers/2108.12409)، يسمح هذا الترميز الموضعي النسبي البسيط للنموذج بالحفاظ على أداء عالٍ حتى في تسلسلات المدخلات النصية الطويلة جدًا.
يُستخدم *ALiBi* في العديد من أهم نماذج اللغة الكبيرة المستخدمة اليوم، مثل:
يمكن لكل من ترميزات الموضع *RoPE* و *ALiBi* الاستقراء إلى أطوال إدخال لم يتم ملاحظتها أثناء التدريب، في حين ثبت أن الاستقراء يعمل بشكل أفضل بكثير خارج الصندوق لـ *ALiBi* مقارنة بـ *RoPE*.
بالنسبة لـ ALiBi، ما عليك سوى زيادة قيم مصفوفة الموضع المثلث السفلي لمطابقة طول تسلسل الإدخال.
بالنسبة لـ *RoPE*، يؤدي الحفاظ على نفس \\( \theta \\) الذي تم استخدامه أثناء التدريب إلى نتائج سيئة عند تمرير إدخالات نصية أطول بكثير من تلك التي شوهدت أثناء التدريب، راجع [Press et al.](https://arxiv.org/abs/2108.12409). ومع ذلك، وجد المجتمع بعض الحيل الفعالة التي تقوم بتعديل \\( \theta \\)، مما يسمح لترميزات الموضع *RoPE* بالعمل بشكل جيد لتسلسلات إدخال النص المستقرئة (راجع [هنا](https://github.com/huggingface/transformers/pull/24653)).
بالنسبة لـ *RoPE*، يؤدي الحفاظ على نفس \\( \theta \\) الذي تم استخدامه أثناء التدريب إلى نتائج سيئة عند تمرير إدخالات نصية أطول بكثير من تلك التي شوهدت أثناء التدريب، راجع [Press et al.](https://huggingface.co/papers/2108.12409). ومع ذلك، وجد المجتمع بعض الحيل الفعالة التي تقوم بتعديل \\( \theta \\)، مما يسمح لترميزات الموضع *RoPE* بالعمل بشكل جيد لتسلسلات إدخال النص المستقرئة (راجع [هنا](https://github.com/huggingface/transformers/pull/24653)).
> كل من RoPE و ALiBi عبارة عن ترميزات موضع نسبي *لا* يتم تعلمها أثناء التدريب، ولكن بدلاً من ذلك تستند إلى الحدس التالي:
- يجب إعطاء الإشارات الموضعية حول إدخالات النص مباشرة إلى مصفوفة \\( QK^T \\) لطبقة الاهتمام الذاتي
@ -755,21 +755,21 @@ Roughly 8 مليار قيمة عائمة! يتطلب تخزين 8 مليارات
#### 3.2.2 Multi-Query-Attention (MQA)
[Multi-Query-Attention](https://arxiv.org/abs/1911.02150) اقترحها Noam Shazeer في ورقته *Fast Transformer Decoding: One Write-Head is All You Need*. كما يقول العنوان، اكتشف Noam أنه بدلاً من استخدام `n_head` من أوزان إسقاط القيمة الرئيسية، يمكن استخدام زوج واحد من أوزان إسقاط رأس القيمة التي يتم مشاركتها عبر جميع رؤوس الاهتمام دون أن يتدهور أداء النموذج بشكل كبير.
[Multi-Query-Attention](https://huggingface.co/papers/1911.02150) اقترحها Noam Shazeer في ورقته *Fast Transformer Decoding: One Write-Head is All You Need*. كما يقول العنوان، اكتشف Noam أنه بدلاً من استخدام `n_head` من أوزان إسقاط القيمة الرئيسية، يمكن استخدام زوج واحد من أوزان إسقاط رأس القيمة التي يتم مشاركتها عبر جميع رؤوس الاهتمام دون أن يتدهور أداء النموذج بشكل كبير.
> باستخدام زوج واحد من أوزان إسقاط رأس القيمة، يجب أن تكون متجهات القيمة الرئيسية \\( \mathbf{k}_i، \mathbf{v}_i \\) متطابقة عبر جميع رؤوس الاهتمام والتي بدورها تعني أننا بحاجة فقط إلى تخزين زوج إسقاط قيمة رئيسي واحد في ذاكرة التخزين المؤقت بدلاً من `n_head` منها.
نظرًا لأن معظم LLMs تستخدم ما بين 20 و100 رأس اهتمام، فإن MQA يقلل بشكل كبير من استهلاك الذاكرة لذاكرة التخزين المؤقت key-value. بالنسبة إلى LLM المستخدم في هذا الدفتر، يمكننا تقليل استهلاك الذاكرة المطلوبة من 15 جيجابايت إلى أقل من 400 ميجابايت عند طول تسلسل الإدخال 16000.
بالإضافة إلى توفير الذاكرة، يؤدي MQA أيضًا إلى تحسين الكفاءة الحسابية كما هو موضح في ما يلي.
في فك التشفير التلقائي، يجب إعادة تحميل متجهات القيمة الرئيسية الكبيرة، ودمجها مع زوج متجه القيمة الحالي، ثم إدخالها في \\( \mathbf{q}_c\mathbf{K}^T \\) الحساب في كل خطوة. بالنسبة لفك التشفير التلقائي، يمكن أن تصبح عرض النطاق الترددي للذاكرة المطلوبة لإعادة التحميل المستمر عنق زجاجة زمنيًا خطيرًا. من خلال تقليل حجم متجهات القيمة الرئيسية، يجب الوصول إلى ذاكرة أقل، وبالتالي تقليل عنق الزجاجة في عرض النطاق الترددي للذاكرة. لمزيد من التفاصيل، يرجى إلقاء نظرة على [ورقة Noam](https://arxiv.org/abs/1911.02150).
في فك التشفير التلقائي، يجب إعادة تحميل متجهات القيمة الرئيسية الكبيرة، ودمجها مع زوج متجه القيمة الحالي، ثم إدخالها في \\( \mathbf{q}_c\mathbf{K}^T \\) الحساب في كل خطوة. بالنسبة لفك التشفير التلقائي، يمكن أن تصبح عرض النطاق الترددي للذاكرة المطلوبة لإعادة التحميل المستمر عنق زجاجة زمنيًا خطيرًا. من خلال تقليل حجم متجهات القيمة الرئيسية، يجب الوصول إلى ذاكرة أقل، وبالتالي تقليل عنق الزجاجة في عرض النطاق الترددي للذاكرة. لمزيد من التفاصيل، يرجى إلقاء نظرة على [ورقة Noam](https://huggingface.co/papers/1911.02150).
الجزء المهم الذي يجب فهمه هنا هو أن تقليل عدد رؤوس الاهتمام بالقيمة الرئيسية إلى 1 لا معنى له إلا إذا تم استخدام ذاكرة التخزين المؤقت للقيمة الرئيسية. يظل الاستهلاك الذروي لذاكرة النموذج لمرور واحد للأمام بدون ذاكرة التخزين المؤقت للقيمة الرئيسية دون تغيير لأن كل رأس اهتمام لا يزال لديه متجه استعلام فريد بحيث يكون لكل رأس اهتمام مصفوفة \\( \mathbf{QK}^T \\) مختلفة.
شهدت MQA اعتمادًا واسع النطاق من قبل المجتمع ويتم استخدامها الآن بواسطة العديد من LLMs الأكثر شهرة:
@ -777,7 +777,7 @@ Roughly 8 مليار قيمة عائمة! يتطلب تخزين 8 مليارات
#### 3.2.3 مجموعة الاستعلام الاهتمام (GQA)
[مجموعة الاستعلام الاهتمام](https://arxiv.org/abs/2305.13245)، كما اقترح Ainslie et al. من Google، وجد أن استخدام MQA يمكن أن يؤدي غالبًا إلى تدهور الجودة مقارنة باستخدام إسقاطات رأس القيمة الرئيسية المتعددة. تجادل الورقة بأنه يمكن الحفاظ على أداء النموذج بشكل أكبر عن طريق تقليل عدد أوزان إسقاط رأس الاستعلام بشكل أقل حدة. بدلاً من استخدام وزن إسقاط قيمة رئيسية واحدة فقط، يجب استخدام `n <n_head` أوزان إسقاط قيمة رئيسية. من خلال اختيار `n` إلى قيمة أقل بكثير من `n_head`،مثل2أو4أو8،يمكنالاحتفاظبمعظممكاسبالذاكرةوالسرعةمنMQAمعالتضحيةبقدرأقلمنسعةالنموذجوبالتالي،منالمفترض،أقلأداء.
[مجموعة الاستعلام الاهتمام](https://huggingface.co/papers/2305.13245)، كما اقترح Ainslie et al. من Google، وجد أن استخدام MQA يمكن أن يؤدي غالبًا إلى تدهور الجودة مقارنة باستخدام إسقاطات رأس القيمة الرئيسية المتعددة. تجادل الورقة بأنه يمكن الحفاظ على أداء النموذج بشكل أكبر عن طريق تقليل عدد أوزان إسقاط رأس الاستعلام بشكل أقل حدة. بدلاً من استخدام وزن إسقاط قيمة رئيسية واحدة فقط، يجب استخدام `n <n_head` أوزان إسقاط قيمة رئيسية. من خلال اختيار `n` إلى قيمة أقل بكثير من `n_head`،مثل2أو4أو8،يمكنالاحتفاظبمعظممكاسبالذاكرةوالسرعةمنMQAمعالتضحيةبقدرأقلمنسعةالنموذجوبالتالي،منالمفترض،أقلأداء.
السببفيأنLLMsالضخمةمثلGPT3/4،وLlama-2-70b،وClaude،وPaLMيمكنأنتعملبسرعةكبيرةفيواجهاتالدردشةمثل [Hugging Face Chat](https://huggingface.co/chat/) أوChatGPTيرجعإلىحدكبيرإلىالتحسيناتالمذكورةأعلاهفيالدقةوالخوارزمياتوالهندسةالمعمارية.
قبل مشاركة نموذج على Hub، ستحتاج إلى بيانات اعتماد حساب Hugging Face الخاصة بك. إذا كنت تستخدم منصة الأوامر، فقم بتشغيل الأمر التالي في بيئة افتراضية حيث تم تثبيت 🤗 Transformers. سيقوم هذا الأمر بتخزين رمز الدخول الخاص بك في مجلد تخزين المؤقت لـ Hugging Face (`~/.cache/` بشكل افتراضي):
```bash
huggingface-cli login
hf auth login
```
إذا كنت تستخدم دفتر ملاحظات مثل Jupyter أو Colaboratory، فتأكد من تثبيت مكتبة [`huggingface_hub`](https://huggingface.co/docs/hub/adding-a-library). تسمح لك هذه المكتبة بالتفاعل برمجيًا مع Hub.
منذ إطلاقه في عام 2017، ألهم نموذج [المحول الأصلي](https://arxiv.org/abs/1706.03762) (راجع مدونة [المحول المشروح](http://nlp.seas.harvard.edu/2018/04/03/attention.html) لمقدمة تقنية مبسطة)، ألهم العديد من النماذج الجديدة والمبتكرة التي تتجاوز مهام معالجة اللغات الطبيعية (NLP). هناك نماذج للتنبؤ [بالبنية البروتينات المطوية](https://huggingface.co/blog/deep-learning-with-proteins)، و[تدريب على اتخاذ القرار](https://huggingface.co/blog/train-decision-transformers)، و[التنبؤ بالسلاسل الزمنية](https://huggingface.co/blog/time-series-transformers). مع وجود العديد من متغيرات المحول المتاحة، قد يكون من السهل أن تفوتك الصورة الأكبر. ما تشترك فيه جميع هذه النماذج هو أنها تستند إلى بنية المحول الأصلية. تستخدم بعض النماذج فقط الترميز أو فك الترميز، بينما تستخدم نماذج أخرى كليهما. يوفر هذا تصنيفًا مفيدًا لتصنيف واستعراض الفروقات الرئيسية بين نماذج عائلة المحولات، وسيساعدك على فهم النماذج التي لم تصادفها من قبل.
منذ إطلاقه في عام 2017، ألهم نموذج [المحول الأصلي](https://huggingface.co/papers/1706.03762) (راجع مدونة [المحول المشروح](http://nlp.seas.harvard.edu/2018/04/03/attention.html) لمقدمة تقنية مبسطة)، ألهم العديد من النماذج الجديدة والمبتكرة التي تتجاوز مهام معالجة اللغات الطبيعية (NLP). هناك نماذج للتنبؤ [بالبنية البروتينات المطوية](https://huggingface.co/blog/deep-learning-with-proteins)، و[تدريب على اتخاذ القرار](https://huggingface.co/blog/train-decision-transformers)، و[التنبؤ بالسلاسل الزمنية](https://huggingface.co/blog/time-series-transformers). مع وجود العديد من متغيرات المحول المتاحة، قد يكون من السهل أن تفوتك الصورة الأكبر. ما تشترك فيه جميع هذه النماذج هو أنها تستند إلى بنية المحول الأصلية. تستخدم بعض النماذج فقط الترميز أو فك الترميز، بينما تستخدم نماذج أخرى كليهما. يوفر هذا تصنيفًا مفيدًا لتصنيف واستعراض الفروقات الرئيسية بين نماذج عائلة المحولات، وسيساعدك على فهم النماذج التي لم تصادفها من قبل.
إذا لم تكن على دراية بنموذج المحول الأصلي أو تحتاج إلى تذكير، فراجع الفصل الخاص بـ [كيف تعمل المحولات](https://huggingface.co/course/chapter1/4؟fw=pt) من دورة Hugging Face.
@ -14,7 +14,7 @@
### الشبكة التلافيفية (Convolutional network)
لطالما كانت الشبكات التلافيفية (CNNs) الطريقة السائدة لمهام رؤية الحاسب حتى برز [محول الرؤية](https://arxiv.org/abs/2010.11929) قابليته للتطوير وكفاءته العالية. وحتى بعد ذلك، لا تزال بعض أفضل صفات CNN، مثل ثبات الإزاحة، قوية جدًا (خاصة بالنسبة لمهام معينة) لدرجة أن بعض المحولات تدمج التلافيف في بنيتها. قلب [ConvNeXt](model_doc/convnext) هذا التبادل رأسًا على عقب وأدرج خيارات التصميم من المحولات لتحديث CNN. على سبيل المثال، يستخدم ConvNeXt نوافذ منزلقة غير متداخلة لتقسيم الصورة إلى رقع وزيادة حقل مجال العام الخاص بها. كما يقوم ConvNeXt بعدة خيارات مثل تصميم الطبقة لتكون أكثر كفاءة في الذاكرة وتحسين الأداء، مما يجعله منافسًا قويًا للمحولات!
لطالما كانت الشبكات التلافيفية (CNNs) الطريقة السائدة لمهام رؤية الحاسب حتى برز [محول الرؤية](https://huggingface.co/papers/2010.11929) قابليته للتطوير وكفاءته العالية. وحتى بعد ذلك، لا تزال بعض أفضل صفات CNN، مثل ثبات الإزاحة، قوية جدًا (خاصة بالنسبة لمهام معينة) لدرجة أن بعض المحولات تدمج التلافيف في بنيتها. قلب [ConvNeXt](model_doc/convnext) هذا التبادل رأسًا على عقب وأدرج خيارات التصميم من المحولات لتحديث CNN. على سبيل المثال، يستخدم ConvNeXt نوافذ منزلقة غير متداخلة لتقسيم الصورة إلى رقع وزيادة حقل مجال العام الخاص بها. كما يقوم ConvNeXt بعدة خيارات مثل تصميم الطبقة لتكون أكثر كفاءة في الذاكرة وتحسين الأداء، مما يجعله منافسًا قويًا للمحولات!
### الترميز[[cv-encoder]] (Encoder)
@ -40,7 +40,7 @@
نموذج [BERT](model_doc/bert) هو محوّل (Transformer) يعتمد على الترميز فقط يقوم بشكل عشوائي بإخفاء رموز معينة في المدخلات لتجنب رؤية باقى الرموز الأخرى، مما يسمح له "بالغش". يتمثل هدف التدريب المسبق في التنبؤ بالرمز المخفي بناءً على السياق. يسمح هذا لـ BERT باستخدام السياقات اليمنى واليسرى بالكامل لمساعدته في تعلم تمثيل أعمق وأغنى للبيانات المدخلة. ومع ذلك، كان هناك مجال للتحسين في استراتيجية التدريب المسبق لـ BERT. نموذج [RoBERTa](model_doc/roberta) اضاف تحسين من خلال تقديم وصفة تدريب مسبق جديدة تشمل التدريب لفترة أطول وعلى دفعات أكبر، وإخفاء الرموز عشوائيًا في كل حقبة بدلاً من مرة واحدة فقط أثناء المعالجة المسبقة، وإزالة هدف التنبؤ بالجملة التالية.
تتمثل الاستراتيجية السائدة لتحسين الأداء في زيادة حجم النموذج. ولكن تدريب النماذج الكبيرة مكلف من الناحية الحسابية. إحدى طرق تقليل التكاليف الحسابية هي استخدام نموذج أصغر مثل [DistilBERT](model_doc/distilbert). يستخدم DistilBERT [ تقنية تقطير المعرفة](https://arxiv.org/abs/1503.02531) - وهي تقنية ضغط - لإنشاء نموذج أصغر من BERT مع الحفاظ على معظم قدراته على فهم اللغةا.
تتمثل الاستراتيجية السائدة لتحسين الأداء في زيادة حجم النموذج. ولكن تدريب النماذج الكبيرة مكلف من الناحية الحسابية. إحدى طرق تقليل التكاليف الحسابية هي استخدام نموذج أصغر مثل [DistilBERT](model_doc/distilbert). يستخدم DistilBERT [ تقنية تقطير المعرفة](https://huggingface.co/papers/1503.02531) - وهي تقنية ضغط - لإنشاء نموذج أصغر من BERT مع الحفاظ على معظم قدراته على فهم اللغةا.
مرت معظم نماذج المحول في الاتجاه نحو المزيد من المعلمات، مما أدى إلى ظهور نماذج جديدة تركز على تحسين كفاءة التدريب. يقلّل [ALBERT](model_doc/albert) من استهلاك الذاكرة عن طريق تقليل عدد المعلمات بطريقتين: فصل تضمين المفردات الأكبر إلى مصفوفتين أصغر والسماح للمستويات بمشاركة المعلمات. أضاف [DeBERTa](model_doc/deberta) آلية انتباه منفصلة حيث يتم ترميز الكلمة وموضعها بشكل منفصل في متجهين. يتم حساب الانتباه من هذه المتجهات المنفصلة بدلاً من متجه واحد يحتوي على تضمين الكلمة والموقع. ركز [Longformer](model_doc/longformer) أيضًا على جعل الانتباه أكثر كفاءة، خاصة لمعالجة المستندات ذات تسلسلات أطولل. فهو يستخدم مزيجًا من انتباه النوافذ المحلية (يتم حساب الانتباه فقط ن نافذة ذات حجم ثابت حول كل رمز) والانتباه العام (فقط لرموز مهمة محددة مثل `[CLS]` للتصنيف) لإنشاء مصفوفة انتباه متفرقة بدلاً من مصفوفة انتباه كاملة.
إذا كنت تريد استخدام طرق PEFT الأخرى، مثل تعلم المحث أو ضبط المحث، أو حول مكتبة 🤗 PEFT بشكل عام، يرجى الرجوع إلى [الوثائق](https://huggingface.co/docs/peft/index).
<small>عملية التفاف أساسية بدون حشو أو خطو خطوة واسعة، مأخوذة من <ahref="https://arxiv.org/abs/1603.07285">دليل لحساب الالتفاف للتعلم العميق.</a></small>
<small>عملية التفاف أساسية بدون حشو أو خطو خطوة واسعة، مأخوذة من <ahref="https://huggingface.co/papers/1603.07285">دليل لحساب الالتفاف للتعلم العميق.</a></small>
يمكنك تغذية هذا الناتج إلى طبقة التفاف أخرى، ومع كل طبقة متتالية، تتعلم الشبكة أشياء أكثر تعقيدًا وتجريدية مثل النقانق أو الصواريخ. بين طبقات الالتفاف، من الشائع إضافة طبقة تجميع لتقليل الأبعاد وجعل النموذج أكثر قوة للتغيرات في موضع الميزة.
تم تقديم رميز أزواج البايت (BPE) في ورقة بحثية بعنوان [الترجمة الآلية العصبية للكلمات النادرة باستخدام وحدات subword (Sennrich et al.، 2015)](https://arxiv.org/abs/1508.07909). يعتمد BPE على مُجزّئ أولي يقسم بيانات التدريب إلى
تم تقديم رميز أزواج البايت (BPE) في ورقة بحثية بعنوان [الترجمة الآلية العصبية للكلمات النادرة باستخدام وحدات subword (Sennrich et al.، 2015)](https://huggingface.co/papers/1508.07909). يعتمد BPE على مُجزّئ أولي يقسم بيانات التدريب إلى
كلمات. يمكن أن يكون التحليل المسبق بسيطًا مثل التقسيم المكاني، على سبيل المثال [GPT-2](model_doc/gpt2)، [RoBERTa](model_doc/roberta). تشمل التقسيم الأكثر تقدمًا معتمد على التحليل القائم على القواعد، على سبيل المثال [XLM](model_doc/xlm)، [FlauBERT](model_doc/flaubert) الذي يستخدم Moses لمعظم اللغات، أو [GPT](model_doc/openai-gpt) الذي يستخدم spaCy و ftfy، لحساب تكرار كل كلمة في مجموعة بيانات التدريب.
بعد التحليل المسبق، يتم إنشاء مجموعة من الكلمات الفريدة وقد تم تحديد تكرار كل كلمة في تم تحديد بيانات التدريب. بعد ذلك، يقوم BPE بإنشاء مفردات أساسية تتكون من جميع الرموز التي تحدث في مجموعة الكلمات الفريدة ويتعلم قواعد الدمج لتشكيل رمز جديد من رمزين من المفردات الأساسية. إنه يفعل ذلك حتى تصل المفردات إلى حجم المفردات المطلوب. لاحظ أن حجم المفردات هو فرط معلمة لتحديد قبل تدريب مُجزّئ النصوص.
Unigram هو خوارزمية توكنيز subword التي تم تقديمها في [تنظيم subword: تحسين نماذج الترجمة الشبكة العصبية
نماذج مع مرشحين subword متعددة (Kudo، 2018)](https://arxiv.org/pdf/1804.10959.pdf). على عكس BPE أو
نماذج مع مرشحين subword متعددة (Kudo، 2018)](https://huggingface.co/papers/1804.10959). على عكس BPE أو
WordPiece، يقوم Unigram بتكوين مفرداته الأساسية إلى عدد كبير من الرموز ويقللها تدريجياً للحصول على مفردات أصغر. يمكن أن تتوافق المفردات الأساسية على سبيل المثال مع جميع الكلمات المسبقة التوكنز والسلاسل الفرعية الأكثر شيوعًا. لا يتم استخدام Unigram مباشرة لأي من النماذج في المحولات، ولكنه يستخدم بالاقتران مع [SentencePiece](#sentencepiece).
في كل خطوة تدريب، يحدد خوارزمية Unigram خسارة (غالبًا ما يتم تعريفها على أنها اللوغاريتم) عبر بيانات التدريب بالنظر إلى المفردات الحالية ونموذج اللغة unigram. بعد ذلك، بالنسبة لكل رمز في المفردات، يحسب الخوارزمية مقدار زيادة الخسارة الإجمالية إذا تم إزالة الرمز من المفردات. ثم يقوم Unigram بإزالة p (مع p عادة ما تكون 10% أو 20%) في المائة من الرموز التي تكون زيادة الخسارة فيها هي الأدنى، *أي* تلك
تحتوي جميع خوارزميات توكنز الموصوفة حتى الآن على نفس المشكلة: من المفترض أن النص المدخل يستخدم المسافات لفصل الكلمات. ومع ذلك، لا تستخدم جميع اللغات المسافات لفصل الكلمات. أحد الحلول الممكنة هو استخداممعالج مسبق للغة محدد، *مثال* [XLM](model_doc/xlm) يلذي يستخدم معالجات مسبقة محددة للصينية واليابانية والتايلاندية.
لحل هذه المشكلة بشكل أعم، [SentencePiece: A simple and language independent subword tokenizer and
detokenizer for Neural Text Processing (Kudo et al.، 2018)](https://arxiv.org/pdf/1808.06226.pdf) يتعامل مع المدخلات
detokenizer for Neural Text Processing (Kudo et al.، 2018)](https://huggingface.co/papers/1808.06226) يتعامل مع المدخلات
كتدفق بيانات خام، وبالتالي يشمل المسافة في مجموعة الأحرف التي سيتم استخدامها. ثم يستخدم خوارزمية BPE أو unigram
ثم أضف ببساطة أحد `["galore_adamw"، "galore_adafactor"، "galore_adamw_8bit"]` في `optim` جنبًا إلى جنب مع `optim_target_modules`، والتي يمكن أن تكون قائمة من السلاسل أو التعبيرات النمطية regex أو المسار الكامل المطابق لأسماء الوحدات المستهدفة التي تريد تكييفها. فيما يلي مثال على النص البرمجي كامل(تأكد من `pip install trl datasets`):
يمكنك قراءة المزيد حول الطريقة في [المستودع الأصلي](https://github.com/jiaweizzhao/GaLore) أو [الورقة البحثية](https://arxiv.org/abs/2403.03507).
يمكنك قراءة المزيد حول الطريقة في [المستودع الأصلي](https://github.com/jiaweizzhao/GaLore) أو [الورقة البحثية](https://huggingface.co/papers/2403.03507).
حاليًا، يمكنك فقط تدريب الطبقات الخطية التي تعتبر طبقات GaLore وستستخدم التحلل ذو الرتبة المنخفضة للتدريب بينما سيتم تحسين الطبقات المتبقية بالطريقة التقليدية.
@ -386,37 +356,22 @@ trainer.train()
يمكنك أيضًا إجراء تحسين طبقة تلو الأخرى عن طريق إضافة `layerwise` إلى اسم المُحسِّن كما هو موضح أدناه:
@ -55,148 +55,148 @@ Die Bibliothek enthält derzeit JAX-, PyTorch- und TensorFlow-Implementierungen,
<!--This list is updated automatically from the README with_make fix-copies_. Do not update manually! -->
1.**[ALBERT](model_doc/albert)** (from Google Research and the Toyota Technological Institute at Chicago) released with the paper [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942), by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut.
1.**[ALIGN](model_doc/align)** (from Google Research) released with the paper [Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision](https://arxiv.org/abs/2102.05918) by Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, Tom Duerig.
1.**[BART](model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
1.**[BARThez](model_doc/barthez)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis.
1.**[BARTpho](model_doc/bartpho)** (from VinAI Research) released with the paper [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen.
1.**[BEiT](model_doc/beit)** (from Microsoft) released with the paper [BEiT: BERT Pre-Training of Image Transformers](https://arxiv.org/abs/2106.08254) by Hangbo Bao, Li Dong, Furu Wei.
1.**[BERT](model_doc/bert)** (from Google) released with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
1.**[BERT For Sequence Generation](model_doc/bert-generation)** (from Google) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
1.**[ALBERT](model_doc/albert)** (from Google Research and the Toyota Technological Institute at Chicago) released with the paper [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://huggingface.co/papers/1909.11942), by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut.
1.**[ALIGN](model_doc/align)** (from Google Research) released with the paper [Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision](https://huggingface.co/papers/2102.05918) by Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, Tom Duerig.
1.**[BART](model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://huggingface.co/papers/1910.13461) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
1.**[BARThez](model_doc/barthez)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://huggingface.co/papers/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis.
1.**[BARTpho](model_doc/bartpho)** (from VinAI Research) released with the paper [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://huggingface.co/papers/2109.09701) by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen.
1.**[BEiT](model_doc/beit)** (from Microsoft) released with the paper [BEiT: BERT Pre-Training of Image Transformers](https://huggingface.co/papers/2106.08254) by Hangbo Bao, Li Dong, Furu Wei.
1.**[BERT](model_doc/bert)** (from Google) released with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://huggingface.co/papers/1810.04805) by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
1.**[BERT For Sequence Generation](model_doc/bert-generation)** (from Google) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://huggingface.co/papers/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
1.**[BERTweet](model_doc/bertweet)** (from VinAI Research) released with the paper [BERTweet: A pre-trained language model for English Tweets](https://aclanthology.org/2020.emnlp-demos.2/) by Dat Quoc Nguyen, Thanh Vu and Anh Tuan Nguyen.
1.**[BigBird-Pegasus](model_doc/bigbird_pegasus)** (from Google Research) released with the paper [Big Bird: Transformers for Longer Sequences](https://arxiv.org/abs/2007.14062) by Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed.
1.**[BigBird-RoBERTa](model_doc/big_bird)** (from Google Research) released with the paper [Big Bird: Transformers for Longer Sequences](https://arxiv.org/abs/2007.14062) by Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed.
1.**[Blenderbot](model_doc/blenderbot)** (from Facebook) released with the paper [Recipes for building an open-domain chatbot](https://arxiv.org/abs/2004.13637) by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston.
1.**[BlenderbotSmall](model_doc/blenderbot-small)** (from Facebook) released with the paper [Recipes for building an open-domain chatbot](https://arxiv.org/abs/2004.13637) by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston.
1.**[BigBird-Pegasus](model_doc/bigbird_pegasus)** (from Google Research) released with the paper [Big Bird: Transformers for Longer Sequences](https://huggingface.co/papers/2007.14062) by Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed.
1.**[BigBird-RoBERTa](model_doc/big_bird)** (from Google Research) released with the paper [Big Bird: Transformers for Longer Sequences](https://huggingface.co/papers/2007.14062) by Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed.
1.**[Blenderbot](model_doc/blenderbot)** (from Facebook) released with the paper [Recipes for building an open-domain chatbot](https://huggingface.co/papers/2004.13637) by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston.
1.**[BlenderbotSmall](model_doc/blenderbot-small)** (from Facebook) released with the paper [Recipes for building an open-domain chatbot](https://huggingface.co/papers/2004.13637) by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston.
1.**[BLOOM](model_doc/bloom)** (from BigScience workshop) released by the [BigScience Workshop](https://bigscience.huggingface.co/).
1.**[BORT](model_doc/bort)** (from Alexa) released with the paper [Optimal Subarchitecture Extraction For BERT](https://arxiv.org/abs/2010.10499) by Adrian de Wynter and Daniel J. Perry.
1.**[ByT5](model_doc/byt5)** (from Google Research) released with the paper [ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/abs/2105.13626) by Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel.
1.**[CamemBERT](model_doc/camembert)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
1.**[CANINE](model_doc/canine)** (from Google Research) released with the paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting.
1.**[CLIP](model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.
1.**[CodeGen](model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong.
1.**[ConvBERT](model_doc/convbert)** (from YituTech) released with the paper [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan.
1.**[ConvNeXT](model_doc/convnext)** (from Facebook AI) released with the paper [A ConvNet for the 2020s](https://arxiv.org/abs/2201.03545) by Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie.
1.**[ConvNeXTV2](model_doc/convnextv2)** (from Facebook AI) released with the paper [ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders](https://arxiv.org/abs/2301.00808) by Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, Saining Xie.
1.**[CPM](model_doc/cpm)** (from Tsinghua University) released with the paper [CPM: A Large-scale Generative Chinese Pre-trained Language Model](https://arxiv.org/abs/2012.00413) by Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin, Yusheng Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi Wang, Yanan Zheng, Guoyang Zeng, Huanqi Cao, Shengqi Chen, Daixuan Li, Zhenbo Sun, Zhiyuan Liu, Minlie Huang, Wentao Han, Jie Tang, Juanzi Li, Xiaoyan Zhu, Maosong Sun.
1.**[CTRL](model_doc/ctrl)** (from Salesforce) released with the paper [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://arxiv.org/abs/1909.05858) by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher.
1.**[CvT](model_doc/cvt)** (from Microsoft) released with the paper [CvT: Introducing Convolutions to Vision Transformers](https://arxiv.org/abs/2103.15808) by Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, Lei Zhang.
1.**[Data2Vec](model_doc/data2vec)** (from Facebook) released with the paper [Data2Vec: A General Framework for Self-supervised Learning in Speech, Vision and Language](https://arxiv.org/abs/2202.03555) by Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli.
1.**[DeBERTa](model_doc/deberta)** (from Microsoft) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
1.**[DeBERTa-v2](model_doc/deberta-v2)** (from Microsoft) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
1.**[Decision Transformer](model_doc/decision_transformer)** (from Berkeley/Facebook/Google) released with the paper [Decision Transformer: Reinforcement Learning via Sequence Modeling](https://arxiv.org/abs/2106.01345) by Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, Igor Mordatch.
1.**[DeiT](model_doc/deit)** (from Facebook) released with the paper [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877) by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou.
1.**[DETR](model_doc/detr)** (from Facebook) released with the paper [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872) by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko.
1.**[DialoGPT](model_doc/dialogpt)** (from Microsoft Research) released with the paper [DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation](https://arxiv.org/abs/1911.00536) by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan.
1.**[DistilBERT](model_doc/distilbert)** (from HuggingFace), released together with the paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into [DistilGPT2](https://github.com/huggingface/transformers-research-projects/tree/main/distillation), RoBERTa into [DistilRoBERTa](https://github.com/huggingface/transformers-research-projects/tree/main/distillation), Multilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers-research-projects/tree/main/distillation) and a German version of DistilBERT.
1.**[DiT](model_doc/dit)** (from Microsoft Research) released with the paper [DiT: Self-supervised Pre-training for Document Image Transformer](https://arxiv.org/abs/2203.02378) by Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei.
1.**[DPR](model_doc/dpr)** (from Facebook) released with the paper [Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2004.04906) by Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
1.**[DPT](master/model_doc/dpt)** (from Intel Labs) released with the paper [Vision Transformers for Dense Prediction](https://arxiv.org/abs/2103.13413) by René Ranftl, Alexey Bochkovskiy, Vladlen Koltun.
1.**[EfficientNet](model_doc/efficientnet)** (from Google Research) released with the paper [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) by Mingxing Tan and Quoc V. Le.
1.**[ELECTRA](model_doc/electra)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
1.**[EncoderDecoder](model_doc/encoder-decoder)** (from Google Research) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
1.**[FlauBERT](model_doc/flaubert)** (from CNRS) released with the paper [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab.
1.**[FLAVA](model_doc/flava)** (from Facebook AI) released with the paper [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela.
1.**[FNet](model_doc/fnet)** (from Google Research) released with the paper [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon.
1.**[Funnel Transformer](model_doc/funnel)** (from CMU/Google Brain) released with the paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
1.**[GLPN](model_doc/glpn)** (from KAIST) released with the paper [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim.
1.**[BORT](model_doc/bort)** (from Alexa) released with the paper [Optimal Subarchitecture Extraction For BERT](https://huggingface.co/papers/2010.10499) by Adrian de Wynter and Daniel J. Perry.
1.**[ByT5](model_doc/byt5)** (from Google Research) released with the paper [ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://huggingface.co/papers/2105.13626) by Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel.
1.**[CamemBERT](model_doc/camembert)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://huggingface.co/papers/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
1.**[CANINE](model_doc/canine)** (from Google Research) released with the paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://huggingface.co/papers/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting.
1.**[CLIP](model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://huggingface.co/papers/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.
1.**[CodeGen](model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://huggingface.co/papers/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong.
1.**[ConvBERT](model_doc/convbert)** (from YituTech) released with the paper [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://huggingface.co/papers/2008.02496) by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan.
1.**[ConvNeXT](model_doc/convnext)** (from Facebook AI) released with the paper [A ConvNet for the 2020s](https://huggingface.co/papers/2201.03545) by Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie.
1.**[ConvNeXTV2](model_doc/convnextv2)** (from Facebook AI) released with the paper [ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders](https://huggingface.co/papers/2301.00808) by Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, Saining Xie.
1.**[CPM](model_doc/cpm)** (from Tsinghua University) released with the paper [CPM: A Large-scale Generative Chinese Pre-trained Language Model](https://huggingface.co/papers/2012.00413) by Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin, Yusheng Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi Wang, Yanan Zheng, Guoyang Zeng, Huanqi Cao, Shengqi Chen, Daixuan Li, Zhenbo Sun, Zhiyuan Liu, Minlie Huang, Wentao Han, Jie Tang, Juanzi Li, Xiaoyan Zhu, Maosong Sun.
1.**[CTRL](model_doc/ctrl)** (from Salesforce) released with the paper [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://huggingface.co/papers/1909.05858) by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher.
1.**[CvT](model_doc/cvt)** (from Microsoft) released with the paper [CvT: Introducing Convolutions to Vision Transformers](https://huggingface.co/papers/2103.15808) by Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, Lei Zhang.
1.**[Data2Vec](model_doc/data2vec)** (from Facebook) released with the paper [Data2Vec: A General Framework for Self-supervised Learning in Speech, Vision and Language](https://huggingface.co/papers/2202.03555) by Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli.
1.**[DeBERTa](model_doc/deberta)** (from Microsoft) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://huggingface.co/papers/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
1.**[DeBERTa-v2](model_doc/deberta-v2)** (from Microsoft) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://huggingface.co/papers/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
1.**[Decision Transformer](model_doc/decision_transformer)** (from Berkeley/Facebook/Google) released with the paper [Decision Transformer: Reinforcement Learning via Sequence Modeling](https://huggingface.co/papers/2106.01345) by Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, Igor Mordatch.
1.**[DeiT](model_doc/deit)** (from Facebook) released with the paper [Training data-efficient image transformers & distillation through attention](https://huggingface.co/papers/2012.12877) by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou.
1.**[DETR](model_doc/detr)** (from Facebook) released with the paper [End-to-End Object Detection with Transformers](https://huggingface.co/papers/2005.12872) by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko.
1.**[DialoGPT](model_doc/dialogpt)** (from Microsoft Research) released with the paper [DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation](https://huggingface.co/papers/1911.00536) by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan.
1.**[DistilBERT](model_doc/distilbert)** (from HuggingFace), released together with the paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://huggingface.co/papers/1910.01108) by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into [DistilGPT2](https://github.com/huggingface/transformers-research-projects/tree/main/distillation), RoBERTa into [DistilRoBERTa](https://github.com/huggingface/transformers-research-projects/tree/main/distillation), Multilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers-research-projects/tree/main/distillation) and a German version of DistilBERT.
1.**[DiT](model_doc/dit)** (from Microsoft Research) released with the paper [DiT: Self-supervised Pre-training for Document Image Transformer](https://huggingface.co/papers/2203.02378) by Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei.
1.**[DPR](model_doc/dpr)** (from Facebook) released with the paper [Dense Passage Retrieval for Open-Domain Question Answering](https://huggingface.co/papers/2004.04906) by Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
1.**[DPT](master/model_doc/dpt)** (from Intel Labs) released with the paper [Vision Transformers for Dense Prediction](https://huggingface.co/papers/2103.13413) by René Ranftl, Alexey Bochkovskiy, Vladlen Koltun.
1.**[EfficientNet](model_doc/efficientnet)** (from Google Research) released with the paper [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://huggingface.co/papers/1905.11946) by Mingxing Tan and Quoc V. Le.
1.**[ELECTRA](model_doc/electra)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://huggingface.co/papers/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
1.**[EncoderDecoder](model_doc/encoder-decoder)** (from Google Research) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://huggingface.co/papers/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
1.**[FlauBERT](model_doc/flaubert)** (from CNRS) released with the paper [FlauBERT: Unsupervised Language Model Pre-training for French](https://huggingface.co/papers/1912.05372) by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab.
1.**[FLAVA](model_doc/flava)** (from Facebook AI) released with the paper [FLAVA: A Foundational Language And Vision Alignment Model](https://huggingface.co/papers/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela.
1.**[FNet](model_doc/fnet)** (from Google Research) released with the paper [FNet: Mixing Tokens with Fourier Transforms](https://huggingface.co/papers/2105.03824) by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon.
1.**[Funnel Transformer](model_doc/funnel)** (from CMU/Google Brain) released with the paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://huggingface.co/papers/2006.03236) by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
1.**[GLPN](model_doc/glpn)** (from KAIST) released with the paper [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://huggingface.co/papers/2201.07436) by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim.
1.**[GPT](model_doc/openai-gpt)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://openai.com/research/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
1.**[GPT Neo](model_doc/gpt_neo)** (from EleutherAI) released in the repository [EleutherAI/gpt-neo](https://github.com/EleutherAI/gpt-neo) by Sid Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy.
1.**[GPT NeoX](model_doc/gpt_neox)** (from EleutherAI) released with the paper [GPT-NeoX-20B: An Open-Source Autoregressive Language Model](https://arxiv.org/abs/2204.06745) by Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, Samuel Weinbach
1.**[GPT NeoX](model_doc/gpt_neox)** (from EleutherAI) released with the paper [GPT-NeoX-20B: An Open-Source Autoregressive Language Model](https://huggingface.co/papers/2204.06745) by Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, Samuel Weinbach
1.**[GPT-2](model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://openai.com/research/better-language-models/) by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever.
1.**[GPT-J](model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki.
1.**[GPTSAN-japanese](model_doc/gptsan-japanese)** released in the repository [tanreinama/GPTSAN](https://github.com/tanreinama/GPTSAN/blob/main/report/model.md) by Toshiyuki Sakamoto(tanreinama).
1.**[GroupViT](model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang.
1.**[Hubert](model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
1.**[I-BERT](model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.
1.**[GroupViT](model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://huggingface.co/papers/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang.
1.**[Hubert](model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://huggingface.co/papers/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
1.**[I-BERT](model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://huggingface.co/papers/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.
1.**[ImageGPT](model_doc/imagegpt)** (from OpenAI) released with the paper [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) by Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever.
1.**[LayoutLM](model_doc/layoutlm)** (from Microsoft Research Asia) released with the paper [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou.
1.**[LayoutLMv2](model_doc/layoutlmv2)** (from Microsoft Research Asia) released with the paper [LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding](https://arxiv.org/abs/2012.14740) by Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou.
1.**[LayoutLMv3](model_doc/layoutlmv3)** (from Microsoft Research Asia) released with the paper [LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking](https://arxiv.org/abs/2204.08387) by Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei.
1.**[LayoutXLM](model_doc/layoutxlm)** (from Microsoft Research Asia) released with the paper [LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding](https://arxiv.org/abs/2104.08836) by Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Furu Wei.
1.**[LED](model_doc/led)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan.
1.**[LeViT](model_doc/levit)** (from Meta AI) released with the paper [LeViT: A Vision Transformer in ConvNet's Clothing for Faster Inference](https://arxiv.org/abs/2104.01136) by Ben Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Hervé Jégou, Matthijs Douze.
1.**[Longformer](model_doc/longformer)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan.
1.**[LongT5](model_doc/longt5)** (from Google AI) released with the paper [LongT5: Efficient Text-To-Text Transformer for Long Sequences](https://arxiv.org/abs/2112.07916) by Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, Yinfei Yang.
1.**[LUKE](model_doc/luke)** (from Studio Ousia) released with the paper [LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention](https://arxiv.org/abs/2010.01057) by Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, Yuji Matsumoto.
1.**[LXMERT](model_doc/lxmert)** (from UNC Chapel Hill) released with the paper [LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering](https://arxiv.org/abs/1908.07490) by Hao Tan and Mohit Bansal.
1.**[M-CTC-T](model_doc/mctct)** (from Facebook) released with the paper [Pseudo-Labeling For Massively Multilingual Speech Recognition](https://arxiv.org/abs/2111.00161) by Loren Lugosch, Tatiana Likhomanenko, Gabriel Synnaeve, and Ronan Collobert.
1.**[M2M100](model_doc/m2m_100)** (from Facebook) released with the paper [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.
1.**[LayoutLM](model_doc/layoutlm)** (from Microsoft Research Asia) released with the paper [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://huggingface.co/papers/1912.13318) by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou.
1.**[LayoutLMv2](model_doc/layoutlmv2)** (from Microsoft Research Asia) released with the paper [LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding](https://huggingface.co/papers/2012.14740) by Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou.
1.**[LayoutLMv3](model_doc/layoutlmv3)** (from Microsoft Research Asia) released with the paper [LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking](https://huggingface.co/papers/2204.08387) by Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei.
1.**[LayoutXLM](model_doc/layoutxlm)** (from Microsoft Research Asia) released with the paper [LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding](https://huggingface.co/papers/2104.08836) by Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Furu Wei.
1.**[LED](model_doc/led)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://huggingface.co/papers/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan.
1.**[LeViT](model_doc/levit)** (from Meta AI) released with the paper [LeViT: A Vision Transformer in ConvNet's Clothing for Faster Inference](https://huggingface.co/papers/2104.01136) by Ben Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Hervé Jégou, Matthijs Douze.
1.**[Longformer](model_doc/longformer)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://huggingface.co/papers/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan.
1.**[LongT5](model_doc/longt5)** (from Google AI) released with the paper [LongT5: Efficient Text-To-Text Transformer for Long Sequences](https://huggingface.co/papers/2112.07916) by Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, Yinfei Yang.
1.**[LUKE](model_doc/luke)** (from Studio Ousia) released with the paper [LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention](https://huggingface.co/papers/2010.01057) by Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, Yuji Matsumoto.
1.**[LXMERT](model_doc/lxmert)** (from UNC Chapel Hill) released with the paper [LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering](https://huggingface.co/papers/1908.07490) by Hao Tan and Mohit Bansal.
1.**[M-CTC-T](model_doc/mctct)** (from Facebook) released with the paper [Pseudo-Labeling For Massively Multilingual Speech Recognition](https://huggingface.co/papers/2111.00161) by Loren Lugosch, Tatiana Likhomanenko, Gabriel Synnaeve, and Ronan Collobert.
1.**[M2M100](model_doc/m2m_100)** (from Facebook) released with the paper [Beyond English-Centric Multilingual Machine Translation](https://huggingface.co/papers/2010.11125) by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.
1.**[MarianMT](model_doc/marian)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team.
1.**[Mask2Former](model_doc/mask2former)** (from FAIR and UIUC) released with the paper [Masked-attention Mask Transformer for Universal Image Segmentation](https://arxiv.org/abs/2112.01527) by Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar.
1.**[MaskFormer](model_doc/maskformer)** (from Meta and UIUC) released with the paper [Per-Pixel Classification is Not All You Need for Semantic Segmentation](https://arxiv.org/abs/2107.06278) by Bowen Cheng, Alexander G. Schwing, Alexander Kirillov.
1.**[mBART](model_doc/mbart)** (from Facebook) released with the paper [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
1.**[mBART-50](model_doc/mbart)** (from Facebook) released with the paper [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan.
1.**[Megatron-BERT](model_doc/megatron-bert)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
1.**[Megatron-GPT2](model_doc/megatron_gpt2)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
1.**[mLUKE](model_doc/mluke)** (from Studio Ousia) released with the paper [mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models](https://arxiv.org/abs/2110.08151) by Ryokan Ri, Ikuya Yamada, and Yoshimasa Tsuruoka.
1.**[MobileBERT](model_doc/mobilebert)** (from CMU/Google Brain) released with the paper [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) by Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou.
1.**[MobileViT](model_doc/mobilevit)** (from Apple) released with the paper [MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178) by Sachin Mehta and Mohammad Rastegari.
1.**[MPNet](model_doc/mpnet)** (from Microsoft Research) released with the paper [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu.
1.**[MT5](model_doc/mt5)** (from Google AI) released with the paper [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
1.**[MVP](model_doc/mvp)** (from RUC AI Box) released with the paper [MVP: Multi-task Supervised Pre-training for Natural Language Generation](https://arxiv.org/abs/2206.12131) by Tianyi Tang, Junyi Li, Wayne Xin Zhao and Ji-Rong Wen.
1.**[Nezha](model_doc/nezha)** (from Huawei Noah’s Ark Lab) released with the paper [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) by Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu.
1.**[NLLB](model_doc/nllb)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team.
1.**[Nyströmformer](model_doc/nystromformer)** (from the University of Wisconsin - Madison) released with the paper [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902) by Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh.
1.**[OneFormer](model_doc/oneformer)** (from SHI Labs) released with the paper [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220) by Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi.
1.**[OPT](master/model_doc/opt)** (from Meta AI) released with the paper [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) by Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen et al.
1.**[OWL-ViT](model_doc/owlvit)** (from Google AI) released with the paper [Simple Open-Vocabulary Object Detection with Vision Transformers](https://arxiv.org/abs/2205.06230) by Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby.
1.**[Pegasus](model_doc/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.
1.**[Perceiver IO](model_doc/perceiver)** (from Deepmind) released with the paper [Perceiver IO: A General Architecture for Structured Inputs & Outputs](https://arxiv.org/abs/2107.14795) by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira.
1.**[Mask2Former](model_doc/mask2former)** (from FAIR and UIUC) released with the paper [Masked-attention Mask Transformer for Universal Image Segmentation](https://huggingface.co/papers/2112.01527) by Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar.
1.**[MaskFormer](model_doc/maskformer)** (from Meta and UIUC) released with the paper [Per-Pixel Classification is Not All You Need for Semantic Segmentation](https://huggingface.co/papers/2107.06278) by Bowen Cheng, Alexander G. Schwing, Alexander Kirillov.
1.**[mBART](model_doc/mbart)** (from Facebook) released with the paper [Multilingual Denoising Pre-training for Neural Machine Translation](https://huggingface.co/papers/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
1.**[mBART-50](model_doc/mbart)** (from Facebook) released with the paper [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://huggingface.co/papers/2008.00401) by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan.
1.**[Megatron-BERT](model_doc/megatron-bert)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://huggingface.co/papers/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
1.**[Megatron-GPT2](model_doc/megatron_gpt2)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://huggingface.co/papers/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
1.**[mLUKE](model_doc/mluke)** (from Studio Ousia) released with the paper [mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models](https://huggingface.co/papers/2110.08151) by Ryokan Ri, Ikuya Yamada, and Yoshimasa Tsuruoka.
1.**[MobileBERT](model_doc/mobilebert)** (from CMU/Google Brain) released with the paper [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://huggingface.co/papers/2004.02984) by Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou.
1.**[MobileViT](model_doc/mobilevit)** (from Apple) released with the paper [MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://huggingface.co/papers/2110.02178) by Sachin Mehta and Mohammad Rastegari.
1.**[MPNet](model_doc/mpnet)** (from Microsoft Research) released with the paper [MPNet: Masked and Permuted Pre-training for Language Understanding](https://huggingface.co/papers/2004.09297) by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu.
1.**[MT5](model_doc/mt5)** (from Google AI) released with the paper [mT5: A massively multilingual pre-trained text-to-text transformer](https://huggingface.co/papers/2010.11934) by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
1.**[MVP](model_doc/mvp)** (from RUC AI Box) released with the paper [MVP: Multi-task Supervised Pre-training for Natural Language Generation](https://huggingface.co/papers/2206.12131) by Tianyi Tang, Junyi Li, Wayne Xin Zhao and Ji-Rong Wen.
1.**[Nezha](model_doc/nezha)** (from Huawei Noah’s Ark Lab) released with the paper [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://huggingface.co/papers/1909.00204) by Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu.
1.**[NLLB](model_doc/nllb)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://huggingface.co/papers/2207.04672) by the NLLB team.
1.**[Nyströmformer](model_doc/nystromformer)** (from the University of Wisconsin - Madison) released with the paper [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://huggingface.co/papers/2102.03902) by Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh.
1.**[OneFormer](model_doc/oneformer)** (from SHI Labs) released with the paper [OneFormer: One Transformer to Rule Universal Image Segmentation](https://huggingface.co/papers/2211.06220) by Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi.
1.**[OPT](master/model_doc/opt)** (from Meta AI) released with the paper [OPT: Open Pre-trained Transformer Language Models](https://huggingface.co/papers/2205.01068) by Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen et al.
1.**[OWL-ViT](model_doc/owlvit)** (from Google AI) released with the paper [Simple Open-Vocabulary Object Detection with Vision Transformers](https://huggingface.co/papers/2205.06230) by Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby.
1.**[Pegasus](model_doc/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://huggingface.co/papers/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.
1.**[Perceiver IO](model_doc/perceiver)** (from Deepmind) released with the paper [Perceiver IO: A General Architecture for Structured Inputs & Outputs](https://huggingface.co/papers/2107.14795) by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira.
1.**[PhoBERT](model_doc/phobert)** (from VinAI Research) released with the paper [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) by Dat Quoc Nguyen and Anh Tuan Nguyen.
1.**[PLBart](model_doc/plbart)** (from UCLA NLP) released with the paper [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) by Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang.
1.**[PoolFormer](model_doc/poolformer)** (from Sea AI Labs) released with the paper [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) by Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng.
1.**[ProphetNet](model_doc/prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
1.**[QDQBert](model_doc/qdqbert)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius.
1.**[RAG](model_doc/rag)** (from Facebook) released with the paper [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401) by Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela.
1.**[REALM](model_doc/realm.html)** (from Google Research) released with the paper [REALM: Retrieval-Augmented Language Model Pre-Training](https://arxiv.org/abs/2002.08909) by Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat and Ming-Wei Chang.
1.**[Reformer](model_doc/reformer)** (from Google Research) released with the paper [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
1.**[RegNet](model_doc/regnet)** (from META Platforms) released with the paper [Designing Network Design Space](https://arxiv.org/abs/2003.13678) by Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, Piotr Dollár.
1.**[RemBERT](model_doc/rembert)** (from Google Research) released with the paper [Rethinking embedding coupling in pre-trained language models](https://arxiv.org/abs/2010.12821) by Hyung Won Chung, Thibault Févry, Henry Tsai, M. Johnson, Sebastian Ruder.
1.**[ResNet](model_doc/resnet)** (from Microsoft Research) released with the paper [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385) by Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun.
1.**[RoBERTa](model_doc/roberta)** (from Facebook), released together with the paper [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
1.**[RoFormer](model_doc/roformer)** (from ZhuiyiTechnology), released together with the paper [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864) by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
1.**[SegFormer](model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
1.**[SEW](model_doc/sew)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
1.**[SEW-D](model_doc/sew_d)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
1.**[SpeechToTextTransformer](model_doc/speech_to_text)** (from Facebook), released together with the paper [fairseq S2T: Fast Speech-to-Text Modeling with fairseq](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino.
1.**[SpeechToTextTransformer2](model_doc/speech_to_text_2)** (from Facebook), released together with the paper [Large-Scale Self- and Semi-Supervised Learning for Speech Translation](https://arxiv.org/abs/2104.06678) by Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau.
1.**[Splinter](model_doc/splinter)** (from Tel Aviv University), released together with the paper [Few-Shot Question Answering by Pretraining Span Selection](https://arxiv.org/abs/2101.00438) by Ori Ram, Yuval Kirstain, Jonathan Berant, Amir Globerson, Omer Levy.
1.**[SqueezeBERT](model_doc/squeezebert)** (from Berkeley) released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
1.**[Swin Transformer](model_doc/swin)** (from Microsoft) released with the paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
1.**[Swin Transformer V2](model_doc/swinv2)** (from Microsoft) released with the paper [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) by Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo.
1.**[T5](model_doc/t5)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
1.**[PLBart](model_doc/plbart)** (from UCLA NLP) released with the paper [Unified Pre-training for Program Understanding and Generation](https://huggingface.co/papers/2103.06333) by Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang.
1.**[PoolFormer](model_doc/poolformer)** (from Sea AI Labs) released with the paper [MetaFormer is Actually What You Need for Vision](https://huggingface.co/papers/2111.11418) by Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng.
1.**[ProphetNet](model_doc/prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://huggingface.co/papers/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
1.**[QDQBert](model_doc/qdqbert)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://huggingface.co/papers/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius.
1.**[RAG](model_doc/rag)** (from Facebook) released with the paper [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://huggingface.co/papers/2005.11401) by Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela.
1.**[REALM](model_doc/realm.html)** (from Google Research) released with the paper [REALM: Retrieval-Augmented Language Model Pre-Training](https://huggingface.co/papers/2002.08909) by Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat and Ming-Wei Chang.
1.**[Reformer](model_doc/reformer)** (from Google Research) released with the paper [Reformer: The Efficient Transformer](https://huggingface.co/papers/2001.04451) by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
1.**[RegNet](model_doc/regnet)** (from META Platforms) released with the paper [Designing Network Design Space](https://huggingface.co/papers/2003.13678) by Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, Piotr Dollár.
1.**[RemBERT](model_doc/rembert)** (from Google Research) released with the paper [Rethinking embedding coupling in pre-trained language models](https://huggingface.co/papers/2010.12821) by Hyung Won Chung, Thibault Févry, Henry Tsai, M. Johnson, Sebastian Ruder.
1.**[ResNet](model_doc/resnet)** (from Microsoft Research) released with the paper [Deep Residual Learning for Image Recognition](https://huggingface.co/papers/1512.03385) by Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun.
1.**[RoBERTa](model_doc/roberta)** (from Facebook), released together with the paper [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://huggingface.co/papers/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
1.**[RoFormer](model_doc/roformer)** (from ZhuiyiTechnology), released together with the paper [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://huggingface.co/papers/2104.09864) by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
1.**[SegFormer](model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://huggingface.co/papers/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
1.**[SEW](model_doc/sew)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://huggingface.co/papers/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
1.**[SEW-D](model_doc/sew_d)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://huggingface.co/papers/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
1.**[SpeechToTextTransformer](model_doc/speech_to_text)** (from Facebook), released together with the paper [fairseq S2T: Fast Speech-to-Text Modeling with fairseq](https://huggingface.co/papers/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino.
1.**[SpeechToTextTransformer2](model_doc/speech_to_text_2)** (from Facebook), released together with the paper [Large-Scale Self- and Semi-Supervised Learning for Speech Translation](https://huggingface.co/papers/2104.06678) by Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau.
1.**[Splinter](model_doc/splinter)** (from Tel Aviv University), released together with the paper [Few-Shot Question Answering by Pretraining Span Selection](https://huggingface.co/papers/2101.00438) by Ori Ram, Yuval Kirstain, Jonathan Berant, Amir Globerson, Omer Levy.
1.**[SqueezeBERT](model_doc/squeezebert)** (from Berkeley) released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://huggingface.co/papers/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
1.**[Swin Transformer](model_doc/swin)** (from Microsoft) released with the paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://huggingface.co/papers/2103.14030) by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
1.**[Swin Transformer V2](model_doc/swinv2)** (from Microsoft) released with the paper [Swin Transformer V2: Scaling Up Capacity and Resolution](https://huggingface.co/papers/2111.09883) by Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo.
1.**[T5](model_doc/t5)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://huggingface.co/papers/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
1.**[T5v1.1](model_doc/t5v1.1)** (from Google AI) released in the repository [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
1.**[TAPAS](model_doc/tapas)** (from Google AI) released with the paper [TAPAS: Weakly Supervised Table Parsing via Pre-training](https://arxiv.org/abs/2004.02349) by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos.
1.**[TAPEX](model_doc/tapex)** (from Microsoft Research) released with the paper [TAPEX: Table Pre-training via Learning a Neural SQL Executor](https://arxiv.org/abs/2107.07653) by Qian Liu, Bei Chen, Jiaqi Guo, Morteza Ziyadi, Zeqi Lin, Weizhu Chen, Jian-Guang Lou.
1.**[Trajectory Transformer](model_doc/trajectory_transformers)** (from the University of California at Berkeley) released with the paper [Offline Reinforcement Learning as One Big Sequence Modeling Problem](https://arxiv.org/abs/2106.02039) by Michael Janner, Qiyang Li, Sergey Levine
1.**[Transformer-XL](model_doc/transfo-xl)** (from Google/CMU) released with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
1.**[TrOCR](model_doc/trocr)** (from Microsoft), released together with the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei.
1.**[UL2](model_doc/ul2)** (from Google Research) released with the paper [Unifying Language Learning Paradigms](https://arxiv.org/abs/2205.05131v1) by Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, Donald Metzler
1.**[TAPAS](model_doc/tapas)** (from Google AI) released with the paper [TAPAS: Weakly Supervised Table Parsing via Pre-training](https://huggingface.co/papers/2004.02349) by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos.
1.**[TAPEX](model_doc/tapex)** (from Microsoft Research) released with the paper [TAPEX: Table Pre-training via Learning a Neural SQL Executor](https://huggingface.co/papers/2107.07653) by Qian Liu, Bei Chen, Jiaqi Guo, Morteza Ziyadi, Zeqi Lin, Weizhu Chen, Jian-Guang Lou.
1.**[Trajectory Transformer](model_doc/trajectory_transformers)** (from the University of California at Berkeley) released with the paper [Offline Reinforcement Learning as One Big Sequence Modeling Problem](https://huggingface.co/papers/2106.02039) by Michael Janner, Qiyang Li, Sergey Levine
1.**[Transformer-XL](model_doc/transfo-xl)** (from Google/CMU) released with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://huggingface.co/papers/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
1.**[TrOCR](model_doc/trocr)** (from Microsoft), released together with the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://huggingface.co/papers/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei.
1.**[UL2](model_doc/ul2)** (from Google Research) released with the paper [Unifying Language Learning Paradigms](https://huggingface.co/papers/2205.05131v1) by Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, Donald Metzler
1.**[UMT5](model_doc/umt5)** (from Google Research) released with the paper [UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining](https://openreview.net/forum?id=kXwdL1cWOAi) by Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant.
1.**[UniSpeech](model_doc/unispeech)** (from Microsoft Research) released with the paper [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang.
1.**[UniSpeechSat](model_doc/unispeech-sat)** (from Microsoft Research) released with the paper [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu.
1.**[VAN](model_doc/van)** (from Tsinghua University and Nankai University) released with the paper [Visual Attention Network](https://arxiv.org/abs/2202.09741) by Meng-Hao Guo, Cheng-Ze Lu, Zheng-Ning Liu, Ming-Ming Cheng, Shi-Min Hu.
1.**[VideoMAE](model_doc/videomae)** (from Multimedia Computing Group, Nanjing University) released with the paper [VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training](https://arxiv.org/abs/2203.12602) by Zhan Tong, Yibing Song, Jue Wang, Limin Wang.
1.**[ViLT](model_doc/vilt)** (from NAVER AI Lab/Kakao Enterprise/Kakao Brain) released with the paper [ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) by Wonjae Kim, Bokyung Son, Ildoo Kim.
1.**[Vision Transformer (ViT)](model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
1.**[VisualBERT](model_doc/visual_bert)** (from UCLA NLP) released with the paper [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.
1.**[ViTMAE](model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.
1.**[Wav2Vec2](model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
1.**[Wav2Vec2-Conformer](model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino.
1.**[Wav2Vec2Phoneme](model_doc/wav2vec2_phoneme)** (from Facebook AI) released with the paper [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://arxiv.org/abs/2109.11680) by Qiantong Xu, Alexei Baevski, Michael Auli.
1.**[WavLM](model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.
1.**[XGLM](model_doc/xglm)** (From Facebook AI) released with the paper [Few-shot Learning with Multilingual Language Models](https://arxiv.org/abs/2112.10668) by Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li.
1.**[XLM](model_doc/xlm)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
1.**[XLM-ProphetNet](model_doc/xlm-prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
1.**[XLM-RoBERTa](model_doc/xlm-roberta)** (from Facebook AI), released together with the paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Alexis Conneau*, Kartikay Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov.
1.**[XLM-RoBERTa-XL](model_doc/xlm-roberta-xl)** (from Facebook AI), released together with the paper [Larger-Scale Transformers for Multilingual Masked Language Modeling](https://arxiv.org/abs/2105.00572) by Naman Goyal, Jingfei Du, Myle Ott, Giri Anantharaman, Alexis Conneau.
1.**[XLM-V](model_doc/xlm-v)** (from Meta AI) released with the paper [XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models](https://arxiv.org/abs/2301.10472) by Davis Liang, Hila Gonen, Yuning Mao, Rui Hou, Naman Goyal, Marjan Ghazvininejad, Luke Zettlemoyer, Madian Khabsa.
1.**[XLNet](model_doc/xlnet)** (from Google/CMU) released with the paper [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
1.**[XLS-R](model_doc/xls_r)** (from Facebook AI) released with the paper [XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale](https://arxiv.org/abs/2111.09296) by Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli.
1.**[XLSR-Wav2Vec2](model_doc/xlsr_wav2vec2)** (from Facebook AI) released with the paper [Unsupervised Cross-Lingual Representation Learning For Speech Recognition](https://arxiv.org/abs/2006.13979) by Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli.
1.**[YOLOS](model_doc/yolos)** (from Huazhong University of Science & Technology) released with the paper [You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection](https://arxiv.org/abs/2106.00666) by Yuxin Fang, Bencheng Liao, Xinggang Wang, Jiemin Fang, Jiyang Qi, Rui Wu, Jianwei Niu, Wenyu Liu.
1.**[YOSO](model_doc/yoso)** (from the University of Wisconsin - Madison) released with the paper [You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling](https://arxiv.org/abs/2111.09714) by Zhanpeng Zeng, Yunyang Xiong, Sathya N. Ravi, Shailesh Acharya, Glenn Fung, Vikas Singh.
1.**[UniSpeech](model_doc/unispeech)** (from Microsoft Research) released with the paper [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://huggingface.co/papers/2101.07597) by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang.
1.**[UniSpeechSat](model_doc/unispeech-sat)** (from Microsoft Research) released with the paper [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://huggingface.co/papers/2110.05752) by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu.
1.**[VAN](model_doc/van)** (from Tsinghua University and Nankai University) released with the paper [Visual Attention Network](https://huggingface.co/papers/2202.09741) by Meng-Hao Guo, Cheng-Ze Lu, Zheng-Ning Liu, Ming-Ming Cheng, Shi-Min Hu.
1.**[VideoMAE](model_doc/videomae)** (from Multimedia Computing Group, Nanjing University) released with the paper [VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training](https://huggingface.co/papers/2203.12602) by Zhan Tong, Yibing Song, Jue Wang, Limin Wang.
1.**[ViLT](model_doc/vilt)** (from NAVER AI Lab/Kakao Enterprise/Kakao Brain) released with the paper [ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision](https://huggingface.co/papers/2102.03334) by Wonjae Kim, Bokyung Son, Ildoo Kim.
1.**[Vision Transformer (ViT)](model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://huggingface.co/papers/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
1.**[VisualBERT](model_doc/visual_bert)** (from UCLA NLP) released with the paper [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://huggingface.co/papers/1908.03557) by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.
1.**[ViTMAE](model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://huggingface.co/papers/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.
1.**[Wav2Vec2](model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://huggingface.co/papers/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
1.**[Wav2Vec2-Conformer](model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://huggingface.co/papers/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino.
1.**[Wav2Vec2Phoneme](model_doc/wav2vec2_phoneme)** (from Facebook AI) released with the paper [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://huggingface.co/papers/2109.11680) by Qiantong Xu, Alexei Baevski, Michael Auli.
1.**[WavLM](model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://huggingface.co/papers/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.
1.**[XGLM](model_doc/xglm)** (From Facebook AI) released with the paper [Few-shot Learning with Multilingual Language Models](https://huggingface.co/papers/2112.10668) by Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li.
1.**[XLM](model_doc/xlm)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://huggingface.co/papers/1901.07291) by Guillaume Lample and Alexis Conneau.
1.**[XLM-ProphetNet](model_doc/xlm-prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://huggingface.co/papers/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
1.**[XLM-RoBERTa](model_doc/xlm-roberta)** (from Facebook AI), released together with the paper [Unsupervised Cross-lingual Representation Learning at Scale](https://huggingface.co/papers/1911.02116) by Alexis Conneau*, Kartikay Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov.
1.**[XLM-RoBERTa-XL](model_doc/xlm-roberta-xl)** (from Facebook AI), released together with the paper [Larger-Scale Transformers for Multilingual Masked Language Modeling](https://huggingface.co/papers/2105.00572) by Naman Goyal, Jingfei Du, Myle Ott, Giri Anantharaman, Alexis Conneau.
1.**[XLM-V](model_doc/xlm-v)** (from Meta AI) released with the paper [XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models](https://huggingface.co/papers/2301.10472) by Davis Liang, Hila Gonen, Yuning Mao, Rui Hou, Naman Goyal, Marjan Ghazvininejad, Luke Zettlemoyer, Madian Khabsa.
1.**[XLNet](model_doc/xlnet)** (from Google/CMU) released with the paper [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://huggingface.co/papers/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
1.**[XLS-R](model_doc/xls_r)** (from Facebook AI) released with the paper [XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale](https://huggingface.co/papers/2111.09296) by Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli.
1.**[XLSR-Wav2Vec2](model_doc/xlsr_wav2vec2)** (from Facebook AI) released with the paper [Unsupervised Cross-Lingual Representation Learning For Speech Recognition](https://huggingface.co/papers/2006.13979) by Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli.
1.**[YOLOS](model_doc/yolos)** (from Huazhong University of Science & Technology) released with the paper [You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection](https://huggingface.co/papers/2106.00666) by Yuxin Fang, Bencheng Liao, Xinggang Wang, Jiemin Fang, Jiyang Qi, Rui Wu, Jianwei Niu, Wenyu Liu.
1.**[YOSO](model_doc/yoso)** (from the University of Wisconsin - Madison) released with the paper [You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling](https://huggingface.co/papers/2111.09714) by Zhanpeng Zeng, Yunyang Xiong, Sathya N. Ravi, Shailesh Acharya, Glenn Fung, Vikas Singh.
@ -56,7 +56,7 @@ Dateien lassen sich auch in einem Repository leicht bearbeiten, und Sie können
Bevor Sie ein Modell für den Hub freigeben, benötigen Sie Ihre Hugging Face-Anmeldedaten. Wenn Sie Zugang zu einem Terminal haben, führen Sie den folgenden Befehl in der virtuellen Umgebung aus, in der 🤗 Transformers installiert ist. Dadurch werden Ihre Zugangsdaten in Ihrem Hugging Face-Cache-Ordner (standardmäßig `~/.cache/`) gespeichert:
```bash
huggingface-cli login
hf auth login
```
Wenn Sie ein Notebook wie Jupyter oder Colaboratory verwenden, stellen Sie sicher, dass Sie die [`huggingface_hub`](https://huggingface.co/docs/hub/adding-a-library) Bibliothek installiert haben. Diese Bibliothek ermöglicht Ihnen die programmatische Interaktion mit dem Hub.
Wenn Sie andere PEFT-Methoden, wie z.B. Prompt Learning oder Prompt Tuning, verwenden möchten, oder über die 🤗 PEFT-Bibliothek im Allgemeinen, lesen Sie bitte die [Dokumentation](https://huggingface.co/docs/peft/index).
Alle Skripte können Ihr endgültiges Modell in den [Model Hub](https://huggingface.co/models) hochladen. Stellen Sie sicher, dass Sie bei Hugging Face angemeldet sind, bevor Sie beginnen:
```bash
huggingface-cli login
hf auth login
```
Dann fügen Sie dem Skript das Argument `push_to_hub` hinzu. Mit diesem Argument wird ein Repository mit Ihrem Hugging Face-Benutzernamen und dem in `output_dir` angegebenen Ordnernamen erstellt.
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contains specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# Accelerator selection
During distributed training, you can specify the number and order of accelerators (CUDA, XPU, MPS, HPU, etc.) to use. This can be useful when you have accelerators with different computing power and you want to use the faster accelerator first. Or you could only use a subset of the available accelerators. The selection process works for both [DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) and [DataParallel](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html). You don't need Accelerate or [DeepSpeed integration](./main_classes/deepspeed).
This guide will show you how to select the number of accelerators to use and the order to use them in.
## Number of accelerators
For example, if there are 4 accelerators and you only want to use the first 2, run the command below.
<hfoptionsid="select-accelerator">
<hfoptionid="torchrun">
Use the `--nproc_per_node` to select how many accelerators to use.
To select specific accelerators to use and their order, use the environment variable appropriate for your hardware. This is often set on the command line for each run, but can also be added to your `~/.bashrc` or other startup config file.
For example, if there are 4 accelerators (0, 1, 2, 3) and you only want to run accelerators 0 and 2:
You can also control the order of Intel XPUs with:
```bash
export ZE_ENABLE_PCI_ID_DEVICE_ORDER=1
```
For more information about device enumeration and sorting on Intel XPU, please refer to the [Level Zero](https://github.com/oneapi-src/level-zero/blob/master/README.md?plain=1#L87) documentation.
</hfoption>
</hfoptions>
> [!WARNING]
> Environment variables can be exported instead of being added to the command line. This is not recommended because it can be confusing if you forget how the environment variable was set up and you end up using the wrong accelerators. Instead, it is common practice to set the environment variable for a specific training run on the same command line.
@ -13,7 +13,7 @@ rendered properly in your Markdown viewer.
-->
# Adding a new model to Transformers
# Legacy model contribution
> [!TIP]
> Try adding new models with a more [modular](./modular_transformers) approach first. This makes it significantly easier to contribute a model to Transformers!
@ -161,7 +161,7 @@ The downside is that if you aren't used to them, it may take some time to get us
Run the command below to start and complete the questionnaire with some basic information about the new model. This command jumpstarts the process by automatically generating some model code that you'll need to adapt.
```bash
transformers-cli add-new-model-like
transformers add-new-model-like
```
## Create a pull request
@ -292,7 +292,7 @@ Once you're able to run the original checkpoint, you're ready to start adapting
## Adapt the model code
The `transformers-cli add-new-model-like` command should have generated a model and configuration file.
The `transformers add-new-model-like` command should have generated a model and configuration file.
@ -551,10 +551,10 @@ While this example doesn't include an image processor, you may need to implement
If you do need to implement a new image processor, refer to an existing image processor to understand the expected structure. Slow image processors ([`BaseImageProcessor`]) and fast image processors ([`BaseImageProcessorFast`]) are designed differently, so make sure you follow the correct structure based on the processor type you're implementing.
Run the following command (only if you haven't already created the fast image processor with the `transformers-cli add-new-model-like` command) to generate the necessary imports and to create a prefilled template for the fast image processor. Modify the template to fit your model.
Run the following command (only if you haven't already created the fast image processor with the `transformers add-new-model-like` command) to generate the necessary imports and to create a prefilled template for the fast image processor. Modify the template to fit your model.
This command will generate the necessary imports and provide a pre-filled template for the fast image processor. You can then modify it to fit your model's needs.
@ -571,7 +571,7 @@ The processor should call the appropriate modality-specific processors within it
@ -14,5 +14,9 @@ rendered properly in your Markdown viewer.
-->
# Agents
(deprecated)
> [!WARNING]
> Agents and tools were spun out into the standalone [smolagents](https://huggingface.co/docs/smolagents/index) library. They were removed from `transformers` in v4.52.
and it will stop printing the statements, as it now uses the `sdpa` attention.
This allows to quickly change an attention function, without needing to reload the model!
## Different attention per backbone in multimodal models
For multimodal models different attention functions may work better for each backbone module. For example, some vision backbones perform better in fp32, but are incompatible with FlashAttention. To continue using FlashAttention while keeping the vision encoder in fp32, create a dict and map each config to an attention implementation as shown below.
The reason you have to register it is because we need to automatically correct your mask format based on the attention implementation (for example, flex attention uses a BlockMask format, while sdpa uses a 4D tensor).
By default, if you do not register an attention mask function along with your attention function, mask creation will be skipped
and `attention_mask=None` will be passed along to the Attention layers.
The default signature of the attention mask functions is the following:
**kwargs,# a few additional args may be passed as kwargs, especially the model's config is always passed
)->Optional[torch.Tensor]:
```
It mostly works thanks to the `mask_function`, which is a `Callable` in the form of [torch's mask_mod functions](https://pytorch.org/blog/flexattention/), taking 4 indices as input and returning a boolean to indicate if this position should take part in the attention computation.
If you cannot use the `mask_function` to create your mask for some reason, you can try to work around it by doing something similar to our [torch export workaround](https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/executorch.py).
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# Documenting a model
The `@auto_docstring` decorator in Transformers generates consistent docstrings for model classes and their methods. It reduces boilerplate by automatically including standard argument descriptions while also allowing overrides to add new or custom arguments. [Contributing a new model](./modular_transformers) is easier because you don't need to manually add the standard docstrings, and only focus on documenting new arguments.
This guide describes how to use the `@auto_docstring` decorator and how it works.
## @auto_docstring
Start by importing the decorator in the modeling file (`modular_model.py` or `modeling_model.py`).
```python
from...utilsimportauto_docstring
```
Select whether you'd like to apply `@auto_docstring` to a class or function below to see how to use it.
<hfoptionsid="type">
<hfoptionid="classes">
Place `@auto_docstring` directly above the class definition. The decorator derives parameter descriptions from the `__init__` method's signature and docstring.
custom_parameter (`int`, *optional*, defaults to 10):
Description of the custom_parameter for MyAwesomeModel.
another_custom_arg (`str`, *optional*, defaults to "default"):
Documentation for another unique argument.
"""
super().__init__(config)
self.custom_parameter=custom_parameter
self.another_custom_arg=another_custom_arg
# ... rest of your init
# ... other methods
```
Arguments can also be passed directly to `@auto_docstring` for more control. Use the `custom_intro` parameter to describe the argument and the `custom_args` parameter to describe the arguments.
```python
@auto_docstring(
custom_intro="""This model performs specific synergistic operations.
It builds upon the standard Transformer architecture with unique modifications.""",
custom_args="""
custom_parameter (`type`, *optional*, defaults to `default_value`):
A concise description for custom_parameter if not defined or overriding the description in `auto_docstring.py`.
internal_helper_arg (`type`, *optional*, defaults to `default_value`):
A concise description for internal_helper_arg if not defined or overriding the description in `auto_docstring.py`.
custom_parameter (`type`, *optional*, defaults to `default_value`):
A concise description for custom_parameter if not defined or overriding the description in `auto_docstring.py`.
internal_helper_arg (`type`, *optional*, defaults to `default_value`):
A concise description for internal_helper_arg if not defined or overriding the description in `auto_docstring.py`.
"""
# ...
```
You should also use the `@auto_docstring` decorator for classes that inherit from [`~utils.ModelOutput`].
```python
@dataclass
@auto_docstring(
custom_intro="""
Custom model outputs with additional fields.
"""
)
classMyModelOutput(ImageClassifierOutput):
r"""
loss (`torch.FloatTensor`, *optional*):
The loss of the model.
custom_field (`torch.FloatTensor` of shape `(batch_size, hidden_size)`, *optional*):
A custom output field specific to this model.
"""
# Standard fields like hidden_states, logits, attentions etc. can be automatically documented if the description is the same as the standard arguments.
# However, given that the loss docstring is often different per model, you should document it in the docstring above.
)->Union[Tuple,ModelOutput]:# The description of the return value will automatically be generated from the ModelOutput class docstring.
r"""
new_custom_argument (`torch.Tensor`, *optional*):
Description of this new custom argument and its expected shape or type.
"""
# ...
```
Arguments can also be passed directly to `@auto_docstring` for more control. Use the `custom_intro` parameter to describe the argument and the `custom_args` parameter to describe the arguments.
The `Returns` and `Examples` parts of the docstring can also be manually specified.
```python
MODEL_COMMON_CUSTOM_ARGS=r"""
common_arg_1 (`torch.Tensor`, *optional*, defaults to `default_value`):
Description of common_arg_1
common_arg_2 (`torch.Tensor`, *optional*, defaults to `default_value`):
Description of an argument specific to this function
Returns:
`torch.Tensor`: For a function returning a generic type, a custom "Returns" section can be specified.
Example:
(To override the default example with a custom one or to add an example for a model class that does not have a pipeline)
```python
...
```
"""
# ...
```
</hfoption>
</hfoptions>
## Documenting arguments
There are some rules for documenting different types of arguments and they're listed below.
- Standard arguments (`input_ids`, `attention_mask`, `pixel_values`, etc.) are defined and retrieved from `auto_docstring.py`. It is the single source of truth for standard arguments and should not be redefined locally if an argument's description and shape is the same as an argument in `auto_docstring.py`.
If a standard argument behaves differently in your model, then you can override it locally in a `r""" """` block. This local definition has a higher priority. For example, the `labels` argument is often customized per model and typically requires overriding.
- New or custom arguments should be documented within an `r""" """` block after the signature if it is a function or in the `__init__` method's docstring if it is a class.
```py
argument_name (`type`, *optional*, defaults to `X`):
Description of the argument.
Explain its purpose, expected shape/type if complex, and default behavior.
This can span multiple lines.
```
* Include `type` in backticks.
* Add *optional* if the argument is not required or has a default value.
* Add "defaults to X" if it has a default value. You don't need to add "defaults to `None`" if the default value is `None`.
These arguments can also be passed to `@auto_docstring` as a `custom_args` argument. It is used to define the docstring block for new arguments once if they are repeated in multiple places in the modeling file.
```py
class MyModel(PreTrainedModel):
# ...
@auto_docstring(
custom_intro="""
This is a custom introduction for the function.
"""
custom_args=r"""
common_arg_1 (`torch.Tensor`, *optional*, defaults to `default_value`):
Description of common_arg_1
"""
)
```
## Checking the docstrings
Transformers includes a utility script to validate the docstrings when you open a Pull Request which triggers CI (continuous integration) checks. The script checks for the following criteria.
* Ensures `@auto_docstring` is applied to relevant mode classes and public methods.
* Ensures arguments are complete and consistent. It checks that documented arguments exist in the signature and verifies whether the types and default values in the docstring match the signature. Arguments that aren't known standard arguments or if they lack a local description are flagged.
* Reminds you to complete placeholders like `<fill_type>` and `<fill_docstring>`.
* Ensures docstrings are formatted according to the expected docstring style.
You can run this check locally - before committing - by running the following command.
```bash
make fix-copies
```
`make fix-copies` runs several other checks as well. If you don't need those checks, run the command below to only perform docstring and auto-docstring checks.
```bash
python utils/check_docstrings.py # to only check files included in the diff without fixing them
# python utils/check_docstrings.py --fix_and_overwrite # to fix and overwrite the files in the diff
# python utils/check_docstrings.py --fix_and_overwrite --check_all # to fix and overwrite all files
```
## modular_model.py files
When working with modular files (`modular_model.py`), follow the guidelines below for applying `@auto_docstring`.
- For standalone models in modular files, apply `@auto_docstring` like you would in a `modeling_model.py` file.
- For models that inherit from other library models, `@auto_docstring` is automatically carried over to the generated modeling file. You don't need to add `@auto_docstring` in your modular file.
If you need to modify the `@auto_docstring` behavior, apply the customized decorator in your modular file. Make sure to **include all other decorators** that are present in the original function or class.
> [!WARNING]
> When overriding any decorator in a modular file, you must include **all** decorators that were applied to that function or class in the parent model. If you only override some decorators, the others won't be included in the generated modeling file.
## How it works
The `@auto_docstring` decorator automatically generates docstrings by:
1. Inspecting the signature (arguments, types, defaults) of the decorated class' `__init__` method or the decorated function.
2. Retrieving the predefined docstrings for common arguments (`input_ids`, `attention_mask`, etc.) from internal library sources like [`ModelArgs`], [`ImageProcessorArgs`], and the `auto_docstring.py` file.
3. Adding argument descriptions in one of two ways as shown below.
| method | description | usage |
|---|---|---|
| `r""" """` | add custom docstring content directly to a method signature or within the `__init__` docstring | document new arguments or override standard descriptions |
| `custom_args` | add custom docstrings for specific arguments directly in `@auto_docstring` | define docstring for new arguments once if they're repeated in multiple places in the modeling file |
4. Adding class and function descriptions. For model classes with standard naming patterns, like `ModelForCausalLM`, or if it belongs to a pipeline, `@auto_docstring` automatically generates the appropriate descriptions with `ClassDocstring` from `auto_docstring.py`.
`@auto_docstring` also accepts the `custom_intro` argument to describe a class or function.
5. Using a templating system to allow predefined docstrings to include dynamic information from Transformers' [auto_modules](https://github.com/huggingface/transformers/tree/main/src/transformers/models/auto) such as `{{processor_class}}` and `{{config_class}}`.
6. Finding appropriate usage examples based on the model's task or pipeline compatibility. It extracts checkpoint information form the model's configuration class to provide concrete examples with real model identifiers.
7. Adding return values to the docstring. For methods like `forward`, the decorator automatically generates the `Returns` field in the docstring based on the method's return type annotation.
For example, if a method returns a [`~transformers.utils.ModelOutput`] subclass, `@auto_docstring` extracts the field descriptions from the class' docstring to create a comprehensive return value description. You can also manually specifiy a custom `Returns` field in a functions docstring.
8. Unrolling kwargs typed with the unpack operator. For specific methods (defined in `UNROLL_KWARGS_METHODS`) or classes (defined in `UNROLL_KWARGS_CLASSES`), the decorator processes `**kwargs` parameters that are typed with `Unpack[KwargsTypedDict]`. It extracts the documentations from the `TypedDict` and adds each parameter to the function's docstring.
Currently only supported for [`FastImageProcessorKwargs`].
## Best practices
Follow the best practices below to help maintain consistent and informative documentation for Transformers!
* Use `@auto_docstring` for new PyTorch model classes ([`PreTrainedModel`] subclasses) and their primary methods like `forward` or `get_text_features`.
* For classes, `@auto_docstring` retrieves parameter descriptions from the `__init__` method's docstring.
* Rely on standard docstrings and do not redefine common arguments unless their behavior is different in your model.
@ -15,8 +15,7 @@ rendered properly in your Markdown viewer.
-->
# Caching
Imagine you’re having a conversation with someone, and instead of remembering what they previously said, they have to start from scratch every time you respond. This would be slow and inefficient, right?
Imagine you're having a conversation with someone, and instead of remembering what they previously said, they have to start from scratch every time you respond. This would be slow and inefficient, right?
You can extend this analogy to transformer models. Autoregressive model generation can be slow because it makes a prediction one token at a time. Each new prediction is dependent on all the previous context.
@ -29,8 +28,50 @@ A key-value (KV) cache eliminates this inefficiency by storing kv pairs derived
> [!WARNING]
> Caching should only be used for **inference**. It may cause unexpected errors if it's enabled during training.
To better understand how and why caching works, let's take a closer look at the structure of the attention matrices.
## Attention matrices
The **scaled dot-product attention** is calculated as shown below for a batch of size `b`, number of attention heads `h`, sequence length so far `T`, and dimension per attention head `d_head`.
The query (`Q`), key (`K`), and value (`V`) matrices are projections from the input embeddings of shape `(b, h, T, d_head)`.
For causal attention, the mask prevents the model from attending to future tokens. Once a token is processed, its representation never changes with respect to future tokens, which means \\( K_{\text{past}} \\) and \\( V_{\text{past}} \\) can be cached and reused to compute the last token's representation.
At inference time, you only need the last token's query to compute the representation \\( x_t \\) that predicts the next token \\( t+1 \\). At each step, the new key and value vectors are **stored** in the cache and **appended** to the past keys and values.
Attention is calculated independently in each layer of the model, and caching is done on a per-layer basis.
Refer to the table below to compare how caching improves efficiency.
| without caching | with caching |
|---|---|
| for each step, recompute all previous `K` and `V` | for each step, only compute current `K` and `V`
| attention cost per step is **quadratic** with sequence length | attention cost per step is **linear** with sequence length (memory grows linearly, but compute/token remains low) |
## Cache class
A basic KV cache interface takes a key and value tensor for the current token and returns the updated `K` and `V` tensors. This is internally managed by a model's `forward` method.
```py
new_K,new_V=cache.update(k_t,v_t,layer_idx)
attn_output=attn_layer_idx_fn(q_t,new_K,new_V)
```
When you use Transformers' [`Cache`] class, the self-attention module performs several critical steps to integrate past and present information.
1. The attention module concatenates current kv pairs with past kv pairs stored in the cache. This creates attentions weights with the shape `(new_tokens_length, past_kv_length + new_tokens_length)`. The current and past kv pairs are essentially combined to compute the attention scores, ensuring a model is aware of previous context and the current input.
@ -39,6 +80,21 @@ When you use Transformers' [`Cache`] class, the self-attention module performs s
3. It is also important to be aware of the `cache_position`. This is important if you want to reuse a prefilled [`Cache`] with the `forward` method because you have to pass a valid `cache_position` value. This indicates the input positions in a sequence. `cache_position` is unaffected by padding, and it always adds one more position for each token. For example, if a kv cache contains 10 tokens - regardless of pad tokens - the cache position for the next token should be `torch.tensor([10])`.
## Cache storage implementation
Caches are structured as a list of layers, where each layer contains a key and value cache. The key and value caches are tensors with the shape `[batch_size, num_heads, seq_len, head_dim]`.
Layers can be of different types (e.g. `DynamicLayer`, `StaticLayer`, `SlidingWindowLayer`), which mostly changes how sequence length is handled and how the cache is updated.
The simplest is a `DynamicLayer` that grows as more tokens are processed. The sequence length dimension (`seq_len`) increases with each new token:
Other layer types like `StaticLayer` and `SlidingWindowLayer` have a fixed sequence length that is set when the cache is created. This makes them compatible with `torch.compile`. In the case of `SlidingWindowLayer`, existing tokens are shifted out of the cache when a new token is added.
The example below demonstrates how to create a generation loop with [`DynamicCache`]. As discussed, the attention mask is a concatenation of past and current token values and `1` is added to the cache position for the next token.
"[INST] Hello, what's your name. [/INST] Hello! My name is LLaMA,"
```
## Cache position
The cache position tracks where to insert new tokens in the attention cache. It represents the *absolute* position of each token in the context, independent of padding or batch structure. Suppose you already cached `N` tokens and are now processing `K` new tokens. The cache position for the new tokens will range from `N` to `N + K - 1`. In other words, you're processing tokens at positions - `[N, N + 1, N + 2, ..., N + K - 1]`.
Cache position is used internally for two purposes:
1. Selecting new tokens to process in the input sequence and ensuring only tokens that haven’t been cached yet are passed to the model's `forward`.
2. Storing key/value pairs at the correct positions in the cache. This is especially important for fixed-size caches, like [`StaticCache`], that pre-allocates a specific cache length.
The generation loop usually takes care of the cache position, but if you're writing a custom generation method, it is important that cache positions are accurate since they are used to write and read key/value states into fixed slots.
Before the [`Cache`] class, the cache used to be stored as a tuple of tuples of tensors. This format has is dynamic because it grows as text is generated, similar to [`DynamicCache`].
Before the [`Cache`] class, the cache used to be stored as a tuple of tuples of tensors. This format is dynamic because it grows as text is generated, similar to [`DynamicCache`].
If your project depends on this legacy format, you can convert between [`DynamicCache`] and a tuple of tuples as shown below with the [`~DynamicCache.from_legacy_cache`] and [`DynamicCache.to_legacy_cache`] functions. This is helpful if you have custom logic for manipulating a cache in a specific format.
The legacy format is essentially the same data structure but organized differently.
- It's a tuple of tuples, where each inner tuple contains the key and value tensors for a layer.
- The tensors have the same shape `[batch_size, num_heads, seq_len, head_dim]`.
- The format is less flexible and doesn't support features like quantization or offloading.
If your project depends on this legacy format, we recommend to convert to [`DynamicCache`] with [`~DynamicCache.from_legacy_cache`]. Note that legacy cache format is deprecated and not used anymore in `Transformers`. You can convert back to tuple format with [`DynamicCache.to_legacy_cache`] functions, which is helpful if you have custom logic for manipulating a cache in a specific format.
@ -25,25 +25,32 @@ Check model leaderboards like [OpenLLM](https://hf.co/spaces/HuggingFaceH4/open_
This guide shows you how to quickly start chatting with Transformers from the command line, how build and format a conversation, and how to chat using the [`TextGenerationPipeline`].
## transformers-cli
## chat CLI
Chat with a model directly from the command line as shown below. It launches an interactive session with a model. Enter `clear` to reset the conversation, `exit` to terminate the session, and `help` to display all the command options.
After you've [installed Transformers](./installation.md), chat with a model directly from the command line as shown below. It launches an interactive session with a model, with a few base commands listed at the start of the session.
For a full list of options, run the command below.
```bash
transformers-cli chat -h
transformers chat -h
```
The chat is implemented on top of the [AutoClass](./model_doc/auto), using tooling from [text generation](./llm_tutorial) and [chat](./chat_templating).
The chat is implemented on top of the [AutoClass](./model_doc/auto), using tooling from [text generation](./llm_tutorial) and [chat](./chat_templating). It uses the `transformers serve` CLI under the hood ([docs](./serving.md#serve-cli)).
@ -20,11 +20,15 @@ A decoding strategy informs how a model should select the next generated token.
This guide will help you understand the different decoding strategies available in Transformers and how and when to use them.
## Greedy search
## Basic decoding methods
Greedy search is the default decoding strategy. It selects the next most likely token at each step. Unless specified in [`GenerationConfig`], this strategy generates a maximum of 20 tokens.
These are well established decoding methods, and should be your starting point for text generation tasks.
Greedy search works well for tasks with relatively short outputs. However, it breaks down when generating longer sequences because it begins to repeat itself.
### Greedy search
Greedy search is the default decoding strategy. It selects the next most likely token at each step. Unless specified in [`GenerationConfig`], this strategy generates a maximum of 20 new tokens.
Greedy search works well for tasks with relatively short outputs where creativity is not a priority. However, it breaks down when generating longer sequences because it begins to repeat itself.
'Hugging Face is an open-source company that provides a suite of tools and services for building, deploying, and maintaining natural language processing'
```
## Contrastive search
### Sampling
[Contrastive search](https://huggingface.co/papers/2202.06417) is a decoding strategy that aims to reduce repetition even while generating longer sequences. This strategy compares how similar a generated token is against previous tokens, and if they're more similar, a penalty is applied.
Sampling, or multinomial sampling, randomly selects a token based on the probability distribution over the entire model's vocabulary (as opposed to the most likely token, as in greedy search). This means every token with a non-zero probability has a chance to be selected. Sampling strategies reduce repetition and can generate more creative and diverse outputs.
Enable contrastive search with the `penalty_alpha` and `top_k` parameters. The `penalty_alpha` manages the penalty applied and `top_k` is the number of most likely tokens to return.
Enable multinomial sampling with `do_sample=True` and `num_beams=1`.
```py
importtorch
@ -55,14 +59,14 @@ inputs = tokenizer("Hugging Face is an open-source company", return_tensors="pt"
'Hugging Face is an open-source company that provides a platform for building and deploying AI models.\nHugging Face is an open-source company that provides a platform for building and deploying AI models. The platform allows developers to build and deploy AI models, as well as collaborate with other developers.\nHugging Face was founded in 2019 by Thibault Wittemberg and Clément Delangue. The company is based in Paris, France.\nHugging Face has'
'Hugging Face is an open-source company 🤗\nWe are open-source and believe that open-source is the best way to build technology. Our mission is to make AI accessible to everyone, and we believe that open-source is the best way to achieve that.'
```
## Beam search
### Beam search
Beam search keeps track of several generated sequences (beams) at each time step. After a certain number of steps, it selects the sequence with the highest *overall* probability. Unlike greedy search, this strategy can "look ahead" and pick a sequence with a higher probability overall even if the initial tokens have a lower probability.
Beam search keeps track of several generated sequences (beams) at each time step. After a certain number of steps, it selects the sequence with the highest *overall* probability. Unlike greedy search, this strategy can "look ahead" and pick a sequence with a higher probability overall even if the initial tokens have a lower probability. It is best suited for input-grounded tasks, like describing an image or speech recognition. You can also use `do_sample=True` with beam search to sample at each step, but beam search will still greedily prune out low probability sequences between steps.
> [!TIP]
> Check out the [beam search visualizer](https://huggingface.co/spaces/m-ric/beam_search_visualizer) to see how beam search works.
"['Hugging Face is an open-source company that develops and maintains the Hugging Face platform, which is a collection of tools and libraries for building and deploying natural language processing (NLP) models. Hugging Face was founded in 2018 by Thomas Wolf']"
```
## Diverse beam search
## Advanced decoding methods
[Diverse beam search](https://hf.co/papers/1610.02424) is a variant of beam search that produces more diverse output candidates to choose from. This strategy measures the dissimilarity of sequences and a penalty is applied if sequences are too similar. To avoid high computation costs, the number of beams is divided into groups.
Advanced decoding methods aim at either tackling specific generation quality issues (e.g. repetition) or at improving the generation throughput in certain situations. These techniques are more complex, and may not work correctly with all models.
Enable diverse beam search with the `num_beams`, `num_beam_groups` and `diversity_penalty` parameters (the `num_beams` parameter should be divisible by `num_beam_groups`).
'Hugging Face is an open-source company 🤗\nWe are an open-source company. Our mission is to democratize AI and make it accessible to everyone. We believe that AI should be used for the benefit of humanity, not for the benefit of a'
```
## Multinomial sampling
Search methods selects the most likely tokens. Sampling, or multinomial sampling, randomly selects a token based on the probability distribution over the entire models vocabulary. This means every token with a non-zero probability has a chance to be selected. Sampling strategies reduce repetition and can generate more creative and diverse outputs.
Enable multinomial sampling with `do_sample=True` and `num_beams=1`.
'Hugging Face is an open-source company 🤗\nWe are open-source and believe that open-source is the best way to build technology. Our mission is to make AI accessible to everyone, and we believe that open-source is the best way to achieve that.'
```
## Beam search multinomial sampling
This decoding strategy is a combination of beam search and multinomial sampling. It generates multiple beams and uses a sampling strategy for each beam.
Enable beam search multinomial sampling by setting `num_beams` to a value greater than 1 and `do_sample=True`.
'Hugging Face is an open-source company 100% dedicated to making AI more accessible. We believe that AI should be available to everyone, and we’re working hard to make that a reality.\nWe’re a team of passionate engineers, designers,'
```
## Speculative decoding
### Speculative decoding
[Speculative](https://hf.co/papers/2211.17192) or assistive decoding isn't a search or sampling strategy. Instead, speculative decoding adds a second smaller model to generate candidate tokens. The main model verifies the candidate tokens in a single `forward` pass, which speeds up the decoding process overall. This method is especially useful for LLMs where it can be more costly and slower to generate tokens. Refer to the [speculative decoding](./llm_optims#speculative-decoding) guide to learn more.
[Prompt lookup decoding](./llm_optims#prompt-lookup-decoding) is a variant of speculative decoding that uses overlapping n-grams as the candidate tokens. It works well for input-grounded tasks such as summarization. Refer to the [prompt lookup decoding](./llm_optims#prompt-lookup-decoding) guide to learn more.
Universal assisted decoding (UAD) enables the main and assistant models to use different tokenizers. The main models input tokens are re-encoded into assistant model tokens. Candidate tokens are generated in the assistant encoding which are re-encoded into the main model candidate tokens. The candidate tokens are verified as explained in [speculative decoding](#speculative-decoding).
['Alice and Bob are sitting in a bar. Alice is drinking a beer and Bob is drinking a']
```
## DoLa
### Contrastive search
[Contrastive search](https://huggingface.co/papers/2202.06417) is a decoding strategy that aims to reduce repetition even while generating longer sequences. This strategy compares how similar a generated token is against previous tokens, and if they're more similar, a penalty is applied.
Enable contrastive search with the `penalty_alpha` and `top_k` parameters. The `penalty_alpha` manages the penalty applied and `top_k` is the number of most likely tokens to return.
'Hugging Face is an open-source company that provides a platform for building and deploying AI models.\nHugging Face is an open-source company that provides a platform for building and deploying AI models. The platform allows developers to build and deploy AI models, as well as collaborate with other developers.\nHugging Face was founded in 2019 by Thibault Wittemberg and Clément Delangue. The company is based in Paris, France.\nHugging Face has'
```
### DoLa
[Decoding by Contrasting Layers (DoLa)](https://hf.co/papers/2309.03883) is a contrastive decoding strategy for improving factuality and reducing hallucination. This strategy works by contrasting the logit differences between the final and early layers. As a result, factual knowledge localized to particular layers are amplified. DoLa is not recommended for smaller models like GPT-2.
[Diverse beam search](https://hf.co/papers/1610.02424) is a variant of beam search that produces more diverse output candidates to choose from. This strategy measures the dissimilarity of sequences and a penalty is applied if sequences are too similar. To avoid high computation costs, the number of beams is divided into groups.
Enable diverse beam search with the `num_beams`, `num_beam_groups` and `diversity_penalty` parameters (the `num_beams` parameter should be divisible by `num_beam_groups`).
'Hugging Face is an open-source company 🤗\nWe are an open-source company. Our mission is to democratize AI and make it accessible to everyone. We believe that AI should be used for the benefit of humanity, not for the benefit of a'
```
## Custom decoding methods
Custom decoding methods enable specialized generation behavior such as the following:
- have the model continue thinking if it is uncertain;
- roll back generation if the model gets stuck;
- handle special tokens with custom logic;
- enhanced input preparation for advanced models;
We enable custom decoding methods through model repositories, assuming a specific model tag and file structure (see subsection below). This feature is an extension of [custom modeling code](./models.md#custom-models) and, like such, requires setting `trust_remote_code=True`.
If a model repository holds a custom decoding method, the easiest way to try it out is to load the model and generate with it:
'The quick brown fox jumps over a lazy dog, and the dog is a type of animal. Is'
```
Model repositories with custom decoding methods have a special property: their decoding method can be loaded from **any** model through [`~GenerationMixin.generate`]'s `custom_generate` argument. This means anyone can create and share their custom generation method to potentially work with any Transformers model, without requiring users to install additional Python packages.
'The quick brown fox jumps over a lazy dog, and the dog is a type of animal. Is'
```
You should read the `README.md` file of the repository containing the custom generation strategy to see what the new arguments and output type differences are, if they exist. Otherwise, you can assume it works like the base [`~GenerationMixin.generate`] method.
> [!TIP]
> You can find all custom decoding methods by [searching for their custom tag.](https://huggingface.co/models?other=custom_generate), `custom_generate`
Consider the Hub repository [transformers-community/custom_generate_example](https://huggingface.co/transformers-community/custom_generate_example) as an example. The `README.md` states that it has an additional input argument, `left_padding`, which adds a number of padding tokens before the prompt.
'<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>The quick brown fox jumps over the lazy dog.\n\nThe sentence "The quick'
```
If the custom method has pinned Python requirements that your environment doesn't meet, you'll get an exception about missing requirements. For instance, [transformers-community/custom_generate_bad_requirements](https://huggingface.co/transformers-community/custom_generate_bad_requirements) has an impossible set of requirements defined in its `custom_generate/requirements.txt` file, and you'll see the error message below if you try to run it.
```
ImportError: Missing requirements in your local environment for `transformers-community/custom_generate_bad_requirements`:
foo (installed: None)
bar==0.0.0 (installed: None)
torch>=99.0 (installed: 2.6.0)
```
Updating your Python requirements accordingly will remove this error message.
### Creating a custom decoding method
To create a new decoding method, you need to create a new [**Model**](https://huggingface.co/new) repository and push a few files into it.
1. The model you've designed your decoding method with.
2.`custom_generate/generate.py`, which contains all the logic for your custom decoding method.
3.`custom_generate/requirements.txt`, used to optionally add new Python requirements and/or lock specific versions to correctly use your method.
4.`README.md`, where you should add the `custom_generate` tag and document any new arguments or output type differences of your custom method here.
After you've added all required files, your repository should look like this
```
your_repo/
├── README.md # include the 'custom_generate' tag
├── config.json
├── ...
└── custom_generate/
├── generate.py
└── requirements.txt
```
#### Adding the base model
The starting point for your custom decoding method is a model repository just like any other. The model to add to this repository should be the model you've designed your method with, and it is meant to be part of a working self-contained model-generate pair. When the model in this repository is loaded, your custom decoding method will override `generate`. Don't worry -- your decoding method can still be loaded with any other Transformers model, as explained in the section above.
If you simply want to copy an existing model, you can do
This is the core of your decoding method. It *must* contain a method named `generate`, and this method *must* contain a `model` argument as its first argument. `model` is the model instance, which means you have access to all attributes and methods in the model, including the ones defined in [`GenerationMixin`] (like the base `generate` method).
> [!WARNING]
> `generate.py` must be placed in a folder named `custom_generate`, and not at the root level of the repository. The file paths for this feature are hardcoded.
Under the hood, when the base [`~GenerationMixin.generate`] method is called with a `custom_generate` argument, it first checks its Python requirements (if any), then locates the custom `generate` method in `generate.py`, and finally calls the custom `generate`. All received arguments and `model` are forwarded to your custom `generate` method, with the exception of the arguments used to trigger the custom generation (`trust_remote_code` and `custom_generate`).
This means your `generate` can have a mix of original and custom arguments (as well as a different output type) as shown below.
Follow the recommended practices below to ensure your custom decoding method works as expected.
- Feel free to reuse the logic for validation and input preparation in the original [`~GenerationMixin.generate`].
- Pin the `transformers` version in the requirements if you use any private method/attribute in `model`.
- Consider adding model validation, input validation, or even a separate test file to help users sanity-check your code in their environment.
Your custom `generate` method can relative import code from the `custom_generate` folder. For example, if you have a `utils.py` file, you can import it like this:
```py
from.utilsimportsome_function
```
Only relative imports from the same-level `custom_generate` folder are supported. Parent/sibling folder imports are not valid. The `custom_generate` argument also works locally with any directory that contains a `custom_generate` structure. This is the recommended workflow for developing your custom decoding method.
#### requirements.txt
You can optionally specify additional Python requirements in a `requirements.txt` file inside the `custom_generate` folder. These are checked at runtime and an exception will be thrown if they're missing, nudging users to update their environment accordingly.
#### README.md
The root level `README.md` in the model repository usually describes the model therein. However, since the focus of the repository is the custom decoding method, we highly recommend to shift its focus towards describing the custom decoding method. In addition to a description of the method, we recommend documenting any input and/or output differences to the original [`~GenerationMixin.generate`]. This way, users can focus on what's new, and rely on Transformers docs for generic implementation details.
For discoverability, we highly recommend you to add the `custom_generate` tag to your repository. To do so, the top of your `README.md` file should look like the example below. After you push the file, you should see the tag in your repository!
```
---
library_name: transformers
tags:
- custom_generate
---
(your markdown content here)
```
Recommended practices:
- Document input and output differences in [`~GenerationMixin.generate`].
- Add self-contained examples to enable quick experimentation.
- Describe soft-requirements such as if the method only works well with a certain family of models.
## Resources
Read the [How to generate text: using different decoding methods for language generation with Transformers](https://huggingface.co/blog/how-to-generate) blog post for an explanation of how common decoding strategies work.
@ -163,7 +163,7 @@ The intermediate embedding size of the feed forward layers is often bigger than
For an input of size `[batch_size, sequence_length]`, the memory required to store the intermediate feed forward
embeddings `[batch_size, sequence_length, config.intermediate_size]` can account for a large fraction of the memory
use. The authors of [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) noticed that since the
use. The authors of [Reformer: The Efficient Transformer](https://huggingface.co/papers/2001.04451) noticed that since the
computation is independent of the `sequence_length` dimension, it is mathematically equivalent to compute the output
embeddings of both feed forward layers `[batch_size, config.hidden_size]_0, ..., [batch_size, config.hidden_size]_n`
individually and concat them afterward to `[batch_size, sequence_length, config.hidden_size]` with `n = sequence_length`, which trades increased computation time against reduced memory use, but yields a mathematically
@ -207,7 +207,7 @@ numerical representations of tokens building the sequences that will be used as
<Youtubeid="VFp38yj8h3A"/>
Each tokenizer works differently but the underlying mechanism remains the same. Here's an example using the BERT
tokenizer, which is a [WordPiece](https://arxiv.org/pdf/1609.08144.pdf) tokenizer:
tokenizer, which is a [WordPiece](https://huggingface.co/papers/1609.08144) tokenizer:
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contains specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# GPU selection
During distributed training, you can specify the number of GPUs to use and in what order. This can be useful when you have GPUs with different computing power and you want to use the faster GPU first. Or you could only use a subset of the available GPUs. The selection process works for both [DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) and [DataParallel](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html). You don't need Accelerate or [DeepSpeed integration](./main_classes/deepspeed).
This guide will show you how to select the number of GPUs to use and the order to use them in.
## Number of GPUs
For example, if there are 4 GPUs and you only want to use the first 2, run the command below.
<hfoptionsid="select-gpu">
<hfoptionid="torchrun">
Use the `--nproc_per_node` to select how many GPUs to use.
To select specific GPUs to use and their order, configure the `CUDA_VISIBLE_DEVICES` environment variable. It is easiest to set the environment variable in `~/bashrc` or another startup config file. `CUDA_VISIBLE_DEVICES` is used to map which GPUs are used. For example, if there are 4 GPUs (0, 1, 2, 3) and you only want to run GPUs 0 and 2:
Only the 2 physical GPUs (0 and 2) are "visible" to PyTorch and these are mapped to `cuda:0` and `cuda:1` respectively. You can also reverse the order of the GPUs to use 2 first. The mapping becomes `cuda:1` for GPU 0 and `cuda:0` for GPU 2.
> As with any environment variable, they can be exported instead of being added to the command line. However, this is not recommended because it can be confusing if you forget how the environment variable was set up and you end up using the wrong GPUs. Instead, it is common practice to set the environment variable for a specific training run on the same command line.
`CUDA_DEVICE_ORDER` is an alternative environment variable you can use to control how the GPUs are ordered. You can order according to the following.
1. PCIe bus IDs that matches the order of [`nvidia-smi`](https://developer.nvidia.com/nvidia-system-management-interface) and [`rocm-smi`](https://rocm.docs.amd.com/projects/rocm_smi_lib/en/latest/.doxygen/docBin/html/index.html) for NVIDIA and AMD GPUs respectively.
```bash
exportCUDA_DEVICE_ORDER=PCI_BUS_ID
```
2. GPU compute ability.
```bash
exportCUDA_DEVICE_ORDER=FASTEST_FIRST
```
The `CUDA_DEVICE_ORDER` is especially useful if your training setup consists of an older and newer GPU, where the older GPU appears first, but you cannot physically swap the cards to make the newer GPU appear first. In this case, set `CUDA_DEVICE_ORDER=FASTEST_FIRST` to always use the newer and faster GPU first (`nvidia-smi` or `rocm-smi` still reports the GPUs in their PCIe order). Or you could also set `export CUDA_VISIBLE_DEVICES=1,0`.
@ -19,6 +19,9 @@ Hyperparameter search discovers an optimal set of hyperparameters that produces
This guide will go over how to set up a hyperparameter search for each of the backends.
> [!WARNING]
> [SigOpt](https://github.com/sigopt/sigopt-server) is in public archive mode and is no longer actively maintained. Try using Optuna, Weights & Biases or Ray Tune instead.
@ -16,7 +16,7 @@ rendered properly in your Markdown viewer.
# Image processors
Image processors converts images into pixel values, tensors that represent image colors and size. The pixel values are inputs to a vision or video model. To ensure a pretrained model receives the correct input, an image processor can perform the following operations to make sure an image is exactly like the images a model was pretrained on.
Image processors converts images into pixel values, tensors that represent image colors and size. The pixel values are inputs to a vision model. To ensure a pretrained model receives the correct input, an image processor can perform the following operations to make sure an image is exactly like the images a model was pretrained on.
- [`~BaseImageProcessor.center_crop`] to resize an image
- [`~BaseImageProcessor.normalize`] or [`~BaseImageProcessor.rescale`] pixel values
@ -15,9 +15,25 @@ rendered properly in your Markdown viewer.
# Transformers
Transformers is a library of pretrained natural language processing, computer vision, audio, and multimodal models for inference and training. Use Transformers to train models on your data, build inference applications, and generate text with large language models.
and adjacent modeling libraries (llama.cpp, mlx, ...) which leverage the model definition from `transformers`.
We pledge to help support new state-of-the-art models and democratize their usage by having their model definition be
simple, customizable, and efficient.
There are over 1M+ Transformers [model checkpoints](https://huggingface.co/models?library=transformers&sort=trending) on the [Hugging Face Hub](https://huggingface.com/models) you can use.
Explore the [Hub](https://huggingface.com/) today to find a model and use Transformers to help you get started right away.
## Features
@ -43,3 +59,6 @@ Transformers is designed for developers and machine learning engineers and resea
</a>
</div>
## Learn
If you're new to Transformers or want to learn more about transformer models, we recommend starting with the [LLM course](https://huggingface.co/learn/llm-course/chapter1/1?fw=pt). This comprehensive course covers everything from the fundamentals of how transformer models work to practical applications across various tasks. You'll learn the complete workflow, from curating high-quality datasets to fine-tuning large language models and implementing reasoning capabilities. The course contains both theoretical and hands-on exercises to build a solid foundational knowledge of transformer models as you learn.
@ -16,7 +16,8 @@ rendered properly in your Markdown viewer.
# Model debugging toolboxes
This page lists all the debugging and model adding tools used by the library, as well as the utility functions it provides for it.
This page lists all the debugging and model adding tools used by the library, as well as the utility functions it
provides for it.
Most of those are only useful if you are adding new models in the library.
@ -26,13 +27,14 @@ Most of those are only useful if you are adding new models in the library.
### Model addition debugger - context manager for model adders
This context manager is a power user tool intended for model adders.
It tracks all forward calls within a model forward and logs a slice of each input and output on a nested Json.
To note, this context manager enforces `torch.no_grad()`.
This context manager is a power user tool intended for model adders. It tracks all forward calls within a model forward
and logs a slice of each input and output on a nested JSON. To note, this context manager enforces `torch.no_grad()`.
### Rationale
Because when porting models to transformers, even from python to python, model adders often have to do a lot of manual operations, involving saving and loading tensors, comparing dtypes, etc. This small tool can hopefully shave off some time.
When porting models to transformers, even from python to python, model adders often have to do a lot of manual
operations, involving saving and loading tensors, comparing dtypes, etc. This small tool can hopefully shave off some
do_prune_layers=False# This will output ALL the layers of a model.
):
model,
debug_path="optional_path_to_your_directory",
do_prune_layers=False# This will output ALL the layers of a model.
):
output=model.forward(**inputs)
```
@ -73,8 +75,8 @@ with model_addition_debugger_context(
### Reading results
The debugger generates two files from the forward call, both with the same base name,
but ending either with `_SUMMARY.json` or with `_FULL_TENSORS.json`.
The debugger generates two files from the forward call, both with the same base name, but ending either with
`_SUMMARY.json` or with `_FULL_TENSORS.json`.
The first one will contain a summary of each module's _input_ and _output_ tensor values and shapes.
@ -142,8 +144,8 @@ The first one will contain a summary of each module's _input_ and _output_ tenso
{...andsoon
```
The `_FULL_TENSORS.json` file will display a full view of all tensors, which is useful
for comparing two files.
The `_FULL_TENSORS.json` file will display a full view of all tensors, which is useful for comparing two files.
```json
"pixel_values":{
"shape":"torch.Size([1, 5, 576, 588])",
@ -196,9 +198,38 @@ for comparing two files.
},
```
#### Saving tensors to disk
Some model adders may benefit from logging full tensor values to disk to support, for example, numerical analysis
across implementations.
Set `use_repr=False` to write tensors to disk using [SafeTensors](https://huggingface.co/docs/safetensors/en/index).
```python
withmodel_addition_debugger_context(
model,
debug_path="optional_path_to_your_directory",
do_prune_layers=False,
use_repr=False,# Defaults to True
):
output=model.forward(**inputs)
```
When using `use_repr=False`, tensors are written to the same disk location as the `_SUMMARY.json` and
`_FULL_TENSORS.json` files. The `value` property of entries in the `_FULL_TENSORS.json` file will contain a relative
path reference to the associated `.safetensors` file. Each tensor is written to its own file as the `data` property of
the state dictionary. File names are constructed using the `module_path` as a prefix with a few possible postfixes that
are built recursively.
* Module inputs are denoted with the `_inputs` and outputs by `_outputs`.
*`list` and `tuple` instances, such as `args` or function return values, will be postfixed with `_{index}`.
*`dict` instances will be postfixed with `_{key}`.
### Comparing between implementations
Once the forward passes of two models have been traced by the debugger, one can compare the `json` output files. See below: we can see slight differences between these two implementations' key projection layer. Inputs are mostly identical, but not quite. Looking through the file differences makes it easier to pinpoint which layer is wrong.
Once the forward passes of two models have been traced by the debugger, one can compare the `json` output files. See
below: we can see slight differences between these two implementations' key projection layer. Inputs are mostly
identical, but not quite. Looking through the file differences makes it easier to pinpoint which layer is wrong.
@ -206,8 +237,124 @@ Once the forward passes of two models have been traced by the debugger, one can
### Limitations and scope
This feature will only work for torch-based models, and would require more work and case-by-case approach for say`jax`-based models that are usually compiled. Models relying heavily on external kernel calls may work, but trace will probably miss some things. Regardless, any python implementation that aims at mimicking another implementation can be traced once instead of reran N times with breakpoints.
This feature will only work for torch-based models, and would require more work and case-by-case approach for say
`jax`-based models that are usually compiled. Models relying heavily on external kernel calls may work, but trace will
probably miss some things. Regardless, any python implementation that aims at mimicking another implementation can be
traced once instead of reran N times with breakpoints.
If you pass `do_prune_layers=False` to your model debugger, ALL the layers will be outputted to `json`. Else, only the first and last layer will be shown. This is useful when some layers (typically cross-attention) appear only after N layers.
If you pass `do_prune_layers=False` to your model debugger, ALL the layers will be outputted to `json`. Else, only the
first and last layer will be shown. This is useful when some layers (typically cross-attention) appear only after N
layers.
[[autodoc]] model_addition_debugger_context
## Analyzer of skipped tests
### Scan skipped tests - for model adders and maintainers
This small util is a power user tool intended for model adders and maintainers. It lists all test methods
existing in `test_modeling_common.py`, inherited by all model tester classes, and scans the repository to measure
how many tests are being skipped and for which models.
### Rationale
When porting models to transformers, tests fail as they should, and sometimes `test_modeling_common` feels irreconcilable with the peculiarities of our brand new model. But how can we be sure we're not breaking everything by adding a seemingly innocent skip?
This utility:
- scans all test_modeling_common methods
- looks for times where a method is skipped
- returns a summary json you can load as a DataFrame/inspect
**For instance test_inputs_embeds is skipped in a whooping 39% proportion at the time of writing this util.**
📄 JSON saved to /home/pablo/git/transformers/all_tests_scan_result.json
```
And it will generate `all_tests_scan_result.json` file that you can inspect. The JSON is indexed by method name, and each entry follows this schema, indicating the origin as well (from `common`or `GenerationMixin`.)
```json
{
"<method_name>":{
"origin":"<test suite>"
"models_ran":["<model_name>",...],
"models_skipped":["<model_name>",...],
"skipped_proportion":<float>,
"reasons_skipped":["<model_name>: <reason>",
...
]
},
...
}
```
Which you can visualise as above with e.g. `pandas`
- align: Inputs_embeds is tested in individual model tests
- altclip: Inputs_embeds is tested in individual model tests
- audio_spectrogram_transformer: AST does not use inputs_embeds
- beit: BEiT does not use inputs_embeds
- bit: Bit does not use inputs_embeds
- blip: Blip does not use inputs_embeds
- blip_2: Inputs_embeds is tested in individual model tests
- bridgetower:
- canine: CANINE does not have a get_input_embeddings() method.
- ...
📄 JSON saved to /home/pablo/git/transformers/scan_test_inputs_embeds.json
```
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.