* accept custom device_mesh
* fix device_map
* assert that num_heads % tp_size == 0
* todo.
* ReplicateParallel
* handle tied weights
* handle dtensor in save_pretrained with safe_serialization
* tp test works
* doesnt work
* fix shard_and_distribute_module's rank should be local_rank
* tp=4 is correct
* dp+tp is broken
* todo allreduce with dtensors on another dim is annoying
* workaround to sync dp grads when using dtensors
* loading a checkpoint works
* wandb and compare losses with different tp/dp
* cleaning
* cleaning
* .
* .
* logs
* CP2 DP2 no mask works after commenting attn_mask and is_causal from scaled_dot_product_attention
* DP=2 TP=2 now works even with tied embeddings
* model.parameters() and model.module.parameters() are empty..
* reformat sanity_check_tensor_sync
* set atol=1e-4 for CP to pass
* try populate _parameters from named_modules
* refactors
TP2 DP2 works
CP2 DP2 works
* is_causal=True and pack sequences, no attn mask, and preshuffle dataset
* fix packing
* CP=4 doesn't work
* fix labels and position_ids for CP
* DP CP works with transformers 🥳🥳🥳
* refactor
* add example cp
* fixup
* revert sdpa changes
* example cleared
* add CP, DP to the mesh init
* nit
* clean
* use `ALL_PARALLEL_STYLES`
* style
* FSDP works
* log on 1 rank
* .
* fix?
* FSDP1 also has .parameters() bug
* reported gradnorm when using FSDP1 is wrong, but loss is correct so it's okay
* .
* style and fixup
* move stuff around
* fix tests
* style
* let's make it a check
* add missing licences
* warning should be an info
* tp plan should not be NONE
* test all
* god damn it
* test all
---------
Co-authored-by: nouamanetazi <nouamane98@gmail.com>
* add seq_idx and fa kwargs
* update tests
* docs and grad ckpt support
* fmt
* better names
* test_raise_missing_padding_free_kwarg_errs
* + seq_idx in doc strings
* padding free training docs
* add link to pr plots
* raise err on attn_mask with padding free
* rm raising missing padding free err test
* BambaFlashAttentionKwargs
* run modular util for modular_granitemoehybrid.py
* accept custom device_mesh
* fix device_map
* assert that num_heads % tp_size == 0
* todo.
* ReplicateParallel
* handle tied weights
* handle dtensor in save_pretrained with safe_serialization
* tp test works
* doesnt work
* fix shard_and_distribute_module's rank should be local_rank
* tp=4 is correct
* dp+tp is broken
* todo allreduce with dtensors on another dim is annoying
* workaround to sync dp grads when using dtensors
* loading a checkpoint works
* wandb and compare losses with different tp/dp
* cleaning
* cleaning
* .
* .
* logs
* CP2 DP2 no mask works after commenting attn_mask and is_causal from scaled_dot_product_attention
* DP=2 TP=2 now works even with tied embeddings
* model.parameters() and model.module.parameters() are empty..
* reformat sanity_check_tensor_sync
* set atol=1e-4 for CP to pass
* try populate _parameters from named_modules
* refactors
TP2 DP2 works
CP2 DP2 works
* is_causal=True and pack sequences, no attn mask, and preshuffle dataset
* fix packing
* CP=4 doesn't work
* fix labels and position_ids for CP
* DP CP works with transformers 🥳🥳🥳
* refactor
* add example cp
* fixup
* revert sdpa changes
* example cleared
* add CP, DP to the mesh init
* nit
* clean
* use `ALL_PARALLEL_STYLES`
* style
* FSDP works
* log on 1 rank
* .
* fix?
* FSDP1 also has .parameters() bug
* reported gradnorm when using FSDP1 is wrong, but loss is correct so it's okay
* .
* style and fixup
* move stuff around
* fix tests
* style
* let's make it a check
* warning should be an info
---------
Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com>
When preparing the causal attention mask at this point the mask comes
in as a float tensor with min value as a masked value.
It is not correct to convert it to bool and treat it as a bool mask as
this inverts the mask.
`torch.nn.functional.scaled_dot_product_attention` expects that a masked value is `False`.
I suspect that the `sdpa` implementation variant may not have been
thoroughly tested and that is why this error was not caught earlier.
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Add Llama4TextModel to AutoModel mapping
using Llama4TextConfig on AutoModel.from_config raises a ValueError when it is expected to instantiate a Llama4TextModel
bnb quant tests: remove obsolete trust_remote_code test
The MPT model is now natively integrated in Transformers and no longer requires trust_remote_code=True. This removes the failing test_get_keys_to_not_convert_trust_remote_code and related usage, which depended on remote code and caused CI issues due to missing dependencies (e.g., triton_pre_mlir).
* Update modular_qwen2_5_omni.py
fix the error when loading quantized model by AuotAWQ.
* Update modeling_qwen2_5_omni.py
sync code to modular_qwen2_5_omni.py
* pipeline generation defaults
* add max_new_tokens=20 in test pipelines
* pop all kwargs that are used to parameterize generation config
* add class attr that tell us whether a pipeline calls generate
* tmp commit
* pt text gen pipeline tests passing
* remove failing tf tests
* fix text gen pipeline mixin test corner case
* update text_to_audio pipeline tests
* trigger tests
* a few more tests
* skips
* some more audio tests
* not slow
* broken
* lower severity of generation mode errors
* fix all asr pipeline tests
* nit
* skip
* image to text pipeline tests
* text2test pipeline
* last pipelines
* fix flaky
* PR comments
* handle generate attrs more carefully in models that cant generate
* same as above
* tmp commit (imports broken)
* working version; update tests
* remove line break
* shorter msg
* dola checks need num_beams=1; other minor PR comments
* update early trainer failing on bad gen config
* make fixup
* test msg
* Fix ModuleNotFoundError torchao.prototype.low_bit_optim since torchao v 0.11.0
* Fix space on blank line
* update torchao's AdamW4bit and AdamW8bit import for v0.11.0
* Apply style fixes
---------
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* add args support to fast image processors
* add comment for clarity
* fix-copies
* Handle child class args passed as both args or kwargs in call and preprocess functions
* revert support args passed as kwargs in overwritten preprocess
* fix image processor errors
* Add flash-attention-2 backend for ESM-2
Signed-off-by: Peter St. John <pstjohn@nvidia.com>
* update extended_attention_mask for fa2
Signed-off-by: Peter St. John <pstjohn@nvidia.com>
* add test_flash_attn_2_equivalence test
Signed-off-by: Peter St. John <pstjohn@nvidia.com>
---------
Signed-off-by: Peter St. John <pstjohn@nvidia.com>
* enable optional RMS in BitLinear
* Fix naming
* Import RMS from Llama using config.*
* make fix-copies
* ran CI loop
* remove default BitNetQuantConfig values
* Fix BitNetQuantConfig to be Optional
* Fix config docstrings to match Optoinal
* Edit docstrings to match standards
---------
Co-authored-by: steinmetzc <codysteinmetz7@gmail.com>
Co-authored-by: codys12 <steinmetzc@dh-mgmt4.hpc.msoe.edu>
Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>
* Include output embedding as well with `include_embedding` flag
Summary:
att
Test Plan:
python tests/quantization/torchao_integration/test_torchao.py -k test_include_embedding
Reviewers:
Subscribers:
Tasks:
Tags:
* format
* rename include_embedding to include_input_output_embeddings
---------
Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>
* disable deepspeed when setting up fake trainer
* Apply style fixes
---------
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* mvp
* remove trust_remote_code
* generate_from_hub
* handle requirements; docs
* english
* doc PR suggestions
* Apply suggestions from code review
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* changed remote code path to generate/generate.py
* model repo has custom generate -> override base generate
* check for proper inheritance
* some doc updates (missing: tag-related docs)
* update docs to model repo
* nit
* nit
* nits
* Update src/transformers/dynamic_module_utils.py
* Apply suggestions from code review
* Update docs/source/en/generation_strategies.md
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* trust remote code is required
* use new import utils for requirements version parsing
* use org examples
* add tests
* Apply suggestions from code review
Co-authored-by: Manuel de Prada Corral <6536835+manueldeprada@users.noreply.github.com>
* ascii file structure; tag instructions on readme.md
---------
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: Manuel de Prada Corral <6536835+manueldeprada@users.noreply.github.com>
* init vilt image processor fast
* Refactor image processor tests to use loop for all processors
* Add ViltImageProcessorFast with PyTorch-based optimized image processing
* Change made automatically by make fixup command
* Change made automatically by make fix-copies command
* Fix type hints in ViltImageProcessorFast for Python compatibility
* Define constants for image resizing based on COCO dataset aspect ratio
* Add missing property initializations to ViltImageProcessorFast
* Extract resize logic into dedicated method in ViltImageProcessorFast
* Extract padding logic into dedicated method
* Implement shape-based image grouping for optimized processing in Vilt
* Update test suite to verify ViltImageProcessorFast attributes
* Move variable declarations to _preprocess method parameters
* Remove unused parameters
* Rename _resize method to resize to override existing function
* Remove whitespace
* Remove unnecessary type check and conversion for stacked_images
* Remove redundant loop and apply padding directly to stacked images
* Refactor pad function to return images and mask as tuple instead of dict
* Add tests comparing padding masks in slow and fast implementations
* Update ViltImageProcessor tests to ensure compatibility between slow and fast implementations
* Replace add_start_docstrings with auto_docstring in ViltImageProcessorFast
* Move docstrings of custom args to ViltFastImageProcessorKwargs
* Use reorder_images function for both masks and images
---------
Co-authored-by: Yoni Gozlan <74535834+yonigozlan@users.noreply.github.com>
* fix llava processor to calculate unpad size correctly
* repo consistency
* Revert "repo consistency" & "setUp in llava family"
This reverts commit 26a50af8db5b15bb6b700db3d53342fe69579d8e.
* add edge case test for padding & unpadding
* compute unpadding size from original size
* make test config explicit
* Revert "compute unpadding size from original size"
This reverts commit 752cd27ad9710ab056c17a9986760c4651975540.
* Revert "add edge case test for padding & unpadding"
This reverts commit ccbd094d69c3f8f6a259159164284f60ba835bce.
* revert unpad logic
* remove irrelevant tests
* model test
* remove processor from model test
---------
Co-authored-by: jaycha <jaycha@ncsoft.com>
* chore(qwen2): display warning log only when sliding window attention is enabled
* Align modeling_qwen2.py and modular_qwen2.py
---------
Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>
* accept arbitrary kwargs
* move user commands to a separate fn
* work with generation config files
* rm cmmt
* docs
* base generate flag doc section
* nits
* nits
* nits
* no <br>
* better basic args description
* initial design
* update all video processors
* add tests
* need to add qwen2-vl (not tested yet)
* add qwen2-vl in auto map
* fix copies
* isort
* resolve confilicts kinda
* nit:
* qwen2-vl is happy now
* qwen2-5 happy
* other models are happy
* fix copies
* fix tests
* add docs
* CI green now?
* add more tests
* even more changes + tests
* doc builder fail
* nit
* Update src/transformers/models/auto/processing_auto.py
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
* small update
* imports correctly
* dump, otherwise this is getting unmanagebale T-T
* dump
* update
* another update
* update
* tests
* move
* modular
* docs
* test
* another update
* init
* remove flakiness in tests
* fixup
* clean up and remove commented lines
* docs
* skip this one!
* last fix after rebasing
* run fixup
* delete slow files
* remove unnecessary tests + clean up a bit
* small fixes
* fix tests
* more updates
* docs
* fix tests
* update
* style
* fix qwen2-5-vl
* fixup
* fixup
* unflatten batch when preparing
* dump, come back soon
* add docs and fix some tests
* how to guard this with new dummies?
* chat templates in qwen
* address some comments
* remove `Fast` suffix
* fixup
* oops should be imported from transforms
* typo in requires dummies
* new model added with video support
* fixup once more
* last fixup I hope
* revert image processor name + comments
* oh, this is why fetch test is failing
* fix tests
* fix more tests
* fixup
* add new models: internvl, smolvlm
* update docs
* imprt once
* fix failing tests
* do we need to guard it here again, why?
* new model was added, update it
* remove testcase from tester
* fix tests
* make style
* not related CI fail, lets' just fix here
* mark flaky for now, filas 15 out of 100
* style
* maybe we can do this way?
* don't download images in setup class
---------
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
Do not erase a cache_position initialization passed explicitly to generate(), if there is one.
But: Let initialization replace cache_position if it's set to None. I assume that if the value is explicitly passed but None, we should initialize anyway.
* update models
* why rename
* return attn weights when sdpa
* fixes
* fix attn implementation composite
* fix moshi
* add message
* add typings
* use explicitly all flags for each attn type
* fix some tests
* import what is needed
* kosmos on main has ew attention already, yay
* new models in main, run fixup
* won't fix kosmos yet
* fix-copies
* clean up after rebasing
* fix tests
* style
* dont cast attns to fp32
* did we update ruff? oke, let's just do what it asks
* fix pixtral after rebase
* Add ALL_ATTENTION_FUNCTIONS compatibility for Pixtral model
* Fix invalid operand type
* Allow image_sizes to be optional in forward pass to fit tests
Disallow using sdpa and output_attentions
* Disallow using sdpa with output_attentions
* Delete useless comments, use eager attention from smolvlm, use pattern from mistral
* add _supports_attention_backend
* use kwargs instead of position_ids
---------
Co-authored-by: aurelien.lac <aurelien.lac@lighton.ai>
* Add fast image processor support for Swin2SR
* Add Swin2SR tests of fast image processing
* Update docs and remove unnecessary test func
* Fix docstring formatting
* Skip fast vs slow processing test
---------
Co-authored-by: Yoni Gozlan <74535834+yonigozlan@users.noreply.github.com>
* i guessreverted all CdGen classes
* style
* llava onevision
* fix copies
* fix some tests
* some more tests
* dump
* skip these
* nevermind, i am dumb
* revert fix not needed
* fixup
* fixup
* another fixup
* more fixup to make ci finally happy
* fixup after rebasing
* fix qwen tests
* add internVL + typos here and there
* image token index -> id
* style
* fix init weights
* revert blip-2 not supported
* address comments
* fix copies
* revert blip2 test file as well
* as discussed internally, revert back CdGen models
* fix some tests
* fix more tests for compile
* CI red
* fix copies
* enumerate explicitly allowed models
* address comments
* fix tests
* fixup
* style again
* add tests for new model class
* another fixup ( x _ x )
* [fixup] unused attributes can be removed post-deprecation
* Enable granite speech 3.3 tests
* skip sdpa test for granite speech
* Explicitly move model to device
* Use granite speech 2b in tests
---------
Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com>
* args keep_torch_compile=False in _save and _wwrap_method
* Fix FSDP execution on evaluation for torch_compile mode
* add test trainer FSDP + Torch Compile
* fix quality code
* make style
* Revert " make style"
This reverts commit 77e797f8829c50992cc21496be3d9a3e480e1c97.
* make style
* [fix] one pixel should be added when length is odd
* [fix] add vision_aspect_ratio args & typo
* [fix] style
* [fix] do not fix fast file directly
* [fix] convert using modular
* remove duplicate codes
* match unpad logic with pad logic
* test odd-sized images for llava & aria
* test unpad odd-sized padding for llava family
* fix style
* add kwarg to onvision modular
* move vision_aspect_ratio from image_processor to processor
(llava_onevision)
* add num_tokens_to_discard to the forward of Dinov2ForImageClassification
* redefine forward in modular file, remove change to modeling_dinov2 file
* run make fixup
---------
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
Implements last migrations for generation from `config.vocab_size` to `config.get_text_config().vocab.size`
In doing so, we enable multimodal models to fully leverage all existing generation features.
* Let notification service succeed even when artifacts and reported jobs on github have mismatch
* Use default trace msg if no trace msg available
* Add pop_default helper fn
* style
Summary:
Currently when we try to quantize input_embedding for some models, the output embedding
(lm_head) will also be quantized the same way, since they are tied, and this may not be what
we want. To break the tie, we added the option to allow people to
1. load unquantized weight
2. tie weights
3. quantize
so that the tie will be broken
Test Plan:
```
from transformers import (
AutoModelForCausalLM,
AutoProcessor,
AutoTokenizer,
TorchAoConfig,
)
from torchao.quantization.quant_api import (
IntxWeightOnlyConfig,
Int8DynamicActivationIntxWeightConfig,
AOPerModuleConfig
)
from torchao.quantization.granularity import PerGroup, PerAxis
import torch
model_id = "microsoft/Phi-4-mini-instruct"
embedding_config = IntxWeightOnlyConfig(
weight_dtype=torch.int8,
granularity=PerAxis(0),
)
linear_config = Int8DynamicActivationIntxWeightConfig(
weight_dtype=torch.int4,
weight_granularity=PerGroup(32),
weight_scale_dtype=torch.bfloat16,
)
quant_config = AOPerModuleConfig({"_default": linear_config, "model.embed_tokens": embedding_config})
quantization_config = TorchAoConfig(quant_type=quant_config, include_embedding=True, untie_embedding_weights=True)
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float32, device_map="auto", quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained(model_id)
print(quantized_model)
print("embed_tokens.weight:", quantized_model.model.embed_tokens.weight)
print("lm head weight:", quantized_model.lm_head.weight)
from transformers.modeling_utils import find_tied_parameters
print(find_tied_parameters(quantized_model))
```
Reviewers:
Subscribers:
Tasks:
Tags:
Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>
* rm already deprecated padding max length
* truncate_strategy AS AN ARG is already deprecated for a few years
* fix
* rm test_padding_to_max_length
* rm pad_to_max_length=True in other tests
* rm from common
* missed fnet
* Support `AOPerModuleConfig` and include_embedding
Summary:
This PR adds support per module configuration for torchao
Also added per module quantization examples:
1. Quantizing different layers with different quantization configs
2. Skip quantization for certain layers
Test Plan:
python tests/quantization/torchao_integration/test_torchao.py -k test_include_embedding
python tests/quantization/torchao_integration/test_torchao.py -k test_per_module_config_skip
Reviewers:
Subscribers:
Tasks:
Tags:
* format
* format
* inlcude embedding remove input embedding from module not to convert
* more docs
* Update docs/source/en/quantization/torchao.md
Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>
* Update src/transformers/quantizers/quantizer_torchao.py
Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>
* Update src/transformers/quantizers/quantizer_torchao.py
Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>
---------
Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>
Support FlaxPreTrainedModel to load model checkpoint from subfolder in local directory as safetensors format
Signed-off-by: Yan Zhao <zhao.y4@northeastern.edu>
* Unhardcode use_chunked_attention, fix no_rope_layers
* Go back to exhaustive list of bools
* Conversion and modeling updates
* Fix rope
* Unhardcode rope
* Fix context length
* style
* Minor updates to conversion
* Use StaticCache
* Minor simplification
* DynamicCache 🤦
* Style
* Style
* No more red flaky tests in the CI!
* Remove the CircleCI logic as well
* Revert most changes including is_flaky behaviour
* make fixup
* Move to a more sensible place
* Mark a flaky test that failed on this PR!
* correct import
* update
* update
* update
* update
---------
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
* Fix check of unecessary packages (issue #37626)
* Reformat using ruff
* And a condition to avoind the risk of matching a random object in `import_utils`
* Reformat
* copy the last changes from broken PR
* small format
* some fixes and refactoring after review
* format
* add config attr for loss
* some fixes and refactoring
* fix copies
* fix style
* add test for d-fine resnet
* fix decoder layer prop
* fix dummies
* format init
* remove extra print
* refactor modeling, move resnet into separate folder
* fix resnet config
* change resnet on hgnet_v2, add clamp into decoder
* fix init
* fix config doc
* fix init
* fix dummies
* fix config docs
* fix hgnet_v2 config typo
* format modular
* add image classification for hgnet, some refactoring
* format tests
* fix dummies
* fix init
* fix style
* fix init for hgnet v2
* fix index.md, add init rnage for hgnet
* fix conversion
* add missing attr to encoder
* add loss for d-fine, add additional output for rt-detr decoder
* tests and docs fixes
* fix rt_detr v2 conversion
* some fixes for loos and decoder output
* some fixes for loss
* small fix for converted modeling
* add n model config, some todo comments for modular
* convert script adjustments and fixes, small refact
* remove extra output for rt_detr
* make some outputs optionsl, fix conversion
* some posr merge fixes
* small fix
* last field fix
* fix not split for hgnet_v2
* disable parallelism test for hgnet_v2 image classification
* skip multi gpu for d-fine
* adjust after merge init
* remove extra comment
* fix repo name references
* small fixes for tests
* Fix checkpoint path
* Fix consistency
* Fixing docs
---------
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
* added fast image processor for VitMatte including updated and new tests, fixed a bug in the slow image processor that processed images incorrectly for input format ChannelDimension.FIRST in which case the trimaps were not added in the correct dimension, this bug was also reflected in the tests through incorretly shaped trimaps being passed
* final edits for fast vitmatte image processor and tests
* final edits for fast vitmatte image processor and tests
---------
Co-authored-by: Yoni Gozlan <74535834+yonigozlan@users.noreply.github.com>
* added the configuartion for sam_hq
* added the modeelling for sam_hq
* added the sam hq mask decoder with hq features
* added the code for the samhq
* added the code for the samhq
* added the code for the samhq
* Delete src/transformers/models/sam_hq/modelling_sam_hq.py
* added the code for the samhq
* added the code for the samhq
* added the chnages for the modeelling
* added the code for sam hq for image processing
* added code for the sam hq model
* added the required changes
* added the changes
* added the key mappings for the sam hq
* adding the working code of samhq
* added the required files
* adding the pt object
* added the push to hub account
* added the args for the sam maks decoder
* added the args for the sam hq vision config
* aded the some more documentation
* removed the unecessary spaces
* all required chnages
* removed the image processor
* added the required file
* added the changes for the checkcopies
* added the code for modular file
* added the changes for the __init file
* added the code for the interm embeds
* added the code for sam hq
* added the changes for modular file
* added the test file
* added the changes required
* added the changes required
* added the code for the
* added the cl errors
* added the changes
* added the required changes
* added the some code
* added the code for the removing image processor
* added the test dimensins
* added the code for the removing extra used variables
* added the code for modeluar file hf_mlp for a better name
* removed abbrevaation in core functionality
* removed abbrevaation in core functionality
* .contiguous() method is often used to ensure that the tensor is stored in a contiguous block of memory
* added the code which is after make fixup
* added some test for the intermediate embeddings test
* added the code for the torch support in sam hq
* added the code for the updated modular file
* added the changes for documentations as mentioned
* removed the heading
* add the changes for the code
* first mentioned issue resolved
* added the changes code to processor
* added the easy loading to init file
* added the changes to code
* added the code to changes
* added the code to work
* added the code for sam hq
* added the code for sam hq
* added the code for the point pad value
* added the small test for the image embeddings and intermediate embedding
* added the code
* added the code
* added the code for the tests
* added the code
* added ythe code for the processor file
* added the code
* added the code
* added the code
* added the code
* added the code
* added the code for tests and some checks
* added some code
* added the code
* added the code
* added some code
* added some code
* added the changes for required
* added the code
* added the code
* added the code
* added the code
* added the code
* added the code
* added the code
* added the code
* added the code
* added the code
* added some changes
* added some changes
* removed spaces and quality checks
* added some code
* added some code
* added some code
* added code quality checks
* added the checks for quality checks
* addded some code which fixes test_inference_mask_generation_no_point
* added code for the test_inference_mask_generation_one_point_one_bb
* added code for the test_inference_mask_generation_one_point_one_bb_zero
* added code for the test_inference_mask_generation_one_box
* added some code in modelling for testing
* added some code which sort maks with high score
* added some code
* added some code
* added some code for the move KEYS_TO_MODIFY_MAPPING
* added some code for the unsqueeze removal
* added some code for the unsqueeze removal
* added some code
* added some code
* add some code
* added some code
* added some code
* added some testign values changed
* added changes to code in sam hq for readbility purpose
* added pre commit checks
* added the fix samvisionmodel for compatibilty
* added the changes made on sam by cyyever
* fixed the tests for samhq
* added some the code
* added some code related to init file issue during merge conflicts
* remobved the merge conflicts
* added changes mentioned by aruther and mobap
* added changes mentioned by aruther and mobap
* solving quality checks
* added the changes for input clearly
* added the changes
* added changes in mask generation file rgearding model inputs and sam hq quargs in processor file
* added changes in processor file
* added the Setup -> setupclass conversion
* added the code mentioned for processor
* added changes for the code
* added some code
* added some code
* added some code
---------
Co-authored-by: Pablo Montalvo <39954772+molbap@users.noreply.github.com>
Two PEFT tests are actually failing:
tests/peft_integration/test_peft_integration.py::PeftIntegrationTester::test_delete_adapter
tests/peft_integration/test_peft_integration.py::PeftIntegrationTester::test_peft_pipeline_no_warning
This must have been going on for some time but was apparently never
noticed. The cause is that the tests themselves are faulty, the PEFT
integration is correct in these cases.
test_delete_adapter
The first faulty test was introduced by #34650. AFAICT, it should never
have passed in the first place, the PEFT integration logic was not
changed in the meantime. At this point, the logs for the PR CI are gone,
so I'm not sure if the test passed back then or not.
test_peft_pipeline_no_warning
This test was introduced in #36783 and should also never have passed, as
the self.assertNoLogs context manager only returns None, thus the assert
should never have worked (mea culpa for suggesting this code snippet).
Here too, the CI logs are deleted by now, so I can't check if the test
already failed back then.
* Fix wrong position_ids shape in doc
Supported by ClvpDecoder.forward, line 1212--1215:
src/transformers/models/clvp/modeling_clvp.py:
1212 if inputs_embeds is None:
1213 inputs_embeds = self.input_embeds_layer(input_ids)
1214 position_embeds = self.position_embeds_layer(position_ids)
1215 inputs_embeds = inputs_embeds + position_embeds
* Fix possibly wrong input_ids shape in doc
Since 'input_ids_length' was mentioned immediately after the shape `(batch_size, sequence_length)`, it doesn't make sense to me for `input_ids` to have such shape---IMO it ought to have shape `(batch_size, input_ids_length)` instead.
* Fix possibly wrong inputs_embeds shape in doc
Supported by CTRLModel.forward, line 448--449:
src/transformers/models/ctrl/modeling_ctrl.py:
448 if inputs_embeds is None:
449 inputs_embeds = self.w(input_ids)
This commit is introduced due to commit 6f36b56497828642b65f54ea26aa4064186de57a.
* Fix possibly wrong token_type_ids shape in doc
Supported by CTRLModel.forward, line 441--460:
src/transformers/models/ctrl/modeling_ctrl.py:
441 if token_type_ids is not None:
442 token_type_ids = token_type_ids.view(-1, input_shape[-1])
443 token_type_embeds = self.w(token_type_ids)
444 token_type_embeds *= np.sqrt(self.d_model_size)
445 else:
446 token_type_embeds = 0
447
448 if inputs_embeds is None:
449 inputs_embeds = self.w(input_ids)
450 # inputs_embeds = embedded.unsqueeze(0) if len(input_ids.shape)<2 else embedded
451 seq_len = input_shape[-1]
452 mask = torch.triu(torch.ones(seq_len + past_length, seq_len + past_length), 1).to(device)
453
454 inputs_embeds *= np.sqrt(self.d_model_size)
455
456 # `self.pos_encoding` won't be sent to the correct device along the model, so we do it manually.
457 self.pos_encoding = self.pos_encoding.to(device)
458 pos_embeds = self.pos_encoding[position_ids, :]
459
460 hidden_states = inputs_embeds + pos_embeds + token_type_embeds
This commit is introduced due to commit 6f36b56497828642b65f54ea26aa4064186de57a.
* Fix possibly wrong position_ids shape in doc
Supported by CTRLModel.forward, line 448--460:
src/transformers/models/ctrl/modeling_ctrl.py:
448 if inputs_embeds is None:
449 inputs_embeds = self.w(input_ids)
450 # inputs_embeds = embedded.unsqueeze(0) if len(input_ids.shape)<2 else embedded
451 seq_len = input_shape[-1]
452 mask = torch.triu(torch.ones(seq_len + past_length, seq_len + past_length), 1).to(device)
453
454 inputs_embeds *= np.sqrt(self.d_model_size)
455
456 # `self.pos_encoding` won't be sent to the correct device along the model, so we do it manually.
457 self.pos_encoding = self.pos_encoding.to(device)
458 pos_embeds = self.pos_encoding[position_ids, :]
459
460 hidden_states = inputs_embeds + pos_embeds + token_type_embeds
This commit is introduced due to commit 6f36b56497828642b65f54ea26aa4064186de57a.
* Fix wrong token_type_ids shape in doc
Supported by TFCTRLMainLayer.call, line 376--394:
src/transformers/models/ctrl/modeling_tf_ctrl.py:
376 if token_type_ids is not None:
377 token_type_ids = tf.reshape(token_type_ids, [-1, shape_list(token_type_ids)[-1]])
378 token_type_embeds = self.w(token_type_ids)
379 token_type_embeds *= tf.math.sqrt(tf.cast(self.d_model_size, dtype=token_type_embeds.dtype))
380 else:
381 token_type_embeds = tf.constant(0.0)
382 position_ids = tf.reshape(position_ids, [-1, shape_list(position_ids)[-1]])
383
384 if inputs_embeds is None:
385 check_embeddings_within_bounds(input_ids, self.w.input_dim)
386 inputs_embeds = self.w(input_ids)
387 seq_len = input_shape[-1]
388 mask = 1 - tf.linalg.band_part(tf.ones((seq_len, seq_len)), -1, 0)
389
390 inputs_embeds *= tf.math.sqrt(tf.cast(self.d_model_size, inputs_embeds.dtype))
391
392 pos_embeds = tf.gather(self.pos_encoding, position_ids)
393 pos_embeds = tf.cast(pos_embeds, dtype=token_type_embeds.dtype)
394 hidden_states = inputs_embeds + pos_embeds + token_type_embeds
* Fix wrong position_ids shape in doc
Supported by TFCTRLMainLayer.call, line 384--394:
src/transformers/models/ctrl/modeling_tf_ctrl.py:
384 if inputs_embeds is None:
385 check_embeddings_within_bounds(input_ids, self.w.input_dim)
386 inputs_embeds = self.w(input_ids)
387 seq_len = input_shape[-1]
388 mask = 1 - tf.linalg.band_part(tf.ones((seq_len, seq_len)), -1, 0)
389
390 inputs_embeds *= tf.math.sqrt(tf.cast(self.d_model_size, inputs_embeds.dtype))
391
392 pos_embeds = tf.gather(self.pos_encoding, position_ids)
393 pos_embeds = tf.cast(pos_embeds, dtype=token_type_embeds.dtype)
394 hidden_states = inputs_embeds + pos_embeds + token_type_embeds
* Fix wrong inputs_embeds shape in doc
Supported by TFCTRLMainLayer.call, line 384--394:
src/transformers/models/ctrl/modeling_tf_ctrl.py:
384 if inputs_embeds is None:
385 check_embeddings_within_bounds(input_ids, self.w.input_dim)
386 inputs_embeds = self.w(input_ids)
387 seq_len = input_shape[-1]
388 mask = 1 - tf.linalg.band_part(tf.ones((seq_len, seq_len)), -1, 0)
389
390 inputs_embeds *= tf.math.sqrt(tf.cast(self.d_model_size, inputs_embeds.dtype))
391
392 pos_embeds = tf.gather(self.pos_encoding, position_ids)
393 pos_embeds = tf.cast(pos_embeds, dtype=token_type_embeds.dtype)
394 hidden_states = inputs_embeds + pos_embeds + token_type_embeds
* Fix wrong inputs_embeds shape in doc
Supported by ClvpDecoder.forward, line 1212--1213:
src/transformers/models/clvp/modeling_clvp.py:
1212 if inputs_embeds is None:
1213 inputs_embeds = self.input_embeds_layer(input_ids)
* Fix wrong position_ids shape in doc
Supported by FlaxGemmaPreTrainedModel.__call__, line 502--508:
src/transformers/models/gemma/modeling_flax_gemma.py:
502 batch_size, sequence_length = input_ids.shape
503
504 if position_ids is None:
505 if past_key_values is not None:
506 raise ValueError("Make sure to provide `position_ids` when passing `past_key_values`.")
507
508 position_ids = jnp.broadcast_to(jnp.arange(sequence_length)[None, :], (batch_size, sequence_length))
* Fix wrong position_ids shape in doc
Supported by FlaxGPT2PreTrainedModel.__call__, line 482--488:
src/transformers/models/gpt2/modeling_flax_gpt2.py:
482 batch_size, sequence_length = input_ids.shape
483
484 if position_ids is None:
485 if past_key_values is not None:
486 raise ValueError("Make sure to provide `position_ids` when passing `past_key_values`.")
487
488 position_ids = jnp.broadcast_to(jnp.arange(sequence_length)[None, :], (batch_size, sequence_length))
* Fix wrong position_ids shape in doc
Supported by GPT2Model.forward, line 918--921:
src/transformers/models/gpt2/modeling_gpt2.py:
918 if inputs_embeds is None:
919 inputs_embeds = self.wte(input_ids)
920 position_embeds = self.wpe(position_ids)
921 hidden_states = inputs_embeds + position_embeds.to(inputs_embeds.device)
* Fix wrong inputs_embeds shape in doc
Supported by GPT2Model.forward, line 918--919:
src/transformers/models/gpt2/modeling_gpt2.py:
918 if inputs_embeds is None:
919 inputs_embeds = self.wte(input_ids)
* Fix wrong labels shape in doc
Supported by GPT2LMHeadModel.forward, line 1156--1157:
src/transformers/models/gpt2/modeling_gpt2.py:
1156 Labels for language modeling. Note that the labels **are shifted** inside the model, i.e. you can set
1157 `labels = input_ids` Indices are selected in `[-100, 0, ..., config.vocab_size]` All labels set to `-100`
* Fix wrong labels shape in doc
Supported by GPT2DoubleHeadsModel.forward, line 1314--1315:
src/transformers/models/gpt2/modeling_gpt2.py:
1314 Labels for language modeling. Note that the labels **are shifted** inside the model, i.e. you can set
1315 `labels = input_ids`. Indices are selected in `[-100, 0, ..., config.vocab_size - 1]`. All labels set to
* Fix wrong token_type_ids shape in doc
Supported by TFGPT2MainLayer.call, line 486--500:
src/transformers/models/gpt2/modeling_tf_gpt2.py:
486 if inputs_embeds is None:
487 check_embeddings_within_bounds(input_ids, self.config.vocab_size)
488 inputs_embeds = self.wte(input_ids)
489
490 position_embeds = self.wpe(position_ids)
491
492 if token_type_ids is not None:
493 token_type_ids = tf.reshape(token_type_ids, [-1, shape_list(token_type_ids)[-1]])
494 token_type_embeds = self.wte(token_type_ids)
495 else:
496 token_type_embeds = tf.constant(0.0)
497
498 position_embeds = tf.cast(position_embeds, dtype=inputs_embeds.dtype)
499 token_type_embeds = tf.cast(token_type_embeds, dtype=inputs_embeds.dtype)
500 hidden_states = inputs_embeds + position_embeds + token_type_embeds
* Fix wrong position_ids shape in doc
Supported by TFGPT2MainLayer.call, line 486--500:
src/transformers/models/gpt2/modeling_tf_gpt2.py:
486 if inputs_embeds is None:
487 check_embeddings_within_bounds(input_ids, self.config.vocab_size)
488 inputs_embeds = self.wte(input_ids)
489
490 position_embeds = self.wpe(position_ids)
491
492 if token_type_ids is not None:
493 token_type_ids = tf.reshape(token_type_ids, [-1, shape_list(token_type_ids)[-1]])
494 token_type_embeds = self.wte(token_type_ids)
495 else:
496 token_type_embeds = tf.constant(0.0)
497
498 position_embeds = tf.cast(position_embeds, dtype=inputs_embeds.dtype)
499 token_type_embeds = tf.cast(token_type_embeds, dtype=inputs_embeds.dtype)
500 hidden_states = inputs_embeds + position_embeds + token_type_embeds
* Fix wrong inputs_embeds shape in doc
Supported by TFGPT2MainLayer.call, line 486--488:
src/transformers/models/gpt2/modeling_tf_gpt2.py:
486 if inputs_embeds is None:
487 check_embeddings_within_bounds(input_ids, self.config.vocab_size)
488 inputs_embeds = self.wte(input_ids)
* Fix wrong position_ids shape in doc
Supported by GPTBigCodeModel.forward, line 962--965:
src/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py:
962 if inputs_embeds is None:
963 inputs_embeds = self.wte(input_ids)
964 position_embeds = self.wpe(position_ids)
965 hidden_states = inputs_embeds + position_embeds.to(inputs_embeds.device)
* Fix wrong inputs_embeds shape in doc
Supported by GPTBigCodeModel.forward, line 962--963:
src/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py:
962 if inputs_embeds is None:
963 inputs_embeds = self.wte(input_ids)
* Fix wrong labels shape in doc
Supported by GPTBigCodeForCausalLM.forward, line 1158--1159:
src/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py:
1158 Labels for language modeling. Note that the labels **are shifted** inside the model, i.e. you can set
1159 `labels = input_ids` Indices are selected in `[-100, 0, ..., config.vocab_size]` All labels set to `-100`
* Fix wrong position_ids shape in doc
Supported by FlaxGPTNeoModule.__call__, line 549--552:
src/transformers/models/gpt_neo/modeling_flax_gpt_neo.py:
549 input_embeds = self.wte(input_ids.astype("i4"))
550 position_embeds = self.wpe(position_ids.astype("i4"))
551
552 hidden_states = input_embeds + position_embeds
* Fix wrong position_ids shape in doc
Supported by GPTNeoModel.forward, line 685--720:
src/transformers/models/gpt_neo/modeling_gpt_neo.py:
685 if inputs_embeds is None:
686 inputs_embeds = self.wte(input_ids)
687
688 # kept for BC (non `Cache` `past_key_values` inputs)
689 return_legacy_cache = False
690 if use_cache and not isinstance(past_key_values, Cache):
691 return_legacy_cache = True
692 if past_key_values is None:
693 past_key_values = DynamicCache()
694 else:
695 past_key_values = DynamicCache.from_legacy_cache(past_key_values)
696 logger.warning_once(
697 "We detected that you are passing `past_key_values` as a tuple of tuples. This is deprecated and "
698 "will be removed in v4.47. Please convert your cache or use an appropriate `Cache` class "
699 "(https://huggingface.co/docs/transformers/kv_cache#legacy-cache-format)"
700 )
701
702 seq_length = inputs_embeds.shape[1]
703 if cache_position is None:
704 past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
705 cache_position = torch.arange(past_seen_tokens, past_seen_tokens + seq_length, device=inputs_embeds.device)
706
707 if position_ids is None:
708 position_ids = cache_position.unsqueeze(0)
709
710 causal_mask = self._update_causal_mask(
711 attention_mask, inputs_embeds, cache_position, past_key_values, output_attentions
712 )
713
714 # Prepare head mask if needed
715 # 1.0 in head_mask indicate we keep the head
716 # attention_probs has shape bsz x num_heads x N x N
717 # head_mask has shape n_layer x batch x num_heads x N x N
718 head_mask = self.get_head_mask(head_mask, self.config.num_layers)
719 position_embeds = self.wpe(position_ids)
720 hidden_states = inputs_embeds + position_embeds
* Fix wrong inputs_embeds shape in doc
Supported by GPTNeoModel.forward, line 685--686:
src/transformers/models/gpt_neo/modeling_gpt_neo.py:
685 if inputs_embeds is None:
686 inputs_embeds = self.wte(input_ids)
* Fix wrong labels shape in doc
Supported by GPTNeoForCausalLM.forward, line 968--969:
src/transformers/models/gpt_neo/modeling_gpt_neo.py:
968 Labels for language modeling. Note that the labels **are shifted** inside the model, i.e. you can set
969 `labels = input_ids` Indices are selected in `[-100, 0, ..., config.vocab_size]` All labels set to `-100`
* Fix wrong position_ids shape in doc
Supported by FlaxGPTJPreTrainedModel.__call__, line 455--461:
src/transformers/models/gptj/modeling_flax_gptj.py:
455 batch_size, sequence_length = input_ids.shape
456
457 if position_ids is None:
458 if past_key_values is not None:
459 raise ValueError("Make sure to provide `position_ids` when passing `past_key_values`.")
460
461 position_ids = jnp.broadcast_to(jnp.arange(sequence_length)[None, :], (batch_size, sequence_length))
* Fix wrong token_type_ids shape in doc
Supported by TFGPTJMainLayer.call, line 482--493:
src/transformers/models/gptj/modeling_tf_gptj.py:
482 if inputs_embeds is None:
483 check_embeddings_within_bounds(input_ids, self.wte.vocab_size)
484 inputs_embeds = self.wte(input_ids, mode="embedding")
485
486 if token_type_ids is not None:
487 token_type_ids = tf.reshape(token_type_ids, [-1, shape_list(token_type_ids)[-1]])
488 token_type_embeds = self.wte(token_type_ids, mode="embedding")
489 else:
490 token_type_embeds = tf.constant(0.0)
491
492 token_type_embeds = tf.cast(token_type_embeds, dtype=inputs_embeds.dtype)
493 hidden_states = inputs_embeds + token_type_embeds
* Fix wrong position_ids shape in doc
Supported by TFGPTJMainLayer.call, line 434--449:
src/transformers/models/gptj/modeling_tf_gptj.py:
434 elif input_ids is not None:
435 input_shape = shape_list(input_ids)
436 input_ids = tf.reshape(input_ids, [-1, input_shape[-1]])
437 elif inputs_embeds is not None:
438 input_shape = shape_list(inputs_embeds)[:-1]
439 else:
440 raise ValueError("You have to specify either input_ids or inputs_embeds")
441
442 if past_key_values is None:
443 past_length = 0
444 past_key_values = [None] * len(self.h)
445 else:
446 past_length = shape_list(past_key_values[0][0])[-2]
447
448 if position_ids is None:
449 position_ids = tf.expand_dims(tf.range(past_length, input_shape[-1] + past_length), axis=0)
* Fix wrong inputs_embeds shape in doc
Supported by TFGPTJMainLayer.call, line 482--484:
src/transformers/models/gptj/modeling_tf_gptj.py:
482 if inputs_embeds is None:
483 check_embeddings_within_bounds(input_ids, self.wte.vocab_size)
484 inputs_embeds = self.wte(input_ids, mode="embedding")
* Fix wrong labels shape in doc
Supported by TFGPTJForCausalLM.call, line 812--813:
src/transformers/models/gptj/modeling_tf_gptj.py:
812 Labels for language modeling. Note that the labels **are shifted** inside the model, i.e. you can set
813 `labels = input_ids` Indices are selected in `[-100, 0, ..., config.vocab_size]` All labels set to `-100`
* Fix possibly wrong input_ids shape in doc
Since 'input_ids_length' was mentioned immediately after the shape `(batch_size, sequence_length)`, it doesn't make sense to me for `input_ids` to have such shape---IMO it ought to have shape `(batch_size, input_ids_length)` instead.
* Fix possibly wrong token_type_ids shape in doc
Supported by ImageGPTModel.forward, line 773--780:
src/transformers/models/imagegpt/modeling_imagegpt.py:
773 if inputs_embeds is None:
774 inputs_embeds = self.wte(input_ids)
775 position_embeds = self.wpe(position_ids)
776 hidden_states = inputs_embeds + position_embeds.to(inputs_embeds.device)
777
778 if token_type_ids is not None:
779 token_type_embeds = self.wte(token_type_ids)
780 hidden_states = hidden_states + token_type_embeds
This commit is introduced due to commit 8e594a4143cca79f165b99e4ed4c9f3a90047bf3.
* Fix possibly wrong position_ids shape in doc
Supported by ImageGPTModel.forward, line 773--776:
src/transformers/models/imagegpt/modeling_imagegpt.py:
773 if inputs_embeds is None:
774 inputs_embeds = self.wte(input_ids)
775 position_embeds = self.wpe(position_ids)
776 hidden_states = inputs_embeds + position_embeds.to(inputs_embeds.device)
This commit is introduced due to commit 8e594a4143cca79f165b99e4ed4c9f3a90047bf3.
* Fix possibly wrong inputs_embeds shape in doc
Supported by ImageGPTModel.forward, line 773--774:
src/transformers/models/imagegpt/modeling_imagegpt.py:
773 if inputs_embeds is None:
774 inputs_embeds = self.wte(input_ids)
This commit is introduced due to commit 8e594a4143cca79f165b99e4ed4c9f3a90047bf3.
* Fix possibly wrong labels shape in doc
Supported by ImageGPTForCausalImageModeling.forward, line 923--924:
src/transformers/models/imagegpt/modeling_imagegpt.py:
923 Labels for language modeling. Note that the labels **are shifted** inside the model, i.e. you can set
924 `labels = input_ids` Indices are selected in `[-100, 0, ..., config.vocab_size]` All labels set to `-100`
This commit is introduced due to commit 8e594a4143cca79f165b99e4ed4c9f3a90047bf3.
* Fix possibly wrong labels shape in doc
Supported by ImageGPTModel.forward, line 665--666:
src/transformers/models/imagegpt/modeling_imagegpt.py:
665 Labels for language modeling. Note that the labels **are shifted** inside the model, i.e. you can set
666 `labels = input_ids` Indices are selected in `[-100, 0, ..., config.vocab_size]` All labels set to `-100`
This commit is introduced due to commit 8e594a4143cca79f165b99e4ed4c9f3a90047bf3.
* Fix wrong position_ids shape in doc
Supported by FlaxLlamaPreTrainedModel.__call__, line 484--490:
src/transformers/models/llama/modeling_flax_llama.py:
484 batch_size, sequence_length = input_ids.shape
485
486 if position_ids is None:
487 if past_key_values is not None:
488 raise ValueError("Make sure to provide `position_ids` when passing `past_key_values`.")
489
490 position_ids = jnp.broadcast_to(jnp.arange(sequence_length)[None, :], (batch_size, sequence_length))
* Fix wrong position_ids shape in doc
Supported by FlaxMistralPreTrainedModel.__call__, line 478--484:
src/transformers/models/mistral/modeling_flax_mistral.py:
478 batch_size, sequence_length = input_ids.shape
479
480 if position_ids is None:
481 if past_key_values is not None:
482 raise ValueError("Make sure to provide `position_ids` when passing `past_key_values`.")
483
484 position_ids = jnp.broadcast_to(jnp.arange(sequence_length)[None, :], (batch_size, sequence_length))
* Fix qwen2_5 get_rope_index tensor device locations
* simpler fix
* edit right file for modular model
* add a test
* try normalizing type to fix non-video
* fix some imports
* add a video forward test with dummy input
* skip compilation on cpu offload
* add test
* better logic
* docstring
* boolean logic
* add disk offload check
* warn users if compilation options are set but compilation doesn happen
* fix test
---------
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* Init `SinusoidsPositionEmbedding` with float to avoid precision problem
* fix hidden_state for talker
* Update modular_qwen2_5_omni.py
* Move hidden processing out from thinker
* fixup
---------
Co-authored-by: lvyuanjun.lyj <lvyuanjun.lyj@alibaba-inc.com>
* fast image processor template for MobileNetV1 via transformers-cli
* Add fast image processors and unify tests for slow/fast image processor classes
* added loop over image_processor_list for all tests and removed boilerplate comments.
---------
Co-authored-by: Yoni Gozlan <74535834+yonigozlan@users.noreply.github.com>
* support poolformer fast image processor
* support test for crop_pct=None
* run make style
* Apply suggestions from code review
* rename test
---------
Co-authored-by: Yoni Gozlan <74535834+yonigozlan@users.noreply.github.com>
* tokenize inputs directly in apply_chat_template
* refactor processing
* revert changes processing llava
* Update docs
* fix issue with str being iterable
* add test chat text only
* change function name
- Since the `get_text_config` references an instance variable within
the class (`self.thinker_config`), the `get_text_config` method
should not be a classmethod.
- Before this fix, users were getting the following error:
'''
AttributeError: type object 'Qwen2_5OmniConfig' has no attribute 'thinker_config'
'''
* new card for mbart and mbart50
* removed comment BADGES
* Update mBart overview
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* fix typo (MBart to mBart)
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* maybe fix typo
* update typo and combine notes
* changed notes
* changed the example sentence
* fixed grammatical error and removed some lines from notes example
* missed one word
* removed documentation resources and added some lines of example code back in notes.
---------
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* fix: RecurrentGemma crashes during inference for inputs longer than sliding window width
* fix recurrentgemma tests; add long test bigger than context window
* Restructure torchao quantization examples
Summary:
Mainly structured the examples by hardwares and then listed
the recommended quantization methods for each hardware H100 GPU, A100 GPU and CPU
Also added example for push_to_hub
Test Plan:
not required
Reviewers:
Subscribers:
Tasks:
Tags:
* update
* drop float8 cpu
* address comments and simplify
* small update
* link update
* minor update
* Set default value for output_attentions parameter in Gemma2 and Gemma3 models
* update
* fix
* fix
---------
Co-authored-by: chenin <wangzhichen@encosmart.com>
* [fix] make legacy bnb code work
* [fix] use get with default instead of getter
* add test for bnb 8bit optim skip embed
* [fix] style
* add require annotation of bnb
---------
Co-authored-by: jaycha <jaycha@ncsoft.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* fix: qwen2.5 omni modular get_rope_index
* test: add test for qwen2.5 omni rope index (video with audio input)
* style
* expected_position_ids readability
* fix: use spatial_merge_size = 1 in unit test
Update generation_strategies.md
The prompt text shown in the example does not match what is inside the generated output. As the generated output always include the prompt, the correct prompt should be "Hugging Face is an open-source company".
* initial commit
* add convert internvl
* add first end-to-end working internvl
* nit prompt and image proc
* add working chat template
* add conversion llama-based models
* add tests
* pass all tests
* fix isort
* fix modular after main merge
* add video processing for internvl
* add support for interlaced images and videos
* Remove processing and config from modular, add more tests
* add llama model tests
* Modify processor for compatibility with refactored got ocr image processor
* add comments in processor
* Add docs and nits
* change video processing to use custom sample_indices_fn
* rebase and fix tests
* add processor tests
* Add changes Raushan review
* Use the new attention interface for the vision model
* nits
* add support for custom video_load_backend
* remove mention to InternVLTokenizer
* refactor vision model to simplify logic
* refactor processor for better readibility
* fix copies
* fix require av processor test
* refactor internVL vision
* Update processor and fix processing tests
* fix docstring
* update convert_weights for internvl3
* change image processor to fast by default
* remove do_center_crop=True in convert_weights
* force use_cache to True
* push_to_hub before reloading
* fix internVLVision for larger models
* update convert weight for qk norm
* fix convert_weights
* fix eos_token_id in convert
* update docs and integration tests
* make modifs after review
* fix wrong k_norm and reduce modular
* change image_token_index to image_token_id
* change checkpoint to OpenGVLab org
* last nits
* explicitely del self.num_key_value_groups
* add extra special tokens
* fix issue that some example with no trainer use accelerator.end_training in a wrong way
* reformat code
---------
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* use only `xxx_token_id` for multimodal tokens
* update modeling files as well
* fixup
* why fixup doesn't fix modular docstring first?
* janus, need to update configs in the hub still
* last fixup
* Iterative generation using input embeds
* Add Janus model
* discard changes
* Janus imports
* Refactor config and processor
* Added Vision tower of Janus
* Import Janus Image processor
* Vision tower fixes
* Refactor code
* Added VQ Model
* Complete model integration
* temp conversion script
* processor refactor
* Adding files to facilitate pulling
* Fixes after debugging
* Skip test for these models
* Add Janus Model
* discard changes
* Janus imports
* Refactor config and processor
* Added Vision tower of Janus
* Import Janus Image processor
* Vision tower fixes
* Refactor code
* Added VQ Model
* Complete model integration
* temp conversion script
* processor refactor
* Adding files to facilitate pulling
* Fixes after debugging
* Refactor to Text config
* ✨ Added generate function
* Saving intermediate convert file. Still need to read configs from the hub and convert them to our format.
* Adding version that reads from the JSON files. Still have to tweak some parameters manually.
* relative imports
* Initial tests
* Refactor image processor
* Seemingly working version of the conversion script, will need to test further.
* Adding command message
* Fixing conflicting JanusTextConfig class
* Incorporating some of the discussed changes.
* Small fix to create dir.
* Removing system from JINJA template
* Adding draft processor tests
* style fixes
* Minor fixes and enhancement
* added generation config
* Initial tests
* Small modifications, tests are now passing.
* Small changes I noticed while reading code.
* more fixes
* Added JanusModel class
* Small merge adaptations
* Small merge adaptations
* Image processing tests passing
* More tests and fixes
* Convert script updated and refactored
* Tests and cleanup
* make style
* Postprocessing for image generation
* generate refactor
* fixes
* - Passing tests that write a part of the model to cpu (e.g. test_cpu_offload)
- Passing tests of dispatching SDPA
- Only gradient checkpointing tests are left.
* Removing temporary code
* Changes
* Writing change to modular
* Added JanusVisionModel. SDPA dispatch tests pass more robustly. Gradient checkpoint tests are next
* Gradient checkpoint tests passing
* Removing debug code
* Major generate refactor 😮💨
* Temp changes for testing
* Green quality CI
* 2 out of 4 integration tests passing
* breadcrumbs
* Usage Examples
* Regenerate modeling after merge
* dirty code
* JanusIntegrationTest are passing
* breadcrumbs
* happy CI
* fixes
* Changing template
* nits
* Text generation logits matching original codebase at 100% precision
* Remove ./tmp from git tracking
* Remove ./tmp from git tracking
* Checkpointing changes after reviewing
* Fixing code in docstrings
* CHanging comments and small bug in convert file
* Fixing bug in image_token_id for 7B version
* Removing line that was added by both of us
* Pushing changes after discussion. Only one left is to change the key mapping for convert file.
* Updating module file
* New convert file using dict. Tested that it is equivalent to the old one by:
- comparing keys in a script
- comparing checksums of the output files between version generated with the current convert script and those generated with the old script. This is a more reliable test.
* revert changes
* mistake
* consistency change for CI
* make style
* doc fixes
* more fixes
* experimenting with masking out pad token
* checkpoint
* Batched generation with multi-images working for 1B models. Will test 7B next.
* Device fix.
* Writing changes to modular, previous ones were written to modeling just for quick testing.
* Using passed processor attention mask (only in modeling for now)
* Matching performance done in the non-standard way
* Working version of batched generation. Will change how some args are passed to make it more similar to language case
* More compliant version of the code
* Removed duplicated `_prepare_4d_causal_attention_mask_with_cache_position`
* Updating modular file, making masked filling with paddings more efficient
* Slightly more efficient version
* Modifying JanusVisionModel to be a wrapper
* Fixing test to comply with new names
* Modular overhaul
* More refactoring
* - Changing JanusVisionModel back
- Changing forward pass
- Adding boi token to the comparison
* - Removing whole context model_ids
- Using inherited implementation of prepare_inputs_for_generation
* Moving the way boi token is passed to the model
* Fixing sdpa test
* Minor changes
* testing changes
* Minor fix
* - Adding postprocessing test
- checking values of generated image on integration test
* changes
* Removing pooled attention vision module, fixing convert script as a consequence
* More changes
* Fixes
* Draft after merge
* Bug fixes
* More bug fix
* Fixing docs
* Nits
* Refactor return dict
* Moving image post processing test to main processor post process
* Passing guidance_scale as kwarg
* make style
* 🔥 refactor
* make style
* Update and green CI
* Nits and tests update
* up
* Added MID block
* fix
* Dead code
* update testcase
* update
* model_id change
* init_weight changes
---------
Co-authored-by: hsilva664 <metallic-silver@hotmail.com>
* Fix mamba2 grouped support in bamba torch path
* patch zamba2 and mamba2
* Add a unit test for grouped SSD
* add comment for the new unit test
* add output_size arg value to repeat_interleave calls
* Add comment
* added efficientnet image preprocessor but tests fail
* ruff checks pass
* ruff formatted
* properly pass rescale_offset through the functions
* - corrected indentation, ordering of methods
- reshape test passes when casted to float64
- equivalence test doesn't pass
* all tests now pass
- changes order of rescale, normalize acc to slow
- rescale_offset defaults to False acc to slow
- resample was causing difference in fast and slow. Changing test to bilinear resolves this difference
* ruff reformat
* F.InterpolationMode.NEAREST_EXACT gives TypeError: Object of type InterpolationMode is not JSON serializable
* fixes offset not being applied when do_rescale and do_normalization are both true
* - using nearest_exact sampling
- added tests for rescale + normalize
* resolving reviews
---------
Co-authored-by: Yoni Gozlan <74535834+yonigozlan@users.noreply.github.com>
* update
* apply suggestion
* fix tests for main branch
* remove unused logger
* add special tokens in tests
* nit
* fix more tests
* fix test
* pg also
Make Ignored Columns Value Error More Informative
Included forward method signature columns in the ValueError so end users will know what columns are expected to be passed to the model in addition to those which are ignored.
* initial documentation
* rename mask to attention_mask
* smaller tests
* fixup
* fix copies
* move to time series section
* sort docs
* isort fix
* batch_size is not a configuration
* rename to TimesFMModelForPrediction
* initial script
* add check_outputs
* remove dropout_rate
* works with torch.Tensor inputs
* rename script
* fix docstrings
* fix freq when window_size is given
* add loss
* fix _quantile_loss
* formatting
* fix isort
* add weight init
* add support for sdpa and flash_attention_2
* fixes for flash_attention
* formatting
* remove flash_attention
* fix tests
* fix file name
* fix quantile loss
* added initial TimesFMModelIntegrationTests
* fix formatting
* fix import order
* fix _quantile_loss
* add doc for SDPA
* use timesfm 2.0
* bug fix in timesfm decode function.
* compare mean forecasts
* refactor type hints, use CamelCase
* consolidate decode func
* more readable code for weight conversion
* fix-copies
* simpler init
* renaem TimesFmMLP
* use T5LayerNorm
* fix tests
* use initializer_range
* TimesFmModel instead of TimesFmDecoder
* TimesFmPositionalEmbedding takes config for its init
* 2.0-500m-pytorch default configs
* use TimesFmModel
* fix formatting
* ignore TimesFmModel for testing
* fix docstring
* override generate as its not needed
* add doc strings
* fix logging
* add docstrings to output data classes
* initial copy from t5
* added config and attention layers
* add TimesFMPositionalEmbedding
* calcuate scale_factor once
* add more configs and TimesFMResidualBlock
* fix input_dims
* standardize code format with black
* remove unneeded modules
* TimesFM Model
* order of imports
* copy from Google official implementation
* remove covariate forecasting
* Adapting TimesFM to HF format
* restructing in progress
* adapted to HF convention
* timesfm test
* the model runs
* fixing unit tests
* fixing unit tests in progress
* add post_init
* do not change TimesFMOutput
* fixing unit tests
* all unit tests passed
* remove timesfm_layers
* add intermediate_size and initialize with config
* initial documentation
* rename mask to attention_mask
* smaller tests
* fixup
* fix copies
* move to time series section
* sort docs
* isort fix
* batch_size is not a configuration
* rename to TimesFMModelForPrediction
* initial script
* add check_outputs
* remove dropout_rate
* works with torch.Tensor inputs
* rename script
* fix docstrings
* fix freq when window_size is given
* add loss
* fix _quantile_loss
* formatting
* fix isort
* add weight init
* add support for sdpa and flash_attention_2
* fixes for flash_attention
* formatting
* remove flash_attention
* fix tests
* fix file name
* fix quantile loss
* added initial TimesFMModelIntegrationTests
* fix formatting
* fix import order
* fix _quantile_loss
* add doc for SDPA
* use timesfm 2.0
* bug fix in timesfm decode function.
* compare mean forecasts
* refactor type hints, use CamelCase
* consolidate decode func
* more readable code for weight conversion
* fix-copies
* simpler init
* renaem TimesFmMLP
* use T5LayerNorm
* fix tests
* use initializer_range
* TimesFmModel instead of TimesFmDecoder
* TimesFmPositionalEmbedding takes config for its init
* 2.0-500m-pytorch default configs
* use TimesFmModel
* fix formatting
* ignore TimesFmModel for testing
* fix docstring
* override generate as its not needed
* add doc strings
* fix logging
* add docstrings to output data classes
* add _CHECKPOINT_FOR_DOC
* fix comments
* Revert "fix comments"
This reverts commit 8deeb3e191b3671bc1d74dbfe77b736a066c3d34.
* add _prepare_4d_attention_mask
* we do not have generative model classes
* use Cache
* return past_key_values
* modules initialized with config only
* update year
* Update docs/source/en/model_doc/timesfm.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* add layer_idx to cache
* modular timesfm
* fix test
* unwrap sequential class
* fix toctree
* remove TimesFmOnnxConfig
* fix modular
* remove TimesFmStackedDecoder
* split qkv layer into individual layers
* rename projection layers
* use ALL_ATTENTION_FUNCTIONS
* is_causal is True
* rename config
* does not support flash_attn_2
* formatting
* fix typo in docsstring
* rename inputs
* add time series mapping
* Update src/transformers/models/olmo2/modeling_olmo2.py
* Update src/transformers/models/moonshine/modeling_moonshine.py
* use updated arguments
* fix class name
* add MODEL_FOR_TIME_SERIES_PREDICTION_MAPPING
* isort
* consolidate _preprocess into forward
* fix a typo
* fix a typo
* fix toc
* fix modular
* remove aaserts
* use self.config._attn_implementation
* move to _postprocess_output
* remove timesfm_get_large_negative_number
* use view unstead of multiple unsqueeze
* make helpers static methods of the Model
* use to_tuple
* use to_tuple if not return_dict
* remove unused intitialization block as its incorporated in nn.Linear
* remove unused num_key_value_groups
* use the same convention as the masking method
* update modular
* do not use unsqueeze
* use view instead of unsqueeze
* use buffer for inv_timescales
* formatting
* modular conversion
* remove unneeded intialization
* add missing docstrings
* remove cache
* use simple_eager_attention_forward
* support tp_plan
* support for flex and flash attention masks
* Revert "support for flex and flash attention masks"
This reverts commit def36c4fcf31599b3f4937c9334b7da1a20132c3.
* fix device
* fix tests on gpu
* remove unsued large model test
* removed unneeded comments
* add example usage
* fix style
* add import
* Update docs/source/en/model_doc/timesfm.md
Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>
* inherit from LlamaRMSNorm
* use can_return_tuple decorator
* remvoe return_dict
* fix year
* Update docs/source/en/model_doc/timesfm.md
Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>
* pretrained does not inherit from GenerationMixin
* use model for integration test
---------
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: Rajat Sen <rsen91@gmail.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>
Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co>
* fix: Restore explicit error surfacing for unexpected hub exceptions
Prior to PR #36033, unexpected exceptions (e.g., ModuleNotFoundError) during hub model loading were not swallowed silently. They either matched specific except blocks or were raised.
After #36033, a catch-all except Exception block was introduced without a fallback else, causing unknown errors to be silently ignored and leading to misleading downstream behavior.
This commit adds an `else: raise e` to ensure only explicitly handled exceptions are suppressed. All others are surfaced, restoring pre-4.50 behavior and aiding in debugging and dependency visibility.
Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co>
* Add MLCD model
* Update codes for auto-mapping
* Add test scripts for MLCD
* Update doc for MLCD model
* Fix import error
* Fix import error
* Fix CI error for attention_outputs
* Fix code style for CI
* Fix code style for CI
* Fix code style for CI
* Fix code style for CI
* Fix code style for CI
* Fix CI error for initialization
* Fix code style for CI
* Fix code style for CI
* Reformat codes and docs for CI test
* Reformat codes and docs for CI test
* Remove unused attributes for CI test
* Fix style for CI test
* List MLCD in flash_attn doc
* Fix: typos, modulars, refactors from suggestions
* Refactoring convert_mlcd_weights_to_hf.py from suggestions
* Fix: docs conflicts
* Fix error for CI test
* Fix style for CI test
* Add integration test for MLCD
* Refactoring by class inheritance
* Fix: refactor attention interface, adjust codes
* Fix: merging conflicts
* Fix: merging conflicts
* Fix: style for CI test
* Fix: style for CI test
* Fix: set test_resize_embeddings to be False
* Fix: initializer for CI test
* Fix: conflicts, CI test, warning and refactoring
* Fix: merging conflicts
* Refactor
* Update docs
* Fix mistakes
* Remove unused args and fix multi-gpu error
* Revert position_embeddings
* Solve conflicts
* Solve conflicts
* Remove dummy
* Update _init_weights
* Update _init_weights
* Update _init_weights for CI test
2025-04-15 11:33:09 +01:00
1575 changed files with 107775 additions and 68966 deletions
- run:if [[ "$CIRCLE_PULL_REQUEST" == "" && "$CIRCLE_BRANCH" != "main" && "$CIRCLE_BRANCH" != *-release ]]; then echo "Not a PR, not the main branch and not a release branch, skip test!"; circleci-agent step halt; fi
- run:if [[ "$(cat pr_number.txt)" == "" && "$CIRCLE_BRANCH" != "main" && "$CIRCLE_BRANCH" != *-release ]]; then echo "Not a PR, not the main branch and not a release branch, skip test!"; circleci-agent step halt; fi
gh pr comment $PR_NUMBER --repo $REPO --body "Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the \`Ready for review\` button (at the bottom of the PR page). This will assign reviewers and trigger CI."
@ -26,7 +26,7 @@ There are two main venues to receive support: [the forums](https://discuss.huggi
[The user forums](https://discuss.huggingface.co/) are supported by the wide community of the library users and backed up by developers when needed.
If you have a difficulty with deploying this library or some questions, or you'd like to discuss a new feature, please first consider discussing those things at the forums. Only when you feel your subject matter has been crystalized and you still need support from the library developers do proceed to file an [issue](https://github.com/huggingface/transformers/issues).
If you have a difficulty with deploying this library or some questions, or you'd like to discuss a new feature, please first consider discussing those things at the forums. Only when you feel your subject matter has been crystallized and you still need support from the library developers do proceed to file an [issue](https://github.com/huggingface/transformers/issues).
In particular all "Please explain" questions or objectively very user-specific feature requests belong to the forums. Here are some example of such questions:
@ -78,7 +78,6 @@ Create and activate a virtual environment with [venv](https://docs.python.org/3/
# venv
python-mvenv.my-env
source.my-env/bin/activate
# uv
uvvenv.my-env
source.my-env/bin/activate
@ -88,10 +87,10 @@ Install Transformers in your virtual environment.
```py
# pip
pipinstalltransformers
pipinstall"transformers[torch]"
# uv
uvpipinstalltransformers
uvpipinstall"transformers[torch]"
```
Install Transformers from source if you want the latest changes in the library or are interested in contributing. However, the *latest* version may not be stable. Feel free to open an [issue](https://github.com/huggingface/transformers/issues) if you encounter an error.
@ -99,7 +98,12 @@ Install Transformers from source if you want the latest changes in the library o
@ -161,7 +161,7 @@ The downside is that if you aren't used to them, it may take some time to get us
Run the command below to start and complete the questionnaire with some basic information about the new model. This command jumpstarts the process by automatically generating some model code that you'll need to adapt.
```bash
transformers-cli add-new-model-like
transformers add-new-model-like
```
## Create a pull request
@ -292,7 +292,7 @@ Once you're able to run the original checkpoint, you're ready to start adapting
## Adapt the model code
The `transformers-cli add-new-model-like` command should have generated a model and configuration file.
The `transformers add-new-model-like` command should have generated a model and configuration file.
@ -551,10 +551,10 @@ While this example doesn't include an image processor, you may need to implement
If you do need to implement a new image processor, refer to an existing image processor to understand the expected structure. Slow image processors ([`BaseImageProcessor`]) and fast image processors ([`BaseImageProcessorFast`]) are designed differently, so make sure you follow the correct structure based on the processor type you're implementing.
Run the following command (only if you haven't already created the fast image processor with the `transformers-cli add-new-model-like` command) to generate the necessary imports and to create a prefilled template for the fast image processor. Modify the template to fit your model.
Run the following command (only if you haven't already created the fast image processor with the `transformers add-new-model-like` command) to generate the necessary imports and to create a prefilled template for the fast image processor. Modify the template to fit your model.
This command will generate the necessary imports and provide a pre-filled template for the fast image processor. You can then modify it to fit your model's needs.
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# Utilizing the @auto_docstring Decorator
The `@auto_docstring` decorator in the Hugging Face Transformers library helps generate docstrings for model classes and their methods, which will be used to build the documentation for the library. It aims to improve consistency and reduce boilerplate by automatically including standard argument descriptions and allowing for targeted overrides and additions.
---
## 📜 How it Works
The `@auto_docstring` decorator constructs docstrings by:
1.**Signature Inspection:** It inspects the signature (arguments, types, defaults) of the decorated class's `__init__` method or the decorated function.
2.**Centralized Docstring Fetching:** It retrieves predefined docstrings for common arguments (e.g., `input_ids`, `attention_mask`) from internal library sources (like `ModelArgs` or `ImageProcessorArgs` in `utils/args_doc.py`).
3.**Overriding or Adding Arguments Descriptions:**
* **Direct Docstring Block:** It incorporates custom docstring content from an `r""" """` (or `""" """`) block below the method signature or within the `__init__` docstring. This is for documenting new arguments or overriding standard descriptions.
* **Decorator Arguments (`custom_args`):** A `custom_args` docstring block can be passed to the decorator to provide docstrings for specific arguments directly in the decorator call. This can be used to define the docstring block for new arguments once if they are repeated in multiple places in the modeling file.
4.**Adding Classes and Functions Introduction:**
* **`custom_intro` argument:** Allows prepending a custom introductory paragraph to a class or function docstring.
* **Automatic Introduction Generation:** For model classes with standard naming patterns (like `ModelForCausalLM`) or belonging to a pipeline, the decorator automatically generates an appropriate introductory paragraph using `ClassDocstring` in `utils/args_doc.py` as the source.
5.**Templating:** The decorator uses a templating system, allowing predefined docstrings to include dynamic information deduced from the `auto_modules` of the library, such as `{{processor_class}}` or `{{config_class}}`.
6.**Deducing Relevant Examples:** The decorator attempts to find appropriate usage examples based on the model's task or pipeline compatibility. It extracts checkpoint information from the model's configuration class to provide concrete examples with real model identifiers.
7.**Adding Return Value Documentation:** For methods like `forward`, the decorator can automatically generate the "Returns" section based on the method's return type annotation. For example, for a method returning a `ModelOutput` subclass, it will extracts field descriptions from that class's docstring to create a comprehensive return value description. A custom `Returns` section can also be manually specified in the function docstring block.
8.**Unrolling Kwargs Typed With Unpack Operator:** For specific methods (defined in `UNROLL_KWARGS_METHODS`) or classes (defined in `UNROLL_KWARGS_CLASSES`), the decorator processes `**kwargs` parameters that are typed with `Unpack[KwargsTypedDict]`. It extracts the documentation from the TypedDict and adds each parameter to the function's docstring. Currently, this functionality is only supported for `FastImageProcessorKwargs`.
---
## 🚀 How to Use @auto_docstring
### 1. Importing the Decorator
Import the decorator into your modeling file:
```python
from...utilsimportauto_docstring
```
### 2. Applying to Classes
Place `@auto_docstring` directly above the class definition. It uses the `__init__` method's signature and its docstring for parameter descriptions.
*`@auto_docstring` retrieves descriptions from a central source. Do not redefine these locally if their description and shape are the same as in `args_doc.py`.
2.**New or Custom Arguments:**
* **Primary Method:** Document these within an `r""" """` docstring block following the signature (for functions) or in the `__init__` method's docstring (for class parameters).
* **Format:**
```
argument_name (`type`, *optional*, defaults to `X`):
Description of the argument.
Explain its purpose, expected shape/type if complex, and default behavior.
This can span multiple lines.
```
* Include `type` in backticks.
* Add "*optional*" if the argument is not required (has a default value).
* Add "defaults to `X`" if it has a default value (no need to specify "defaults to `None`" if the default value is `None`).
3. **Overriding Standard Arguments:**
* If a standard argument behaves differently (e.g., different expected shape, model-specific behavior), provide its complete description in the local `r""" """` docstring. This local definition takes precedence.
* The `labels` argument is often customized per model and typically requires a specific docstring.
4. **Using Decorator Arguments for Overrides or New Arguments (`custom_args`):**
* New or custom arguments docstrings can also be passed to `@auto_docstring` as a `custom_args` argument. This can be used to define the docstring block for new arguments once if they are repeated in multiple places in the modeling file.
---
### Usage with [modular files](./modular_transformers)
When working with modular files, follow these guidelines for applying the `@auto_docstring` decorator:
- **For standalone models in modular files:**
Apply the `@auto_docstring` decorator just as you would in regular modeling files.
- **For models inheriting from other library models:**
- When inheriting from a parent model, decorators (including `@auto_docstring`) are automatically carried over to the generated modeling file without needing to add them in your modular file.
- If you need to modify the `@auto_docstring` behavior, apply the customized decorator in your modular file, making sure to *include all other decorators* that were present on the original function/class.
> **Warning**: When overriding any decorator in a modular file, you must include ALL decorators that were applied to that function/class in the parent model. If you only override some decorators, the others won't be included in the generated modeling file.
**Note**: The `check_auto_docstrings` tool doesn't check modular files directly, but it will check (and modify when using `--fix_and_overwrite`) the generated modeling files. If issues are found in the generated files, you'll need to update your modular files accordingly.
---
## ✅ Checking Your Docstrings with `check_auto_docstrings`
The library includes a utility script to validate docstrings. This check is typically run during Continuous Integration (CI).
#### What it Checks:
* **Decorator Presence:** Ensures `@auto_docstring` is applied to relevant model classes and public methods. (TODO)
* **Argument Completeness & Consistency:**
* Flags arguments in the signature that are not known standard arguments and lack a local description.
* Ensures documented arguments exist in the signature. (TODO)
* Verifies that types and default values in the docstring match the signature. (TODO)
* **Placeholder Detection:** Reminds you to complete placeholders like `<fill_type>` or `<fill_docstring>`.
* **Formatting:** Adherence to the expected docstring style.
#### Running the Check Locally:
Run this check locally before committing. The common command is:
```bash
make fix-copies
```
Alternatively, to only perform docstrings and auto-docstring checks, you can use:
```bash
python utils/check_docstrings.py # to only check files included in the diff without fixing them
# Or: python utils/check_docstrings.py --fix_and_overwrite # to fix and overwrite the files in the diff
# Or: python utils/check_docstrings.py --fix_and_overwrite --check_all # to fix and overwrite all files
```
#### Workflow with the Checker:
1. Add `@auto_docstring(...)` to the class or method.
2. For new, custom, or overridden arguments, add descriptions in an `r""" """` block.
3. Run `make fix-copies` (or the `check_docstrings.py` utility).
* For unrecognized arguments lacking documentation, the utility will create placeholder entries.
4. Manually edit these placeholders with accurate types and descriptions.
5. Re-run the check to ensure all issues are resolved.
---
## 🔑 Key Takeaways & Best Practices
* Use `@auto_docstring` for new PyTorch model classes (`PreTrainedModel` subclasses) and their primary for methods (e.g., `forward`, `get_text_features` etc.).
* For classes, the `__init__` method's docstring is the main source for parameter descriptions when using `@auto_docstring` on the class.
* Rely on standard docstrings; do not redefine common arguments unless their behavior is different in your specific model.
* Document new or custom arguments clearly.
* Run `check_docstrings` locally and iteratively.
By following these guidelines, you help maintain consistent and informative documentation for the Hugging Face Transformers library 🤗.
@ -25,22 +25,28 @@ Check model leaderboards like [OpenLLM](https://hf.co/spaces/HuggingFaceH4/open_
This guide shows you how to quickly start chatting with Transformers from the command line, how build and format a conversation, and how to chat using the [`TextGenerationPipeline`].
## transformers-cli
## transformers CLI
Chat with a model directly from the command line as shown below. It launches an interactive session with a model. Enter `clear` to reset the conversation, `exit` to terminate the session, and `help` to display all the command options.
After you've [installed Transformers](./installation.md), chat with a model directly from the command line as shown below. It launches an interactive session with a model, with a few base commands listed at the start of the session.
For a full list of options, run the command below.
```bash
transformers-cli chat -h
transformers chat -h
```
The chat is implemented on top of the [AutoClass](./model_doc/auto), using tooling from [text generation](./llm_tutorial) and [chat](./chat_templating).
@ -20,18 +20,22 @@ A decoding strategy informs how a model should select the next generated token.
This guide will help you understand the different decoding strategies available in Transformers and how and when to use them.
## Greedy search
## Basic decoding methods
Greedy search is the default decoding strategy. It selects the next most likely token at each step. Unless specified in [`GenerationConfig`], this strategy generates a maximum of 20 tokens.
These are well established decoding methods, and should be your starting point for text generation tasks.
Greedy search works well for tasks with relatively short outputs. However, it breaks down when generating longer sequences because it begins to repeat itself.
### Greedy search
Greedy search is the default decoding strategy. It selects the next most likely token at each step. Unless specified in [`GenerationConfig`], this strategy generates a maximum of 20 new tokens.
Greedy search works well for tasks with relatively short outputs where creativity is not a priority. However, it breaks down when generating longer sequences because it begins to repeat itself.
'Hugging Face is an open-source company that provides a suite of tools and services for building, deploying, and maintaining natural language processing'
```
## Contrastive search
### Sampling
[Contrastive search](https://huggingface.co/papers/2202.06417) is a decoding strategy that aims to reduce repetition even while generating longer sequences. This strategy compares how similar a generated token is against previous tokens, and if they're more similar, a penalty is applied.
Sampling, or multinomial sampling, randomly selects a token based on the probability distribution over the entire model's vocabulary (as opposed to the most likely token, as in greedy search). This means every token with a non-zero probability has a chance to be selected. Sampling strategies reduce repetition and can generate more creative and diverse outputs.
Enable contrastive search with the `penalty_alpha` and `top_k` parameters. The `penalty_alpha` manages the penalty applied and `top_k` is the number of most likely tokens to return.
Enable multinomial sampling with `do_sample=True` and `num_beams=1`.
```py
importtorch
@ -55,14 +59,14 @@ inputs = tokenizer("Hugging Face is an open-source company", return_tensors="pt"
'Hugging Face is an open-source company that provides a platform for building and deploying AI models.\nHugging Face is an open-source company that provides a platform for building and deploying AI models. The platform allows developers to build and deploy AI models, as well as collaborate with other developers.\nHugging Face was founded in 2019 by Thibault Wittemberg and Clément Delangue. The company is based in Paris, France.\nHugging Face has'
'Hugging Face is an open-source company 🤗\nWe are open-source and believe that open-source is the best way to build technology. Our mission is to make AI accessible to everyone, and we believe that open-source is the best way to achieve that.'
```
## Beam search
### Beam search
Beam search keeps track of several generated sequences (beams) at each time step. After a certain number of steps, it selects the sequence with the highest *overall* probability. Unlike greedy search, this strategy can "look ahead" and pick a sequence with a higher probability overall even if the initial tokens have a lower probability.
Beam search keeps track of several generated sequences (beams) at each time step. After a certain number of steps, it selects the sequence with the highest *overall* probability. Unlike greedy search, this strategy can "look ahead" and pick a sequence with a higher probability overall even if the initial tokens have a lower probability. It is best suited for input-grounded tasks, like describing an image or speech recognition. You can also use `do_sample=True` with beam search to sample at each step, but beam search will still greedily prune out low probability sequences between steps.
> [!TIP]
> Check out the [beam search visualizer](https://huggingface.co/spaces/m-ric/beam_search_visualizer) to see how beam search works.
"['Hugging Face is an open-source company that develops and maintains the Hugging Face platform, which is a collection of tools and libraries for building and deploying natural language processing (NLP) models. Hugging Face was founded in 2018 by Thomas Wolf']"
```
## Diverse beam search
## Advanced decoding methods
[Diverse beam search](https://hf.co/papers/1610.02424) is a variant of beam search that produces more diverse output candidates to choose from. This strategy measures the dissimilarity of sequences and a penalty is applied if sequences are too similar. To avoid high computation costs, the number of beams is divided into groups.
Advanced decoding methods aim at either tackling specific generation quality issues (e.g. repetition) or at improving the generation throughput in certain situations. These techniques are more complex, and may not work correctly with all models.
Enable diverse beam search with the `num_beams`, `num_beam_groups` and `diversity_penalty` parameters (the `num_beams` parameter should be divisible by `num_beam_groups`).
'Hugging Face is an open-source company 🤗\nWe are an open-source company. Our mission is to democratize AI and make it accessible to everyone. We believe that AI should be used for the benefit of humanity, not for the benefit of a'
```
## Multinomial sampling
Search methods selects the most likely tokens. Sampling, or multinomial sampling, randomly selects a token based on the probability distribution over the entire models vocabulary. This means every token with a non-zero probability has a chance to be selected. Sampling strategies reduce repetition and can generate more creative and diverse outputs.
Enable multinomial sampling with `do_sample=True` and `num_beams=1`.
'Hugging Face is an open-source company 🤗\nWe are open-source and believe that open-source is the best way to build technology. Our mission is to make AI accessible to everyone, and we believe that open-source is the best way to achieve that.'
```
## Beam search multinomial sampling
This decoding strategy is a combination of beam search and multinomial sampling. It generates multiple beams and uses a sampling strategy for each beam.
Enable beam search multinomial sampling by setting `num_beams` to a value greater than 1 and `do_sample=True`.
'Hugging Face is an open-source company 100% dedicated to making AI more accessible. We believe that AI should be available to everyone, and we’re working hard to make that a reality.\nWe’re a team of passionate engineers, designers,'
```
## Speculative decoding
### Speculative decoding
[Speculative](https://hf.co/papers/2211.17192) or assistive decoding isn't a search or sampling strategy. Instead, speculative decoding adds a second smaller model to generate candidate tokens. The main model verifies the candidate tokens in a single `forward` pass, which speeds up the decoding process overall. This method is especially useful for LLMs where it can be more costly and slower to generate tokens. Refer to the [speculative decoding](./llm_optims#speculative-decoding) guide to learn more.
[Prompt lookup decoding](./llm_optims#prompt-lookup-decoding) is a variant of speculative decoding that uses overlapping n-grams as the candidate tokens. It works well for input-grounded tasks such as summarization. Refer to the [prompt lookup decoding](./llm_optims#prompt-lookup-decoding) guide to learn more.
Universal assisted decoding (UAD) enables the main and assistant models to use different tokenizers. The main models input tokens are re-encoded into assistant model tokens. Candidate tokens are generated in the assistant encoding which are re-encoded into the main model candidate tokens. The candidate tokens are verified as explained in [speculative decoding](#speculative-decoding).
['Alice and Bob are sitting in a bar. Alice is drinking a beer and Bob is drinking a']
```
## DoLa
### Contrastive search
[Contrastive search](https://huggingface.co/papers/2202.06417) is a decoding strategy that aims to reduce repetition even while generating longer sequences. This strategy compares how similar a generated token is against previous tokens, and if they're more similar, a penalty is applied.
Enable contrastive search with the `penalty_alpha` and `top_k` parameters. The `penalty_alpha` manages the penalty applied and `top_k` is the number of most likely tokens to return.
'Hugging Face is an open-source company that provides a platform for building and deploying AI models.\nHugging Face is an open-source company that provides a platform for building and deploying AI models. The platform allows developers to build and deploy AI models, as well as collaborate with other developers.\nHugging Face was founded in 2019 by Thibault Wittemberg and Clément Delangue. The company is based in Paris, France.\nHugging Face has'
```
### DoLa
[Decoding by Contrasting Layers (DoLa)](https://hf.co/papers/2309.03883) is a contrastive decoding strategy for improving factuality and reducing hallucination. This strategy works by contrasting the logit differences between the final and early layers. As a result, factual knowledge localized to particular layers are amplified. DoLa is not recommended for smaller models like GPT-2.
[Diverse beam search](https://hf.co/papers/1610.02424) is a variant of beam search that produces more diverse output candidates to choose from. This strategy measures the dissimilarity of sequences and a penalty is applied if sequences are too similar. To avoid high computation costs, the number of beams is divided into groups.
Enable diverse beam search with the `num_beams`, `num_beam_groups` and `diversity_penalty` parameters (the `num_beams` parameter should be divisible by `num_beam_groups`).
'Hugging Face is an open-source company 🤗\nWe are an open-source company. Our mission is to democratize AI and make it accessible to everyone. We believe that AI should be used for the benefit of humanity, not for the benefit of a'
```
## Custom decoding methods
Custom decoding methods enable specialized generation behavior such as the following:
- have the model continue thinking if it is uncertain;
- roll back generation if the model gets stuck;
- handle special tokens with custom logic;
- enhanced input preparation for advanced models;
We enable custom decoding methods through model repositories, assuming a specific model tag and file structure (see subsection below). This feature is an extension of [custom modeling code](./models.md#custom-models) and, like such, requires setting `trust_remote_code=True`.
If a model repository holds a custom decoding method, the easiest way to try it out is to load the model and generate with it:
<!-- TODO before merging: 1) better repo name (use a `generate-community` org?) 2) prettify the repo -->
'The quick brown fox jumps over a lazy dog, and the dog is a type of animal. Is'
```
Model repositories with custom decoding methods have a special property: their decoding method can be loaded from **any** model through [`~GenerationMixin.generate`]'s `custom_generate` argument. This means anyone can create and share their custom generation method to potentially work with any Transformers model, without requiring users to install additional Python packages.
'The quick brown fox jumps over a lazy dog, and the dog is a type of animal. Is'
```
You should read the `README.md` file of the repository containing the custom generation strategy to see what the new arguments and output type differences are, if they exist. Otherwise, you can assume it works like the base [`~GenerationMixin.generate`] method.
> [!TIP]
> You can find all custom decoding methods by [searching for their custom tag.](https://huggingface.co/models?other=custom_generate), `custom_generate`
Consider the Hub repository [transformers-community/custom_generate_example](https://huggingface.co/transformers-community/custom_generate_example) as an example. The `README.md` states that it has an additional input argument, `left_padding`, which adds a number of padding tokens before the prompt.
'<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>The quick brown fox jumps over the lazy dog.\n\nThe sentence "The quick'
```
If the custom method has pinned Python requirements that your environment doesn't meet, you'll get an exception about missing requirements. For instance, [transformers-community/custom_generate_bad_requirements](https://huggingface.co/transformers-community/custom_generate_bad_requirements) has an impossible set of requirements defined in its `custom_generate/requirements.txt` file, and you'll see the error message below if you try to run it.
```
ImportError: Missing requirements in your local environment for `transformers-community/custom_generate_bad_requirements`:
foo (installed: None)
bar==0.0.0 (installed: None)
torch>=99.0 (installed: 2.6.0)
```
Updating your Python requirements accordingly will remove this error message.
### Creating a custom decoding method
To create a new decoding method, you need to create a new [**Model**](https://huggingface.co/new) repository and push a few files into it.
1. The model you've designed your decoding method with.
2.`custom_generate/generate.py`, which contains all the logic for your custom decoding method.
3.`custom_generate/requirements.txt`, used to optionally add new Python requirements and/or lock specific versions to correctly use your method.
4.`README.md`, where you should add the `custom_generate` tag and document any new arguments or output type differences of your custom method here.
After you've added all required files, your repository should look like this
```
your_repo/
├── README.md # include the 'custom_generate' tag
├── config.json
├── ...
└── custom_generate/
├── generate.py
└── requirements.txt
```
#### Adding the base model
The starting point for your custom decoding method is a model repository just like any other. The model to add to this repository should be the model you've designed your method with, and it is meant to be part of a working self-contained model-generate pair. When the model in this repository is loaded, your custom decoding method will override `generate`. Don't worry -- your decoding method can still be loaded with any other Transformers model, as explained in the section above.
If you simply want to copy an existing model, you can do
This is the core of your decoding method. It *must* contain a method named `generate`, and this method *must* contain a `model` argument as its first argument. `model` is the model instance, which means you have access to all attributes and methods in the model, including the ones defined in [`GenerationMixin`] (like the base `generate` method).
> [!WARNING]
> `generate.py` must be placed in a folder named `custom_generate`, and not at the root level of the repository. The file paths for this feature are hardcoded.
Under the hood, when the base [`~GenerationMixin.generate`] method is called with a `custom_generate` argument, it first checks its Python requirements (if any), then locates the custom `generate` method in `generate.py`, and finally calls the custom `generate`. All received arguments and `model` are forwarded to your custom `generate` method.
This means your `generate` can have a mix of original and custom arguments (as well as a different output type) as shown below.
Follow the recommended practices below to ensure your custom decoding method works as expected.
- Feel free to reuse the logic for validation and input preparation in the original [`~GenerationMixin.generate`].
- Pin the `transformers` version in the requirements if you use any private method/attribute in `model`.
- You can add other files in the `custom_generate` folder, and use relative imports.
- Consider adding model validation, input validation, or even a separate test file to help users sanity-check your code in their environment.
#### requirements.txt
You can optionally specify additional Python requirements in a `requirements.txt` file inside the `custom_generate` folder. These are checked at runtime and an exception will be thrown if they're missing, nudging users to update their environment accordingly.
#### README.md
The root level `README.md` in the model repository usually describes the model therein. However, since the focus of the repository is the custom decoding method, we highly recommend to shift its focus towards describing the custom decoding method. In addition to a description of the method, we recommend documenting any input and/or output differences to the original [`~GenerationMixin.generate`]. This way, users can focus on what's new, and rely on Transformers docs for generic implementation details.
For discoverability, we highly recommend you to add the `custom_generate` tag to your repository. To do so, the top of your `README.md` file should look like the example below. After you push the file, you should see the tag in your repository!
```
---
library_name: transformers
tags:
- custom_generate
---
(your markdown content here)
```
Recommended practices:
- Document input and output differences in [`~GenerationMixin.generate`].
- Add self-contained examples to enable quick experimentation.
- Describe soft-requirements such as if the method only works well with a certain family of models.
## Resources
Read the [How to generate text: using different decoding methods for language generation with Transformers](https://huggingface.co/blog/how-to-generate) blog post for an explanation of how common decoding strategies work.
@ -16,7 +16,7 @@ rendered properly in your Markdown viewer.
# Image processors
Image processors converts images into pixel values, tensors that represent image colors and size. The pixel values are inputs to a vision or video model. To ensure a pretrained model receives the correct input, an image processor can perform the following operations to make sure an image is exactly like the images a model was pretrained on.
Image processors converts images into pixel values, tensors that represent image colors and size. The pixel values are inputs to a vision model. To ensure a pretrained model receives the correct input, an image processor can perform the following operations to make sure an image is exactly like the images a model was pretrained on.
- [`~BaseImageProcessor.center_crop`] to resize an image
- [`~BaseImageProcessor.normalize`] or [`~BaseImageProcessor.rescale`] pixel values
Once the forward passes of two models have been traced by the debugger, one can compare the `json` output files. See below: we can see slight differences between these two implementations' key projection layer. Inputs are mostly identical, but not quite. Looking through the file differences makes it easier to pinpoint which layer is wrong.
This feature will only work for torch-based models, and would require more work and case-by-case approach for say `jax`-based models that are usually compiled. Models relying heavily on external kernel calls may work, but trace will probably miss some things. Regardless, any python implementation that aims at mimicking another implementation can be traced once instead of reran N times with breakpoints.
If you pass `do_prune_layers=False` to your model debugger, ALL the layers will be outputted to `json`. Else, only the first and last layer will be shown. This is useful when some layers (typically cross-attention) appear only after N layers.
@ -18,7 +18,7 @@ rendered properly in your Markdown viewer.
The key-value (KV) vectors are used to calculate attention scores. For autoregressive models, KV scores are calculated *every* time because the model predicts one token at a time. Each prediction depends on the previous tokens, which means the model performs the same computations each time.
A KV *cache* stores these calculations so they can be reused without recomputing them. Efficient caching is crucial for optimizing model performance because it reduces computation time and improves response rates. Refer to the [Caching](./cache_explanation.md) doc for a more detailed explanation about how a cache works.
A KV *cache* stores these calculations so they can be reused without recomputing them. Efficient caching is crucial for optimizing model performance because it reduces computation time and improves response rates. Refer to the [Caching](./cache_explanation) doc for a more detailed explanation about how a cache works.
Transformers offers several [`Cache`] classes that implement different caching mechanisms. Some of these [`Cache`] classes are optimized to save memory while others are designed to maximize generation speed. Refer to the table below to compare cache types and use it to help you select the best cache for your use case.
@ -20,9 +20,13 @@ rendered properly in your Markdown viewer.
Text generation is the most popular application for large language models (LLMs). A LLM is trained to generate the next word (token) given some initial text (prompt) along with its own generated outputs up to a predefined length or when it reaches an end-of-sequence (`EOS`) token.
In Transformers, the [`~GenerationMixin.generate`] API handles text generation, and it is available for all models with generative capabilities.
In Transformers, the [`~GenerationMixin.generate`] API handles text generation, and it is available for all models with generative capabilities. This guide will show you the basics of text generation with [`~GenerationMixin.generate`] and some common pitfalls to avoid.
This guide will show you the basics of text generation with [`~GenerationMixin.generate`] and some common pitfalls to avoid.
> [!TIP]
> You can also chat with a model directly from the command line. ([reference](./conversations.md#transformers-cli))
[`~GenerationMixin.generate`] is a powerful tool that can be heavily customized. This can be daunting for a new users. This section contains a list of popular generation options that you can define in most text generation tools in Transformers: [`~GenerationMixin.generate`], [`GenerationConfig`], `pipelines`, the `chat` CLI, ...
| Option name | Type | Simplified description |
|---|---|---|
| `max_new_tokens` | `int` | Controls the maximum generation length. Be sure to define it, as it usually defaults to a small value. |
| `do_sample` | `bool` | Defines whether generation will sample the next token (`True`), or is greedy instead (`False`). Most use cases should set this flag to `True`. Check [this guide](./generation_strategies.md) for more information. |
| `temperature` | `float` | How unpredictable the next selected token will be. High values (`>0.8`) are good for creative tasks, low values (e.g. `<0.4`) for tasks that require "thinking". Requires `do_sample=True`. |
| `num_beams` | `int` | When set to `>1`, activates the beam search algorithm. Beam search is good on input-grounded tasks. Check [this guide](./generation_strategies.md) for more information. |
| `repetition_penalty` | `float` | Set it to `>1.0` if you're seeing the model repeat itself often. Larger values apply a larger penalty. |
| `eos_token_id` | `List[int]` | The token(s) that will cause generation to stop. The default value is usually good, but you can specify a different token. |
## Pitfalls
The section below covers some common issues you may encounter during text generation and how to solve them.
@ -286,4 +304,4 @@ Take a look below for some more specific and specialized text generation librari
- [SynCode](https://github.com/uiuc-focal-lab/syncode): a library for context-free grammar guided generation (JSON, SQL, Python).
- [Text Generation Inference](https://github.com/huggingface/text-generation-inference): a production-ready server for LLMs.
- [Text generation web UI](https://github.com/oobabooga/text-generation-webui): a Gradio web UI for text generation.
- [logits-processor-zoo](https://github.com/NVIDIA/logits-processor-zoo): additional logits processors for controlling text generation.
- [logits-processor-zoo](https://github.com/NVIDIA/logits-processor-zoo): additional logits processors for controlling text generation.
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# Video Processor
A **Video Processor** is a utility responsible for preparing input features for video models, as well as handling the post-processing of their outputs. It provides transformations such as resizing, normalization, and conversion into PyTorch.
The video processor extends the functionality of image processors by allowing Vision Large Language Models (VLMs) to handle videos with a distinct set of arguments compared to images. It serves as the bridge between raw video data and the model, ensuring that input features are optimized for the VLM.
When adding a new VLM or updating an existing one to enable distinct video preprocessing, saving and reloading the processor configuration will store the video related arguments in a dedicated file named `video_preprocessing_config.json`. Don't worry if you haven't upadted your VLM, the processor will try to load video related configurations from a file named `preprocessing_config.json`.
### Usage Example
Here's an example of how to load a video processor with [`llava-hf/llava-onevision-qwen2-0.5b-ov-hf`](https://huggingface.co/llava-hf/llava-onevision-qwen2-0.5b-ov-hf) model:
Currently, if using base image processor for videos, it processes video data by treating each frame as an individual image and applying transformations frame-by-frame. While functional, this approach is not highly efficient. Using `AutoVideoProcessor` allows us to take advantage of **fast video processors**, leveraging the [torchvision](https://pytorch.org/vision/stable/index.html) library. Fast processors handle the whole batch of videos at once, without iterating over each video or frame. These updates introduce GPU acceleration and significantly enhance processing speed, especially for tasks requiring high throughput.
Fast video processors are available for all models and are loaded by default when an `AutoVideoProcessor` is initialized. When using a fast video processor, you can also set the `device` argument to specify the device on which the processing should be done. By default, the processing is done on the same device as the inputs if the inputs are tensors, or on the CPU otherwise. For even more speed improvement, we can compile the processor when using 'cuda' as device.
@ -57,6 +57,7 @@ This model was contributed by [lysandre](https://huggingface.co/lysandre). This
- Embedding size E is different from hidden size H justified because the embeddings are context independent (one embedding vector represents one token), whereas hidden states are context dependent (one hidden state represents a sequence of tokens) so it's more logical to have H >> E. Also, the embedding matrix is large since it's V x E (V being the vocab size). If E <H,ithaslessparameters.
@ -55,6 +55,7 @@ This model was contributed by [sshleifer](https://huggingface.co/sshleifer). The
* mask a span of k tokens with a single mask token (a span of 0 tokens is an insertion of a mask token)
* permute sentences
* rotate the document to make it start at a specific token
- The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
@ -81,10 +81,10 @@ print(f"The predicted token is: {predicted_token}")
```
</hfoption>
<hfoptionid="transformers-cli">
<hfoptionid="transformers CLI">
```bash
echo -e "Plants create [MASK] through a process known as photosynthesis."| transformers-cli run --task fill-mask --model google-bert/bert-base-uncased --device 0
echo -e "Plants create [MASK] through a process known as photosynthesis."| transformers run --task fill-mask --model google-bert/bert-base-uncased --device 0
```
</hfoption>
@ -256,4 +256,4 @@ echo -e "Plants create [MASK] through a process known as photosynthesis." | tran
@ -36,6 +36,7 @@ This model was contributed by [kamalkraj](https://huggingface.co/kamalkraj). The
- BioGPT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than the left.
- BioGPT was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next token in a sequence. Leveraging this feature allows BioGPT to generate syntactically coherent text as it can be observed in the run_generation.py example script.
- The model can take the `past_key_values` (for PyTorch) as input, which is the previously computed key/value attention pairs. Using this (past_key_values or past) value prevents the model from re-computing pre-computed values in the context of text generation. For PyTorch, see past_key_values argument of the BioGptForCausalLM.forward() method for more information on its usage.
- The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
<!--Copyright 2025 The BitNet Team and The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# BitNet
## Overview
Trained on a corpus of 4 trillion tokens, this model demonstrates that native 1-bit LLMs can achieve performance comparable to leading open-weight, full-precision models of similar size, while offering substantial advantages in computational efficiency (memory, energy, latency).
Several versions of the model weights are available on Hugging Face:
* [**`microsoft/bitnet-b1.58-2B-4T`**](https://huggingface.co/microsoft/bitnet-b1.58-2B-4T): Contains the packed 1.58-bit weights optimized for efficient inference. **Use this for deployment.**
* [**`microsoft/bitnet-b1.58-2B-4T-bf16`**](https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-bf16): Contains the master weights in BF16 format. **Use this only for training or fine-tuning purposes.**
* [**`microsoft/bitnet-b1.58-2B-4T-gguf`**](https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf): Contains the model weights in GGUF format, compatible with the `bitnet.cpp` library for CPU inference.
### Model Details
* **Architecture:** Transformer-based, modified with `BitLinear` layers (BitNet framework).
* Uses Rotary Position Embeddings (RoPE).
* Uses squared ReLU (ReLU²) activation in FFN layers.
* No bias terms in linear or normalization layers.
* **Quantization:** Native 1.58-bit weights and 8-bit activations (W1.58A8).
* Weights are quantized to ternary values {-1, 0, +1} using absmean quantization during the forward pass.
* Activations are quantized to 8-bit integers using absmax quantization (per-token).
* **Crucially, the model was *trained from scratch* with this quantization scheme, not post-training quantized.**
* **Parameters:** ~2 Billion
* **Training Tokens:** 4 Trillion
***Context Length:** Maximum sequence length of **4096 tokens**.
**Recommendation:* For optimal performance on tasks requiring very long contexts (beyond the pre-training length or for specialized long-reasoning tasks), we recommend performing intermediate long-sequence adaptation/training before the final fine-tuning stage.
* **Training Stages:**
1.**Pre-training:** Large-scale training on public text/code and synthetic math data using a two-stage learning rate and weight decay schedule.
2.**Supervised Fine-tuning (SFT):** Fine-tuned on instruction-following and conversational datasets using sum loss aggregation and specific hyperparameter tuning.
3.**Direct Preference Optimization (DPO):** Aligned with human preferences using preference pairs.
> Please do NOT expect performance efficiency gains (in terms of speed, latency, or energy consumption) when using this model with the standard transformers library.
>
> The current execution paths within transformers do not contain the specialized, highly optimized computational kernels required to leverage the advantages of the BitNet architecture. Running the model via transformers will likely result in inference speeds and energy usage comparable to, or potentially worse than, standard full-precision models within this framework on both CPU and GPU.
>
> While you might observe reduced memory usage due to the quantized weights, the primary computational efficiency benefits are not accessible through this standard transformers usage path.
>
> For achieving the efficiency benefits demonstrated in the technical paper, you MUST use the dedicated C++ implementation: [bitnet.cpp](https://github.com/microsoft/BitNet).
echo -e "# Function to calculate the factorial of a number\ndef factorial(n):"| transformers-cli run --task text-generation --model meta-llama/CodeLlama-7b-hf --device 0
echo -e "# Function to calculate the factorial of a number\ndef factorial(n):"| transformers run --task text-generation --model meta-llama/CodeLlama-7b-hf --device 0
```
</hfoption>
@ -146,7 +146,7 @@ visualizer("""def func(a, b):
- Use the `<FILL_ME>` token where you want your input to be filled. The tokenizer splits this token to create a formatted input string that follows the [original training pattern](https://github.com/facebookresearch/codellama/blob/cb51c14ec761370ba2e2bc351374a79265d0465e/llama/generation.py#L402). This is more robust than preparing the pattern yourself.
```py
from transformers import LlamaForCausalLM, CodeLlamaTokenizer
[ColPali](https://huggingface.co/papers/2407.01449) is a model designed to retrieve documents by analyzing their visual features. Unlike traditional systems that rely heavily on text extraction and OCR, ColPali treats each page as an image. It uses [Paligemma-3B](./paligemma) to capture not only text, but also the layout, tables, charts, and other visual elements to create detailed embeddings. This offers a more comprehensive understanding of documents and enables more efficient and accurate retrieval.
## Overview
You can find all the original ColPali checkpoints under the [ColPali](https://huggingface.co/collections/vidore/hf-native-colvision-models-6755d68fc60a8553acaa96f7) collection.
The *ColPali* model was proposed in [ColPali: Efficient Document Retrieval with Vision Language Models](https://doi.org/10.48550/arXiv.2407.01449) by **Manuel Faysse***, **Hugues Sibille***, **Tony Wu***, Bilel Omrani, Gautier Viaud, Céline Hudelot, Pierre Colombo (* denotes equal contribution). Work lead by ILLUIN Technology.
> [!TIP]
> Click on the ColPali models in the right sidebar for more examples of how to use ColPali for image retrieval.
In our proposed *ColPali* approach, we leverage VLMs to construct efficient multi-vector embeddings directly from document images (“screenshots”) for document retrieval. We train the model to maximize the similarity between these document embeddings and the corresponding query embeddings, using the late interaction method introduced in ColBERT.
<hfoptionsid="usage">
<hfoptionid="image retrieval">
Using *ColPali* removes the need for potentially complex and brittle layout recognition and OCR pipelines with a single model that can take into account both the textual and visual content (layout, charts, etc.) of a document.
## Resources
- The *ColPali* arXiv paper can be found [here](https://doi.org/10.48550/arXiv.2407.01449). 📄
- The official blog post detailing ColPali can be found [here](https://huggingface.co/blog/manu/colpali). 📝
- The original model implementation code for the ColPali model and for the `colpali-engine` package can be found [here](https://github.com/illuin-tech/colpali). 🌎
- Cookbooks for learning to use the transformers-native version of *ColPali*, fine-tuning, and similarity maps generation can be found [here](https://github.com/tonywu71/colpali-cookbooks). 📚
This model was contributed by [@tonywu71](https://huggingface.co/tonywu71) and [@yonigozlan](https://huggingface.co/yonigozlan).
## Usage
This example demonstrates how to use *ColPali* to embed both queries and images, calculate their similarity scores, and identify the most relevant matches. For a specific query, you can retrieve the top-k most similar images by selecting the ones with the highest similarity scores.
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
The example below uses [bitsandbytes](../quantization/bitsandbytes.md) to quantize the weights to int4.
- [`~ColPaliProcessor.score_retrieval`] returns a 2D tensor where the first dimension is the number of queries and the second dimension is the number of images. A higher score indicates more similarity between the query and image.
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# Csm
## Overview
The Conversational Speech Model (CSM) is the first open-source contextual text-to-speech model [released by Sesame](https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice). It is designed to generate natural-sounding speech with or without conversational context. This context typically consists of multi-turn dialogue between speakers, represented as sequences of text and corresponding spoken audio.
**Model Architecture:**
CSM is composed of two LLaMA-style auto-regressive transformer decoders: a backbone decoder that predicts the first codebook token and a depth decoder that generates the remaining tokens. It uses the pretrained codec model [Mimi](./mimi.md), introduced by Kyutai, to encode speech into discrete codebook tokens and decode them back into audio.
The original csm-1b checkpoint is available under the [Sesame](https://huggingface.co/sesame/csm-1b) organization on Hugging Face.
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# D-FINE
## Overview
The D-FINE model was proposed in [D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement](https://arxiv.org/abs/2410.13842) by
*We introduce D-FINE, a powerful real-time object detector that achieves outstanding localization precision by redefining the bounding box regression task in DETR models. D-FINE comprises two key components: Fine-grained Distribution Refinement (FDR) and Global Optimal Localization Self-Distillation (GO-LSD).
FDR transforms the regression process from predicting fixed coordinates to iteratively refining probability distributions, providing a fine-grained intermediate representation that significantly enhances localization accuracy. GO-LSD is a bidirectional optimization strategy that transfers localization knowledge from refined distributions to shallower layers through self-distillation, while also simplifying the residual prediction tasks for deeper layers. Additionally, D-FINE incorporates lightweight optimizations in computationally intensive modules and operations, achieving a better balance between speed and accuracy. Specifically, D-FINE-L / X achieves 54.0% / 55.8% AP on the COCO dataset at 124 / 78 FPS on an NVIDIA T4 GPU. When pretrained on Objects365, D-FINE-L / X attains 57.1% / 59.3% AP, surpassing all existing real-time detectors. Furthermore, our method significantly enhances the performance of a wide range of DETR models by up to 5.3% AP with negligible extra parameters and training costs. Our code and pretrained models: this https URL.*
This model was contributed by [VladOS95-cyber](https://github.com/VladOS95-cyber).
The original code can be found [here](https://github.com/Peterande/D-FINE).
@ -53,6 +53,7 @@ The original code for vision can be found [here](https://github.com/facebookrese
- For Data2VecAudio, preprocessing is identical to [`Wav2Vec2Model`], including feature extraction
- For Data2VecText, preprocessing is identical to [`RobertaModel`], including tokenization.
- For Data2VecVision, preprocessing is identical to [`BeitModel`], including feature extraction.
- The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
-Use [torch.jit.trace](https://pytorch.org/docs/stable/generated/torch.jit.trace.html) to speedup inference. However, it will produce some mismatched elements. The difference between the original and traced model is 1e-4.
-The example below shows how to split the output tensor into:
- one embedding for the whole image, commonly referred to as a `CLS` token,
useful for classification and retrieval
- a set of local embeddings, one for each `14x14` patch of the input image,
useful for dense tasks, such as semantic segmentation
```py
import torch
from transformers import AutoImageProcessor, AutoModel
from PIL import Image
import requests
```py
from transformers import AutoImageProcessor, AutoModel
echo -e "I love using Hugging Face Transformers!"| transformers-cli run --task text-classification --model distilbert-base-uncased-finetuned-sst-2-english
echo -e "I love using Hugging Face Transformers!"| transformers run --task text-classification --model distilbert-base-uncased-finetuned-sst-2-english
```
</hfoption>
@ -213,7 +213,3 @@ echo -e "I love using Hugging Face Transformers!" | transformers-cli run --task
echo -e "This restaurant has amazing food."| transformers-cli run --task text-classification --model bhadresh-savani/electra-base-emotion --device 0
echo -e "This restaurant has amazing food."| transformers run --task text-classification --model bhadresh-savani/electra-base-emotion --device 0
```
</hfoption>
@ -96,12 +96,12 @@ echo -e "This restaurant has amazing food." | transformers-cli run --task text-c
```py
# Example of properly handling padding with attention masks
inputs = tokenizer(["Short text", "This is a much longer text that needs padding"],
padding=True,
inputs = tokenizer(["Short text", "This is a much longer text that needs padding"],
padding=True,
return_tensors="pt")
outputs = model(**inputs) # automatically uses the attention_mask
```
- When using the discriminator for a downstream task, you can load it into any of the ELECTRA model classes ([`ElectraForSequenceClassification`], [`ElectraForTokenClassification`], etc.).
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# FalconH1
## Overview
The FalconH1 model was developed by the TII Pretraining team. A comprehensive research paper covering the architecture, pretraining dynamics, experimental results, and conclusions is forthcoming. You can read more about this series in [this website](https://github.com/tiiuae/Falcon-H1).
## Contributors
This model was contributed by [DhiyaEddine](https://huggingface.co/DhiyaEddine), [ybelkada](https://huggingface.co/ybelkada), [JingweiZuo](https://huggingface.co/JingweiZuo), [IlyasChahed](https://huggingface.co/IChahed), and [MaksimVelikanov](https://huggingface.co/yellowvm).
The original code can be found [here](https://github.com/tiiuae/Falcon-H1).
## FalconH1Config
| Model | Depth | Dim | Attn Heads | KV | Mamba Heads | d_head | d_state | Ctx Len |
This HF implementation is contributed by [younesbelkada](https://github.com/younesbelkada) and [DhiaEddineRhaiem](https://github.com/dhiaEddineRhaiem).
The Gemma model was proposed in [Gemma: Open Models Based on Gemini Technology and Research](https://blog.google/technology/developers/gemma-open-models/) by Gemma Team, Google.
Gemma models are trained on 6T tokens, and released with 2 versions, 2b and 7b.
[Gemma](https://huggingface.co/papers/2403.08295) is a family of lightweight language models with pretrained and instruction-tuned variants, available in 2B and 7B parameters. The architecture is based on a transformer decoder-only design. It features Multi-Query Attention, rotary positional embeddings (RoPE), GeGLU activation functions, and RMSNorm layer normalization.
The abstract from the paper is the following:
The instruction-tuned variant was fine-tuned with supervised learning on instruction-following data, followed by reinforcement learning from human feedback (RLHF) to align the model outputs with human preferences.
*This work introduces Gemma, a new family of open language models demonstrating strong performance across academic benchmarks for language understanding, reasoning, and safety. We release two sizes of models (2 billion and 7 billion parameters), and provide both pretrained and fine-tuned checkpoints. Gemma outperforms similarly sized open models on 11 out of 18 text-based tasks, and we present comprehensive evaluations of safety and responsibility aspects of the models, alongside a detailed description of our model development. We believe the responsible release of LLMs is critical for improving the safety of frontier models, and for enabling the next wave of LLM innovations*
You can find all the original Gemma checkpoints under the [Gemma](https://huggingface.co/collections/google/gemma-release-65d5efbccdbb8c4202ec078b) release.
Tips:
- The original checkpoints can be converted using the conversion script `src/transformers/models/gemma/convert_gemma_weights_to_hf.py`
> [!TIP]
> Click on the Gemma models in the right sidebar for more examples of how to apply Gemma to different language tasks.
This model was contributed by [Arthur Zucker](https://huggingface.co/ArthurZ), [Younes Belkada](https://huggingface.co/ybelkada), [Sanchit Gandhi](https://huggingface.co/sanchit-gandhi), [Pedro Cuenca](https://huggingface.co/pcuenq).
The example below demonstrates how to generate text with [`Pipeline`] or the [`AutoModel`] class, and from the command line.
<hfoptionsid="usage">
<hfoptionid="Pipeline">
```py
importtorch
fromtransformersimportpipeline
pipeline=pipeline(
task="text-generation",
model="google/gemma-2b",
torch_dtype=torch.bfloat16,
device="cuda",
)
pipeline("LLMs generate text through a process known as",max_new_tokens=50)
echo -e "LLMs generate text through a process known as"| transformers run --task text-generation --model google/gemma-2b --device 0
```
</hfoption>
</hfoptions>
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
The example below uses [bitsandbytes](../quantization/bitsandbytes) to only quantize the weights to int4.
Use the [AttentionMaskVisualizer](https://github.com/huggingface/transformers/blob/beb9b5b02246b9b7ee81ddf938f93f44cfeaad19/src/transformers/utils/attention_visualizer.py#L139) to better understand what tokens the model can and cannot attend to.
- The original Gemma models support standard kv-caching used in many transformer-based language models. You can use use the default [`DynamicCache`] instance or a tuple of tensors for past key values during generation. This makes it compatible with typical autoregressive generation workflows.
```py
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, DynamicCache
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
The example below uses [bitsandbytes](../quantization/bitsandbytes) to only quantize the weights to int4.
```python
@ -118,7 +118,7 @@ Use the [AttentionMaskVisualizer](https://github.com/huggingface/transformers/bl
@ -28,7 +28,7 @@ rendered properly in your Markdown viewer.
The instruction-tuned variant was post-trained with knowledge distillation and reinforcement learning.
You can find all the original Gemma 3 checkpoints under the [Gemma 3](https://huggingface.co/collections/meta-llama/llama-2-family-661da1f90a9d678b6f55773b) release.
You can find all the original Gemma 3 checkpoints under the [Gemma 3](https://huggingface.co/collections/google/gemma-3-release-67c6c6f89c4f76621268bb6d) release.
> [!TIP]
> Click on the Gemma 3 models in the right sidebar for more examples of how to apply Gemma to different vision and language tasks.
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
The example below uses [bitsandbytes](../quantization/bitsandbytes) to only quantize the weights to 4-bits.
@ -46,8 +46,12 @@ The main differences compared to GPT2.
- Merge the key and value caches into one (this changes the format of layer_past/ present, does it risk creating problems?)
- Use the memory layout (self.num_heads, 3, self.head_dim) instead of `(3, self.num_heads, self.head_dim)` for the QKV tensor with MHA. (prevents an overhead with the merged key and values, but makes the checkpoints incompatible with the original openai-community/gpt2 model).
You can read more about the optimizations in the [original pull request](https://github.com/huggingface/transformers/pull/22575)
> [!NOTE]
> The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
## Combining Starcoder and Flash Attention 2
First, make sure to install the latest version of Flash Attention 2 to include the sliding window attention feature.
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# GraniteMoeHybrid
## Overview
The `GraniteMoeHybrid` model builds on top of `GraniteMoeSharedModel` and `Bamba`. Its decoding layers consist of state space layers or MoE attention layers with shared experts. By default, the attention layers do not use positional encoding.
# loop over the batch to print, in this example the batch size is 1
foriinoutput:
print(i)
```
This HF implementation is contributed by [Sukriti Sharma](https://huggingface.co/SukritiSharma) and [Alexander Brooks](https://huggingface.co/abrooks9944).
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# HGNet-V2
## Overview
A HGNet-V2 (High Performance GPU Net) image classification model.
HGNet arhtictecture was proposed in [HGNET: A Hierarchical Feature Guided Network for Occupancy Flow Field Prediction](https://arxiv.org/abs/2407.01097) by
Zhan Chen, Chen Tang, Lu Xiong
The abstract from the HGNET paper is the following:
*Predicting the motion of multiple traffic participants has always been one of the most challenging tasks in autonomous driving. The recently proposed occupancy flow field prediction method has shown to be a more effective and scalable representation compared to general trajectory prediction methods. However, in complex multi-agent traffic scenarios, it remains difficult to model the interactions among various factors and the dependencies among prediction outputs at different time steps. In view of this, we propose a transformer-based hierarchical feature guided network (HGNET), which can efficiently extract features of agents and map information from visual and vectorized inputs, modeling multimodal interaction relationships. Second, we design the Feature-Guided Attention (FGAT) module to leverage the potential guiding effects between different prediction targets, thereby improving prediction accuracy. Additionally, to enhance the temporal consistency and causal relationships of the predictions, we propose a Time Series Memory framework to learn the conditional distribution models of the prediction outputs at future time steps from multivariate time series. The results demonstrate that our model exhibits competitive performance, which ranks 3rd in the 2024 Waymo Occupancy and Flow Prediction Challenge.*
This model was contributed by [VladOS95-cyber](https://github.com/VladOS95-cyber).
The original code can be found [here](https://github.com/PaddlePaddle/PaddleDetection/blob/develop/ppdet/modeling/backbones/hgnet_v2.py).
@ -50,7 +50,7 @@ This model was contributed by [patrickvonplaten](https://huggingface.co/patrickv
- Hubert is a speech model that accepts a float array corresponding to the raw waveform of the speech signal.
- Hubert model was fine-tuned using connectionist temporal classification (CTC) so the model output has to be decoded
using [`Wav2Vec2CTCTokenizer`].
- The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
The InternVL3 family of Visual Language Models was introduced in [InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models](https://huggingface.co/papers/2504.10479).
The abstract from the paper is the following:
*We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage. This unified training paradigm effectively addresses the complexities and alignment challenges commonly encountered in conventional post-hoc training pipelines for MLLMs. To further improve performance and scalability, InternVL3 incorporates variable visual position encoding (V2PE) to support extended multimodal contexts, employs advanced post-training techniques such as supervised fine-tuning (SFT) and mixed preference optimization (MPO), and adopts test-time scaling strategies alongside an optimized training infrastructure. Extensive empirical evaluations demonstrate that InternVL3 delivers superior performance across a wide range of multi-modal tasks. In particular, InternVL3-78B achieves a score of 72.2 on the MMMU benchmark, setting a new state-of-the-art among open-source MLLMs. Its capabilities remain highly competitive with leading proprietary models, including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while also maintaining strong pure-language proficiency. In pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs.*
<small> Overview of InternVL3 models architecture, which is the same as InternVL2.5. Taken from the <ahref="https://huggingface.co/OpenGVLab/InternVL3-1B">original checkpoint.</a></small>
<small> Comparison of InternVL3 performance on OpenCompass against other SOTA VLLMs. Taken from the <ahref="https://huggingface.co/OpenGVLab/InternVL3-1B">original checkpoint.</a></small>
This model was contributed by [yonigozlan](https://huggingface.co/yonigozlan).
The original code can be found [here](https://github.com/OpenGVLab/InternVL).
## Usage example
### Inference with Pipeline
Here is how you can use the `image-text-to-text` pipeline to perform inference with the `InternVL3` models in just a few lines of code:
'The image showcases a vibrant scene of nature, featuring several flowers and a bee. \n\n1. **Foreground Flowers**: \n - The primary focus is on a large, pink cosmos flower with a prominent yellow center. The petals are soft and slightly r'
```
### Inference on a single image
This example demonstrates how to perform inference on a single image with the InternVL models using chat templates.
> [!NOTE]
> Note that the model has been trained with a specific prompt format for chatting. Use `processor.apply_chat_template(my_conversation_dict)` to correctly format your prompts.
'The image shows two cats lying on a pink blanket. The cat on the left is a tabby with a mix of brown, black, and white fur, and it appears to be sleeping with its head resting on the blanket. The cat on the'
```
### Text-only generation
This example shows how to generate text using the InternVL model without providing any image input.
["user\n\nWrite a haiku for this image\nassistant\nSilky lake, \nWooden pier, \nNature's peace.",
'user\n\nDescribe this image\nassistant\nThe image shows a street scene with a traditional Chinese archway, known as a "Chinese Gate" or "Chinese Gate of']
```
### Batched multi-image input
This implementation of the InternVL models supports batched text-images inputs with different number of images for each text.
["user\n\nWrite a haiku for this image\nassistant\nSilky lake, \nWooden pier, \nNature's peace.",
'user\n\n\nThese images depict two different landmarks. Can you identify them?\nassistant\nYes, these images depict the Statue of Liberty and the Golden Gate Bridge.']
```
### Video input
InternVL models can also handle video inputs. Here is an example of how to perform inference on a video input using chat templates.
['user\n\n\nThese images depict two different landmarks. Can you identify them?\nassistant\nThe images depict the Statue of Liberty and the Golden Gate Bridge.',
'user\nFrame1: \nFrame2: \nFrame3: \nFrame4: \nFrame5: \nFrame6: \nFrame7: \nFrame8: \nWhat type of shot is the man performing?\nassistant\nA forehand shot',
"user\n\nWrite a haiku for this image\nassistant\nSilky lake, \nWooden pier, \nNature's peace."]
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# Janus
## Overview
The Janus Model was originally proposed in [Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation](https://arxiv.org/abs/2410.13848) by DeepSeek AI team and later refined in [Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling](https://arxiv.org/abs/2501.17811). Janus is a vision-language model that can generate both image and text output, it can also take both images and text as input.
> [!NOTE]
> The model doesn't generate both images and text in an interleaved format. The user has to pass a parameter indicating whether to generate text or image.
The abstract from the original paper is the following:
*In this paper, we introduce Janus, an autoregressive framework that unifies multimodal understanding and generation. Prior research often relies on a single visual encoder for both tasks, such as Chameleon. However, due to the differing levels of information granularity required by multimodal understanding and generation, this approach can lead to suboptimal performance, particularly in multimodal understanding. To address this issue, we decouple visual encoding into separate pathways, while still leveraging a single, unified transformer architecture for processing. The decoupling not only alleviates the conflict between the visual encoder's roles in understanding and generation, but also enhances the framework's flexibility. For instance, both the multimodal understanding and generation components can independently select their most suitable encoding methods. Experiments show that Janus surpasses previous unified model and matches or exceeds the performance of task-specific models. The simplicity, high flexibility, and effectiveness of Janus make it a strong candidate for next-generation unified multimodal models.*
The abstract from the aforementioned `Janus-Pro` paper, released afterwards, is the following:
*In this work, we introduce Janus-Pro, an advanced version of the previous work Janus. Specifically, Janus-Pro incorporates (1) an optimized training strate (2) expanded training data, and (3) scaling to larger model size. With these improvements, Janus-Pro achieves significant advancements in both multimodal understanding and text-to-image instruction-following capabilities, while also enhancing the stability of text-to-image generation. We hope this work will inspire further exploration in the field. Code and models are publicly available.*
This model was contributed by [Yaswanth Gali](https://huggingface.co/yaswanthgali) and [Hugo Silva](https://huggingface.co/hugosilva664).
The original code can be found [here](https://github.com/deepseek-ai/Janus).
## Usage Example
### Single image inference
Here is the example of visual understanding with a single image.
> [!NOTE]
> Note that the model has been trained with a specific prompt format for chatting. Use `processor.apply_chat_template(my_conversation_dict)` to correctly format your prompts.
Janus can perform inference with multiple images as input, where images can belong to the same prompt or different prompts in batched inference, where the model processes many conversations in parallel. Here is how you can do it:
@ -315,6 +315,10 @@ model = AutoModelForImageTextToText.from_pretrained(
[[autodoc]] LlavaNextProcessor
## LlavaNextModel
[[autodoc]] LlavaNextModel
## LlavaNextForConditionalGeneration
[[autodoc]] LlavaNextForConditionalGeneration
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.