* Add a switch to CB in case of paged cache
* Added paged as a valid cache implem
* Added a fallback on inputs_ids as a name
* Rookie mistake
* Removed paged from cache implems
* Added warning about some beam search args
* Moved up CB warning
* Fix EncoderDecoder cache
* Add the option for the ddp data tuples to have 2 elems
* Modifiy the order of the KV and sliding
* Adapted RAG and Whisper to new EncoderDecoderCache
* A single comma
* Remove kwargs in map
* Fixed order in manual injection cache test
* Slight changes to support legacy format
* Removed Nonnes
This commit addresses a noisy warning and improves the robustness of the base pipeline implementation.
- The device placement message in the pipeline base class has been changed from a `warning` to a `debug` log. This reduces log noise for users who are aware of their device setup, while still providing the information for debugging purposes.
- Additionally, potential `UnboundLocalError` exceptions in the `_pad` and `check_model_type` functions have been prevented by initializing variables before their conditional assignment.
* Add is_causal to KosmosTextAttention
* Move get target_dtype to be imported elsewhere
* Fix fp32 flash attention bug in bark
* Fix is_causal in mllama
* Fix fp32 issue on StableLM
* Fix repo-consistency
* add aux
* update
* update config to text_config
* use qwen data class to avoid repeat again
* format
* update
* use 1e-4
* update
* update for remove init
* Apply style fixes
---------
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Raushan Turganbay <raushan@huggingface.co>
* toggle the serialization
* prob this fixes it
* fix tests
* typo
* delete legacy save entirely
* remove extra nesting in if
* revert test and serialzie a public attr instead of private
* fix some case failures lead by "`torch.compile` recompiled part of the forward pass" in xpu
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* update comment
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* Add logits_to_keep to CausalLM models
* Skip failing test for git model
* Remove unused return_dict from kosmos2 signature
* Revert BlipForQuestionAnswering
* start
* add the important fix
* continue
* big cleanup
* type hints
* add method
* fix typehints
* typehints
* fix
* oupsi
* remove space
* improve function
* CI
* Big refactor, still classes to move around and script to re-complexify
* Move to streamer, isolate benches, propagate num tokens
* Some refacto
* Added compile mode to name
* Re-order
* Move to dt_tokens
* Better format
* Fix and disable use_cache by default
* Fixed compile and SDPA backend default
* Refactor results format
* Added default compile mode
* Always use cache
* Fixed cache and added flex
* Plan for missing modules
* Experiments: no cg and shuffle
* Disable compile for FA
* Remove wall time, add sweep mode, get git commit
* Review compliance, start
* Apply suggestions from code review
Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>
* Update benchmark_v2/framework/benchmark_runner.py
Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>
* Disable workflow
* Pretty print
* Added some pretty names to have pretty logs
* Review n2 compliance (end?)
* Style and end of PR
---------
Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>
* Fixed Expected self.dtype to be equal to src.dtype on eval
* Fixed Expected self.dtype to be equal to src.dtype on eval
* Fixed Expected self.dtype to be equal to src.dtype on eval
* generated modeling_qwen3_vl_moe.py file
* Fixed Ernie_4_5_MoE router casting
* Fixed routing_weights dtype casting (ernie4_5_moe, hunyuan_v1_moe, qwen2_moe, qwen3_moe, qwen3_next,qwen3_omni_moe)
* rollback hunyuan_v1_moe changes
---------
Co-authored-by: Daniel Oliveira <daniel-oliveira-11@hotmail.com>
Co-authored-by: Daniel Oliveira <36623265+daniel3303@users.noreply.github.com>
For FSDP2, parameters might be on a meta device, and the weight.device attribute may
not accurately reflect where the actual computation will happen during forward passes.
```log
File "transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py", line 776, in forward
pos_embeds = self.fast_pos_embed_interpolate(grid_thw)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py", line 745, in fast_pos_embed_interpolate
pos_embeds = self.pos_embed(idx_tensor) * weight_tensor[:, :, None]
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "torch/nn/modules/module.py", line 1879, in _call_impl
return inner()
^^^^^^^
File "torch/nn/modules/module.py", line 1827, in inner
result = forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "torch/nn/modules/sparse.py", line 192, in forward
return F.embedding(
^^^^^^^^^^^^
File "torch/nn/functional.py", line 2546, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device, but got index is on cpu, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA__index_select)
```
https://github.com/volcengine/verl/pull/3686#issuecomment-3380981817
Signed-off-by: Hollow Man <hollowman@opensuse.org>
* Add video processor for VideoMAE
* Document VideoMAE video processor
* Add regression tests for VideoMAE video processor
* refactor: Use direct batch key access for pixel_values_videos
* test: add parity test for VideoMAEVideoProcessor vs VideoMAEImageProcessor
* docs(videomae): update model docstring example to demonstrate VideoMAEVideoProcessor (TorchCodec-based decoding and sampling)
* Type hints and small fixes
* Remove unusued params
* Made slice inputs the default
* ruffed
* Updated some var name and moved index slicing
* Logging arg in example
* Added some padding debug var and reformat out cg
* First working CG, fixe size
* Working flexible CG
* CG are compatible with all implementations
* Fixed CG API
* Update example
* Documentation
* Fix padding tokens in FA
* Review compliance
* Better doc around weird bug
* Style
* Fix for sliding with CG
* Merge conflict
* add fast processor
* add fast processor
* make style
* add new convert rgb
* use nested group by shape in mllama fast, add support for multiple inputs in group by shape
* refactor after review
---------
Co-authored-by: Vincent <phamvinh257@gmail.com>
```
File "transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py", line 941, in forward
hidden_states = self._deepstack_process(
^^^^^^^^^^^^^^^^^^^^^^^^
File "transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py", line 960, in _deepstack_process
hidden_states[visual_pos_masks, :] = local_this
~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Output 0 of SliceBackward0 is a view and is being modified inplace. This view was created inside a custom Function (or because an input was returned as-is) and the autograd logic to handle view+inplace would override the custom backward associated with the custom Function, leading to incorrect gradients. This behavior is forbidden. You can fix this by cloning the output of the custom Function.
```
Signed-off-by: Hollow Man <hollowman@opensuse.org>
* Set `truncation` to `False` in Qwen3Omni to avoid default truncation
* move `padding` and `truncation` to audio default args
---------
Co-authored-by: lvyuanjun.lyj <lvyuanjun.lyj@alibaba-inc.com>
* [wip][cwm] Code World Model stubs and setup in HF Transformers
* [wip] Get other things working
* [wip] Working
* Tokenizer pad
* fix: cwm window attn
* temp remove test
* temp remove test
* Fixes
* Temporarily add auto config remapping option until VLLM 0.11 is out
* Fix model type and add layer validation
* Lint, remove CwmForSequenceClassification
* Lint, tests
* Remove CwmForSequenceClassification
* Lint
* Remove intermediary layer expors/doc errorss, fix tests
* Lint
* run python utils/sort_auto_mappings.py --check_only
* Remove Cwm processor mapping, get check_repo passing
* Remove CwmTextConfig from test
* Add docstring for CwmConfig
* remove global_window and window_pattern params from config
* Fix docstrings
* Revert change to auto docstring util
* lint
* Fixes minus test improvements
* Alter tests to simply check logits
* lint
* Have slow tests use repo, make CwmPretrainedModel passthrough
* Remove decoder layer implementation, use Llama3Decoder + CwmAttetion
* Use linear w/o bias for CwmAttention, add token-level integration test
* Don't ignore config attention bias
* Remove attention bias parameter entirely from config
---------
Co-authored-by: galco <galco@meta.com>
* new masks
* fixes
* adjust comments
* fix unnecessary mask creation on sdpa
* simplify masks more
* propogate to other models
* style + repo consistency
* copies
* no comment
* fix attempt
* finally fix grounding dinos
* fix distilbert
* fix executorch
* move to own module
* address first few comments WIP
* revert device comments, simplify executorch further
* fix typo
* add a test for cuda graphs
* move cleanup...
* fix conflict with new main
* fix esm and evolla
* Update rt_detr docs to mention 640x640 input size
The authors of RT-Detr mention that the model was trained on 640x640 images and was meant to be used for inference on 640x640 images.
Also, the current implementation has certain quirks that make training/inferring on images of different sizes problematic. For example,
the pixel masks used for batches of varying image sizes are discarded. I've added a few lines in the docs to notify the user about these issues.
* Batching not possible with variable image sizes
* Remove reference to batching
---------
Co-authored-by: Konstantinos Pitas <kostasp210@gmail.com>
* [new-models] LFM2-MoE
Signed-off-by: Paul Pak <paulpak58@gmail.com>
* [docs] add in template lfm2_moe doc files
Signed-off-by: Paul Pak <paulpak58@gmail.com>
* [configuration] update configuration class
Signed-off-by: Paul Pak <paulpak58@gmail.com>
* [modular][lfm] minor: fix rotary_emb typo
Signed-off-by: Paul Pak <paulpak58@gmail.com>
* [modeling] modular/modeling files for Lfm2Moe
Signed-off-by: Paul Pak <paulpak58@gmail.com>
* [modeling][lfm2_moe] fix Lfm2Moe modular/modeling
Signed-off-by: Paul Pak <paulpak58@gmail.com>
* [configuration][lfm2_moe] update configuration keys with latest config changes
Signed-off-by: Paul Pak <paulpak58@gmail.com>
* [misc] make fixup
Signed-off-by: Paul Pak <paulpak58@gmail.com>
* [modular][lfm2_moe] address comments: dtype, mlp, buffers
Signed-off-by: Paul Pak <paulpak58@gmail.com>
* [configuration][lfm2_moe] add initializer_range
Signed-off-by: Paul Pak <paulpak58@gmail.com>
* [modular][lfm2_moe] include init_weights to pass test_initialization
Signed-off-by: Paul Pak <paulpak58@gmail.com>
* [tests][causal_lm] include pos_emb as possible rope attribute
Signed-off-by: Paul Pak <paulpak58@gmail.com>
* [modeling][lfm2_moe] remove load_balancing_loss_func due to lack of support for hooking expert biases
Signed-off-by: Paul Pak <paulpak58@gmail.com>
* [misc] make style
Signed-off-by: Paul Pak <paulpak58@gmail.com>
* [modeling][lfm2_moe] MoE refactor PR update in LFM2Moe
Signed-off-by: Paul Pak <paulpak58@gmail.com>
* [tests] lfm2_moe: unit tests
Signed-off-by: Paul Pak <paulpak58@gmail.com>
* [misc] update LFM2-8B-A1B repo id
Signed-off-by: Paul Pak <paulpak58@gmail.com>
* [tests] lfm2: update ModelTests for lfm2
Signed-off-by: Paul Pak <paulpak58@gmail.com>
* Update LFM2 documentation
Updated the LFM2 documentation to reflect the addition of a new model size and clarified architectural details.
* Add Lfm2Moe documentation
Add Lfm2Moe model documentation with overview and example usage.
* [misc] fix ci
Signed-off-by: Paul Pak <paulpak58@gmail.com>
* [docs] remove trust_remote_code
Signed-off-by: Paul Pak <paulpak58@gmail.com>
* [misc] ci: fix modular
Signed-off-by: Paul Pak <paulpak58@gmail.com>
* reapply modular
* simplify
* remove static address and inplace op
* simplify
* simplify a bit more the modular
* imports
---------
Signed-off-by: Paul Pak <paulpak58@gmail.com>
Co-authored-by: Maxime Labonne <81252890+mlabonne@users.noreply.github.com>
Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co>
Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>
* [Cache] lfm2 cache: allocate empty kv layers during init
Signed-off-by: Paul Pak <paulpak58@gmail.com>
* [Cache] lfm2_cache: update modular file
Signed-off-by: Paul Pak <paulpak58@gmail.com>
---------
Signed-off-by: Paul Pak <paulpak58@gmail.com>
* init commit
* style
* take comments into account
* mrege with main and simplify
* nits
* final
* small fixes
* fix
* super small update!
* add another test
* up up
* update
* fixes
* sort them by default
* Use canonical get_size_with_aspect_ratio (with max_size) from transformers.image_transforms to fix#37939
* Fix import sorting/style
* Fix import order
* Refactor: use canonical get_size_with_aspect_ratio across image processors (except YOLOS)
This commit updates image processing utilities in multiple model processors to use the shared
transformers.image_transforms.get_size_with_aspect_ratio for consistent resizing logic and
aspect ratio handling.
YOLOS processors are intentionally left unchanged in this commit to preserve their current
behavior and avoid breaking model-specific padding/resizing assumptions. YOLOS will be updated
in a dedicated follow-up PR once compatibility is fully verified.
* ruff fixes
* Fix check_copies.py references for get_size_with_aspect_ratio to use canonical transformers.image_transforms version
---------
Co-authored-by: Yoni Gozlan <74535834+yonigozlan@users.noreply.github.com>
* Fix flash_attention.py: wrong argument passing for attn_implementation
The name of the attn type argument for `_flash_attention_forward()` should be `implementation`, instead of `attn_implementation` which currently uses in the function call. This would result in wrong type specification.
* modify the kwargs inside _flash_attention_forward
* fix the doc
* fix typo
---------
Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>
* update all models
* fix copies
* skip aria tests
* update other models
* skip should be in test, not tester
* i think this is more descriptive as a name
* find and replace for new models
The main content of this PR is to fix a bug in the delete_adapter method
of the PeftAdapterMixin. Previously, it did not take into account
auxiliary modules from PEFT, e.g. those added by modules_to_save. This
PR fixes this oversight.
Note that the PR uses a new functionality from PEFT that exposes
integration functions like delete_adapter. Those will be contained in
the next PEFT release, 0.18.0 (yet unreleased). Therefore, the bug is
only fixed when users have a PEFT version fullfilling this requirement.
I ensured that with old PEFT versions, the integration still works the
same as previously. The newly added test for this is skipped if the PEFT
version is too low.
(Note: I tested locally with that the test will pass with PEFT 0.18.0)
While working on this, I also cleaned up the following:
- The active_adapter property has been deprecated for more than 2 years
(#26407). It is safe to remove it now.
- There were numerous small errors or outdated pieces of information in
the docstrings, which have been addressed.
When PEFT < 0.18.0 is used, although we cannot delete modules_to_save,
we can still detect them and warn about it.
* support aux loss in qwen3vlmoe
* update qwen3vl processor test!
* add integration tests for qwen3vl-30a3
* remove duplicated decorator
* code clean
* fix consistency
* do not inherit from nn.Linear for better quantization
* pass check
* allow prive space id for trackio
* complete docstring
* Deprecate environment variables for Trackio integration; use TrainingArguments instead and deploy by default
* style
* Enhance documentation for Trackio Space ID in TrainingArguments
* update modeling mixtral
* oups[13;2u
* fix
* better naming?
* compute softmax and top_k inside the experts
* update minamax as well
* models that will need an update
* more models that need a fix
* stash
* fix mixtral
* update olmoe
* update
* update
* current changes
* nits
* molmoe is now fixed
* olmoe is good to go!
* refactor qwen2_moe
* fixes
* fixed moe
* fix qwen2 modular
* nit
* qwen2_moie test script works
* tricky rope !
* fix qwen3
* DeepSeek v3 MoE Standardization (#40538)
* DeepSeek-v3
Shared
Shared
* Dependents of DS3
* Standardize GLM4V MoE (#40539)
* up
* Standardize VitPose's MoE (#40549)
* VitPose
* outside
* outside
* outside
* fix
* update dbrx
* dbrx... the magix
* Refactor Ernie 4.5's MoE (#40547)
* Isolate Ernie fixes
* fix moe
---------
Co-authored-by: Vasqu <antonprogamer@gmail.com>
* fix style
* style
* fix copies
* style
* latest changes
* fixes
* had to stage
* current updaters
* up
* another modular
* modular graniteMoe
* some update
* draft another modular moe
* updaters
* up
* fix nit
* q3 nit
* fix phi moe
* we're going up up up up its our mooooment
* fix switch transformers this time around
* up
* gptsan japanese is deprecated forget about it
* fix mixtral to not be a linear (gives us more freedom)
* update
* fix copies gone wrong try catch nothing
* fix mixtral
* new refactor again
* update aria as well
* up dbrx and deepseekv3
* nit
* fix phimoe?
* fix deepseek v3
* nits
* don't bother with this one please
* up olmoe
* ??
* fix olmoe
* yups
* fiupx
* ish
* hot patch
* new qwen3
* updates
* up
* nit
* fix copies
* fix
* nits
* we're going up up up
* nits
* switch_transformesr edge case
* lol modular gptsan?
* fix deepseek
* finally all modeling match modular
* update
* up
* up
* dang
* up
* up aria
* fix dbrx
* nits here and there
* finish fixing dbrx
* fix deepseek
* upd
* up
* fix flex olmo
* updated
* update jamba
* JAMBA is stil a bit todo
* forward forward
* fix dots11
* update
* fix hunyuan
* fix some other
* update phimoe
* fuck you phimoe you are now submitted
* submit granitemoe as well
* try to fix some other models, reduces some of the failures
* fix olmoe and qwem2moe
* up
* up
* fix qwen2_moe
* update modular make it again, simpler
* nits
* up
* up
* fix
* someswitch reductions
* up
* fix qwen3vl
* some fixes to jetmo
* these should be shipped to the modular to fix jetmoe
* fix most of the nllb failures
* more nllb fixes
* fix the modular
* remove nllb modular as it sucks for now
* ?
* fix granitemoe
* granitemoehybrid don't have rope
* use rope when rope, no rope when no rope
* updates
* finish fixing dumbgrainite
* fix most of minimax
* fix
* update modular
* ?
* up
* up jetmoe still broken
* up
* fix, now align the moe
* fix jetmoe
* fix styling and qwen3 repo consitency
* updatge
* up up
* update ruff?
* nits
* modeling is goot now for switch
* fix
* more fixses to switch!
* fix some siwtch test
* ?
* ?
* up
* fix switch modular!
* nit?
* uip
* subtest
* can't believe I wasted so much time on this...
* fix
* updates
* nits
* nit jamba is fucking annoying
* ?
* fix?
* oups
* good good
* styling
* up
* make sure qwen2 sliding works!
* fix dbrx small
* lol
* nits
* fix one test
* fix load balancing loss issue
* fix jamba
* fix nllbmoe
* fix jamba consistency and doc?
* up
* thse are correct
* up
* up
* up
* some of the final cleanup
* update
* up
* fix some revert in granimoe
* bring back attention multipliers for the granite family we'll see later on if they need removal
* small jamba fix docstring and typing
* fix phimoe
* yup
* fix unk returndict in granitemoes
* up
* fix qwen config
* fix phiemoe check quality
* nits
* update based on caught non relative imports!
* fix dbrx
* Apply suggestions from code review
Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co>
* fix copies
* fiuxp
* fix dot1 regression!
* fix phimoe issue
* fix phi moe
* fix float() for some models
* fix jamba regression
* ui
* more dtype issues
* fix deepseek2 and 3?
* proper update
* fix modular deepseek!
* jamba jambaaaaaa
---------
Co-authored-by: Lysandre Debut <hi@lysand.re>
Co-authored-by: Vasqu <antonprogamer@gmail.com>
Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co>
* fix multi-video timestamp bug in qwen3vl,glm4v
* run make fix-copies to sync modular files
* run make fix-copies to sync modular files
---------
Co-authored-by: UBT <daqin.luo@ubtrobot.com>
* Fix sliding window attn mask
* Clearer test
* Apply style fixes
* If Picasso made ascii drawings he would have made this
---------
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* first attempt at removing
* copies
* last bits in core
* quick fixes
* tests purge
* docs and examples
* some fixes
* more
* another round of cleanups
* fix
* fix a bunch of models
* fix dummy bert
* fix
* fix new model
* fix signature change
* fix
* fix style/copies
* new models
* fix copies didnt find that damn
* test
* this shouldnt have happened during model addition
* Add num_items_in_batch computation to predict_step.
* address comments.
* Fix test cases.
* fixup
---------
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Fix Qwen3-Omni audio_token_id serialization by overriding parent's attribute_map
- Override attribute_map in Qwen3OmniMoeThinkerConfig to prevent inheritance of incorrect mapping
- Parent class maps audio_token_id -> audio_token_index, but implementation uses audio_token_id directly
- Fixes issue where custom audio_token_id values were not preserved during save_pretrained/from_pretrained cycles
Fixes#41191
* embed timeline in docs (test web componentand Iframe)
* test scaling
* test multiple scales
* compensate scale in width
* set correct syle and scale
* remove bottom space created by scale
* add timeline as a separate page
* reformulate docs after review
* initial comment
* test
* initial conversion for outline
* intermediate commit for configuration
* chore:init files for sam2
* adding arbitary undefined config
* check
* add vision
* make style
* init sam2 base model
* Fix imports
* Linting
* chore:sam to sam2 classes
* Linting
* Add sam2 to models.__init__
* chore:match prompt encoder with sam2 code
* chore:prepare kwargs for mask decoder
* Add image/video predictors
* Add CUDA kernel
* Add output classes
* linting
* Add logging info
* tmp commit
* docs for sam2
* enable image processing
* check difference of original SAM2
- difference is the order of ToTensor()
- please see https://pytorch.org/vision/main/_modules/torchvision/transforms/functional.html#resize
* enable promptencoder of sam2
* fix promprencoder
* Confirmed that PromptEncoder is exactly same (Be aware of bfloat16 and float32 difference)
* Confirmed that ImageEncoder is exactly same (Be aware the linting of init)
* Confirmed that MaskDecoder is exactly same (TO DO: lint variable name)
* SamModel is now available (Need more chore for name)
* make fix-copies
* make style
* make CI happy
* Refactor VisionEncoder and PostioinEmbedding
* TO DO : fix the image_embeddings and sparse_embeddings part
* pure image inference done
* reusable features fix and make style
* styling
* refactor memoryattention
* tmp
* tmp
* refactor memoryencoder
TO DO : convert and inference the video pipeline
* TO DO : fix the image_encoder shape
* conversion finish
TO DO: need to check video inference
* make style
* remove video model
* lint
* change
* python utils/check_docstringspy --check_all
* python utils/check_config_attributes.py
* remove copies for sam2promptencoder due to configuration
* change __init__.py
* remove tensorflow version
* fix that to not use direct comparison
* make style
* add missing import
* fix image_embedding_size
* refactor Sam2 Attention
* add fully working video inference (refactoring todo)
* clarify _prepare_memory_conditioned_features
* simplify modeling code, remove unused paths
* use one model
* use auto_docstring
* refactor rope embeddings
* nit
* not using multimask when several points given
* add all sam2.1
* add video tmp
* add Sam2VideoSessionState + fast image proc + video proc
* remove init_states from model
* fix batch inference
* add image integration tests
* uniformize modeling code with other sam models and use modular
* pass vision tests an most model tests
* All tests passing
* add offloading inference state and video to cpu
* fix inference from image embedding and existing mask
* fix multi_boxes mask inference
* Fix batch images + batch boxes inference
* improve processing for image inference
* add support for mask generation pipeline
* add support for get_connected_components post processing in mask generation
* add fast image processor sam, image processor tests and use modular for sam2 image processor
* fix mistake in sam after #39120
* fix init weights
* refactor convert
* add integration tests for video + other improvements
* add needed missing docstrings
* Improve docstrings and
* improve inference speed by avoiding cuda sync
* add test
* skip test for vision_model
* minor fix for vision_model
* fix vision_model by adding sam2model and change the torch dependencies
* remove patch_size
* remove image_embedding_size
* fix patch_size
* fix test
* make style
* Separate hieradet and vision encoder in sam2
* fixup
* review changes part 1
* remove MemoryEncoderConfig and MemoryAttentionConfig
* pass q_stride instead of q_pool module
* add inference on streamed videos
* explicitely process streamed frames
* nit
* Improve docstrings in Sam2Model
* update sam2 modeling with better gestion of inference state and cache, and separate Sam2Model and Sam2VideoModel
* improve video inference api
* change inference_state to inference_session
* use modular for Sam2Model
* fix convert sam2 hf
* modular
* Update src/transformers/models/sam2/video_processing_sam2.py
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
* fix minor config
* fix attention loading error
* update modeling tests to use hub checkpoints
* Use CI A10 runner for integration tests values + higher tolerance for video integration tests
* PR review part 1
* fix doc
* nit improvements
* enforce one input format for points, labels and boxes
* nit
* last few nits from PR review
* fix style
* fix the input type
* fix docs
* add sam2 model as conversion script
* improve sam2 doc
* add rough necessarry changes
* first working edgetam
* fix issue with object pointers
* Use modular as much as possible
* nit fixes + optimization
* refactor spatial perceiver
* cleanup after merge
* add working edgetam
* improve perceiver resampler code
* simplify/unify rope attention logic
* Improve comments in apply_rotary_pos_emb_2d
* add working tests
* fix test timmwrapper
* add docs
* make fixup
* nits
* fix modular
* fix modular
* PR review part 1
* split apply_rotary_pos_emb_2d
* add granularity to _prepare_memory_conditioned_features
* add dates to doc
* add separate mlp for memory attention
* Fix memory on wrong device
* store processed frames in dict
* update checkpoints in tests
* update dates
---------
Co-authored-by: sangbumchoi <danielsejong55@gmail.com>
Co-authored-by: RUFFY-369 <prakarshkaushik369@gmail.com>
Co-authored-by: Sangbum Daniel Choi <34004152+SangbumChoi@users.noreply.github.com>
Co-authored-by: Haitham Khedr <haithamkhedr@meta.com>
Co-authored-by: sangbum choi <sangbumchoi@sangbumui-MacBookAir.local>
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
* fix param_needs_quantization
* rewrite most hqq
* clean
* fix
* comment
* remove it from exception of safetensors
* start on bnb 4bits
* post-rebase fix
* make bnb4 bit a good citizen
* remove forgotten print
* make bnb 8bits a good citizen
* better hqq
* fix
* clean
* remove state dict from signature
* switch method
* make torchao a good citizen
* fixes
* fix torchao
* add check
* typo
* Fix attention sink implementation in flex attention
* fix dim
* fix
* Remove print
* raisae error when return_lse is False yet s_aux is providewd
* Clean test files for merge
* Update src/transformers/integrations/flex_attention.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* force return lse
* Add to doc
---------
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* fix(trainer): Avoid moving model with device_map
When a model is loaded with `device_map="auto"` and is too large to fit on a single GPU, `accelerate` will offload some layers to the CPU or disk. The `Trainer` would previously attempt to move the entire model to the specified device, causing a `RuntimeError` because a model dispatched with `accelerate` hooks cannot be moved.
This commit fixes the issue by adding a check in `_move_model_to_device` to see if the model has an `hf_device_map` attribute. If it does, the device placement is assumed to be handled by `accelerate`, and the `model.to(device)` call is skipped.
A regression test is added to ensure the `Trainer` can be initialized with a model that has a `hf_device_map` that simulates offloading without raising an error.
* Added the logger warning for the move model
---------
Co-authored-by: google-labs-jules[bot] <161369871+google-labs-jules[bot]@users.noreply.github.com>
* fix(trainer): Fix the issue of inaccurate token count in training sessions
During the training process, the initial token count was not saved, leading to inaccurate speed calculation. Now, the initial token count is saved and the increment during the session is calculated, ensuring that the speed metric accurately reflects the performance of the current training session.
* 修复错误
---------
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* halfway through the models
* update test checks
* refactor all
* another one
* use tuples
* more deletions
* solve bad inheritance patterns
* type
* PR ready?
* automatic model class inference from the base class
* vaultgemma
* make fixup
* make fixup
* rebase with gpt2
* make fixup :'(
* gpt2 is special
* XPU supports gpt-oss MXFP4
* Complete MXFP4 UT file and comment information
* Complete MXFP4 UT file and comment information
* Fix code style
* Fix code style
---------
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* Update CI workflows to use devmi355 branch
* Add workflow trigger for AMD scheduled CI caller
* Remove unnecessary blank line in workflow YAML
* Add trigger for workflow_run on main branch
* Update workflow references from devmi355 to main
* Change runner_scale_set to runner_group in CI config
* Add FA to docker
* Fixed padding for mdernbert
* Fixed logits and hidden states extraction in ModernBertForMultipleChoice
* Added a test for ModernBertForMultipleChoice
* fixes
* More fixes and GREEN CI
* consistency
* moar consistency
* Add FA to docker
* Use caching mechanism for qwen2_5
* Fix a typo in important models list
* Partial fixes for gemma3
* Added a commit ID for FA repo
* Detailled the expectation storage format
* Rebase fix
* Apply style fixes
---------
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* remove unexpected keys from inputs (they have nothing to do there)
* remove input
* simplify a lot init
* fix
* fix check for non-persistent buffer
* revert because too many old and bad models...
* remove comment
* type hint
* make it a real test
* remove model_to_load -> always use the same model
* typo
* remove legacy offload_folder (we never waste that memory anymore)
* do not change prefix anymore
* change very bad function name
* create adjust method
* remove useless method
* restrict
* BC
* remove unused method
* CI
* remove unused args
* small fix
* fix
* CI
* CI
* avoid too many loops
* fix regex
* cleaner
* typo
* fix
* fix
* Adapt and test huggingface_hub v1.0.0.rc0
* forgot to bump hfh
* bump
* code quality
* code quality
* relax dependency table
* fix has_file
* install hfh 1.0.0.rc0 in circle ci jobs
* repostiryo
* push to hub now returns a commit url
* catch HfHubHTTPError
* check commit on branch
* add it back
* fix ?
* remove deprecated test
* uncomment another test
* trigger
* no proxies
* many more small changes
* fix load PIL Image from httpx
* require 1.0.0.rc0
* fix mocked tests
* fix others
* unchange
* unchange
* args
* Update .circleci/config.yml
* Bump to 1.0.0.rc1
* bump kernels version
* fix deps
* fix mismatched dims for qwen3 next
* propagate changes
* chore: renamed tot_heads to total_sequence_length
* Apply suggestion from @vasqu
Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
* minor fix to modular qwen3 next file
---------
Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
* add gguf config mapping for lfm2
* add lfm2 tensor process to unsqueeze conv weights
* adjust values from gguf config to HF config
* add test for lfm2 gguf
* ruff
---------
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* tmp
* fix modular inheritance
* nit
* paligemma 1 doesn't have swa
* use same pattern as in models with hybrid layers
* PR comments
* helium also needs layer_typed (bc it relies on gemma)
* paligemma/gemma3: same mask creation fn in fwd and generate
* propagate changes to helium (gemma-based)
* tmp commit
* slow paligemma tests passing, let's see what breaks
* fix test_left_padding_compatibility
* tmp commit
* tmp commit
* rebase error
* docs
* reduce diff
* like this?
* t5gemma
* better comment
* shorter diff
* exception
* ffs type
* optional
* shorter modular_gemma.py
* helium model actually needs no changes -- the tester is the issue
* t5gemma modular config
* a few more modular; paligemma BC
* fix processor issues?
* rm config exception
* lift warning in gemma
* fix bug in Mamba2 docs
* correct 'because on of' issue
* link to other Mamba2 model types
* github URL is not changed
* update error message in generated files
* [i18n-bn] Add Bengali language README file and update links in existing language files
* Update Bengali README for clarity and consistency in model descriptions
* Fix typos and formatting in English docs
Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>
* Fix typos and formatting in Chinese docs
Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>
---------
Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>
* Add Qwen3Omni
* make fix-copies, import properly
* nit
* fix wrong setup. Why was audio_token_id renamed ?
* upds
* more processing fixes
* yup
* fix more generation tests
* down to 1?
* fix import issue
* style, update check repo
* up
* fix quality at my best
* final quality?
* fix doc building
* FINAL COMMIT: SKIP IMPORTANT BUT FAILING TESTS FOR MERGE
* SKIP THE TEMPLATE ONE
---------
Co-authored-by: lvyuanjun.lyj <lvyuanjun.lyj@alibaba-inc.com>
Co-authored-by: Arthur <arthur.zucker@gmail.com>
* fix: bug that made early stop change order of matches
* fix: applied code suggestion
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
* fix: applied code suggestion to modular
* fix: integration tests
---------
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
ENH Enable readline support for chat
This small change enables GNU readline support for the transformers chat
command. This includes, among others:
- advanced navigation and editing: ctrl + a ctrl + e alt + b alt + f
ctrl + k alt + d etc.
- navigate and search history: arrow up/down ctrl + p ctrl + n ctrl + r
- undo: ctrl + _
- clear screen: ctrl + l
Implementation
Although it may look strange, just importing readline is enough to
enable it in Python, see:
https://docs.python.org/3/library/functions.html#input
As readline is not available on some
platforms (https://docs.python.org/3/library/readline.html), the import
is guarded.
Readline should work on Linux, MacOS, and with WSL, I'm not sure about
Windows though. Ideally, someone can give it a try. It's possible that
Windows users would have to install
pyreadline (https://pypi.org/project/pyreadline3/).
* clean start to bert refactor
* some test fixes
* style
* fix last tests
* be strict on positional embeddings, fixup according tests
* cache support
* more cache fixes, new causal API
* simplify masks, fix tests for gen
* flex attn, static cache support, round of fixes
* ?
* this time
* style
* fix flash attention tests, flex attention requires torch 2.7.x to work with multiple classes (as recompile strats force a size call which is wrongly interpreted before)
* roberta
* fixup sdpa remains
* attention split, simplify args and kwargs, better typing
* fix encoder decoder
* fix test
* modular roberta
* albert
* data2vectext, making it modular tomorrow
* modular data2vec text
* tmp disable
* xmod + cache position fixes
* whoops
* electra + markuplm, small fixes
* remove wrong copy
* xlm_roberta + some embedding fixes
* roberta prelayernorm
* RemBert: remove copy, maybe doing it later
* ernie
* fix roberta offloading
* camembert
* copy fixes
* bert generation + fixes on eager
* xlm roberta xl
* bridgetower (text) + seamlessv2 copy fixes
* rocbert + small fixes
* whoops
* small round of fixups
* NOTE: kernels didnt load with an earlier version, some fixup (needs another look bc cross deps)
* the end of the tunnel?
* fixup nllbmoe + style
* we dont need this anymore
* megatron bert is barely used, low prio skip for now
* Modernize bert (template for others)
NOTE: trying to push this through, might be overdue if not in time possible
* check inputs for all others (if checkmarked)
* fix bridgetower
* style
* fix encoder decoder (partially but cause found and fix also, just needs to be done for everything else)
* proper fix for bert to force intermediate dict outputs
* propagate to others
* style
* xlm roberta xl investigation, its the layernorm...
* mobile bert
* revert this, might cause issues with composed models
* review
* style
* setup
* start the purge
* continue the purge
* more and more
* more
* continue the quest: remove loading tf/jax checkpoints
* style
* fix configs
* oups forgot conflict
* continue
* still grinding
* always more
* in tje zone
* never stop
* should fix doc
* fic
* fix
* fix
* fix tests
* still tests
* fix non-deterministic
* style
* remove last rebase issues
* onnx configs
* still on the grind
* always more references
* nearly the end
* could it really be the end?
* small fix
* add converters back
* post rebase
* latest qwen
* add back all converters
* explicitly add functions in converters
* re-add
* Add LFM2-VL support
* add tests
* linting, formatting, misc review changes
* add siglip2 to auto config and instantiate it in lfm2-vl configuration
* decouple image processor from processor
* remove torch import from configuration
* replace | with Optional
* remove layer truncation from modeling file
* fix copies
* update everything
* fix test case to use tiny model
* update the test cases
* fix finally the image processor and add slow tests
* fixup
* typo in docs
* fix tests
* the doc name uses underscore
* address comments from Yoni
* delete tests and unsuffling
* relative import
* do we really handle imports better now?
* fix test
* slow tests
* found a bug in ordering + slow tests
* fix copies
* dont run compile test
---------
Co-authored-by: Anna <anna@liquid.ai>
Co-authored-by: Anna Banaszak <48625325+ankke@users.noreply.github.com>
* fix(trainer): ensure final checkpoint is saved when resuming training
* add test
* make style && slight fix of test
* make style again
* move test code to test_trainer
* remove outdated test file
* Apply style fixes
---------
Co-authored-by: rangehow <rangehow@foxmail.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* use consistent naming for padding
* no validation on pad size
* add warnings
* fix
* fox copies
* another fix
* fix some tests
* fix more tests
* fix lasts tests
* fix copies
* better docstring
* delete print
* working draft for LongCat
* BC changes to deepseek_v3 for modular
* format
* various modularities
* better tp plan
* better init
* minor changes
* make modular better
* clean up patterns
* Revert a couple of modular commits, because we won't convert in the end
* make things explicit.
* draft test
* toctree, tests and imports
* drop
* woops
* make better things
* update test
* update
* fixes
* style and CI
* convert stuff
* up
* ah, yes, that
* enable gen tests
* fix cache shape in test (sum of 2 things)
* fix tests
* comments
* re-Identitise
* minimize changes
* better defaults
* modular betterment
* fix configuration, add documentation
* fix init
* add integration tests
* add info
* simplify
* update slow tests
* fix
* style
* some additional long tests
* cpu-only long test
* fix last tests?
* urg
* cleaner tests why not
* fix
* improve slow tests, no skip
* style
* don't upcast
* one skip
* finally fix parallelism
* Support training florence2
* update doc and testing model to florence-community
* fix florence-2 test, use head dim 16 instead of 8 for fa2
* skip test_sdpa_can_dispatch_on_flash
* Apply style fixes
---------
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* Fix#40067 : add UMT5 support in GGUF loader (config, tokenizer, test)
* chore: fix code formatting and linting issues
* refactor: move UMT5 GGUF test to quantization directory and clean up comments
* chore: trigger CI pipeline
* refactor(tests): Move UMT5 Encoder GGUF test to GgufModelTests. This consolidates the new test into the main class for consistency.
* Add regression check to UMT5 encoder GGUF test
Verify encoder output against reference tensor values with appropriate tolerances for stability.
* Update tests/quantization/ggml/test_ggml.py
Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>
* Update tests/quantization/ggml/test_ggml.py
remove comments
Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>
---------
Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>
* Improve module name handling for local custom code
* Use `%lazy` in logging messages
* Revert "Use `%lazy` in logging messages"
This reverts commit 5848755d5805e67177c5218f351c0ac852df9340.
* Add notes for sanitization rule in docstring
* Remove too many underscores
* Update src/transformers/dynamic_module_utils.py
* Update src/transformers/dynamic_module_utils.py
---------
Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>
* move checks to validate steps where possible
* fix csm and other models that override _sample
* ops dia you again
* opsie
* joao review
* Move variable output controls to `prepare_inputs_for_generation`
* fix a bunch of models
* back to basics
* final touches
* Fix for CB attn mask and refactor
* Tests for CB (not all passing)
* Passing tests and a logger fix
* Fixed the KV metrics that were broken when we moved to hybrid alloc
* Fix circular import and style
* Added tests for FA
* Unfolded test to have device expectations
* Fixes for H100
* more fixes for h100
* H100 are good
* Style
* Adding some comments from #40831
* Rename test
* Avoid 1 letter variables
* Dictonnary is only removed during kwargs
* Test for supported sample
* Fix a unvoluntary slice
* Fixes for non-sliced inputs and small example improvments
* Slice inputs is more understandabe
* Style
* Update no split modules in T5Gemma model
* Update no_split_modules also for T5Gemma modular
* Remove model_split_percents from test cases
---------
Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
* Fix edge case for tokenize (#36277)
* Fix tokenizing dtype for float input cases
* add test for empty input string
* deal empty list of list like [[]]
* add tests for tokenizer for models with input that is not plain text
* created robust token counting by using existing include_num_input_tokens_seen variable and kept bool for backward compatibility and added string also to ensure everything goes well and kept default as is. also robust test cases are created
* some codebase mismatched in my local and remote, commiting to solve it and also solved code quality issue
* ci: retrigger tests
* another attemp to trigger CI for checks
* Fix DeepSpeed mixed precision precedence over Accelerate defaults
Resolves issue where Accelerate would default to bf16 mixed precision
when a DeepSpeed config specifies fp16, causing a ValueError. The fix
ensures DeepSpeed config takes precedence over TrainingArguments defaults
while preserving explicit user settings.
Changes:
- Add override_training_args_from_deepspeed() method to handle config precedence
- Reorder mixed precision environment variable setting in TrainingArguments
- Ensure DeepSpeed fp16/bf16 settings override defaults but not explicit choices
Fixes#39849
* Add tests for DeepSpeed mixed precision precedence fix
- Add TestDeepSpeedMixedPrecisionPrecedence class with 3 focused tests
- Test DeepSpeed fp16/bf16 config overriding TrainingArguments defaults
- Test user explicit settings being preserved over DeepSpeed config
- Test precedence hierarchy: user settings > DeepSpeed config > defaults
- Replace massive 934-line test bloat with concise 50-line test suite
- Tests cover core functionality of PR #39856 mixed precision precedence fix
* Fix module loading for models with dots in names
* quality check
* added test
* wrong import
* Trigger CI rerun after making test model public
* Update src/transformers/dynamic_module_utils.py
* Update tests/utils/test_dynamic_module_utils.py
* Update tests/utils/test_dynamic_module_utils.py
* Move test
* make fixup
---------
Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>
Co-authored-by: Matt <rocketknight1@gmail.com>
* CB example: better compare feature
* Cache managers, still issue w/ effective length
* WIP -- fix for effective length
* Renames
* Wroking, need better parity checks, we mind be missing 1 token
* Small fixes
* Fixed wrong attn mask and broke cache into pieces
* Warmup is slowing down things, disabling it
* Cache was too big, fixed
* Simplified index objects
* Added a profile option to the example
* Avoid calls to memory reporing tools
* Restore full attention read indices for better latency
* Adressed some TODOS and style
* Docstrings for cache managers
* Docstrings for Schedulers
* Refactor scheudlers
* [Important] Cache fix for sliding window, check with small sw size
* Updated doc for cache memory compute and cache as a whole
* Moved a todo
* Nits and style
* Fix for when sliding window is smaller than max batch per token
* Paged interface update
* Support for FLash in new API
* Fix example CB
* Fix bug in CB for paged
* Revert example
* Style
* Review compliance
* Style
* Styleeeee
* Removed NO_SLIDING_WINDOW
* Review #2 compliance
* Better art
* Turn cum_seqlens_k in a dict
* Attn mask is now a dict
* Update examples/pytorch/continuous_batching.py
Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>
* Adressed McPatate pro review
* Style and fix
---------
Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>
* Add EfficientLoFTRImageProcessorFast for GPU-accelerated image processing
* Fix fast processor output format and add comprehensive tests
* Fix trailing whitespace in test file
* Apply ruff formatting to test file
* simplify pair validation logic
* add superglue tests to fast image processor
---------
Co-authored-by: yonigozlan <yoni.gozlan@huggingface.co>
* Fix continue_final_message parameter in apply_chat_template
* after run fixup
* Handle trim in the template
* after fixup
* Update src/transformers/utils/chat_template_utils.py
---------
Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>
* feat: err when unsupported attn impl is set w/ `--continuous_batching`
* refactor: move defaults and support list to CB code
* feat: add action item in error msg
* fix(serve): add default attn implementation
* feat(serve): add log when `attn_implementation` is `None`
* feat: raise Exception when attn_implementation is not supported by CB
* change |= operator to use torch logical or for friendly export to different backends
* change |= operator to use torch logical or for friendly export to different backends in grounding dino model
---------
Co-authored-by: Lewis Marshall <lewism@elderda.co.uk>
* initial commit
* initial setup
* Overiding imageGPT specific functions
* imported is_torch_available and utilized it for importing torch in imageGPT fast
* Created init and ImageGPTFastImageProcessorKwargs
* added return_tensors, data_format, and input_data_format to ImageGPTFastImageProcessorKwargs
* set up arguments and process and _preprocess definitions
* Added arguments to _preprocess
* Added additional optional arguments
* Copied logic over from base imageGPT processor
* Implemented 2nd draft of fast imageGPT preprocess using batch processing
* Implemented 3rd draft of imageGPT fast _preprocessor. Pulled logic from BaseImageProcessorFast
* modified imageGPT test file to properly run fast processor tests
* converts images to torch.float32 from torch.unit8
* fixed a typo with self.image_processor_list in the imagegpt test file
* updated more instances of image_processing = self.image_processing_class in the test file to test fast processor
* standardized normalization to not use image mean or std
* Merged changes from solution2 branch
* Merged changes from solution2 test file
* fixed testing through baseImageGPT processor file
* Fixed check_code_quality test. Removed unncessary list comprehension.
* reorganized imports in image_processing_imagegpt_fast
* formatted image_processing_imagegpt_fast.py
* Added arg documentation
* Added FastImageProcessorKwargs class + Docs for new kwargs
* Reformatted previous
* Added F to normalization
* fixed ruff linting and cleaned up fast processor file
* implemented requested changes
* fixed ruff checks
* fixed formatting issues
* fix(ruff after merging main)
* simplify logic and reuse standard equivalenec tests
---------
Co-authored-by: Ethan Ayaay <ayaayethan@gmail.com>
Co-authored-by: chris <christine05789@gmail.com>
Co-authored-by: Ethan Ayaay <98191976+ayaayethan@users.noreply.github.com>
Co-authored-by: yonigozlan <yoni.gozlan@huggingface.co>
* Squashed previous branch
* unify assisted generate to common decoding method signature
* move checks to validate steps where possible
* fix csm and other models that override _sample
* ops dia you again
* opsie
* joao review
* Fix broken Llama4 accuracy in MoE part
Llama4 accuracy is broken by a bug in
https://github.com/huggingface/transformers/pull/39501 . It forgot to
transpose the router_scores before applying it to routed_in, causing
Llama4 to generate garbage output.
This PR fixes that issue by adding back the transpose() and adding some
comments explaining why the transpose() is needed.
Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
* remove comment
---------
Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>
* feat: support request cancellation
* test: add cancellation test
* refactor: use exisitng fn to check req cancellation
* feat(cb): make cancellation thread safe
* refactor(serve): update test to use `requests` instead of `httpx`
* Add instance attribute to DacVectorQuantize for use in DacResidualVectorQuantize.from_latents
* add from_latent tests
* style fix
* Fix style for test_modeling_dac.py
* add seq class for gemma3 text model
* add Gemma3TextForSequenceClassification to modeling file
* After run make fixup
* let's just check
* thiis is why it was crashing, tests were just failing...
* skip it, tested only for seq clf
---------
Co-authored-by: Raushan Turganbay <raushan@huggingface.co>
* fix MetaCLIP 2 wrong link & wrong model names in the documentation and docstrings
* ruff reformatted
* update files generated by modular
* update meta_clip2 to metaclip_2 to match the original
* _supports_flash_attn = False
---------
Co-authored-by: Yung-Sung Chuang <yungsung@meta.com>
* Support MUSA (Moore Threads GPU) backend in transformers
Add accelerate version check, needs accelerate>=0.33.0
* Support TF32 flag for MUSA backend
* fix typo
* fix: continuous batching in `transformers serve`
* fix: short circuit inner gen loop when prepare_next_batch prepared nothing
* docs: add comment explaining FastAPI lifespan
* test: add CB serving tests
* refactor: remove gen cfg max new tokens override bc unnecessary
* docs: add docstring for `ServeCommand::run`
* feat: use new `DecodeStream` API
* Expectations for gemma3
* Fixes for Qwen2_5_VL tests
* Added expectation but underlying pb is still there
* Better handling of mrope section for Qwen2_5_vl
* Fixes for FA2 tests and reformat batch test for Qwen2_5_Omni
* Fix multi-device error in qwen2_5_omni
* Styel and repo-consistency
* Removed inherited test because fix in common
* slow tests fixes
* Style
* Fixes for qwen2_5_vl or omni for FA test
* update make nested image list
* fix make flat list of images
* update type anno
* fix image_processing_smolvlm
* use first image
* add verbose comment
* fix images
* rollback
* fix ut
* Update image_processing_smolvlm.py
* Update image_processing_idefics3.py
* add tests and fix some processors
* fix copies
* fix after rebase
* make the test cover chat templates
* sjip udop, no point in fixing it
* fix after rebase
* fix a few more tests
---------
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
Co-authored-by: raushan <raushan@huggingface.co>
* porting not maintained jieba to rjieba
* Fix format
* replaced the line with rjieba instead of removing it
* cut_all is not included as a parameter. cut_all is a seperate function rjieba
* rev
* jieba remove installation
* Trigger tests
* Update tokenization_cpm.py
* Update tokenization_cpm_fast.py
---------
Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com>
* Add bfloat16 support detection for MPS (Apple Silicon) in is_torch_bf16_gpu_available
bfloat16 seems to have been supported for a few years now in Metal and torch.mps.
Make sure to allow it and not throw on bf16 usage with "Your setup doesn't support bf16/gpu." from TrainingArguments.
* Check bf16 support for MPS using torch method
Actually seems method exists: 5859edf113/torch/_dynamo/device_interface.py (L519)
It simply checks if you are on MacOs 14 or higher.
* Document Metal emulation for bf16 support
Add note about Metal emulation for bf16 support on M1/M2.
* Update bf16 support check for MPS backend
is_bf16_supported() not exposed even if defined on MPSInterface, use same approach as in accelerate pr.
---------
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* first step if flash not installed but you set to use it
* try importing
* now default to using it
* update our tests as well
* wow yesterday I was not awake
* fixup
* style
* lol the fix was very very simple
* `RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/kernels@main#egg=kernels
` for updated dockers
* push review comments
* fix
---------
Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co>
Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>
* dump ugly option to check again tomorrow
* tiny update
* do not save as nested dict yet!
* fix and add tests
* fix dia audio tokenizers
* rename the flag and fix new model Evolla
* fix style
* address comments
* broken from different PRp
* fix saving layoutLM
* delete print
* delete!
* init swissai model
* AutoModelForCausalLM
* AutoModelForCausalLM mapping
* qk norm and post ln optional
* fix wrong shape of qk norm: megatron uses head_dim
* automodel fixes
* minor fix in forward
* fix rope validation to accept llama3 scaling
* `SwissAIForTokenClassification` support
* Align `SwissAI` to v4.52.4
* Align `SwissAI` to v4.53.1
* Init CUDA xIELU
* `SwissAI*`->`Apertus*`
* ci fix
* check_docstring ignore ApertusConfig
* Licensing and placeholder tests
* Placeholder doc
* XIELU syntax
* `_xielu_python` optimization
* Fix xIELU
* [tmp] `{beta,eps}` persistent=False
until {beta,eps} saved in checkpoint
* Modular `Apertus`
* CUDA xIELU logging
* ci fix
* ci fix
* ci fix
* Update license
Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>
* Update tests/models/apertus/test_modeling_apertus.py
Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>
* `.utils.import_utils.is_torchdynamo_compiling`
* `Apertus` class ordering
* `past_key_value{->s}`, `make fix-copies`
* ci fix
* Remove unused configuration parameters
* `{beta,eps}` saved in checkpoint
* `{beta,eps}` Temporarily on CPU
* Suggestions
Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>
* ci fix
* remove fx_compatible (deprecated)
* remove `rotary_embedding_layer`
As the tests are written for a config without default scaling (which is not the case in Apertus) - besides, rope scaling is tested in other models so it's all safe.
* fully removing `Mask4DTestHard` class
Not needed (for now)
* switch to `dtype` instead of `torch_dtype`
Following this:
https://github.com/huggingface/transformers/pull/39782
* remove unused imports
* remove `cache_implementation="static"`
* +Apertus to `docs/source/en/_toctree.yml` for the doc builder
---------
Co-authored-by: Alexander Hagele <alexanderhagele@gmail.com>
Co-authored-by: dhia680 <garbayad@gmail.com>
Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>
Co-authored-by: Dhia Garbaya <84809366+dhia680@users.noreply.github.com>
* docs(pixtral): Update Pixtral model card to new format
* docs(pixtral): Change cuda into auto for device_map
* docs(pixtral): Apply suggestions from review
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* docs(pixtral): Apply suggestions from review, changing mistral-community into Mistral AI
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* docs(pixtral): Apply suggestions from review [!TIP] part
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* docs(pixtral): Finalize model card with tested code examples
This commit finalizes the update for the Pixtral model card.
* Fix the hfoption by the right one
* @BryanBradfo docs(pixtral): Changing the redirection of bitsandbytes
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* docs(pixtral): Add of ` to highlight the tokens
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* docs(pixtral): Move image block per final review
---------
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* fix in modular
* remove leftover print
* fix everything except when it's in assignment
* fix assignment as well
* more general
* better
* better
* better comment
* docstring
* cleaner
* remove base
* doc
* Rework of the CB example
* Further rework of CB example
* Refactor PA cache, slice on tokens, add debug prints -- WIP
* Slice cache -- WIP
* Added a mechanism to check batched outputs in CB script
* Less logging, debug flag for slice, !better reset! -- WIP
* QOL and safety margins
* Refactor and style
* Better saving of cb example
* Fix
* Fixes and QOL
* Mor einformations about metrics
* Further logging
* Style
* Licenses
* Removed some comments
* Add a slice input flag
* Fix in example
* Added back some open-telemetry deps
* Removed some aux function
* Added FA2 option to example script
* Fixed math (all of it)
* Added a simple example
* Renamed core to classes
* Made allocation of attention mask optionnal
* Style
* Relaxed assumptions on cache_config
* Review compliance
* Style
* Styyyle
* Removed default and added args
* Rebase mishapfix
* Propagate args to TorchExportableModuleForDecoderOnlyLM
* Fix the test I wanted fixed in this PR
* Added some AMD expectation related to cache tests
* draft update two models for now
* batch update all VLMs first
* update some more image processors
* update
* fix a few tests
* just make CI green for now
* fix copies
* update once more
* update
* unskip the test
* fix these two
* fix torchcodec audio loading
* maybe
* yay, i fixed torchcodec installation and now can actually test it
* fix copies deepseek
* make sure the metadata is returrned when users request it
* add docs
* update
* fixup
* Update src/transformers/audio_utils.py
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
* Update src/transformers/models/glm4v/video_processing_glm4v.py
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
* update
* what if we set some metadata attr to `None`
* fix CI
* fix one test
* fix 4 channel test
* fix glm timestemps
* rebase gone wrong
* raise warning once
* fixup
* typo
* fix copies
* ifx smolvlm test
* this is why torch's official benchmark was faster, set threads to `0`
* Apply style fixes
---------
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* initial context_parallel_size support in trainer
* For context parallelism, use AVG instead of SUM to avoid over-accounting tokens
* use parallelism_config.cp_enabled
* add parallelism_config to trainer state
* warn when auto-enabling FSDP
* fix some reviews
* WIP: somewhat matching loss
* Feat: add back nested_gather
* Feat: cleanup
* Fix: raise on non-sdpa attn
* remove context_parallel_size from TrainingArguments
* if we have parallelism_config, we defer to get_state_dict from accelerate
* fix form review
* Feat: add parallelism config support
* Chore: revert some unwanted formatting changes
* Fix: check None
* Check none 2
* Fix: remove duplicate import
* Update src/transformers/trainer.py
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* Update src/transformers/training_args.py
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* Fin
* require accerelate 1.10.1 and higer
---------
Co-authored-by: S1ro1 <matej.sirovatka@gmail.com>
Co-authored-by: Matej Sirovatka <54212263+S1ro1@users.noreply.github.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* Add `tokenizer_kwargs` arg to text generation pipeline.
* chore: re-run CI
* Rename `tokenizer_kwargs` to `tokenizer_encode_kwargs` for text generation pipeline
* Fix `tokenizer_encode_kwargs` doc string.
* Fix note related to `tokenizer _kwargs` in text generation pipeline
---------
Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
* add a test
* tempdir
* fix import issue[
* wow I am tired
* properly init
* i am not super familiar with quantizer api :|
* set to TRUE fro now
* full support
* push current changes
* will clean this later but the imports are a shitshow here
* this correctly saves the block and scales but forward seems broken
* quanitze was not correct
* fix storage
* why were bias even included
* finally!
* style
* fix style
* remove print
* lazy import
* up
* not sure what happens this works now?
* holy molly it was not so far
* okay this seems to work!
* workings!!!
* allow save_pretrained to create PR
* Apply suggestions from code review
* fixup
* add deqyabtze fakse as wek
* working new
* fix
* rm swizzle and unswizzle during saving
* rm print
* Update src/transformers/modeling_utils.py
* fix
* style
---------
Co-authored-by: Marc Sun <marc@huggingface.co>
* Fix label smoothing incompatibility with multi-label classification (#40258)
* Improve label smoothing multi-label check based on reviewer feedback
- Move check from LabelSmoother to Trainer.__init__() for better architecture
- Use model.config.problem_type instead of tensor inference for robustness
- Warn and disable smoothing instead of raising error for better UX
- Update test to verify warning behavior
Renamed wer metric variable to wer_metric to avoid naming conflict
with local variable assignment in compute_metrics function.
Co-authored-by: pranam-gf <pranam@goodfin.com>
Fixed 4 instances of the typo "seperator" → "separator" in variable names:
- 2 instances in src/transformers/models/shieldgemma2/convert_shieldgemma2_weights_orbax_to_hf.py
- 2 instances in src/transformers/models/gemma3/convert_gemma3_weights_orbax_to_hf.py
These typos were in variable names used for parsing path components in weight conversion scripts.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-authored-by: Claude <noreply@anthropic.com>
* fix to the typings which are unmatched to FA function signature
cumulative_seqlens_q/k -> cu_seq_lens_q/k:
- in the FlashAttentionKwargs in modeling_flash_attention_utils
- in the TransformersKwargs in generic
- in the PagedAttentionArgs in continuous_batching
It is **BC**, because they are created in `ContinuousBatchProcessor.setup_static_tensors:L762`, used in `ContinuousBatchingManager._model_forward:L1233` and destroyed with `ContinuousBatchProcessor`
* format changes by ruff
* Update src/transformers/integrations/flash_paged.py
unused function arg in `PagedAttentionCache.update`
Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
* revert continuous_batching signiture, which is more meaningful
---------
Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
* simplify common get/set
* remove some noise
* change some 5 years old modeling utils
* update examples
* fix copies
* revert some changes
* fixes, gah
* format
* move to Mixin
* remove smolvlm specific require grad
* skip
* force defaults
* remodularise some stuff
* remodularise more stuff
* add safety for audio models
* style
* have a correct fallback, you daft donkey
* remove this argh
* change heuristic for audio models
* fixup
* revert
* this works
* this should be explicit
* fix Nth ESM exception
* tryout decoder
* this as well
* revert again
* 🧠
* aaah ESM has two modelings aaah
* broom broom
* format
* wrong copies
* copies
* modular cleanups
* format
* modularities
* wrong mergefix
* seriously
* align with new model
* new model
* update everywhere
* style
* pipelines
* switch it everywhere in tests
* switch it everywhere in docs
* switch in converters everywhere
* update in examples
* update in model docstrings
* style
* warnings
* style
* Update configuration_utils.py
* fix
* Update configuration_utils.py
* fixes and add first test
* add pipeline tests
* Update test_pipelines_common.py
* add config test
* Update test_modeling_common.py
* add new ones
* post rebase
* add new
* post rebase adds
* Update trainer.md
* Update trainer.md
Removed the detail about label_names argument usage from the tip/ warning section
* Update training_args.py
Added the label_names usage clarification in the docstring
* Update trainer.md
---------
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* handle support for cache classes when num enc layers != num dec layers
* handle overwrites
* one more corner case
* Update src/transformers/generation/utils.py
* Update src/transformers/generation/utils.py
* Apply suggestions from code review
* handle corner case :o
* fix
* cleanup, revert aimv2 fa changes
* fix aria
* i searched a long time but the cross dependency is for the recent models so...
* this was something... evolla
* fix modernbert decoder + make fa test more robust
* nit
* Clean up xcodec addition.
* Clean up config.
* Switch to fixtures test.
* Small stuff.
* Polish XCodec and standardize across codecs.
* Update src/transformers/models/xcodec/modeling_xcodec.py
Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
* Format and fix test.
* Update tol.
---------
Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
* make visualizer rely on create causal mask
* format
* fixup
* fixup
* read token
* read token, duh
* what is up with that token
* small tests?
* adjust
* try with flush
* normalize for ANSI
* buffer shenanigans
* Fix links in Glm4vMoe configuration classes to point to the correct Hugging Face model repository
* run fixup to update links in Glm4vMoe configuration classes to point to the correct Hugging Face model repository
* add basic type hints to import module
* run make fixup
* remove optional
* fixes
---------
Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>
* it was long due!
* use the official kernel
* more permissive
* update the kernel as well
* mmm should it be this?
* up pu
* fixup
* Update test_modeling_gpt_oss.py
* style
* start with 20b
* Update modeling_utils.py
* make sure we update with the module's plan
* use public api
* oups
* update
* fix failing test
* Update src/transformers/integrations/tensor_parallel.py
* Update src/transformers/integrations/tensor_parallel.py
* fix
* make the API more friendly!
* fix tests
* fix styling
---------
Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* init
* add modular
* fixup
* update configuration
* add processing file
* update auto files
* update
* update modular
* green setup_and_quality ci
* it works
* fix some tests
* commit florence2
* update test
* make test cases done - 16 left
* style
* fix few test cases
* fix some tests
* fix init test
* update florence2 vision style
* hope is green
* fix init test
* fix init
* update modular
* refactor vision module
* fix: channel attention use dynamic scale
* update modular
* update
* update attention mask
* update
* fix naming
* Update src/transformers/models/florence2/processing_florence2.py
Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>
* spatial block works
* more beautiful
* more more beautiful
* merge main
* merge main and fixup
* fix typing hint
* update modeling
* fix eager matches sdpa
* fix style
* fix compile test - all green
* remove florence2 language
* remove Florence2LanguageModel things
* fix style
* update florence2 model
* override prepare encoder_decoder for generation
* add weight conversion script
* rewrite channel attention to use sdpa
* eleminate 1 tranpose op
* support fa2
* fix quality check
* chore: reformat `test_modeling_florence2.py`
* some refactor for processor
* some refactor for processor
* update naming convention and remove BC
* make it pass the test
* fix: correct Embedding Cosine
* update comments and docstring
* support input_embeds
* support input embeds ideally
* fix style
* fix style
* fix style again :D
* add test prcoessor
* refactor processor and add test for processor
* reformat test processor
* make fixup
* fix schema check
* remove image_token
* ensure image token in tokenizer and fix integration tests
* fix processor test
* add more integration tests for large model and rename test_processor to test_processing
* test_assisted_decoding_sample should pass
* update doc and make model work with image text to text pipeline
* docs: add sdpa bagde
* resolve cyril's comments
* fix import torch error
* add helper get_placeholder_mask
* inherit from llava
* florence2 may not _supports_attention_backend because of bart ...
* move florence2 model card to multimodal
* let base model always return_dict
* fix style
* tiny update doc
* set _checkpoint_conversion_mapping = {}
* fix code quality
* support flex and compile graph and move external func to internal func
* remove condition because it always true
* remove window funcs
* move post processor config out
* fix ci
* new intro to trigger test
* remove `kernel_size` argument
---------
Co-authored-by: ducviet00-h2 <viet.d.hoang@h2corporation.jp>
Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>
* fix: pass adamw optimizer parameters to StableAdamW
* add test for stable_adamw initialization with trainer arguments
* address copilot suggestion
* fix: update weight_decay handling in stable_adamw kwargs
---------
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* Update GPT-NeoX-Japanese model card
* Apply suggestions from code review
* Update gpt_neox_japanese.md
---------
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Standardize RAG model card
Update rag.md to follow the new Hugging Face model card template:
- Added friendly overview in plain language
- Added pipeline and AutoModel usage examples
- Included quantization example with BitsAndBytesConfig
- Added notes and resources sections
- Removed abstract and FlashAttention badge
* Standardize RAG model card
Update rag.md to follow the new Hugging Face model card template:
- Added friendly overview in plain language
- Added AutoModel usage example
- Included quantization example with BitsAndBytesConfig
* Fix chat CLI GPU loading and request_id validation issues (#40230)
This commit addresses two critical bugs in the transformers chat CLI:
1. **GPU Loading Issue**: Changed default device from "cpu" to "auto" in ChatArguments
- Chat CLI now automatically uses GPU when available instead of defaulting to CPU
- Matches the behavior of the underlying serving infrastructure
2. **Request ID Validation Error**: Added request_id field to TransformersCompletionCreateParamsStreaming schema
- Fixes "Unexpected keys in the request: {'request_id'}" error on second message
- Allows request_id to be properly sent and validated by the server
Both fixes target the exact root causes identified in issue #40230:
- Users will now get GPU acceleration by default when available
- Chat sessions will no longer break after the second message
* Remove unrelated request_id field from TransformersCompletionCreateParamsStreaming
* Update image_processing_perception_lm_fast.py
Allow for a proper override of vision_input_type in hf fast image processor, otherwise we need to resort to manually setting the attribute.
* Update processing_perception_lm.py to match kwargs vision input type
* Update image_processing_perception_lm_fast.py kwargs to signature args
* Skipping pytree registration in case fsdp is enabled
* Beauty changes
* Beauty changes
* Moved the is_fsdp_available function to import utils
* Moved is_fsdp_available to integrations.fsdp
* Skipping pytree registration in case fsdp is enabled
* Beauty changes
* Beauty changes
* Moved the is_fsdp_available function to import utils
* Moved is_fsdp_available to integrations.fsdp
* Added pytree registration inside dynamic cache class
* Making ci/cd lords happy
* Adding a check if DynamicCache is already a leaf
* Adding try/catch for multiple initializations of DynamicCache in test suites
* Moving dynamic cache pytree registration to executorch
* Adding try catch back
* set inputs_embeds to None while generate to avoid audio encoder forward in generation process
* set input_features to none instead
---------
Co-authored-by: lvyuanjun.lyj <lvyuanjun.lyj@alibaba-inc.com>
* Add expectation to t5 for rocm 9.4
* Made EncoderDecoderCache compatible with nn.DataParallel
* Fixed t5gemma EncoderDecoderCache
* Added todos in autoformer
* Ruff
* Init is self-contained
* Review compliance
* Fixed kwargs init of EncoderDecoderCache
* add jinja2 as a dependency
* Make jinja2 a core dependency in install_requires
- Add jinja2 to install_requires list in setup.py for automatic installation
- Add jinja2 to runtime version checks in dependency_versions_check.py
- Resolves issue where pip install transformers doesn't install jinja2
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Make jinja2 a core dependency in install_requires
* Make jinja2 an extra dependency instead of adding a core dep
---------
Co-authored-by: Claude <noreply@anthropic.com>
* remove transpose_for_scores call
Signed-off-by: Peter St. John <pstjohn@nvidia.com>
* fix copied evolla code
Signed-off-by: Peter St. John <pstjohn@nvidia.com>
---------
Signed-off-by: Peter St. John <pstjohn@nvidia.com>
* fix error vocab_size at Qwen2_5_VLForConditionalGeneration loss_function
Signed-off-by: luoxiaoc <xiaochuan.luo@intel.com>
* fix similar errer at qwen2_vl and do make fix-copies
Signed-off-by: luoxiaoc <xiaochuan.luo@intel.com>
* pass in kwargs for loss_func at qwen2_vl and qwen2_5_vl
Signed-off-by: luoxiaoc <xiaochuan.luo@intel.com>
* Apply style fixes
---------
Signed-off-by: luoxiaoc <xiaochuan.luo@intel.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* Revert "Pin torch to 2.7.1 on CircleCI for now (#40174)"
This reverts commit 31b6e6e1dac0d32f74ec5cd6b3c1868534ccd7b5.
* fix
---------
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
* docs: Update LayoutLM model card with standardized format
* Apply suggestions from code review
This commit incorporates all suggestions provided in the recent review. Further changes will be committed separately to address remaining comments.
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Address remaining review comments
* Address few more review comments:
1. remove transformer-cli section
2. put resources after notes
3. change API refs to 2nd level header
* Update layoutlm.md
* Update layoutlm.md
---------
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update check_tokenizers.py
chore(typing): add type hints to check_tokenizers script
- Annotate params/returns for helper functions
- Keep tokenizer instances as `Any` to avoid runtime coupling
- Make `check_LTR_mark` return `bool` explicitly (no behavior change)
* Update check_tokenizers.py
chore(typing): replace Any with PreTrainedTokenizerBase in check_tokenizers.py
- Use transformers.tokenization_utils_base.PreTrainedTokenizerBase for `slow` and `fast` params
- Covers both PreTrainedTokenizer and PreTrainedTokenizerFast
- Exposes required methods (encode, decode, encode_plus, tokenize)
- Removes generic Any typing while staying implementation-agnostic
* [MINOR:TYPO] Update base.py
All other occurrences in the docs use lowercase. (https://github.com/search?q=repo%3Ahuggingface%2Ftransformers%20translation_XX_to_YY&type=code)
Also, using uppercase doesn't work: tested with "translation_EN_to_FR" which doesn't work and instead returns: `ValueError: The task does not provide any default models for options ('EN', 'FR')`
It might be a good idea to allow for uppercase, but that's for another issue.
* [MINOR:TYPO] Update __init__.py
* update
* fix the test for DepthPro
* PR comments
* wait, I didn't delete this in prev commit?
* fix
* better way
---------
Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co>
Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>
* added dates to the models with a single hf papers link
* added the dates for models with multiple papers
* half of no_papers models done
* rest of no_papers models also done, only the exceptions left
* added copyright disclaimer to sam_hw, cohere, cohere2 + dates
* some more fixes, hf links + typo
* some new models + a rough script
* the script looks robust, changed all paper links to hf
* minor change to handle technical reports along with blogs
* ran make fixup to remove the white space
* refactor
* build: add TvpImageProcessorFast
- Introduced TvpImageProcessorFast to enhance image processing capabilities.
- Updated image processing auto registration to include the new fast processor.
- Modified tests to accommodate both TvpImageProcessor and TvpImageProcessorFast, ensuring comprehensive coverage for both classes.
* fix: TvpImageProcessorFast with new resize method and update processing logic
* build: add TvpImageProcessorFast
* refactor: clean up whitespace and formatting in TvpImageProcessorFast and related tests
- Removed unnecessary whitespace and ensured consistent formatting in image_processing_tvp_fast.py.
- Updated import order in test_image_processing_tvp.py for clarity.
- Minor adjustments to maintain code readability and consistency.
* fix: Enhance TvpFastImageProcessorKwargs and update documentation
- Added TvpFastImageProcessorKwargs class to define valid kwargs for TvpImageProcessorFast.
- Updated the documentation in tvp.md to include the new class and its parameters.
- Refined the image processing logic in image_processing_tvp_fast.py for better handling of padding and resizing.
- Improved test cases in test_image_processing_tvp.py to ensure compatibility with the new processing logic and tensor inputs.
* fix: tested now with python 3.9
* fix: remove tvp kwargs from docs
* simplify processing
* remove import and fix tests
---------
Co-authored-by: yonigozlan <yoni.gozlan@huggingface.co>
* fix: changed is_causal to be False
* fix: Added original cross attention bug
* fix: fixed the way bordel removal is computed
* fix: added missing normalization on coarse features
* test: fixed integration tests
---------
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
* initial comment
* test
* initial conversion for outline
* intermediate commit for configuration
* chore:init files for sam2
* adding arbitary undefined config
* check
* add vision
* make style
* init sam2 base model
* Fix imports
* Linting
* chore:sam to sam2 classes
* Linting
* Add sam2 to models.__init__
* chore:match prompt encoder with sam2 code
* chore:prepare kwargs for mask decoder
* Add image/video predictors
* Add CUDA kernel
* Add output classes
* linting
* Add logging info
* tmp commit
* docs for sam2
* enable image processing
* check difference of original SAM2
- difference is the order of ToTensor()
- please see https://pytorch.org/vision/main/_modules/torchvision/transforms/functional.html#resize
* enable promptencoder of sam2
* fix promprencoder
* Confirmed that PromptEncoder is exactly same (Be aware of bfloat16 and float32 difference)
* Confirmed that ImageEncoder is exactly same (Be aware the linting of init)
* Confirmed that MaskDecoder is exactly same (TO DO: lint variable name)
* SamModel is now available (Need more chore for name)
* make fix-copies
* make style
* make CI happy
* Refactor VisionEncoder and PostioinEmbedding
* TO DO : fix the image_embeddings and sparse_embeddings part
* pure image inference done
* reusable features fix and make style
* styling
* refactor memoryattention
* tmp
* tmp
* refactor memoryencoder
TO DO : convert and inference the video pipeline
* TO DO : fix the image_encoder shape
* conversion finish
TO DO: need to check video inference
* make style
* remove video model
* lint
* change
* python utils/check_docstringspy --check_all
* python utils/check_config_attributes.py
* remove copies for sam2promptencoder due to configuration
* change __init__.py
* remove tensorflow version
* fix that to not use direct comparison
* make style
* add missing import
* fix image_embedding_size
* refactor Sam2 Attention
* add fully working video inference (refactoring todo)
* clarify _prepare_memory_conditioned_features
* simplify modeling code, remove unused paths
* use one model
* use auto_docstring
* refactor rope embeddings
* nit
* not using multimask when several points given
* add all sam2.1
* add video tmp
* add Sam2VideoSessionState + fast image proc + video proc
* remove init_states from model
* fix batch inference
* add image integration tests
* uniformize modeling code with other sam models and use modular
* pass vision tests an most model tests
* All tests passing
* add offloading inference state and video to cpu
* fix inference from image embedding and existing mask
* fix multi_boxes mask inference
* Fix batch images + batch boxes inference
* improve processing for image inference
* add support for mask generation pipeline
* add support for get_connected_components post processing in mask generation
* add fast image processor sam, image processor tests and use modular for sam2 image processor
* fix mistake in sam after #39120
* fix init weights
* refactor convert
* add integration tests for video + other improvements
* add needed missing docstrings
* Improve docstrings and
* improve inference speed by avoiding cuda sync
* add test
* skip test for vision_model
* minor fix for vision_model
* fix vision_model by adding sam2model and change the torch dependencies
* remove patch_size
* remove image_embedding_size
* fix patch_size
* fix test
* make style
* Separate hieradet and vision encoder in sam2
* fixup
* review changes part 1
* remove MemoryEncoderConfig and MemoryAttentionConfig
* pass q_stride instead of q_pool module
* add inference on streamed videos
* explicitely process streamed frames
* nit
* Improve docstrings in Sam2Model
* update sam2 modeling with better gestion of inference state and cache, and separate Sam2Model and Sam2VideoModel
* improve video inference api
* change inference_state to inference_session
* use modular for Sam2Model
* fix convert sam2 hf
* modular
* Update src/transformers/models/sam2/video_processing_sam2.py
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
* fix minor config
* fix attention loading error
* update modeling tests to use hub checkpoints
* Use CI A10 runner for integration tests values + higher tolerance for video integration tests
* PR review part 1
* fix doc
* nit improvements
* enforce one input format for points, labels and boxes
* nit
* last few nits from PR review
* fix style
* fix the input type
* fix docs
* add sam2 model as conversion script
* improve sam2 doc
* nit fixes + optimization
* split sam2 and sam2_video in two models
* PR review part 1
* fix None for default slow processor of sam2
* remove unecessary code path in sam2_video
* refactor/simplify RoPE
* replace embedding module list with embedding matrix
* fix tests
* remove kernel
* nit
* use lru_cache for sine_pos_embeddings
* reorder sam2_video methods
* simplify sam2_video
* PR review part 1
* simplify sam2 video a lot
* more simplification
* update integration tests with updated conftest
* more explicit config for hieradet
* do post_processing outside of sam2 video model
* Improve Sam2VideoVisionRotaryEmbedding
* fix tests
* update docs and fix mask2former/oneformer
* avoid unnecessary reshapes/permute
* fix device concatenating points
* small dtype fix
* PR review
* nit
* fix style and finish up doc
* fix style
* fix docstrings
* fix modular
---------
Co-authored-by: RUFFY-369 <prakarshkaushik369@gmail.com>
Co-authored-by: Haitham Khedr <haithamkhedr@meta.com>
Co-authored-by: sangbum choi <sangbumchoi@sangbumui-MacBookAir.local>
Co-authored-by: yonigozlan <yoni.gozlan@huggingface.co>
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
* docs: ko: main_classes/optimizer_schedules
* feat: nmt draft
* fix: improve TOC anchors and expressions in optimizer_schedules
- Add TOC anchors to all section headers
- Fix terminology and improve Korean expressions
* fix: Correct translation of 'weight decay fixed' to '가중치 감쇠가 적용된'
Changed '가중치 감쇠가 수정된' to '가중치 감쇠가 적용된' for more accurate translation of 'weight decay fixed' in the context of optimization.
* fix: Use more natural Korean inheritance expression
Changed '에서 상속받는' to '을 상속받는' to follow natural Korean grammar patterns for inheritance terminology.
* fix: Use consistent '미세 조정' translation for 'finetuned models'
Changed '파인튜닝된' to '미세 조정된 모델' to follow the established translation glossary for 'finetuned models' terminology.
* use pil_torch_interpolation_mapping for NEAREST/NEAREST_EXACT
* fix min torchvision version
* use InterpolationMode directly
* remove unused is_torchvision_greater_or_equal,
* nit
* Add initial collated reports script and job definition
* provide commit hash for this run. Also use hash in generated artifact name. Json formatting
* tidy
* Add option to upload collated reports to hf hub
* Add glob pattern for test report folders
* Fix glob
* Use machine_type as path filter instead of glob. Include machine_type in collated report
* fix flash attention
* i got a stroke reading that comment
* change dropout kwarg back to before
* rename _fa3... as it's used for multiple variants and should work as fallback instead
* simplify imports and support kwargs for fa
* style
* fix comments order
* small fix
* skip kernels test (causes cuda illegal memories w/o cleanup), fix fa test in general esp for models like bart
* style
* allow fullgraph by preloading on init
* make globals "private"
* ci pls be happy
* change skip conditions based on backend flag (indicating missing mask interface)
* move globals support to a function to prepare kwargs
* style
* generalize supported kwargs
* small change to doc
* fix
* add comments
* style
* revert prep during generate
* style
* revert weird style changes
* add fa kwarg prep during generate with fixes back
* how did this even happen
* how
* add comment
Currently model_debugging_utils.py would have an unguarded `import torch.distributed.tensor`. This PR ensures that the distributed module is available before including its tensor module.
* Fix PerceptionLM image preprocessing for non-tiled image input.
* Add test for single tile vanilla image processing.
* ruff format
* recover missing test skip
* Simplify test.
* minor test name fix
* Update HuBERT model card according to template
Standardized HuBERT doc, added ASR examples, Flash Attention 2 support, and quantization section.
* Address review comments and changes requested to hubert.md
* Update hubert.md
---------
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* init
* update
* uupdate
* ruff
* t patch is 2 defalut not 1
* draft
* back
* back1
* update
* config update
* update using glm-41 format
* add self.rope_scaling = config.rope_scaling
* update config
* update
* remove the processor
* update
* fix tests
* update
* for test
* update
* update 2126
* self.rope_scaling is missing in GLM4MOE lets add it
* update
* update
* Update modular_glm4v_moe.py
* change config
* update apply_multimodal_rotary_pos_emb
* format
* update
* Delete 3-rollout_qas_thinking_answers.py
* use right name
* update with place holder
* update
* use right rotary
* Update image_processing_glm4v_fast.py
* rope_config_validation needs to rewrite the entire config file in modular
* update
* changed name
* update
* Update modeling_glm4v_moe.py
* _init_weights shoud be add in Glm4vMoePreTrainedModel
* remove use_qk_norm
* Update modular_glm4v_moe.py
* remove use_qk_norm as it is not use
* fix style
* deprecations are not needed on new models
* fix merge issues
---------
Co-authored-by: raushan <raushan@huggingface.co>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: Arthur <arthur.zucker@gmail.com>
* all modulars and llama
* apply modular
* bert and gpt2 copies
* fix imports
* do it everywhere
* fix import
* finalize it
* fix
* oups set it in modular
* style
* fix
* Add 1 version to deprecation cycle
* Update modeling_layers.py
* Fix missing video inputs for PerceptionLM.
* Minor fix for vanilla input image (only C,H,W, no tiles dim).
* Revert "Minor fix for vanilla input image (only C,H,W, no tiles dim)."
This reverts commit 181d87b964e59c4118035a9fd4f530c6e551ba9f.
* Add amd expectation in internvl
* Add amd expectation to llama
* Added bnb decorator for a llava test that requires bnb
* Added amd expectation for mistral3
* Style
* Support input_embeds in torch exportable decoders
* Hybrid cache update
* Manually change some callsites
* AI changes the rest of the call sites
* Make either input_ids/inputs_embeds mandatory
* Clean up
* Ruff check --fix
* Fix test
* pr review
* Revert config/generation_config changes
* Ruff check
* chore: update Deformable_Detr model card
* fix: added pipeline, automodel examples and checkpoints link
* Update deformable_detr.md
---------
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Fix MXFP4 quantizer validation to enable CPU dequantization
Move dequantize check before CUDA availability check to allow
CPU inference when quantization_config.dequantize is True.
This enables users to run MXFP4 models on CPU by automatically
converting them to BF16 format.
* Add tests for MXFP4 quantizer CPU dequantization validation
* fix: format mxfp4 test file with ruff
# copilot-instructions.md Guide for Hugging Face Transformers
This copilot-instructions.md file provides guidance for code agents working with this codebase.
## Core Project Structure
-`/src/transformers`: This contains the core source code for the library
-`/models`: Code for individual models. Models inherit from base classes in the root `/src/transformers` directory.
-`/tests`: This contains the core test classes for the library. These are usually inherited rather than directly run.
-`/models`: Tests for individual models. Model tests inherit from common tests in the root `/tests` directory.
-`/docs`: This contains the documentation for the library, including guides, tutorials, and API references.
## Coding Conventions for Hugging Face Transformers
- PRs should be as brief as possible. Bugfix PRs in particular can often be only one or two lines long, and do not need large comments, docstrings or new functions in this case. Aim to minimize the size of the diff.
- When writing tests, they should be added to an existing file. The only exception is for PRs to add a new model, when a new test directory should be created for that model.
- Code style is enforced in the CI. You can install the style tools with `pip install -e .[quality]`. You can then run `make fixup` to apply style and consistency fixes to your code.
## Copying and inheritance
Many models in the codebase have similar code, but it is not shared by inheritance because we want each model file to be self-contained.
We use two mechanisms to keep this code in sync:
- "Copied from" syntax. Functions or entire classes can have a comment at the top like this: `# Copied from transformers.models.llama.modeling_llama.rotate_half` or `# Copied from transformers.models.t5.modeling_t5.T5LayerNorm with T5->MT5`
These comments are actively checked by the style tools, and copies will automatically be updated when the base code is updated. If you need to update a copied function, you should
either update the base function and use `make fixup` to propagate the change to all copies, or simply remove the `# Copied from` comment if that is inappropriate.
- "Modular" files. These files briefly define models by composing them using inheritance from other models. They are not meant to be used directly. Instead, the style tools
automatically generate a complete modeling file, like `modeling_bert.py`, from the modular file like `modular_bert.py`. If a model has a modular file, the modeling file
should never be edited directly! Instead, changes should be made in the modular file, and then you should run `make fixup` to update the modeling file automatically.
When adding new models, you should prefer `modular` style and inherit as many classes as possible from existing models.
## Testing
After making changes, you should usually run `make fixup` to ensure any copies and modular files are updated, and then test all affected models. This includes both
the model you made the changes in and any other models that were updated by `make fixup`. Tests can be run with `pytest tests/models/[name]/test_modeling_[name].py`
If your changes affect code in other classes like tokenizers or processors, you should run those tests instead, like `test_processing_[name].py` or `test_tokenization_[name].py`.
In order to run tests, you may need to install dependencies. You can do this with `pip install -e .[testing]`. You will probably also need to `pip install torch accelerate` if your environment does not already have them.
RUN_SLOW:yes# For gated repositories, we still need to agree to share information on the Hub repo. page in order to get access. # This token is created under the bot `hf-transformers-bot`.
name:Self-hosted runner scale set (AMD mi300 scheduled CI caller)
name:Self-hosted runner scale set (AMD mi355 scheduled CI caller)
# Note: For every job in this workflow, the name of the runner scale set is finalized in the runner yaml i.e. huggingface/hf-workflows/.github/workflows/transformers_amd_ci_scheduled_arc_scale_set.yaml
# For example, 1gpu scale set: amd-mi300-ci-1gpu
# 2gpu scale set: amd-mi300-ci-2gpu
# For example, 1gpu : amd-mi355-ci-1gpu
# 2gpu : amd-mi355-ci-2gpu
on:
workflow_run:
workflows:["Self-hosted runner (AMD scheduled CI caller)"]
RUN_SLOW:yes# For gated repositories, we still need to agree to share information on the Hub repo. page in order to get access. # This token is created under the bot `hf-transformers-bot`.
#because the SSH can be enabled dynamically if the workflow failed, so we need to store slack infos to be able to retrieve them during the waitforssh step
Like the slow tests, there are other environment variables available which are not enabled by default during testing:
- `RUN_CUSTOM_TOKENIZERS`: Enables tests for custom tokenizers.
More environment variables and additional information can be found in the [testing_utils.py](https://github.com/huggingface/transformers/blob/main/src/transformers/testing_utils.py).
@ -247,7 +246,6 @@ You are not required to read the following guidelines before opening an issue. H
Try not use italics and bold text too much as these often make the text more difficult to read.
12. If you are cross-referencing a specific comment in a given thread or another issue, always link to that specific comment, rather than using the issue link. If you do the latter it could be quite impossible to find which specific comment you're referring to.
To get the link to the specific comment do not copy the url from the location bar of your browser, but instead, click the `...` icon in the upper right corner of the comment and then select "Copy Link".
@ -257,7 +255,6 @@ You are not required to read the following guidelines before opening an issue. H
13. If you are replying to a last comment, it's totally fine to make your reply with just your comment in it. The readers can follow the information flow here.
But if you're replying to a comment that happened some comments back it's always a good practice to quote just the relevant lines you're replying it. The `>` is used for quoting, or you can always use the menu to do so. For example your editor box will look like:
and adjacent modeling libraries (llama.cpp, mlx, ...) which leverage the model definition from `transformers`.
@ -80,7 +81,7 @@ Explore the [Hub](https://huggingface.com/) today to find a model and use Transf
## Installation
Transformers works with Python 3.9+ [PyTorch](https://pytorch.org/get-started/locally/) 2.1+, [TensorFlow](https://www.tensorflow.org/install/pip) 2.6+, and [Flax](https://flax.readthedocs.io/en/latest/) 0.4.1+.
Transformers works with Python 3.9+, and [PyTorch](https://pytorch.org/get-started/locally/) 2.1+.
Create and activate a virtual environment with [venv](https://docs.python.org/3/library/venv.html) or [uv](https://docs.astral.sh/uv/), a fast Rust-based Python package and project manager.
@ -14,7 +14,7 @@ Models uploaded on the Hugging Face Hub come in different formats. We heavily re
models in the [`safetensors`](https://github.com/huggingface/safetensors) format (which is the default prioritized
by the transformers library), as developed specifically to prevent arbitrary code execution on your system.
To avoid loading models from unsafe formats(e.g. [pickle](https://docs.python.org/3/library/pickle.html), you should use the `use_safetensors` parameter. If doing so, in the event that no .safetensors file is present, transformers will error when loading the model.
To avoid loading models from unsafe formats(e.g. [pickle](https://docs.python.org/3/library/pickle.html), you should use the `use_safetensors` parameter. If doing so, in the event that no .safetensors file is present, transformers will error when loading the model.
@ -6,7 +6,7 @@ developers, researchers, students, professors, engineers, and anyone else to bui
In this list, we showcase incredibly impactful and novel projects that have pushed the field forward. We celebrate
100 of these projects as we reach the milestone of 100k stars as a community; but we're very open to pull requests
adding other projects to the list. If you believe a project should be here and it's not, then please, open a PR
adding other projects to the list. If you believe a project should be here and it's not, then please, open a PR
to add it.
## [gpt4all](https://github.com/nomic-ai/gpt4all)
@ -49,7 +49,7 @@ Keywords: LLMs, Large Language Models, Agents, Chains
[LlamaIndex](https://github.com/run-llama/llama_index) is a project that provides a central interface to connect your LLM's with external data. It provides various kinds of indices and retrieval mechanisms to perform different LLM tasks and obtain knowledge-augmented results.
Keywords: LLMs, Large Language Models, Data Retrieval, Indices, Knowledge Augmentation
Keywords: LLMs, Large Language Models, Data Retrieval, Indices, Knowledge Augmentation
@ -257,7 +257,7 @@ Stable-Dreamfusion is a pytorch implementation of the text-to-3D model Dreamfusi
Keywords: Text-to-3D, Stable Diffusion
## [txtai](https://github.com/neuml/txtai)
[txtai](https://github.com/neuml/txtai) is an open-source platform for semantic search and workflows powered by language models. txtai builds embeddings databases, which are a union of vector indexes and relational databases enabling similarity search with SQL. Semantic workflows connect language models together into unified applications.
Keywords: Semantic search, LLM
@ -309,8 +309,8 @@ Keywords: OCR, LaTeX, Math formula
OpenCLIP is an open source implementation of OpenAI's CLIP.
The goal of this repository is to enable training models with contrastive image-text supervision, and to investigate their properties such as robustness to distribution shift.
The starting point is an implementation of CLIP that matches the accuracy of the original CLIP models when trained on the same dataset.
The goal of this repository is to enable training models with contrastive image-text supervision, and to investigate their properties such as robustness to distribution shift.
The starting point is an implementation of CLIP that matches the accuracy of the original CLIP models when trained on the same dataset.
Specifically, a ResNet-50 model trained with this codebase on OpenAI's 15 million image subset of YFCC achieves 32.7% top-1 accuracy on ImageNet.
@ -596,7 +596,7 @@ Keywords: Data-Centric AI, Data Quality, Noisy Labels, Outlier Detection, Active
## [BentoML](https://github.com/bentoml/BentoML)
[BentoML](https://github.com/bentoml) is the unified framework for building, shipping, and scaling production-ready AI applications incorporating traditional ML, pre-trained AI models, Generative and Large Language Models.
[BentoML](https://github.com/bentoml) is the unified framework for building, shipping, and scaling production-ready AI applications incorporating traditional ML, pre-trained AI models, Generative and Large Language Models.
All Hugging Face models and pipelines can be seamlessly integrated into BentoML applications, enabling the running of models on the most suitable hardware and independent scaling based on usage.
Keywords: BentoML, Framework, Deployment, AI Applications
@ -606,4 +606,3 @@ Keywords: BentoML, Framework, Deployment, AI Applications
[LLaMA Factory](https://github.com/hiyouga/LLaMA-Factory) offers a user-friendly fine-tuning framework that incorporates PEFT. The repository includes training(fine-tuning) and inference examples for LLaMA-2, BLOOM, Falcon, Baichuan, Qwen, and other LLMs. A ChatGLM version is also available in [ChatGLM-Efficient-Tuning](https://github.com/hiyouga/ChatGLM-Efficient-Tuning).
For uploading results, you need a HuggingFace token with write permissions to the target dataset. You can provide the token in several ways (in order of precedence):
1. Command line: `--token hf_your_token_here`
3. Environment variable: `HF_TOKEN`
### Running Specific Benchmarks
```bash
# Include only specific benchmarks
python run_benchmarks.py --include llama
# Exclude specific benchmarks
python run_benchmarks.py --exclude old_benchmark
## Output Format
Results are saved as JSON files with the following structure:
```json
{
"model_name": "llama_2_7b",
"benchmark_scenarios": [
{
"scenario_name": "eager_variant",
"metadata": {
"timestamp": "2025-01-XX...",
"commit_id": "abc123...",
"hardware_info": {
"gpu_name": "NVIDIA A100",
"gpu_memory_total": 40960,
"cpu_count": 64
},
"config": {
"variant": "eager",
"warmup_iterations": 3,
"measurement_iterations": 5
}
},
"measurements": {
"latency": {
"mean": 2.45,
"median": 2.43,
"std": 0.12,
"min": 2.31,
"max": 2.67,
"p95": 2.61,
"p99": 2.65
},
"time_to_first_token": {
"mean": 0.15,
"std": 0.02
},
"tokens_per_second": {
"mean": 87.3,
"unit": "tokens/sec"
}
},
"gpu_metrics": {
"gpu_utilization_mean": 85.2,
"gpu_memory_used_mean": 12450
}
}
]
}
```
### Debug Mode
```bash
python run_benchmarks.py --log-level DEBUG
```
## Contributing
To add new benchmarks:
1. Create a new file in `benches/`
2. Implement the `ModelBenchmark` interface
3. Add a runner function (`run_<benchmark_name>` or `run_benchmark`)
"The French Revolution was a period of political and societal change in France that began with the Estates General of 1789 and ended with the Coup of 18 Brumaire on 9 November 1799.",
"Many of the revolution's ideas are considered fundamental principles of liberal democracy, and its values remain central to modern French political discourse.",
"It was caused by a combination of social, political, and economic factors which the existing regime proved unable to manage.",
"Financial crisis and widespread social distress led to the convocation of the Estates General in May 1789, its first meeting since 1614.",
"The representatives of the Third Estate broke away and re-constituted themselves as a National Assembly in June.",
"The Storming of the Bastille in Paris on 14 July led to a series of radical measures by the Assembly, including the abolition of feudalism, state control over the Catholic Church in France, and issuing the Declaration of the Rights of Man and of the Citizen.",
"The next three years were dominated by a struggle for political control.",
"King Louis XVI's attempted flight to Varennes in June 1791 further discredited the monarchy, and military defeats after the outbreak of the French Revolutionary Wars in April 1792 led to the insurrection of 10 August 1792.",
"As a result, the monarchy was replaced by the French First Republic in September, followed by the execution of Louis XVI himself in January 1793.",
"After another revolt in June 1793, the constitution was suspended, and political power passed from the National Convention to the Committee of Public Safety, dominated by radical Jacobins led by Maximilien Robespierre.",
"About 16,000 people were sentenced by the Revolutionary Tribunal and executed in the Reign of Terror, which ended in July 1794 with the Thermidorian Reaction.",
"Weakened by external threats and internal opposition, the Committee of Public Safety was replaced in November 1795 by the Directory.",
"Its instability ended in the coup of 18 Brumaire and the establishment of the Consulate, with Napoleon Bonaparte as First Consul.",
])# fmt: skip
defcompact_json_numeric_arrays(data:dict):
# Match arrays that contain only numbers (ints/floats), whitespace, commas, and newlines
@ -20,22 +20,21 @@ To generate the documentation, you first have to build it. Several packages are
you can install them with the following command, at the root of the code repository:
```bash
pip install -e ".[docs]"
pip install -e ".[dev]"
```
> [!NOTE]
> This command might fail for some OS that are missing dependencies. Check step 4 in [Create a Pull Request](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#create-a-pull-request) to workaround it.
Then you need to install our special tool that builds the documentation:
The docs will be viewable at [http://localhost:3000](http://localhost:3000). You can also preview the docs once you have opened a PR. You will see a bot add a comment to a link where the documentation with your changes lives.
---
**NOTE**
The `preview` command only works with existing doc files. When you add a completely new file, you need to update `_toctree.yml`& restart `preview` command (`ctrl-c` to stop it & call `doc-builder preview ...` again).
---
> [!NOTE]
> The `preview` command only works with existing doc files. When you add a completely new file, you need to update `_toctree.yml` & restart `preview` command (`ctrl-c` to stop it & call `doc-builder preview ...` again).
## Adding a new element to the navigation bar
@ -164,6 +159,9 @@ These classes should be added using our Markdown syntax. Usually as follows:
[[autodoc]] XXXConfig
```
> [!IMPORTANT]
> Always add a blank line after `[[autodoc]]` to ensure it passes the CI/CD checks.
This will include every public method of the configuration that is documented. If for some reason you wish for a method
not to be displayed in the documentation, you can do so by specifying which methods should be in the docs:
1. Start with the `_toctree.yml` file that corresponds to your documentation chapter. This file is essential for rendering the table of contents on the website.
- If the `_toctree.yml` file doesn’t exist for your language, create one by copying the English version and removing unrelated sections.
- If the `_toctree.yml` file doesn't exist for your language, create one by copying the English version and removing unrelated sections.
- Ensure it is placed in the `docs/source/LANG-ID/` directory.
Here’s an example structure for the `_toctree.yml` file:
<figcaptionclass="mt-2 text-center text-sm text-gray-500">الصورة توضح مخطط مراحل نموذج Swin.</figcaption>
</div>
يسمح لك [`AutoBackbone`] باستخدام النماذج المُدربة مسبقًا كعمود فقري للحصول على خرائط ميزات من مراحل مختلفة من العمود الفقري. يجب عليك تحديد أحد المعلمات التالية في [`~PretrainedConfig.from_pretrained`]:
يسمح لك [`AutoBackbone`] باستخدام النماذج المُدربة مسبقًا كعمود فقري للحصول على خرائط ميزات من مراحل مختلفة من العمود الفقري. يجب عليك تحديد أحد المعلمات التالية في [`~PreTrainedConfig.from_pretrained`]:
*`out_indices` هو فهرس الطبقة التي تريد الحصول على خريطة الميزات منها
*`out_features` هو اسم الطبقة التي تريد الحصول على خريطة الميزات منها
@ -115,8 +115,6 @@
## النموذج التلقائي (AutoModel)
<frameworkcontent>
<pt>
تسمح لك فئات `AutoModelFor` بتحميل نموذج مُدرب مسبقًا لمهمة معينة (راجع [هنا](model_doc/auto) للحصول على قائمة كاملة بالمهام المتاحة). على سبيل المثال، قم بتحميل نموذج لتصنيف التسلسل باستخدام [`AutoModelForSequenceClassification.from_pretrained`]:
```py
@ -143,25 +141,4 @@
بشكل عام، نوصي باستخدام فئة `AutoTokenizer` وفئة `AutoModelFor` لتحميل مثيلات مُدربة مسبقًا من النماذج. سيساعدك هذا في تحميل البنية الصحيحة في كل مرة. في البرنامج التعليمي التالي، تعرف على كيفية استخدام المحلل اللغوي ومعالج الصور ومستخرج الميزات والمعالج الذي تم تحميله حديثًا لمعالجة مجموعة بيانات للضبط الدقيق.
</pt>
<tf>
أخيرًا، تسمح لك فئات `TFAutoModelFor` بتحميل نموذج مُدرب مسبقًا لمهمة معينة (راجع [هنا](model_doc/auto) للحصول على قائمة كاملة بالمهام المتاحة). على سبيل المثال، قم بتحميل نموذج لتصنيف التسلسل باستخدام [`TFAutoModelForSequenceClassification.from_pretrained`]:
بشكل عام، نوصي باستخدام فئة `AutoTokenizer` وفئة `TFAutoModelFor` لتحميل نسخ لنماذج مُدربة مسبقًا. سيساعدك هذا في تحميل البنية الصحيحة في كل مرة. في البرنامج التعليمي التالي، ستتعرف على كيفية استخدام المُجزّئ اللغوي ومعالج الصور ومستخرج الميزات والمعالج الذي تم تحميله حديثًا لمعالجة مجموعة بيانات للضبط الدقيق.
بشكل افتراضي، تقوم فئات Hugging Face مثل [`TextGenerationPipeline`] أو [`AutoModelForCausalLM`] بتحميل النموذج في دقة "float32". وهذا يعني أنه يحتاج إلى 4 بايتات (32 بت) لكل معلمة، لذا فإن نموذج "8B" بحجم 8 مليار معلمة سيحتاج إلى ~32 جيجابايت من الذاكرة. ومع ذلك، يمكن أن يكون هذا مضيعة للموارد! يتم تدريب معظم نماذج اللغة الحديثة في دقة "bfloat16"، والتي تستخدم فقط 2 بايت لكل معلمة. إذا كان عتادك يدعم ذلك (Nvidia 30xx/Axxx أو أحدث)، فيمكنك تحميل النموذج في دقة "bfloat16"، باستخدام معامل "torch_dtype" كما فعلنا أعلاه.
بشكل افتراضي، تقوم فئات Hugging Face مثل [`TextGenerationPipeline`] أو [`AutoModelForCausalLM`] بتحميل النموذج في دقة "float32". وهذا يعني أنه يحتاج إلى 4 بايتات (32 بت) لكل معلمة، لذا فإن نموذج "8B" بحجم 8 مليار معلمة سيحتاج إلى ~32 جيجابايت من الذاكرة. ومع ذلك، يمكن أن يكون هذا مضيعة للموارد! يتم تدريب معظم نماذج اللغة الحديثة في دقة "bfloat16"، والتي تستخدم فقط 2 بايت لكل معلمة. إذا كان عتادك يدعم ذلك (Nvidia 30xx/Axxx أو أحدث)، فيمكنك تحميل النموذج في دقة "bfloat16"، باستخدام معامل "dtype" كما فعلنا أعلاه.
ومن الممكن أيضًا النزول إلى أقل من 16 بت باستخدام "التكميم"، وهي طريقة لضغط أوزان النموذج بطريقة تفقد بعض المعلومات. يسمح هذا بضغط كل معلمة إلى 8 بتات أو 4 بتات أو حتى أقل. لاحظ أنه، خاصة في 4 بتات، قد تتأثر جودة ناتج النموذج سلبًا، ولكن غالبًا ما يكون هذا مقايضة تستحق القيام بها لتناسب نموذج محادثة أكبر وأكثر قدرة في الذاكرة. دعنا كيف يمكننا تطبيق ذلك باستخدام مكتبة `bitsandbytes`:
بمجرد أن تصبح راضيًا عن تكوين نموذجك، يمكنك حفظه باستخدام [`~PretrainedConfig.save_pretrained`]. يتم تخزين ملف التكوين الخاص بك على أنه ملف JSON في دليل الحفظ المحدد:
بمجرد أن تصبح راضيًا عن تكوين نموذجك، يمكنك حفظه باستخدام [`~PreTrainedConfig.save_pretrained`]. يتم تخزين ملف التكوين الخاص بك على أنه ملف JSON في دليل الحفظ المحدد:
الخطوة التالية هي إنشاء [نموذج](main_classes/models). النموذج - ويُشار إليه أحيانًا باسم البنية - يُحدد وظيفة كل طبقة والعمليات الحسابية المُنفذة. تُستخدم خصائص مثل `num_hidden_layers` من التكوين لتحديد هذه البنية. تشترك جميع النماذج في فئة أساسية واحدة هي [`PreTrainedModel`] وبعض الوظائف المُشتركة مثل غيير حجم مُدخلات الكلمات وتقليص رؤوس آلية الانتباه الذاتي. بالإضافة إلى ذلك، فإن جميع النماذج هي فئات فرعية إما من [`torch.nn.Module`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html)، [`tf.keras.Model`](https://www.tensorflow.org/api_docs/python/tf/keras/Model) أو [`flax.linen.Module`](https://flax.readthedocs.io/en/latest/api_reference/flax.linen/module.html) . هذا يعني النماذج متوافقة مع كل استخدام لإطار العمل الخاص بها.
<frameworkcontent>
<pt>
قم بتحميل خصائص التكوين المخصصة الخاصة بك في النموذج:
هذا ينشئ نموذجًا بقيم عشوائية بدلاً من الأوزان المُدربة مسبقًا. لن يكون هذا النموذج مفيدًا حتى يتم تدريبه. تُعد عملية التدريب مكلفة وتستغرق وقتًا طويلاً. من الأفضل بشكل عام استخدام نموذج مُدرب مسبقًا للحصول على نتائج أفضل بشكل أسرع، مع استخدام جزء بسيط فقط من الموارد المطلوبة للتدريب.
قم بإنشاء نموذج مُدرب مسبقًا باستخدام [`~TFPreTrainedModel.from_pretrained`]:
عندما تقوم بتحميل الأوزان المُدربة مسبقًا،يتم تحميل إعدادات النموذج الافتراضي تلقائيًا إذا كان النموذج من مكتبة 🤗 Transformers. ومع ذلك، يمكنك أيضًا استبدال - بعض أو كل - إعدادات النموذج الافتراضية بإعداداتك الخاصة:
في هذه المرحلة، لديك نموذج DistilBERT الأساسي الذي يخرج *حالات الكامنة*. تُمرَّر هذه الحالات الكامنة كمدخلات لرأس النموذج لإنتاج المخرجات النهائية. توفر مكتبة 🤗 Transformers رأس نموذج مختلف لكل مهمة طالما أن النموذج يدعم المهمة (أي لا يمكنك استخدام DistilBERT لمهمة تسلسل إلى تسلسل مثل الترجمة).
<frameworkcontent>
<pt>
على سبيل المثال، [`DistilBertForSequenceClassification`] هو نموذج DistilBERT الأساس مزودًا برأس تصنيف تسلسلي. يُشكّل رأس التصنيف التسلسلي طبقة خطية فوق المخرجات المجمعة.
على سبيل المثال، [`TFDistilBertForSequenceClassification`] هو نموذج DistilBERT الأساسي برأس تصنيف تسلسل. رأس التصنيف التسلسلي هو طبقة خطية أعلى المخرجات المجمعة.
أعد استخدام هذا نقطة التحقق لمهمة أخرى عن طريق التبديل إلى رأس نموذج مختلف. لمهمة الإجابة على الأسئلة، ستستخدم رأس النموذج [`TFDistilBertForQuestionAnswering`]. رأس الإجابة على الأسئلة مشابه لرأس التصنيف التسلسلي باستثناء أنه طبقة خطية أعلى حالات الإخراج المخفية.
في مثالنا، سنعدّل بعض الوسائط في فئة ResNet التي قد نرغب في ضبطها. ستعطينا التكوينات المختلفة أنواع ResNets المختلفة الممكنة. سنقوم بتخزين هذه الوسائط بعد التحقق من صحته.
```python
fromtransformersimportPretrainedConfig
fromtransformersimportPreTrainedConfig
fromtypingimportList
classResnetConfig(PretrainedConfig):
classResnetConfig(PreTrainedConfig):
model_type="resnet"
def__init__(
@ -58,11 +58,11 @@ class ResnetConfig(PretrainedConfig):
```
الأشياء الثلاثة المهمة التي يجب تذكرها عند كتابة تكوينك الخاص هي:
- يجب أن ترث من `PretrainedConfig`،
- يجب أن تقبل دالة `__init__` الخاصة بـ `PretrainedConfig` أي معامﻻت إضافية kwargs،
- يجب أن ترث من `PreTrainedConfig`،
- يجب أن تقبل دالة `__init__` الخاصة بـ `PreTrainedConfig` أي معامﻻت إضافية kwargs،
- يجب تمرير هذه المعامﻻت الإضافية إلى دالة `__init__` فى الفئة الأساسية الاعلى.
يضمن الإرث حصولك على جميع الوظائف من مكتبة 🤗 Transformers، في حين أن القيدين التانى والثالث يأتيان من حقيقة أن `PretrainedConfig` لديه المزيد من الحقول أكثر من تلك التي تقوم بتعيينها. عند إعادة تحميل تكوين باستخدام طريقة `from_pretrained`، يجب أن يقبل تكوينك هذه الحقول ثم إرسالها إلى الفئة الأساسية الأعلى.
يضمن الإرث حصولك على جميع الوظائف من مكتبة 🤗 Transformers، في حين أن القيدين التانى والثالث يأتيان من حقيقة أن `PreTrainedConfig` لديه المزيد من الحقول أكثر من تلك التي تقوم بتعيينها. عند إعادة تحميل تكوين باستخدام طريقة `from_pretrained`، يجب أن يقبل تكوينك هذه الحقول ثم إرسالها إلى الفئة الأساسية الأعلى.
تحديد `model_type` لتكوينك (هنا `model_type="resnet"`) ليس إلزاميًا، ما لم ترغب في
تسجيل نموذجك باستخدام الفئات التلقائية (راجع القسم الأخير).
هناك العديد من [استراتيجيات التوليد](generation_strategies)، وفي بعض الأحيان قد لا تكون القيم الافتراضية مناسبة لحالتك الاستخدام. إذا لم تكن الإخراج الخاصة بك متوافقة مع ما تتوقعه، فقد قمنا بإنشاء قائمة بأكثر الأخطاء الشائعة وكيفية تجنبها.
> يتم تدريب جميع النماذج تقريبًا بتنسيق bfloat16 في الوقت الحالي، ولا يوجد سبب لتشغيل النموذج بدقة float32 الكاملة إذا [كانت وحدة معالجة الرسومات (GPU) الخاصة بك تدعم bfloat16](https://discuss.pytorch.org/t/bfloat16-native-support/117155/5). لن توفر دقة float32 نتائج استدلال أفضل من الدقة التي تم استخدامها لتدريب النموذج.
إذا لم تكن متأكدًا من تنسيق تخزين أوزان النموذج على Hub، فيمكنك دائمًا الاطلاع على تهيئة نقطة التفتيش في `"torch_dtype"`، على سبيل المثال [هنا](https://huggingface.co/meta-llama/Llama-2-7b-hf/blob/6fdf2e60f86ff2481f2241aaee459f85b5b0bbb9/config.json#L21). يوصى بتعيين النموذج إلى نفس نوع الدقة كما هو مكتوب في التهيئة عند التحميل باستخدام `from_pretrained(..., torch_dtype=...)` إلا إذا كان النوع الأصلي هو float32، وفي هذه الحالة يمكن استخدام `float16` أو `bfloat16` للاستدلال.
إذا لم تكن متأكدًا من تنسيق تخزين أوزان النموذج على Hub، فيمكنك دائمًا الاطلاع على تهيئة نقطة التفتيش في `"dtype"`، على سبيل المثال [هنا](https://huggingface.co/meta-llama/Llama-2-7b-hf/blob/6fdf2e60f86ff2481f2241aaee459f85b5b0bbb9/config.json#L21). يوصى بتعيين النموذج إلى نفس نوع الدقة كما هو مكتوب في التهيئة عند التحميل باستخدام `from_pretrained(..., dtype=...)` إلا إذا كان النوع الأصلي هو float32، وفي هذه الحالة يمكن استخدام `float16` أو `bfloat16` للاستدلال.
دعونا نحدد وظيفة `flush(...)` لتحرير جميع الذاكرة المخصصة بحيث يمكننا قياس ذروة ذاكرة وحدة معالجة الرسومات (GPU) المخصصة بدقة.
@ -231,7 +231,7 @@ flush()
دعنا نرى ما هو استهلاك ذاكرة GPU الذروة الذي يوفره تكميم 4 بت. يمكن تكميم النموذج إلى 4 بت باستخدام نفس واجهة برمجة التطبيقات كما في السابق - هذه المرة عن طريق تمرير `load_in_4bit=True` بدلاً من `load_in_8bit=True`.
```python
model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", load_in_4bit=True, pad_token_id=0)
model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", quantization_config=BitsAndBytesConfig(load_in_4bit=True), pad_token_id=0)
يحصل نموذج OctoCoder الخاص بنا الآن على موجه إدخال أطول بشكل كبير يتضمن ما يسمى *موجه النظام*. تُستخدم موجهات النظام لتوجيه LLM إلى مساعد أفضل مصمم لمهام المستخدمين.
فيما يلي، نستخدم موجه النظام الذي سيجعل OctoCoder مساعد ترميز أفضل.
```python
system_prompt = """Below are a series of dialogues between various people and an AI technical assistant.
The assistant tries to be helpful, polite, honest, sophisticated, emotionally aware, and humble but knowledgeable.
The assistant is happy to help with code questions and will do their best to understand exactly what is needed.
It also tries to avoid giving false or misleading information, and it caveats when it isn't entirely sure about the right answer.
That said, the assistant is practical really does its best, and doesn't let caution get too much in the way of being useful.
The Starcoder models are a series of 15.5B parameter models trained on 80+ programming languages from The Stack (v1.2) (excluding opt-out requests).
The model uses Multi Query Attention, was trained using the Fill-in-the-Middle objective, and with 8,192 tokens context window for a trillion tokens of heavily deduplicated data.
-----
Question: Write a function that takes two lists and returns a list that has alternating elements from each input list.
Answer: Sure. Here is a function that does that.
def alternating(list1, list2):
results = []
for i in range(len(list1)):
results.append(list1[i])
results.append(list2[i])
return results
Question: Can you write some test cases for this function?
Question: Modify the function so that it returns all input elements when the lists have uneven length. The elements from the longer list should be at the end.
Answer: Here is the modified function.
def alternating(list1, list2):
results = []
for i in range(min(len(list1), len(list2))):
results.append(list1[i])
results.append(list2[i])
if len(list1) > len(list2):
results.extend(list1[i+1:])
else:
results.extend(list2[i+1:])
return results
-----
"""
```
لأغراض التوضيح، سنكرر موجه النظام عشر مرات بحيث يكون طول الإدخال طويلاً بما يكفي لملاحظة وفورات ذاكرة Flash Attention.
نضيف موجه النص الأصلي "سؤال: يرجى كتابة وظيفة في Python تقوم بتحويل البايتات إلى جيجا بايت.
```python
long_prompt = 10 * system_prompt + prompt
```
نقوم بتنفيذ نموذجنا مرة أخرى بدقة bfloat16.
```python
model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", torch_dtype=torch.bfloat16, device_map="auto")
دعنا الآن نقوم بتشغيل النموذج تمامًا مثلما كان من قبل *بدون اهتمام فلاشي* وقياس متطلبات ذاكرة GPU وقت الذروة ووقت الاستدلال.
```python
import time
start_time = time.time()
result = pipe(long_prompt, max_new_tokens=60)[0]["generated_text"][len(long_prompt):]
print(f"Generated in {time.time() - start_time} seconds.")
result
```
**الإخراج**:
```
تم التوليد في 10.96854019165039 ثانية.
بالتأكيد. إليك وظيفة للقيام بذلك.
def bytes_to_giga(bytes):
return bytes / 1024 / 1024 / 1024
الإجابة: بالتأكيد. إليك وظيفة للقيام بذلك.
ديف
```
نحصل على نفس الإخراج كما كان من قبل، ولكن هذه المرة، يقوم النموذج بتكرار الإجابة عدة مرات حتى يتم قطعها عند 60 رمزًا. ليس من المستغرب أننا كررنا موجه النظام عشر مرات لأغراض التوضيح وبالتالي قمنا بتشغيل النموذج لتكرار نفسه.
**ملاحظة** لا ينبغي تكرار موجه النظام عشر مرات في التطبيقات الواقعية - مرة واحدة كافية!
كما نرى، فإن متطلبات ذاكرة GPU وقت الذروة أعلى بكثير مما كانت عليه في البداية، وهو ما يرجع إلى حد كبير إلى تسلسل الإدخال الأطول. أيضًا، يستغرق التوليد أكثر من دقيقة بقليل الآن.
لمقارنة، دعونا نقوم بتشغيل نفس الدالة، ولكن تمكين الاهتمام فلاش بدلا من ذلك.
للقيام بذلك، نقوم بتحويل النموذج إلى [BetterTransformer](Https://huggingface.co/docs/optimum/bettertransformer/overview) ومن خلال القيام بذلك تمكين PyTorch's [SDPA self-attention](Https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention) والتي بدورها قادرة على استخدام الاهتمام فلاش.
```python
model.to_bettertransformer()
```
الآن نقوم بتشغيل نفس مقتطف التعليمات البرمجية بالضبط كما كان من قبل وتحت الغطاء سوف تستخدم المحولات الاهتمام فلاش.
```py
start_time = time.time()
with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):
result = pipe(long_prompt, max_new_tokens=60)[0]["generated_text"][len(long_prompt):]
print(f"Generated in {time.time() - start_time} seconds.")
result
```
**الإخراج**:
```
تم التوليد في 3.0211617946624756 ثانية.
بالتأكيد. إليك وظيفة للقيام بذلك.
def bytes_to_giga(bytes):
return bytes / 1024 / 1024 / 1024
الإجابة: بالتأكيد. إليك وظيفة للقيام بذلك.
ديف
```
نحصل على نفس النتيجة بالضبط كما كان من قبل، ولكن يمكننا ملاحظة تسريع كبير بفضل الاهتمام فلاش.
ونحن تقريبا مرة أخرى إلى ذاكرة GPU الذروة الأصلية لدينا 29GB.
يمكننا أن نلاحظ أننا نستخدم فقط حوالي 100 ميجابايت إضافية من ذاكرة GPU عند تمرير تسلسل إدخال طويل جدًا مع الاهتمام فلاش مقارنة بتمرير تسلسل إدخال قصير كما فعلنا في البداية.
```py
flush()
```
لمزيد من المعلومات حول كيفية استخدام Flash Attention، يرجى الاطلاع على [صفحة doc هذه](Https://huggingface.co/docs/transformers/en/perf_infer_gpu_one#flashattention-2).
## 3. الابتكارات المعمارية
حتى الآن، نظرنا في تحسين الكفاءة الحسابية والذاكرة من خلال:
@ -640,7 +472,7 @@ for _ in range(5):
next_token_id = torch.argmax(next_logits, dim=-1)
print("shape of input_ids", next_token_id.shape)
print("length of key-value cache", len(past_key_values[0][0])) # past_key_values are of shape [num_layers, 0 for k, 1 for v, batch_size, length, hidden_dim]
print("length of key-value cache", past_key_values.get_seq_length()) # past_key_values are of shape [num_layers, 0 for k, 1 for v, batch_size, length, hidden_dim]
تحويل نقطة التحقق لإطار عمل آخر أمر سهل. تأكد من تثبيت PyTorch و TensorFlow (راجع [هنا](installation) لتعليمات التثبيت)، ثم ابحث عن النموذج الملائم لمهمتك في الإطار الآخر.
<frameworkcontent>
<pt>
حدد `from_tf=True` لتحويل نقطة تحقق من TensorFlow إلى PyTorch:
مشاركة نموذجك على Hub مر بسيط للغاية كل ما عليك هو إضافة معلمة أو استدعاء رد إضافي. كما تذكر من درس [التدريب الدقيق](training)، فإن فئة [`TrainingArguments`] هي المكان الذي تحدد فيه المعلمات الفائقة وخيارات التدريب الإضافية. تشمل إحدى خيارات التدريب هذه القدرة على دفع النموذج مباشرة إلى المنصة Hub. قم بتعيين `push_to_hub=True` في [`TrainingArguments`]:
@ -127,29 +99,6 @@ pip install huggingface_hub
```py
>>>trainer.push_to_hub()
```
</pt>
<tf>
شارك نموذجًا على Hub باستخدام [`PushToHubCallback`]. في دالة [`PushToHubCallback`], أضف:
- دليل إخراج لنموذجك.
- مُجزّئ اللغوي.
-`hub_model_id`، والذي هو اسم مستخدم Hub واسم النموذج الخاص بك.
* انقر فوق الزر **Edit model card** في مستودع نموذجك.
الق نظرة على بطاقة [DistilBert](https://huggingface.co/distilbert/distilbert-base-uncased) للحصول على مثال جيد على نوع المعلومات التي يجب أن تتضمنها بطاقة النموذج. للحصول على مزيد من التفاصيل حول الخيارات الأخرى التي يمكنك التحكم فيها في ملف `README.md` مثل البصمة الكربونية للنموذج أو أمثلة الأداة، راجع الوثائق [هنا](https://huggingface.co/docs/hub/models-cards).
الق نظرة على بطاقة [DistilBert](https://huggingface.co/distilbert/distilbert-base-uncased) للحصول على مثال جيد على نوع المعلومات التي يجب أن تتضمنها بطاقة النموذج. للحصول على مزيد من التفاصيل حول الخيارات الأخرى التي يمكنك التحكم فيها في ملف `README.md` مثل البصمة الكربونية للنموذج أو أمثلة الأداة، راجع الوثائق [هنا](https://huggingface.co/docs/hub/models-cards).
| [كيفية ضبط نموذج بدقة على التلخيص](https://github.com/huggingface/notebooks/blob/main/examples/summarization.ipynb)| يوضح كيفية معالجة البيانات مسبقًا وضبط نموذج مُدرَّب مسبقًا بدقة على XSUM. | [](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization.ipynb)| [](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/summarization.ipynb)|
| [كيفية تدريب نموذج لغة من البداية](https://github.com/huggingface/blog/blob/main/notebooks/01_how_to_train.ipynb)| تسليط الضوء على جميع الخطوات لتدريب نموذج Transformer بشكل فعال على بيانات مخصصة | [](https://colab.research.google.com/github/huggingface/blog/blob/main/notebooks/01_how_to_train.ipynb)| [](https://studiolab.sagemaker.aws/import/github/huggingface/blog/blob/main/notebooks/01_how_to_train.ipynb)|
| [كيفية إنشاء نص](https://github.com/huggingface/blog/blob/main/notebooks/02_how_to_generate.ipynb)| كيفية استخدام أساليب فك التشفير المختلفة لإنشاء اللغة باستخدام المحولات | [](https://colab.research.google.com/github/huggingface/blog/blob/main/notebooks/02_how_to_generate.ipynb)| [](https://studiolab.sagemaker.aws/import/github/huggingface/blog/blob/main/notebooks/02_how_to_generate.ipynb)|
| [كيفية إنشاء نص (مع قيود)](https://github.com/huggingface/blog/blob/main/notebooks/53_constrained_beam_search.ipynb)| كيفية توجيه إنشاء اللغة باستخدام القيود التي يوفرها المستخدم | [](https://colab.research.google.com/github/huggingface/blog/blob/main/notebooks/53_constrained_beam_search.ipynb)| [](https://studiolab.sagemaker.aws/import/github/huggingface/blog/blob/main/notebooks/53_constrained_beam_search.ipynb)|
| [Reformer](https://github.com/huggingface/blog/blob/main/notebooks/03_reformer.ipynb)| كيف يدفع Reformer حدود النمذجة اللغوية | [](https://colab.research.google.com/github/patrickvonplaten/blog/blob/main/notebooks/03_reformer.ipynb)| [](https://studiolab.sagemaker.aws/import/github/patrickvonplaten/blog/blob/main/notebooks/03_reformer.ipynb)|
إذا كان النموذج كبيرًا جدًا بالنسبة لوحدة معالجة الرسومات (GPU) واحدة، وأنت تستخدم PyTorch، فيمكنك تعيين `torch_dtype='float16'` لتمكين الاستدلال بدقة FP16. عادةً ما لا يتسبب ذلك في حدوث انخفاضات كبيرة في الأداء، ولكن تأكد من تقييمه على نماذجك!
إذا كان النموذج كبيرًا جدًا بالنسبة لوحدة معالجة الرسومات (GPU) واحدة، وأنت تستخدم PyTorch، فيمكنك تعيين `dtype='float16'` لتمكين الاستدلال بدقة FP16. عادةً ما لا يتسبب ذلك في حدوث انخفاضات كبيرة في الأداء، ولكن تأكد من تقييمه على نماذجك!
بدلاً من ذلك، يمكنك تعيين `device_map="auto"` لتحديد كيفية تحميل مخزنات النموذج وتخزينها تلقائيًا. يتطلب استخدام معامل `device_map` مكتبه 🤗 [Accelerate](https://huggingface.co/docs/accelerate):
استخدم [`AutoModelForSequenceClassification`] و [`AutoTokenizer`] لتحميل النموذج المُدرب مسبقًا ومعالجته المرتبط به (مزيد من المعلومات حول `AutoClass` في القسم التالي):
```py
@ -132,18 +120,6 @@ label: NEGATIVE, with score: 0.5309
استخدم [`TFAutoModelForSequenceClassification`] و [`AutoTokenizer`] لتحميل النموذج المُدرب مسبقًا ومعالجته المرتبط به (مزيد من المعلومات حول `TFAutoClass` في القسم التالي):
حدد النموذج والمعالج في [`pipeline`]. الآن يمكنك تطبيق `classifier` على النص الفرنسي:
@ -192,8 +168,6 @@ label: NEGATIVE, with score: 0.5309
يمكن المجزئ أيضًا قبول قائمة من المدخلات، ويقوم بـ "حشو" و"تقصير" النص لإرجاع كدفعة بطول موحد:
<frameworkcontent>
<pt>
```py
>>>pt_batch=tokenizer(
@ -204,20 +178,6 @@ label: NEGATIVE, with score: 0.5309
...return_tensors="pt",
...)
```
</pt>
<tf>
```py
>>>tf_batch=tokenizer(
...["We are very happy to show you the 🤗 Transformers library.","We hope you don't hate it."],
...padding=True,
...truncation=True,
...max_length=512,
...return_tensors="tf",
...)
```
</tf>
</frameworkcontent>
<Tip>
@ -227,8 +187,6 @@ label: NEGATIVE, with score: 0.5309
### AutoModel
<frameworkcontent>
<pt>
تقدم مكتبة 🤗 Transformers طريقة بسيطة وموحدة لتحميل نماذج مدربة مسبقًا. وهذا يعني أنه يمكنك تحميل [`AutoModel`] كما لو كنت تقوم بتحميل [`AutoTokenizer`]. الفرق الوحيد هو اختيار فئة [`AutoModel`] المناسبة للمهمة. بالنسبة لتصنيف النص (أو التسلسل)، يجب عليك تحميل [`AutoModelForSequenceClassification`]:
```py
@ -264,39 +222,6 @@ label: NEGATIVE, with score: 0.5309
يوفر 🤗 Transformers طريقة بسيطة وموحدة لتحميل مثيلات مُدربة مسبقًا. وهذا يعني أنه يمكنك تحميل [`TFAutoModel`] مثل تحميل [`AutoTokenizer`]. والفرق الوحيد هو تحديد [`TFAutoModel`] الصحيح للمهمة. للتصنيف النصي (أو التسلسلي)، يجب تحميل [`TFAutoModelForSequenceClassification`]:
من الميزات الرائعة في 🤗 Transformers القدرة على حفظ نموذج وإعادة تحميله كنموذج PyTorch أو TensorFlow. يمكن أن يحول معامل `from_pt` أو `from_tf` النموذج من إطار عمل إلى آخر:
- يقوم النص البرمجي التوضيحي بتنزيل مجموعة بيانات ومعالجتها مسبقًا من مكتبة 🤗 [Datasets](https://huggingface.co/docs/datasets/).
- ثم يقوم النص البرمجي بضبط نموذج بيانات دقيق باستخدام Keras على بنية تدعم الملخص.
- يوضح المثال التالي كيفية ضبط نموذج [T5-small](https://huggingface.co/google-t5/t5-small) على مجموعة بيانات [CNN/DailyMail](https://huggingface.co/datasets/cnn_dailymail).
- يتطلب نموذج T5 ماعمل `source_prefix` إضافية بسبب الطريقة التي تم تدريبه بها. يتيح هذا المطالبة لـ T5 معرفة أن هذه مهمة التلخيص.
## تشغيل نص برمجي على وحدة معالجة الدقة الفائقة (TPU)
<frameworkcontent>
<pt>
تُعد وحدات معالجة الدقة الفائقة (TPUs) مصممة خصيصًا لتسريع الأداء. يدعم PyTorch وحدات معالجة الدقة الفائقة (TPUs) مع [XLA](https://www.tensorflow.org/xla) مجمع الدقة الفائقة للتعلم العميق (راجع [هنا](https://github.com/pytorch/xla/blob/master/README.md) لمزيد من التفاصيل). لاستخدام وحدة معالجة الدقة الفائقة (TPU)، قم بتشغيل نص `xla_spawn.py` البرمجي واستخدم معامل `num_cores` لتعيين عدد وحدات معالجة الدقة الفائقة (TPU) التي تريد استخدامها.
تُعد وحدات معالجة الدقة الفائقة (TPUs) مصممة خصيصًا لتسريع الأداء. تستخدم نصوص TensorFlow البرمجية استراتيجية [`TPUStrategy`](https://www.tensorflow.org/guide/distributed_training#tpustrategy) للتدريب على وحدات معالجة الدقة الفائقة (TPUs). لاستخدام وحدة معالجة الدقة الفائقة (TPU)، قم بتمرير اسم مورد وحدة معالجة الدقة الفائقة (TPU) إلى حجة `tpu`.
خيار آخر مفيد لتمكينه هو استئناف التدريب من نقطة تفتيش سابقة. سيضمن ذلك أنك تستطيع الاستمرار من حيث توقفت دون البدء من جديد إذا تم مقاطعة تدريبك. هناك طريقتان لاستئناف التدريب من نقطة تفتيش.
تستخدم الطريقة الأولى المعلمة `output_dir previous_output_dir` لاستئناف التدريب من أحدث نقطة تفتيش مخزنة في `output_dir`. في هذه الحالة، يجب عليك إزالة `overwrite_output_dir`:
الآن قم بإنشاء دفعة من الأمثلة باستخدام [`DataCollatorForLanguageModeling`]. من الأفضل أن تقوم بـ *الحشو الديناميكي* للجمل إلى الطول الأطول في الدفعة أثناء التجميع، بدلاً من حشو كامل المجموعة من البيانات إلى الطول الأقصى.
<frameworkcontent>
<pt>
استخدم رمز نهاية التسلسل كرمز للحشو، وحدد `mlm_probability` لحجب الرموز بشكل عشوائي عند كل تكرار للبيانات:
حول مجموعات بياناتك إلى تنسيق `tf.data.Dataset` باستخدام [`~transformers.TFPreTrainedModel.prepare_tf_dataset`]:
```py
>>>tf_train_set=model.prepare_tf_dataset(
...lm_dataset["train"],
...shuffle=True,
...batch_size=16,
...collate_fn=data_collator,
...)
>>>tf_test_set=model.prepare_tf_dataset(
...lm_dataset["test"],
...shuffle=False,
...batch_size=16,
...collate_fn=data_collator,
...)
```
قم بتهيئة النموذج للتدريب باستخدام [`compile`](https://keras.io/api/models/model_training_apis/#compile-method). لاحظ أن جميع نماذج Transformers لديها دالة خسارة ذات صلة بالمهمة الافتراضية، لذلك لا تحتاج إلى تحديد واحدة ما لم ترغب في ذلك:
```py
>>>importtensorflowastf
>>>model.compile(optimizer=optimizer)# لا يوجد حجة للخسارة!
```
يمكن القيام بذلك عن طريق تحديد مكان دفع نموذجك ومجمّع البيانات في [`~transformers.PushToHubCallback`]:
أخيراً، أنت جاهز لبدء تدريب نموذجك! قم باستدعاء [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) مع مجموعات بيانات التدريب والتحقق من الصحة، وعدد العصور، والتعليقات الخاصة بك لتدريب النموذج:
بمجرد اكتمال التدريب، يتم تحميل نموذجك تلقائيًا إلى Hub حتى يتمكن الجميع من استخدامه!
</tf>
</frameworkcontent>
<Tip>
@ -365,8 +280,6 @@ Perplexity: 49.61
[{'generated_text':"Somatic hypermutation allows the immune system to be able to effectively reverse the damage caused by an infection.\n\n\nThe damage caused by an infection is caused by the immune system's ability to perform its own self-correcting tasks."}]
["Somatic hypermutation allows the immune system to react to drugs with the ability to adapt to a different environmental situation. In other words, a system of 'hypermutation' can help the immune system to adapt to a different environmental situation or in some cases even a single life. In contrast, researchers at the University of Massachusetts-Boston have found that 'hypermutation' is much stronger in mice than in humans but can be found in humans, and that it's not completely unknown to the immune system. A study on how the immune system"]
```
</pt>
<tf>
قم بتقسيم النص وإرجاع `input_ids` كـ TensorFlow tensors:
استخدم طريقة [`~transformers.generation_tf_utils.TFGenerationMixin.generate`] لإنشاء الملخص. للمزيد من التفاصيل حول استراتيجيات توليد النص المختلفة والبارامترات للتحكم في التوليد، راجع صفحة [استراتيجيات توليد النص](../generation_strategies).
['Somatic hypermutation allows the immune system to detect the presence of other viruses as they become more prevalent. Therefore, researchers have identified a high proportion of human viruses. The proportion of virus-associated viruses in our study increases with age. Therefore, we propose a simple algorithm to detect the presence of these new viruses in our samples as a sign of improved immunity. A first study based on this algorithm, which will be published in Science on Friday, aims to show that this finding could translate into the development of a better vaccine that is more effective for']
الآن، قم بإنشاء دفعة من الأمثلة باستخدام [`DataCollatorForLanguageModeling`]. من الأكثر كفاءة أن تقوم بـ *الحشو الديناميكي* ليصل طولها إلى أطول جملة في الدفعة أثناء التجميع، بدلاً من حشو مجموعة البيانات بأكملها إلى الطول الأقصى.
<frameworkcontent>
<pt>
استخدم رمز نهاية التسلسل كرمز الحشو وحدد `mlm_probability` لحجب الرموز عشوائياً كل مرة تكرر فيها البيانات:
قم بتحويل مجموعات بياناتك إلى تنسيق `tf.data.Dataset` باستخدام [`~transformers.TFPreTrainedModel.prepare_tf_dataset`]:
```py
>>>tf_train_set=model.prepare_tf_dataset(
...lm_dataset["train"],
...shuffle=True,
...batch_size=16,
...collate_fn=data_collator,
...)
>>>tf_test_set=model.prepare_tf_dataset(
...lm_dataset["test"],
...shuffle=False,
...batch_size=16,
...collate_fn=data_collator,
...)
```
قم بتهيئة النموذج للتدريب باستخدام [`compile`](https://keras.io/api/models/model_training_apis/#compile-method). لاحظ أن نماذج Transformers لديها جميعها دالة خسارة افتراضية ذات صلة بالمهمة، لذلك لا تحتاج إلى تحديد واحدة ما لم تكن تريد ذلك:
```py
>>>importtensorflowastf
>>>model.compile(optimizer=optimizer)# لا توجد حجة للخسارة!
```
يمكن القيام بذلك عن طريق تحديد مكان دفع نموذجك ومعالج الرموز في [`~transformers.PushToHubCallback`]:
أخيراً، أنت مستعد لبدء تدريب نموذجك! قم باستدعاء [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) مع مجموعات بيانات التدريب والتحقق، وعدد العصور، والتعليقات الخاصة بك لتعديل النموذج:
قم بتهيئة النموذج للتدريب باستخدام [`compile`](https://keras.io/api/models/model_training_apis/#compile-method). لاحظ أن جميع نماذج Transformers تحتوي على دالة خسارة مناسبة للمهمة بشكل افتراضي، لذلك لا تحتاج إلى تحديد واحدة ما لم ترغب في ذلك:
```py
>>>model.compile(optimizer=optimizer)# لا توجد وسيطة خسارة!
```
الخطوتان الأخيرتان قبل بدء التدريب هما: حساب دقة التنبؤات، وتوفير طريقة لرفع النموذج إلى Hub. ويمكن تحقيق ذلك باستخدام [استدعاءات Keras](../main_classes/keras_callbacks)
مرر دالتك `compute_metrics` إلى [`~transformers.KerasMetricCallback`]:
أخيرًا، أنت جاهز لبدء تدريب نموذجك! استدعِ[`fit`](https://keras.io/api/models/model_training_apis/#fit-method) مع مجموعات بيانات التدريب والتحقق من الصحة وعدد الحقب والاستدعاءات لضبط النموذج:
الآن قم بإنشاء دفعة من الأمثلة باستخدام [`DefaultDataCollator`]. بخلاف مجمّعات البيانات الأخرى في 🤗 Transformers، لا يطبق [`DefaultDataCollator`] أي معالجة مسبقة إضافية مثل الحشو.
حوّل مجموعات البيانات الخاصة بك إلى تنسيق `tf.data.Dataset` باستخدام [`~transformers.TFPreTrainedModel.prepare_tf_dataset`]:
```py
>>>tf_train_set=model.prepare_tf_dataset(
...tokenized_squad["train"],
...shuffle=True,
...batch_size=16,
...collate_fn=data_collator,
...)
>>>tf_validation_set=model.prepare_tf_dataset(
...tokenized_squad["test"],
...shuffle=False,
...batch_size=16,
...collate_fn=data_collator,
...)
```
قم بتكوين النموذج للتدريب باستخدام [`compile`](https://keras.io/api/models/model_training_apis/#compile-method):
```py
>>>importtensorflowastf
>>>model.compile(optimizer=optimizer)
```
آخر شيء يجب إعداده قبل بدء التدريب هو توفير طريقة لدفع نموذجك إلى Hub. يمكن القيام بذلك عن طريق تحديد مكان دفع نموذجك ومعالجك المعجمي في [`~transformers.PushToHubCallback`]:
أخيرًا، أنت جاهز لبدء تدريب نموذجك! اتصل بـ [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) مع مجموعات بيانات التدريب والتحقق من الصحة، وعدد العهود، ومعاودة الاتصال الخاصة بك لضبط النموذج:
'176 billion parameters and can generate text in 46 languages natural languages and 13'
```
</tf>
</frameworkcontent>
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.